Craig Danton

Self-Sovereign Data (SoDa): The New Web3 Data Economy

craig-danton@newsletter.paragraph.com (Craig Danton) — Thu, 11 Nov 2021 17:03:32 GMT

This is not another post about the metaverse. Well, not exactly… but with the Meta rebrand solidifying the concept in the 2021 zeitgeist, there is a less explored, and I believe critical concept, from the same novel — Neal Stephenson’s “Snowcrash” — that has me thinking.

In Stephenson’s “Snowcrash” Hiro Protagonist, the aptly named protagonist of the book, side hustles as a “Stringer for CIC”; essentially a gig economy worker collecting digital intelligence and posting it to a massive data marketplace. Users of the metaverse can then access this library to search for any information they want.

This concept of a community “dataverse” where people or corporations freely share data is almost universal to sci-fi but is obviously missing in the real world. While the early internet gave us Wikipedia, this clearly falls far short of a structured, real-time, global database of collective intelligence. All of which begs the question — if this is a prerequisite of any sufficiently impressive advanced sci-fi culture — why isn’t there a billion dollar company built around exactly this paradigm today?

The cynical answer is that this is not how mega-caps have become mega-caps to date. In fact, to the contrary, fortunes have been made by building the richest gated internal dataverses possible. What I’d like to posit as a slightly more balanced hypothesis is that we have lacked the economic incentive to fund it, the organizational structures to manage it, and the technology to build it.

An Introduction to SoDa

The predecessor to any dataverse would be a vibrant “data economy” or a fluid market where data is exchanged. Today, internal spending on data dwarfs the market for external data; and most data-related value is concentrated in a few monopolies. The data economy is lagging far behind the rest of the data industry. Over the past few months as I have descended the rabbit hole of blockchain and “Web3”, it has become increasingly clear that this is about to change.

Web3, the decentralized internet built on the back of blockchain technology, shifts ownership and power back to users and away from monolithic platforms (e.g., Facebook, Google, Amazon etc). It is a world where people own portions of the platforms they use, earn money fairly for content they generate, and are no longer held hostage via their data. This newly freed data, will also give rise to a new data economy, which I am calling Self-Sovereign Data or SoDa for short.

SoDa begins with users claiming more rights over their data from large platforms, but also extends to the ways organizations will monetize, and build around newly available data. While there are already many efforts that allow you to “get paid” for your data; I believe SoDa goes far beyond this very literal application. Data will emerge as the 3rd major category of assets next to physical and financial assets. It has the potential to bring the next wave of capital and users over to Web3 and represents the next “killer application” of blockchain much in the way that DeFi (Decentralized Finance) was the first.

What does decentralization do for data?

There are three ways blockchain/web3 could transform how we build a vibrant data economy:

Ownership: Data, like all digital assets, is incredibly hard to value given that it can be replicated at almost zero cost. Yet, fundamental to any economy is a widely accepted mechanism for valuing an asset and tracking ownership. Fortunately, enforced digital scarcity is core to what made Bitcoin, and later any cryptocurrency, viable. By tokenizing access to data we can track ownership and lineage, and create open markets to help determine fair value.
**Organization: **One of the most striking achievements of Web3 has been the massive shift in how communities self-organize. Building on patterns pioneered in open source software, Web3 organizations are able to rapidly enlist communities to contribute in small increments. This is driven by tokens which serve as payment and shares of ownership that directly align incentives without the need for formal employment or contracts. These organizations could be immensely valuable for creating shared data protocols that incentivize the curators, maintainers, and contributors necessary for success, without creating the threat of lock-in.
**Technology: **One of the largest technical issues with data is that it is siloed. Siloed data has a tendency to diverge in structure, quality and standards. The internet is the ultimate example of this as a networked array of data silos that leaves information trapped within each application and company. A better model for this would be a shared database with tight paradigms for access control that we all could draw from and contribute to. This would require large buy-in and incentives (see 1 and 2) and a massively distributed database. As Balaji Srinivasan’s essay Yes, You May Need a Blockchain explains, blockchains can behave as exactly that. While there are practical problems around scaling, it is a pattern that could change data engineering dramatically.

We’ll explore examples and applications of these forces later as we dive into the existing SoDa landscape, but let’s first look at how this could work.

An internet-native data layer

Until Bitcoin the concept of payments and cash were not native to the internet, instead companies like Stripe built infrastructure to bridge the gap between banks and the internet. Financial applications were still beholden to the underlying infrastructure which was slow and costly and wrapped in a complex layer of regulation and bureaucracy. Ethereum introduced programmable contracts (smart contracts) with its own token (internet money) which created a modern, composable financial infrastructure. This unlocked an explosion of DeFi applications last summer, dubbed “DeFi Summer”, which offered much more attractive financial options and drove massive global user adoption.

While an internet-native financial economy is an absolute prerequisite for a new internet-native data economy it is not a complete solution in and of itself.

So what then are the building blocks of an internet-native data layer?

In his vision for Ocean Tokens: “From Money Legos to Data Legos” Trent McConaghy, co-founder of Ocean Protocol, provides an incredible overview on how data can build on top of a structure similar to DeFi. Ocean and other early leaders in the space are beginning to converge around a number key elements that compose this data layer:

Self-Sovereign IDs: A shared universal system for the identification of people, organizations and devices.
Data Wallets: Interfaces for the secure management of personal data assets.
Protocols for Tokenization & Data Exchanges: Agreed upon ways of allowing configurable access to data through tokens and a listing marketplace for those tokens.
Secure Data Enclaves: Neutral compute zones, which allow access for machines to run processes without transferring data or exposing its contents.
Data Oracles: The equivalent of data APIs for developers to access data on the blockchain.
**Data Unions (DAOs): **Decentralized autonomous organizations governing a contributory data network.

It’s useful to consider what a theoretical application of these elements working together might look like.

Snowcrash: A Theoretical Case Study for SoDa

YoursTruly, or “YT” for short, is a courier. She delivers high value packages around the city, so when she broke her smart skateboard on her last delivery she knew needed a new one, ASAP.

She logs on to Sk8!, a skateboard e-commerce site.

The site can pull her universal public profile (1): her delivery addresses, language preferences, “dark mode” html customizations, and she overrides the delivery standard option from “next day” to “next hour”. The site requests access to her relevant previous browsing history and financial transactions through her wallet (2) in order to give her board recommendations and financing options, she accepts and the information is provisioned to a secure enclave (4) where Sk8! can run their recommendation algorithms and underwriting models but without possessing the underlying data. She buys the board, and accepts the financing. When she receives it, she connects the live location data to an oracle network (5) this allows the competing delivery apps to see where she is and connect her with the next client. YT is also a member of the CourierDAO (6) a data union that pools together movement data from couriers around the world, and rents access to the dataset on a data exchange (3). Companies like Sk8! license the data to train their AI-driven R&D efforts and YT gets her cut of that license revenue.

If that virtuous cycle seems too “out of this world”, consider that the Indian government began an ambitious project over a decade ago which has already begun to make a domestic ecosystem like this a reality.

“India Stack”: A Real World Case Study

In 2009, only 17% of India’s massive population was participating in the formal financial system. The hurdles to open bank accounts and enroll in digital payment ecosystems or debt markets was too great. The government saw this as a massive limit to their development potential and

to solve this problem began implementing one of the most ambitious state-led digital transformations of all time.

“India Stack” is a three layer: identity, payments and data sharing, network. While the data sharing network is still in the early stages of its roll out, the identity system is responsible for bringing hundreds of millions of individuals into the banking system. To date this has grown the percentage of adults with bank accounts to 80%! Progress that might otherwise have taken decades has happened in 9 years.

For the data sharing system, they envision a time when consumers will be able provision their data to a new bank or service provider for a limited time to make a decision, after which access will be revoked. “The Internet Country” by Aaryaman Vir and Rahul Sanghi gives an in depth view into how this came about, and the future for the system.

While the original instantiation of India Stack was not based on blockchains, the government is now developing a strategy to do just that. To do this at a global scale would likely require a decentralized solution from the beginning.

India Stack is first and foremost a financial platform rather than a new data economy, but it is a clear demonstration of how intrinsically linked these concepts are and foreshadows the new wave of applications that may be built as DeFi becomes more widely adopted. This real world example also sets a precedent for learning to “rent” data, rather than storing and owning everything — representing a fundamental shift in development architectures.

So where are we today when it comes to SoDa?

SoDA Landscape

Inspired by Matt Turck’s data landscapes, whose annual releases serve as a visual survey for the growth and evolution of the data industry, I’ve compiled a landscape for the organizations, technologies and products that I believe comprise SoDa today. Inclusion is not based on the use of data alone, nor does data need to be the only product or value proposition. Instead, SoDa organizations (in addition to being a blockchain-based technology) fall into one or many of the following groups:

Tooling essential for the use and collection of data in Web3 applications.
Protocols for facilitating the portability of user data between applications.
Blockchains which make data privacy a primary differentiator.
Applications which utilize data beyond blockchain metrics (token prices etc)
Applications focused on the monetization of data enabled by blockchain tokenization.

Landscape Notes: 1. Some organizations could be placed in multiple locations based on their products/features, I have chosen where I believe they may be strongest. 2. Identity could wide landscape in itself, these are a selection of projects which I believe have promise or a focus on data in particular.

Today SoDa has its roots in DeFi, and therefore the largest players (ex. Chainlink) concentrate on tooling and infrastructure for it. But we are already beginning to see a shift. Insurance applications which were originally targeted at fellow DeFi applications have grown to include more traditional lines like weather (Arbol) or travel (Koala). With the launch of projects like Sign in with Ethereum from Spruce, users will begin to keep their data in wallets rather than storing it in-application. This will allow users to monetize their data through data unions or improved prices and services in the application layer without needing technical skills. Similarly, while most of the data that is being put on oracle networks today is DeFi orientated, demand for traditional 3rd party data companies to provide more “real world” (weather, traffic, commerce, movement) data will grow and be an opportunity for new revenue streams for these companies. Infrastructure like Helium will also allow low cost IOT devices to help collect data and transact with each other directly on chain.

Dataverse: So when moon?

We are still extremely early and the infrastructure in this space feels like the internet of the 90’s. Network effects are difficult to break and getting users to adopt web3 social platforms or e-commerce won’t be easy. Yet as consumers feel what it is like to have a stake in the networks they use, my intuition is that progress will compound rapidly. With some of the largest companies in the world reliant on captive data for their moats there will be no shortage of companies competing to change the game.

The “dataverse” is a big audacious goal and our current patterns of thinking lead us to believe that big problems lead to big companies. Even Stephenson conceived of it as being controlled by a centralized entity — CIC was the CIA merged with the Library of Congress. But just as few of us believe that Meta will control the metaverse; the dataverse will also emerge not from a centralized roadmap but from an ecosystem of projects. It is unlikely to manifest in the way we imagine, or be a one-for-one replacement of a company today but something brand new and that’s what’s most exciting.

Wagmi!

I hope this is a useful introduction into the world of Web3 Data Economy. If you’re interested in the space, have projects that we should add to the landscape or just would like to connect, you can find me on twitter (@craig_danton).

Thank you to @I_F_H and @Chris_AA for the help getting this out there.

Data Buying: What is Data Really Worth?

craig-danton@newsletter.paragraph.com (Craig Danton) — Thu, 11 Nov 2021 00:17:57 GMT

This was originally published on Medium 6/1/2020

In May 2017, the cover of the Economist pronounced data as “The World’s Most Valuable Resource”. Companies the world over agreed, spending more than $45B with external data companies in 2018 alone. Yet unlike oil or other commodities, there are few ways to value data. Instead, data buyers are at the mercy of opaque claims made by data vendors and find themselves overspending on inferior sources, with little clue of how to calculate an ROI. I collect, buy, and sell data professionally and it shocks me that an industry built on the so-called value of analytics offers so little in the way of a quantitative measure of value.

Google, Amazon, and Facebook dominate their market segments and in doing so have built an unrivaled data asset, representative not just of their customers but arguably the market as a whole. Unlike these data monopolies, most companies instead look to “external data” to augment their knowledge about their customers and compete. Yet the starting point of what to buy and how much to spend remains more of a dark art than a science.

Most of the data industry is used to buying and selling data at the “database level” — think one or multiple tables packaged together. High-quality data sources range from thousands to hundreds of thousands of dollars or more. But despite these substantial price tags, most customers complain that they use less than 20% of the data they buy. Either most of the data they buy is irrelevant to their domain, lacks signal (low information), or can‘t be fully leveraged due to limitations in technology.

Putting aside the cost implications from my experience, more sophisticated buyers such as hedge funds or ad tech companies will shop for data at the “column level”. For example, rather than looking for a whole database of consumer attributes, they will look for one attribute (or column) that answers their precise need — like likelihood in purchasing a good online in the next 30 days. With a limited amount of analytical evaluation, we can then piece together the “best of” columns from multiple vendors and build a superior dataset.

What if we take this one step further? **APIs allow us to buy data one row or even one cell at a time **but they do not account for the difference of value between these cells. It is safe to assume that in a free market each cell wouldn’t be equally valuable. But where the data industry is today, we are far from being able to put a precise value on a set of data, let alone a cell. If we are going to challenge the dominance data monopolies and start to allow for data marketplaces where sellers looking to monetize data assets can meet customers looking to find the next attribute, we must first establish a fair price. The sections below introduce a framework for how I think about valuing data and explore some of its strengths and limitations in a real-world use case.

Information is inherently valuable, data is not.

While there is a lot of talk about the size of a dataset, its granularity, or its refresh rate, all of these qualities are just a piece of the puzzle to determine its value. Instead, I believe the value of a data source is proportional to 5 main properties:

Value ∝ F(I, N, T, U) — C

I — amount information (or insight) about an entity (event, person, place, company, etc),

N — the number of applicable entities this information pertains to,

T — the length of the predictive time horizon,

U — the uniqueness of this data source,

C — the cost of the data to acquire as it reduces the total value.

Let’s break each of these down a little further using Weather Data as an example.

(I) — It’s pretty convenient to say that the value of a dataset is proportional to the information within it, but what is information and how can you measure it? Claude Shannon, an American mathematician and “father of information theory” put forth a concept of “information entropy” in 1948 that serves as a starting point. It proposes that a data source that predicts a low-probability event produces more information than that of one predicting a high-probability event. Or put another way information is the degree of “surprise” you have when you see that information (Bishop). In terms of weather, a dataset that predicts a 90F sunny day in the middle of January in NYC (pretty surprising! ) is providing more “information” than one that is predicting a cold cloudy day.

(N) Next, we consider the number of “entities” — in this case, locations in the world — that the data source is relevant to. A data source accurately forecasting one location might be extremely valuable for someone getting married there but to most of us, it has limited value compared to a data source predicting the weather the world over.

(T) How far in advance does this source have predictive value? Our ability to react, strategize, and implement a plan to capitalize on information is tied to how much notice we have — for instance moving your wedding to a sunnier location two days before is pretty infeasible vs say 12months before. To measure this, we generally conduct a “backtest” (more on this below) by looking at copies of data from the past to see if they contained predictive information about events that we now know have come to pass.

(U) How rare is this source of information? At this stage predicting weather 10 days or so in advance has become commoditized and so to command a high value, a new data source would have to be a significant improvement on this.

(C ) How expensive is this source? Some sources of data can be purchased, others are collected by legions of people or users, and many need to be mined by countless computers — each of these has a measurable cost. In the case of our new weather source, it may need to be collected by expensive land stations. We will largely ignore the issue of the cost going forward, as it is discernibly more measurable for most data users and should just be subtracted from the value it creates.

Ok — so this is a framework for considering the relative value of a data source but it falls short of putting a precise dollar value on it. To go that far, we need to have a solid understanding of much value we create from novel information or how efficient is our business at converting insight into dollars? For many of us, this is an even harder question to answer, and the lack of an answer is probably a reason why data is still dramatically underutilized today. If we can’t measure and monetize insight, why would we mine data for it?

Why hedge funds generate dollars from information and we can’t

Over 78% of Hedge Funds use external or “alternative” data to trade. So much so, the ability to use data has become a key competitive differentiator for funds due in part to the fact that a data source’s ROI can quantitatively be measured and tested.

For the purpose of this greatly oversimplified example let’s imagine a use case of buying and selling US stocks (a long/short U.S equities strategy). The goal is to predict what will be on the earnings report of a public company as early as possible and then take a position in this company that reflects our belief. Let’s also assume that it is relatively trivial to calculate the impact of information (such as sales or profitability) on the price per share using a valuation model.

When a hedge fund evaluates a potential new data source it conducts a process of “backtesting” whereby the models that were used to calculate the value of the company are re-run for historical dates over a period of several years, the amount that the data actually improved the model (ie. how close it predicted the future value of the company at earnings time) then gives them a very quantified upper bound measure of the potential information entropy (I) contained in the source. These types of backtesting models can then be conducted on each of the entities of interest (N) — in this case, any of the approximately~3500 stocks listed in the US or categorical ETFs. If they then factor in the size of the investment they could have made in the stock (a factor related to the size of the company, volumes it trades in, and risk tolerance of the fund) they can then approach a precise figure for how much money could have been made if they had had this source at some time in the past. Considering that information tends to become more available as you get closer to earnings and things become “priced in” to the stock price the amount of time (T) in advance of the earnings date, and the rareness (U) of how many institutions have this information also affect how much this upper bound needs to be revised down to an actual value. While there are many assumptions throughout that could lead a hedge fund astray, the approach above gives them a value range that they can then use to make a decision of whether to acquire this data source. YES, we did it!

However, there are a few pieces of market infrastructure that further simplify the measurement of I,N,T, U in the example above that make quantitative trading extremely unique. If we are going to apply this framework to other industries, we will need to proactively seek approximations in our own use cases that help us put precise values where one is not provided by a market.

At Enigma (my current company) we cover Small and Medium Businesses (SMBs), which present their own unique challenges. In italics below each example are some of the solutions we considered to adjust for shortcomings in our domain in comparison to the more mature equities market.

Stock Prices & Benchmarks: A transparent, realtime, and agreed upon set price for a good and the ability to transact at that price. This creates the opportunity to value Information (I) precisely in comparison to benchmarks which provide a consensus for value.

There is no stock market for SMBs to agree upon a fair price or audited financials to find the truth. At Enigma we decided that building a “golden record” by hand, based on empirical observation, was the only way to get this baseline.

2. Finite Equities List: A defined universe of entities that we care about each with unique tickets or identifiers. That helps us size and name the applicable number of entities (N) and their comparative value.

There are over 30m businesses in the U.S., many more than in the stock example above and they don’t have tickers. At Enigma we invested heavily in building an ID for each business to allow them to be identified. Auren Hoffman in his SIMPLE acronym explains the importance of these linking keys to data products in general.

3. Quarterly Public Earnings Calls: Defined times when companies share the “truth” about their earnings. This provides the necessary historical data to predict against, and distinct time periods over which we are interested in predicting (T).

SMBs apply for loans, or services at random intervals throughout their lifecycle, and are dramatically more prone to episodes of exponential growth or decline than mature public businesses. To counteract these effects, Enigma focused on reducing the lag on each of its data sources and recalculating its estimates every week rather than quarterly to build the best view of a business at each point in time.

While we are still short putting a price on a “cell” of data, we have taken strides to allow us to make smart decisions about the value of each source we buy, build or acquire and come closer each day to a true ROI on data spend. As I alluded to at the beginning of this post, the largest and fastest-growing companies in the world design their products to capture data about their users, they then use this data to reinforce their position in-market by creating responsive, intelligent experiences for their customers. I believe that if we can begin to collaborate and share data across industries in a manner that is privacy-centric and secure. Companies that lack the scale of FAANG can build experiences that compete, while businesses that collect data as the exhaust of their existing business lines may find alternative revenue streams; but all of this begins with valuing this precious and opaque asset.

If you found this framework useful, have others to share that I should look at or would like help thinking through how to approach your own data challenges, I would love to hear about them. You can reach me on Twitter @craig_danton or leave a comment.