The myth of raw data

"Data is the new oil" is an old saying at this point, referring to how the effectiveness of machine learning models scales with data size. However, this likeness is severely misguided. It views data as a raw resource that can be extracted, traded and used indifferently. It presumes that data is, in its original form, free from bias and distortion. But data is not a resource, it is an observation. Any datapoint is a small descriptor of reality, and just as there is no "raw phenomena", but only observations of a phenomena, there is no raw data.

What is the truth of a photon – a particle or a wave?

Data is never "raw" and inherently biased, just as “facts” are always biased. For each datapoint you observe you have to decide what to exclude and what to include. This decision cannot be unbiased unless you choose to include everything everywhere all at once, or nothing.

If you give a user a choice between hotdogs, hamburgers and pizza, and the user chooses pizza, does that mean that the user likes pizza? Or that the user thought pizza was the lesser of three evils? Or that they just clicked on the first option without thinking? Perhaps they would greatly prefer to eat a healthy salad but were unable to express that desire given the choice menu. The "pizza-like" is thus a biased datapoint and should be treated accordingly.

In our society at large, we tend to over-capture and over-value these kinds of "revealed preferences". For good reason. It is easy to track your behavior as you browse the web, buy products and like videos. It is hard to en-masse retroactively ask which articles you were happy you read, which products brought you joy, and which videos you would recommend to a friend.

One way to do so is through surveys. Sometimes these surveys are combined with behavioral data, from which models are used to predict the survey responses. I have heard critique against surveys in that users tend to lie when asked about their behavior, and that their behavior is hard-truth. However, given how choice is limited by the options, one might ask – how much of a "ground truth" is revealed preference really?

Of course, there is no "ground truth" about behavior just as there is no ground truth of physical reality (unless you are an adamant defender of the classical model of physics). Capturing any datapoint, or making any observation, necessitates taking a stand about what is relevant to capture. Being a "passive observer" is an oxymoron.

Oliver Klingefjord

Oliver Klingefjord

No activity yet

Oliver Klingefjord

Oliver Klingefjord

No activity yet

No activity yet

No activity yet

The myth of raw data

The myth of raw data