The deception that lurks in our data-driven world

By Alexis Madrigal | October 6, 2015 | 4:24pm

Latest

I start each day with a lie.

I get up, walk into the bathroom, and weigh myself. The data streams from the Chinese scale to an app on my phone and into an Apple dataserver, my permanent record in the cloud.

Most Popular

I started this ritual because I thought it would keep me honest. It would keep me from deluding myself into thinking my clothes didn’t fit because of an overzealous dryer rather than beer and cheese. The data would be real and fixed in a way that my subjective evaluations are not. The scale could not lie.

And of course, the number that shows on the scale isn’t, technically, a lie. It is my exact weight at that exact moment. If I were an ingredient in a cake recipe or cargo for a rocketship, this is the number you’d want to believe.

But one thing you learn weighing yourself a lot—or wrestling in high school—is that one’s weight, this number that determines whether you’re normal or obese, skinny or fat, is susceptible to manipulation. (This is the warning embedded in the pithy title of NYU professor Lisa Gitelman’s 2013 book: “Raw Data Is An Oxymoron.”)

This number that determines whether you’re normal or obese, skinny or fat, is susceptible to manipulation.

If I want to weigh in light, I go running and sweat out some water before getting on the scale. If I’m worried that my fitness resolve is slipping and I need to scare myself back into healthy eating, I’ll weigh myself a little later—after some food and plenty of water— and watch my weight spike upwards.

Sure, the difference in all of these measurements is only plus or minus five pounds, but for someone with my own psychology—and maybe some of you—those differences are enough to make me this guy.

Or this guy.

You might like to think that this is just one man’s data deception. That the data out in the rest of the world, like the stuff that gets published in science journals, is less susceptible to human manipulation.

But then you see studies like the one that recently came out in Science, America’s leading scientific journal, that subjected 100 supposed high-quality psychology papers to a large-scale replication study. When new research groups replicated the experiments in the papers to see if they’d get the same results, they were only able to do so 36% of the time. Almost two-thirds of the papers’ effects couldn’t be replicated by other careful, professional researchers.

“This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know,” concluded the authors of the Science paper.

In many fields of research right now, scientists collect data until they see a pattern that appears statistically significant, and then they use that tightly selected data to publish a paper. Critics have come to call this p-hacking, and the practice uses a quiver of little methodological tricks that can inflate the statistical significance of a finding. As enumerated by one research group, the tricks can include:

“conducting analyses midway through experiments to decide whether to continue collecting data,”
“recording many response variables and deciding which to report postanalysis,”
“deciding whether to include or drop outliers postanalyses,”
“excluding, combining, or splitting treatment groups postanalysis,”
“including or excluding covariates postanalysis,”
“and stopping data exploration if an analysis yields a significant p-value.”

Add it all up, and you have a significant problem in the way our society produces knowledge.

Add it all up, and you have a significant problem in the way our society produces knowledge.

When fed into the Facebook-driven digital media ecosystem, a small p-hacked study can reverberate across the world without much skepticism. An average person scrolling through a newsfeed won’t realize that much of the shit that “science” or “a study” says wouldn’t hold up on closer examination, especially if it was published in a journal.

And that’s the professional science! It’s to say nothing of all the data-driven decision-making that’s happening in business right now.

There’s this amazing book called Seeing Like a State, which shows how governments and other big institutions try to reduce the vast complexity of the world into a series of statistics that their leaders use to try to comprehend what’s happening.

The author, James C. Scott, opens the book with an extended anecdote about the Normalbaum. In the second half of the 18th century, Prussian rulers wanted to know how many “natural resources” they had in the tangled woods of the country. So, they started counting. And they came up with these huge tables that would let them calculate how many board-feet of wood they could pull from a given plot of forest. All the rest of the forest, everything it did for the people and the animals and general ecology of the place was discarded from the analysis.

The world proved too unruly. Their data wasn’t perfect.

But the world proved too unruly. Their data wasn’t perfect. So they started creating new forests, the Normalbaum, planting all the trees at the same time, and monoculturing them so that there were no trees in the forest that couldn’t be monetized for wood. “The fact is that forest science and geometry, backed by state power, had the capacity to transform the real, diverse, and chaotic old-growth forest into a new, more uniform forest that closely resembled the administrative grid of its techniques,” Scott wrote.

The spreadsheet became the world! They even planted the trees in rows, like a grid.

German foresters got very scientific with their fertilizer applications and management practices. And the scheme really worked—at least for a hundred years. Pretty much everyone across the world adopted their methods.

Then the forests started dying.

“In the German case, the negative biological and ultimately commercial consequences of the stripped-down forest became painfully obvious only after the second rotation of conifers had been planted,” Scott wrote.

The complex ecosystem that underpinned the growth of these trees through generations—all the microbial and inter-species relationships—were torn apart by the rigor of the Normalbaum. The nutrient cycles were broken. Resilience was lost. The hidden underpinnings of the world were revealed only when they were gone. The Germans, like they do, came up with a new word for what happened: Waldsterben, or forest death.

The hidden underpinnings of the world were revealed only when they were gone.

Sometimes, when I look out at our world—at the highest level—in which thin data have come to stand in for huge complex systems of human and biological relationships, I wonder if we’re currently deep in the Normalbaum phase of things, awaiting the moment when Waldsterben sets in.

Take the ad-supported digital media ecosystem. The idea is brilliant: capture data on people all over the web and then use what you know to show them relevant ads, ads they want to see. Not only that, but because it’s all tracked, unlike broadcast or print media, an advertiser can measure what they’re getting more precisely. And certainly the digital advertising market has grown, taking share from most other forms of media. The spreadsheet makes a ton of sense—which is one reason for the growth predictions that underpin the massive valuations of new media companies.

But scratch the surface, like Businessweek recently did, and the problems are obvious. A large percentage of the traffic to many stories and videos consists of software pretending to be human.

“The art is making the fake traffic look real, often by sprucing up websites with just enough content to make them appear authentic,” Businessweek says. “Programmatic ad-buying systems don’t necessarily differentiate between real users and bots, or between websites with fresh, original work, and Potemkin sites camouflaged with stock photos and cut-and-paste articles.”

Of course, that’s not what high-end media players are doing. But the cheap programmatic ads, fueled by fake traffic, drive down the prices across the digital media industry, making it harder to support good journalism. Meanwhile, users of many sites are rebelling against the business model by installing ad blockers.

The advertisers and ad-tech firms just wanted to capture user data to show them relevant ads. They just wanted to measure their ads more effectively. But placed into the real-world, the system that grew up around these desires has reshaped the media landscape in unpredictable ways.

We’ve deceived ourselves into thinking data is a camera, but it’s really an engine. Capturing data about something changes the way that something works. Even the mere collection of stats is not a neutral act, but a way of reshaping the thing itself.

That is to say, I don’t weigh myself to know my weight, I weigh myself to change my mind. And it’s always a useful lie.