By Kyle Johnson
These are just two of many examples of spurious correlations floating around in the media and research journals. Following the rise of big data, it is easier than ever before for researchers to test huge data sets and find correlated variables. In many cases, however, these correlations are meaningless – due to common factors or simply chance.
Professor Mark White, who teaches in the International Economics department at SAIS, recently spoke with the SAIS Review about these implications of big data.
The biggest problem, White said, is that the rise of big data has drastically increased the “number of false positives in the literature,” but has not produced a corresponding increase in re-testing or retractions. Researchers make their mark by finding correlations to prove rather than reporting on the absence of correlation.
“There’s not much of an incentive for professional researchers in any field to go out there and spend their time disproving other people’s ideas,” he said.
Nate Silver’s The Signal and the Noise illustrates how greater access to data can cause researchers to make more mistakes: the U.S. government publishes data on roughly 45,000 economic statistics. Testing all combinations of two pairs of these statistics provides 1 billion hypotheses to test.
“But the number of meaningful relationships in the data – those that speak to causality rather than correlation and testify to how the world really works – is orders of magnitude smaller… There isn’t any more truth in the world than there was before the Internet or the printing press. Most of the data is just noise, as most of the universe is filled with empty space.” [The Signal and the Noise, pp. 249-250]
With so much “noise” out there, what should consumers of statistical research do? First, White said that a “healthy skepticism” should be employed.
Where did the data come from? If it is from subjective criteria, check the researcher’s methodology to ensure that it is appropriate. And is the hypothesis re-testable? If the population is small, there may be no other data available to confirm or reject the researcher’s hypothesis.
In particular, the source of the research should be closely examined, White said.
An example of good research illustrates this point further. Scientists hypothesized that the Higgs boson existed and designed an experiment (in this case, building the Large Hadron Collider) to test their idea. The testing confirmed the hypothesis.
“If I do that and I find that the data supports the idea, I’m a long ways towards establishing causality because I started with causality in mind and presumably I designed my experiment to pick that up,” White said. “I’m a lot closer to establishing causality there than if I do a correlation or if I use self-reported data.”
Data mining to discover correlations – such as the link between a country’s chocolate consumption and number of Nobel Peace Prize winners – is an example of bad research. In this case, the problem is further amplified by the fact that there is no ability to re-test the theory on a different data set.
“It’s not like the correlation is not there, because it is. People will find it repeatedly now that they’re looking for it, because you can’t change the historical chocolate consumption and you can’t change the number of prizewinners, but it means nothing.”
Finally, readers should always keep in mind the pitfalls of “correlative thinking.” Instead of assuming that correlation implies causation – to return to the first example, that safer playground equipment makes children more obese – readers should simply note that there is a relationship between the two variables.
“The important question is not how correlated a relationship is, but why individual data points are not on the trend line,” White said.
Kyle Johnson is a second-year MA candidate at SAIS, concentrating in Korean Studies and International Economics. He is an assistant editor at the SAIS Review.