A few weeks ago, Chris Anderson, editor of Wired and author of The Long Tail, predicted the end of scientific method (“hypotheses-model-test”). He argued that the combination of data analysis and of the advent the petabyte age, where it is possible to store and analyze an unimaginable quantity of data, always false models will be replaced with data crunching. “With enough data, the numbers speak for themselves,” he confidently asserted.
While acknowledging that correlation was -hitherto- not causation, he nevertheless claimed, on the basis of the gene sequencing experience, that:
Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
The comments below the article and elsewhere on the web (here in French, there in English for instance) are much more interesting than the post in itself, whose main interest is to warn about the drifting of overconfidence in data analysis. The discussions on the problem of inductivism, the true meaning of correlation, the context and applications of models etc. are pretty interesting, especially when they resurrect the Friedman idea that prediction is what matters or display overtones of the Koopmans-Vining debate.
Chris Anderson was trained in Physics, his biography says. I wonder whether such article could have been written by an economist. If only because the economist’s approach to data.
“Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age,”
Anderson begins with. But it seems to me that economists are not even done with the first step, “ma[ke] information readable.” The history of econometrics can be read as a long and difficult quest to cope with missing data, outliers, spurious correlation, endogeneity etc. And the today economic articles I most admire are those where shrewd proxies for missing data sets are designed, such as the –controversial- use of streams as a proxy for the number of school districts to test the benefits of public school choices, or the use of nineteenth century Prussian census data to test the Weberian relation between Protestantism and prosperity (thanks Mathieu).