Big data: correlation is no substitute for causality - By visiting Prof. Nick Bailey
Publication date 21-05-2015
‘Big data’ has become one of those buzzwords we can’t escape in the urban field these days. There is a whole new industry of big data analytics companies, selling their services to cities that want to be seen as ‘smart’, promising insights that will lead to more efficient or effective services. In academia, the terms ‘big data’ and ‘smart cities’ are dominating funding calls and research initiatives.
One of the claims of the ‘big data’ movement is that their techniques are based on a new paradigm for research. Using approaches which are unfamiliar to most social scientists, they apply automated analytical strategies (machine learning) to vast quantities of data from diverse sources. These approaches are said to succeed where traditional approaches would be overwhelmed by the volume and complexity of the data. And they do so by the application of brute computing power (or, rather, sophisticated programming), instead of relying on theory to target limited research resources: “With enough data, the numbers speak for themselves” as one well-known comment has put it (Anderson 2008).
The numbers ‘speak’ through the patterns of correlations which these automated analytics reveal. The new techniques, it is claimed, have moved beyond the need to worry about theory or causality. The sheer volume of data is sufficient guarantee of the importance of the relationships. It is true that simple correlations are enough for some purposes. One commonly cited illustration is the work in New York to understand the links between building subdivisions and fire risks. The city authorities had limited resources and inspectors were frequently called out to premises where they found minimal risks. The data scientists built a model to predict which callouts were likely to be associated with a significant fire risk so they could target the inspection resources more effectively. And they seem to have succeeded (Mayer-Schönberger and Culkier 2013).
For many other purposes, however, correlations are not enough. Consider another widely-cited case study – a hospital seeking to reduce readmission rates, also discussed by Mayer-Schönberger and Culkier (2013). Data scientists identified an unexpected risk factor for readmission that clinicians had supposedly overlooked – depression. On the basis of this study of correlations, the hospital introduced a policy for screening patients for depression and offering additional counselling those with symptoms, and readmission rates fell as a result.
Before we chalk this up as another success for the big data approach, however, let us unpack this a bit further. It is true that, at the stage of the data scientists’ analysis, the study is simply based on correlations – which factor or factors co-occur with readmission. Once the hospital moves to intervention, however, it shifts to being a claim about causation: counselling is introduced on the assumption that the relationship is causal – otherwise, it would be pointless. And that is what it turns out to be – celebrations all round. But it is also possible that depression could have turned out not to be the causal factor. For example, it might have been that the causal factor in readmissions was poor housing. This could cause both a higher incidence of readmission because it places greater stress on subjects’ physical health, but also cause a higher incidence of depression. In this case, counselling would have had no benefit. The insight from big data was only useful because it was subsequently shown to be causal.
We should not dismiss big data approaches. They have much to offer the social sciences. They offer us a real challenge to try to exploit types of data that we are not used to working with, and to learn from their sophisticated analytical techniques. But social scientists need to ensure that concerns about causality are kept at the centre of the picture, and that in turn requires the traditional social science expertise of good theory and good design.
Nick Bailey is Professor in Urban Studies, based in the School of Social & Political Sciences at the University of Glasgow. He has recently provided a lecture on the ‘fiscal crisis’ and the consequences of welfare state restructuring in the UK. Currently, he is Associate Director of the Urban Big Data Centre, Glasgow where he leads on the creation of a service for the analysis of confidential data. He has a particular interest in issues of equity in access to public services.
Anderson, C. (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired Magazine , 16 July. [http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory]
Mayer-Schönberger, V. and Cukier, K. (2013) Big Data: a revolution that will transform how we live, work and think. Boston: Eamon Dolan.