Data. A really short history

“One man’s noise is another man’s signal.” Edward Ng

 

Data are “givens”. They are artefacts. They don’t pre-exist in the world, but come to the world because we, the human beings, want them to do so. If you would imagine an intimate relationship between humans and data, you would say that data do not exist independently from humans. Hunters and gatherers collect the discrete objects of their desire, put them in order, administer them, store them, use them, throw them away, forget about where they were stored and at times uncover those escaped places again, for example during a blitheful snowball fight. At that time data were clearly divided from non-data – edible things were separated from inedible ones, helpful items from uninteresting ones. This selectivity may have benefited from the dependency on shifting availability of resources in a wider space.

Later onwards, when mankind went out of the forests, things became much more complicated. With the advent of agriculture and especially of trade data became more meaningful. Furthermore, rulers became interested in knowledge about their sheep and therefore asked some of their accountants to keep record of their subordinates – the eldest census dates back to 700 B.C. In the times when mathematics became a bit more complicated from the 17th (probability) and 19th (statistics) century onwards, data about people, agriculture, and trade began to heap up and it became more and more difficult to distinguish between relevant data (“signal”) and irrelevant ones (“noise”). The distinction between these two simply became a matter of the question with which the data at hands were consulted.

With the advent of the industrial age, the concept of mechanical objectivity was introduced, and the task of data creation was delegated to machines which were constructed to collect the items in which humans were interested in. Now data were available in huge amounts, and the need for organizing and ordering them became even more pressing. It is over here, where powerful schemes came into force: Selection processes, categorizations, classifications, standards; variables prioritized as signal over others reduced to noise, thus creating systems of measurement and normativity intended to govern societies. They have been insightful investigated in the book “Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life.”*

It was only later, in the beginning of the post-industrial age, when an alternative to this scheme- and signal-oriented approach was developed by simply piling up everything that may be of any interest, a task also often delegated to machines because of their patient effortlessness. The agglomeration of masses presupposes that storing is not a problem, neither in spatial nor in temporal terms. The result of such an approach is nowadays called “Big Data” – the accumulation of masses of (mostly observational) data for no specific purpose. Collecting noise in the hope of signal, without defining what noise and what signal is. Fabricating incredibly large haystacks and assuming there are needles in it. Data as the output of a completely economised process with its classic capitalistic division of labour, including the alienation from their sources.

What is termed “culture” often resembles these haystacks. Archives are haystacks. The German Historical Museum in Berlin is located in the “Zeughaus”, literally the “house of stuff”,** stuffed with the remnants of history, a hayloft of history. Libraries are haystacks as well; if you are not bound to the indexes and registers called catalogues, if you are free to wander around the shelves of books and pick them out at random, you might get lost in an ocean of thoughts, in countless imaginary worlds, in intellectual universes. This is the fun of the explorers, conquerors and researchers: Never get bored through routine, always discover something new which feeds your curiosity. And it is here, within this flurry of noise and signal, within the richness of multisensory, kinetic and synthetic modes of access, where it becomes tangible that in culture noise and signal cannot be thought without the environment out of which they were taken, that both are the product of human endeavours, and that data are artefacts that cannot be understood without the context in which they were created.

 

*Martha Lampland, Susan Leigh Star (Eds.), Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life. Ithaca: Cornell Univ. Press 2009.
** And yes, it should be translated correctly as “armoury”.

Dr Jennifer Edmond on RTÉ Brainstorm

RTE Brainstorm

A piece by Dr. Jennifer Emond, Principal Investigator of the KPLEX project, Director of Strategic Projects in the Faculty of Arts, Humanities and Social Sciences Trinity College Dublin and co-director of the Trinity Centre for Digital Humanities, has been published on the RTÉ Brainstorm website.  In this insightful piece, Dr. Edmond addresses some of the challenges of talking about Big Data, an important theme currently being explored by the KPLEX project team.

You can read the full article here: https://www.rte.ie/eile/brainstorm/2018/0110/932290-the-problem-with-talking-about-big-data/

Big Data and Biases

Big Data – these are per se data which are too big to be inspected by humans. Their bigness has consequences: They are so large that typical applications to store, compute and analyse them are inappropriate. Often processing them is a challenge for a single computer; thus, a cluster of computers have to be used in parallel. Or the amount of data has to be reduced by mapping an unstructured dataset into a dataset where individual elements are key-value pairs; on a reduced selection of these key-value pairs mathematical analyses can be performed (“MapReduce”). Even though Big Data are not collected in response to a specific research question, their sheer largeness (millions of observations x of y variables) promises to provide answers relevant for a large part of a societies’ population. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size is more important. However, on the other hand, large does not mean all; one has to be aware of the universe covered by the data. Statistical inference – conclusions drawn from data about the population as a whole – cannot easily be applied, because the datasets are not established in a way that ensures that they are representative. Therefore, bias in Big Data ironically may come from missing datasets, e.g. on those parts of the population which are not captured in the data collection process.

But biases may also arise in the process of analysing Big Data. This has also to do with the substantial size of the datasets; standard software may be inappropriate to handle it. Beyond parallel computing and MapReduce, the use of machine learning seems to provide solutions. Machine learning designates algorithms that can learn from and make predictions on data through building a model from sample inputs. It is a type of artificial intelligence in which the system learns from lots of examples; results – such as patterns or clusters – become stronger with more evidence. It is for this reason why Big Data and machine learning seem to go hand in hand. Machine learning can roughly be divided into A) analytic techniques which use stochastic data models, most often classification and regression in supervised learning; and B) predictive approaches, where the data mechanism is unknown, as it is the case with neural nets and deep learning. In both cases biases may be the result of the processing of Big Data.

A) The goal of statistical modelling is to find a model which allows to draw quantitative conclusions from data. It has the advantage of the data model being transparent and comprehensible by the analyst. However, what sounds objective (since it is ‘based on statistics’), neither needs to be correct (since if the model is a poor emulation of reality, the conclusions may be wrong), nor need it be fair: The algorithms may simply not be written in a manner which describes fairness or an even distribution as a goal of the problem-solving procedure. Machine learning then commits disparate mistreatment: the algorithm optimizes the discrimination for the whole population, but it is not looking for a fair distribution of this discrimination. ‘Objective decisions’ in machine learning can therefore be objectively unfair. This is the reason why Cathy O’Neill has called an algorithm “an opinion formalized in code”[1] – it does not simply provide objectivity, but works towards the (unfair) goals for which it written. But there is remedy; it is possible to develop mechanisms for fair algorithmic decision making. See for example the publications of Krishna P. Gummadi from the Max Planck Institute for Software Systems.

AlgorithmTanSteinbachKumarch4

Example of an algorithm, taken from: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Boston 2006, p. 164.

B) In recent years, powerful new tools for Big Data analysis have been developed: Neural nets, deep learning algorithms. The goal of these tools is predictive accuracy; they are hardware-hungry and data hungry, but have their strength in complex prediction problems where it is obvious that stochastic data models are not applicable. Therefore, the approach is designed in another way here: What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The challenge is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. The goal is to have the algorithm produce results with a strong predictive accuracy. The focus does not lie with the model by which the input x is transformed into the output y; it does not have to be a stochastic data model. Rather, the model is unknown, complex & mysterious; and irrelevant. This is the reason why accurate prediction methods are addressed as complex “black boxes”; at least with neural nets, ‘algorithms’ are seen as a synecdoche for “black box”. Other than it is the case with stochastic models, the goal is not interpretability, but accurate information. And it is here, on the basis of an opaque data model, where neural nets and deep learning extract features from Big Data and identify patterns or clusters which have been invisible to the human analyst. It is fascinating to see that humans don’t decide what those features are. The predictive analysis of Big Data can identify and magnify patterns hidden in the data. This is the case with many recent studies, like, for example, the facial recognition system recognizing ethnicity which has been developed by the company Kairos or the Stanford study inferring sexual orientation by analysing people’s faces. What comes out here is that the automatic feature extraction amplifies human bias. A lot of the talk about “biased algorithms” is a result out of these findings. But are the algorithms really to blame for the bias, especially in the case of machine learning systems with a non-transparent data model?

This question leads us back again to Big Data. There are at least two possible ways in which the data used predetermine the outcomes: The first is Big Data with built-in bias which is then amplified by the algorithm. Simply go to the Google image search and perform a search either for the words “CEO” or “cleaner”. The second is the difference between the data sets used as training data for the algorithm and the data analysed subsequently. If you don’t have, for example, African American faces in a training set on facial recognition, you simply don’t know how the algorithm will behave when applied to images with African American faces. Therefore, the appropriateness and the coverage of the data set is crucial.

The other point lies with data models and the modelling process. Models are always contextual, be they stochastic models with built-in assumptions about how the world works; or be they charged with context during the modelling process. This is why we should reflect on the social and historical contexts in which Big Data sets have been established; and the way our models and modelling processes are being shaped. And maybe it is also timely to reflect on the term “bias”, and to recollect that it implies an impossible unbiased ideal …

 

[1] Cathy O’Neil, Weapons of Math Destruction. How Big Data increases inequality and threatens Democracy, New York: Crown 2016, p.53.

The K-PLEX project on the European Big Data Value Forum 2017

Mike Priddy (DANS, 2nd from right in the image) represented the K-PLEX project at the European Big Data Value Forum 2017 Conference in a panel on privacy-preserving technologies. Read here about the statements he made in answering to three questions posed.

IMG_0603

Question 1: There is an apparent trade-off between big data exploitation and privacy – do you agree or not?

  • Privacy is only one part of Identity. There needs to be respect for the individual’s right to build their identity upon a rich and informed basis.
  • The right not to know should also be considered. People have a right to their own levels of knowledge
  • Privacy is broader than the individual. Confidential data exists in and can affect: family, community, & company/organisations. The self is relational, it is not individual, it produces social facts and consequences.
  • Trust in data use & third party use – where should the accountability be?
  • There is the challenge of transparency versus accountability; just making all data available may obfuscate the accountability.
  • Accountability versus responsibility? Where is the ethical responsibility lie with human & non-human actors?
  • Anonymisation is still an evolving ‘science’ – the effectiveness of anonymising processes is is not always well and broadly understood. Aggregation may not give the results that users want or can use, but may protect the individual but not necessarily for a community or family.
  • Anonymity maybe an illusion; we don’t understand how minimal the data may need to be in order to expose identity. DoB, Gender & Region is enough to be disclosive for the majority of a population.
  • individuals, in particular young or vulnerable individuals, may not be in a position to defend themselves.
  • This means that big data may need to exclude communities & people with niche problems
  • Black boxes of ML & NNets don’t allow people to understand or track use or misuse or misinformation – wrong assertions being made: you cannot give informed consent under these conditions.
  • IOT and other technologies (facial recognition) mean that there is possibly no point at which informed consent can be given.

Strategies for meeting these issues:

  • There are well established strategies to deal with disclosure of confidential data in the Social Sciences and Official Statistics: such as output checking, off the grid access, remote execution (with testable data), secure rooms etc. Checks and balances are needed (a pause) before it goes out – this is a part of oversight and governance.
  • Individuals should be able to see when these processes are triggered, and decide if it is disclosive and whether that is appropriate.
  • More information about how data is used, shared, processed must be made available to the data creator (in a way they can use it)
  • meeting ISO 27001 standard in your data handling and procedures within your organisation is a good start.

Question 2: Regarding the level of development, privacy preserving big data technologies still have a long way to go – do you agree or not?

  • Biases are baked in. There isn’t enough differentiation between kinds of data: mine, yours, raw, cleaned, input, output – data is seen as just data and processed without narrative or context. We need not privacy by design, we need humanity at the centre of design and respect human agency.
  • Too often we only are concerned about privacy when it becomes a problem: privacy/confidentiality is NOT an obsolete concept.

Question 3: Individual application areas differ considerably with regard to the difficulty of meeting the privacy requirements – do you agree or not?

  • The problem is the way the question is formulated. By looking at application areas we are basically saying the problem is superficial. It is not. It is fundamental.
  • It has become very hard to opt out of everything. We cannot cut all of our social ties because of network effects.
  • Technology is moving faster than society can cope with and understand how data is being used. Not a new phenomena, we can see similar challenges in the historical record.
  • Privacy needs to be understood as a public good; there must be the right to be forgotten, but also right not to be recorded.
  • Data citizenship is needed: Citizens need to be involved enough & to be able to make better decisions about providing confidential/personal data & what happens to their data. What it means and what happens when you fill in that form

 

Ways of Being in a Digital Age.

I’m just back from a few days in Liverpool, where I attended the “Ways of Being in a Digital Age” review conference at the University of Liverpool. This was the TCD working group’s first trip abroad for KPLEX dissemination, and my first ever trip to Liverpool.

UoL’s “Ways of Being in a Digital Age” was a a massive scoping review on how digital technology mediates our lives, and how technology and social change co-evolve and impact on each other. This conference came at the conclusion of the project, and invited participation in the form of papers that addressed any of the seven project domains: Citizenship and politics, Communities and identities, Communication and relationships, Health and wellbeing, Economy and sustainability, Data and representation, Governance and security. Naturally we hopped right into the “Data and Representation” category.

I was presenting a paper I co-wrote with Jennifer (Edmond, KPLEX’s PI) and, like most of my KPLEX activities thus far, I also used the platform as an opportunity to include as many funny data memes as I could reasonably fit into a 20 minute Powerpoint presentation. Which, by the way, is A LOT.

Our paper was titled “Digitising Cultural Complexity: Representing Rich Cultural Data in a Big Data environment” and in it we drew on many of the issues we’ve discussed thus far in the blog, such as data definitions, on the problems brought about by having so many different types of data, all classified using the same term (data), on data and surveillance, data and the humanities, and the “aura” of big data and how the possibilities of big data are manipulated and bumped up, so that it seems like an infallible “cure all,” when in fact it is anything but. And most importantly, on why complexity matters, and what happens when we allow alternative facts to take precedence over data.

The most exciting thing (from my perspective) was that we got to talk about some of our initial findings, findings based off of interviews I conducted with a group of computer scientists who very generously gave me their some of their time over the summer, and a more recent data mining project that is still underway, but that is producing some really exciting results. After all this desk research and talk about data over the last 9 months or so, the KPLEX team as a unit are in the midst of collecting data of our own, and that’s exciting. Below is a photo montage of my experience of the WOBDA conference, which mainly consists of all the different types of coffee I drank while there, along with some colossal pancakes I had for breakfast. I also acquired a new moniker, from the hipster barista in a very lovely coffee shop that I frequented twice during my two day stay. On the receipt, the note she wrote to find me was “Glasses lady.” 🙂

IMG_4444

To err is human – but also computers can make mistakes

Imagine an automated rating of CVs in order to decide whom to grant a scholarship or which job candidate to hire. This is not science-fiction. Big private companies increasingly rely on such mechanisms to make hiring decisions. They analyse the data of their employees to find the underlying patterns for success. The characteristics of job candidates are then matched with those of successful employees and the system recommends those candidates with most similar characteristics. Much less time and effort is needed to choose the “right” candidates from a pool of promising applicants. Certainly, the human resources department has to reflect on what characteristics to choose and how to combine and weight them, but the recommendations based on the analysis of big data seem to be very efficient.

Making automatic algorithm-produced predictions about one individual person by analyzing the information from many people is still problematic in several ways. First of all, it requires inter alia large datasets to avoid bias. Second, the demand for standardized CVs implies that non-normative ways of presenting oneself are excluded a priori. Third, assume that the characteristics of high achievers change by time. The system will continue (at least for some time) to formulate recommendations based on past experiences. The static model of prediction will be unable to detect potential innovative candidates who have divergent experiences and skills. It thus discriminates against individuals with non-standard backgrounds and motivations. A last problem is that all the data available to the model are based on the people who have been accepted in the first place and who have proven successful or unsuccessful thereafter. Nothing is known about the career paths of the applicants who had been rejected.

“In this context, data – long-term, comparable, and interoperable – become a sort of actor, shaping and reshaping the social worlds around them” (Ribes and Jackson 2013: 148). Taken from an ecological research with stream chemistry data this statement applies equally to the problem of automatic recommendation systems.

Even worse, not only the CV but the footprint one leaves in cyberspace might serve as the basis of decision-making. The company Xerox used data it has mined of their (former) employees to define the criteria for hiring new staff for its 55,000 call-centre positions. The applicants’ data gained from the screening test were compared with the significant, but sometimes unexpected criteria detected so far. In the case of Xerox for example “employees who are members of one or two social networks were found to stay in their job for longer than those who belonged to four or more social networks”.

To-err-is-human1

Whether the social consequences of these new developments can be attributed to humans or also computers is highly controversial. Luciano Floridi (2013) makes the point that we should differentiate between the accountability of (artificial) agents and the responsibility of (human) agents. Does the algorithm discussed above qualify as an agent? Floridi would argue yes, because “artificial agents could have acted differently had they chosen differently, and they could have chosen differently because they are interactive, informed, autonomous and adaptive” (ibid. 149). So even if “it would be ridiculous to praise or blame an artificial agent for its behavior, or charge it with a moral accusation” (ibid. 150), we must acknowledge that artificial agents as transition systems interact with their environment, that they can change their state independently and that they are even able to adopt new transition rules by which they change their state.

The difference between accountability and responsibility should be kept in mind, so that attempts to delegate responsibility to artificial agents can be uncovered. In case of artificial agents malfunctioning, the engineers who designed them are requested to re-engineer them to make sure they no longer cause evil. And in the case of recruitment decisions companies should be very careful about how to proceed. There is no single recipe for success.


Floridi, Luciano (2013) The Ethics of Information. Oxford: Oxford University Press.

Ribes, David/ Jackson, Steven J. (2013) Data Bite Man: The Work of Sustaining a Long-Term Study. In: Lisa Gitelman, “Raw Data” Is an Oxymoron. Cambridge: The MIT Press. 147-166.

Featured image was taken from: https://cdn.static-economist.com/sites/default/files/images/print-edition/20170610_FND000_0.jpg

On the aura of Big Data

People who know very little about technology seem to attribute an aura of “objectivity” and “impartiality” to Big Data and analyses based on them. Statistics and predictive analytics give the impression, to the outside observer, of being able to reach objective conclusions based on massive samples. But why exactly is that so? How has it come that a societal discourse has ascribed that certain aura to Big Data analyses?

Since most people conceive of Big Data as tables filled with numbers which have been collected by machines observing human behavior, there are at least two points intermingled in this peculiar aura of Big Data: The belief that numbers are impartial and preinterpretive, and the conviction that there exists something like mechanical objectivity. Both concepts have a long history, and it is therefore wise to consult cultural historians and historians of science.

With respect to the claim that numbers are theory-free and value-free, one can consult the book “A History of the Modern Fact”[1] by Mary Poovey. Poovey traces the history of that modern epistemological assumption that numbers are free of an interpretive dimension, and she points to the story of how description came to seem separate from interpretation. In analyzing historical debates about induction and by studying authors such as Adam Smith, Thomas Malthus, and William Petty, Poovey points out that “Separating numbers from interpretive narrative, that is, reinforced the assumption that numbers were different in kind from the analytic accounts that accompanied them.” (XV) If nowadays many members of our societies imagine that observation can be separated from analysis and numbers guarantee value-free description, this is the result of the long historical process examined by Poovey. But seen from an epistemological point this is not correct, because numbers are interpretive – they embody theoretical assumptions about what should be counted, they depend on categories, entities and units of measurement established before counting has begun, and they contain assumptions on how one should understand material reality.

The second point, mechanical objectivity, has been treated by Lorraine Daston and Peter Galison in their book on “Objectivity”; it contains a chapter of the same name.[2] Daston and Galison focus on photography as a primary metaphor for the objectivity ascribed to a machine. Alongside this example, they describe mechanical objectivity as “the insistent drive to repress the willful intervention of the artist-author, and to put in its stead a set of procedures that would, as it were, move nature to the page through a strict protocol, if not automatically.” (121) Both authors see two intertwined processes at work: On the one hand the separation of the development and activities of machines from the human beings who conceived them, with the result that machines were attributed freedom from the willful interventions that had come to be seen as the most dangerous aspects of subjectivity. And on the other hand the development of an ethics of objectivity, which called for a morality of self-restraint in order to refrain researchers from intervention and interferences like interpretation, aestheticization, and theoretical overreaching. Thus machines – be they cameras, sensors or electronic devices – have become emblematic for the elimination of human agency.

If the aura of Big Data is based on these conceptions of an “impartiality” of numbers and data collected by “objectively” working machines, there remains little space for human agency. But this aura proves of a false consciousness, the consequences of which can easily be seen: If analyses based on Big Data are taken as ground truth, it is no wonder that there is no space being opened up for a public discussion, for decisions made independently by citizens, and for a democratically organized politics, where the processes in which Big Data play an important role are being shaped actively.

[1] Mary Poovey, A History of the Modern Fact. Problems of Knowledge in the Sciences of Wealth and Society, Chicago / London: The University of Chicago Press 1998.

[2] Lorraine Daston, Peter Galison, Objectivity, New York: Zone Books 2007.

New Wine into Old Wineskins

The communication of scientific outputs or in other words the way narratives relate to data has received much attention in previous KPLEX posts. Questions such as “Do the narratives elaborate on the data, create narrative from the data, or do the narratives reveal the latent richness inherent in the data?” have been raised. These fundamental questions touch upon the very heart of the scientific enterprise. How do we try to grasp the complexity of a phenomenon and how do we translate our insights and findings into clear language?

Debates in anthropology might be revealing in this regard. The “reflexive turn” since the 1970s led anthropologists to ask themselves if it was possible to create an objective study of a culture when their own biases and epistemologies were inherently involved. So far they had produced some kind of “realist tale” with a focus on the regular and the junction of observations with standard categories. Only those data had been allowed that supported the analysis; the underanalyzed or problematic had been left out. Anthropologists’ worries and efforts had revolved around the possible criticism that their work was “’opinion, not fact’, ‘taste, not logic’, or ‘art, not science’” (Van Maanen 1988: 68). Then a plentitude of new tales emerged that openly discussed the accuracy, breadth, typicality, or generality of the own cultural representations and interpretations. Put under headings like “confessional tale”, “impressionist tale”, “critical tale”, “literary tale” or “jointly told tale” all these kinds of narratives open up new ways of description and explanation. Issues such as serendipity, errors, misgivings and limiting research roles for example are taken up by confessional writers (Karlan and Appel 2016).

How far has this discussion progressed in the sciences? A differentiation analogue to the one between “real events (Geschichte) and the narrative strategies (Historie) used to represent, capture, communicate, and render these events meaningful” has to some extent taken place. Still the process of constructing scientific facts often happens in a black box and is not revealed to the reader, as the famous study by Bruno Latour and Steve Woolgar has shown. Like in the humanities also in the sciences different kinds of statements – ranging from taken-for granted facts through tentative suggestions and claims to conjectures – contribute to the establishment of ”truth”. The combination of researchers, machines, “inscription devices”, skills, routines, research programs, etc. leads to the “stabilisation” of statements: “At the point of stabilisation, there appears to be both objects and statements about these objects. Before long, more and more reality is attributed to the object and less and less to the statement about the object. Consequently, an inversion take place: the object becomes the reason why the statement was formulated in the first place. At the onset of stabilisation, the object was the virtual image of the statement; subsequently, the statement becomes the mirror image of the reality ‘out there’” (Latour and Woolgar 1979: 176 f.).

New wine in old skins

So with regard to the sciences the questions that had been raised before: “Is it possible for a narrative to become data-heavy or data-saturated? Does this impede the narrative from being narrative?” have to be negated. Discursive representations are always implicated when conveying some version of truth. In terms of reflexivity there is still some room for improvement, e.g. putting the focus not only on the communication of startling facts but also on non-significant results. This would certainly help the practitioners of science to get to know better the scope and explanatory power of their disciplinary methods and theories. Hope remains that unlike in the parable of new wine in old wineskins the sciences will stand these changes and not burst.


Karlan, Dean S./ Appel, Jacob (2016) Failing in the field: What we can learn when field research goes wrong. Princeton: Princeton University Press.

Latour, Bruno/ Woolgar, Steve (1979) Laboratory Life: The Construction of Scientific Facts. Princeton: Princeton University Press.

Van Maanen (1988) Tales of the Field: On Writing Ethnography. Chicago: The University of Chicago Press.

Featured image was taken from: https://i.pinimg.com/originals/22/bc/0a/22bc0a8c4573701d9df70f51a971388a.jpg

Statistical Modeling: The Two Cultures

In this article Leo Breiman describes two approaches in statistics: One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

Statisticians in applied research consider data modeling as the template for statistical analysis and focus within their range of multivariate analysis tools on discriminant analysis and logistic regression in classification and multiple linear regression in regression. This approach has the plus that it produces a simple and understandable picture of the relationship between the input variables and response. But the assumption that the data model is an emulation of nature is not necessarily right and can lead to wrong conclusions.

The algorithmic approach uses neural nets and decision trees; predictive accuracy as criterion to judge the quality of the results of analysis. This approach does not apply data models to explain the relationship between input variable x and output variable y, but treats this relationship as a black box. Hence the focus is on finding an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. While this approach has seen major advances in machine learning, it lacks interpretability of the relationship between prediction and response variables.

This article has been published in 2001, when the word “Big Data” was not yet in everybody’s mouth. But by shaping two different cultures to analyzing data and balancing pros and cons of each approach, it makes the differences of big data analysis in contrast to stochastic data models understandable even to laymen.

Leo Breiman, Statistical Modeling: The Two Cultures. In: Statistical Science, Vol. 16 (2001), No. 3, 199-231. Freely vailable online here.