Statistical Modeling: The Two Cultures

In this article Leo Breiman describes two approaches in statistics: One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

Statisticians in applied research consider data modeling as the template for statistical analysis and focus within their range of multivariate analysis tools on discriminant analysis and logistic regression in classification and multiple linear regression in regression. This approach has the plus that it produces a simple and understandable picture of the relationship between the input variables and response. But the assumption that the data model is an emulation of nature is not necessarily right and can lead to wrong conclusions.

The algorithmic approach uses neural nets and decision trees; predictive accuracy as criterion to judge the quality of the results of analysis. This approach does not apply data models to explain the relationship between input variable x and output variable y, but treats this relationship as a black box. Hence the focus is on finding an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. While this approach has seen major advances in machine learning, it lacks interpretability of the relationship between prediction and response variables.

This article has been published in 2001, when the word “Big Data” was not yet in everybody’s mouth. But by shaping two different cultures to analyzing data and balancing pros and cons of each approach, it makes the differences of big data analysis in contrast to stochastic data models understandable even to laymen.

Leo Breiman, Statistical Modeling: The Two Cultures. In: Statistical Science, Vol. 16 (2001), No. 3, 199-231. Freely vailable online here.

On assistance and partnership with artificial intelligence

After mobile devices and touchscreens, personal assistants will be “the next big thing” in the tech industry. Amazon’s Alexa, Microsoft’s Cortana, Google Home, Apple’s HomePod  – all these voice-controlled systems come to your home, driven by artificial intelligence, ready for dialogues. Artificial intelligence currently works bests when provided with clear tasks, where clear goals are defined. That can be a goal defined by the user (“Please turn down the music, Alexa!”), but generally the goals of the companies offering personal assistants dominate: They want to sell. And there you find the differences between these personal assistants: Alexa sells the whole range of products marketed by Amazon, Cortana eases access to Microsoft’s soft- and hardware, Google Home has its strengths with smart home devices and the internet of things, and Apple’s HomePod … well, urges you into the hall of mirrors which has been created by Apple’s Genius and other flavour enhancers.

Beyond well-defined tasks, artificial intelligence is bad at chatting and assisting. If you are looking for a partner, for someone to talk to, the predefined goals are missing. AI lacks the world knowledge needed for such a task, nor is it capable to provide for the appropriateness of answers in a conversation or for the similarity of mindsets which is the basis of friendship.

But this is exactly what is promised by the emotion robot “Pepper”. This robot saves the emotions it collects from its human interaction partners on a shared server, a cloud. All of the existing Pepper robots are connected to this cloud. This way the robots, which are already autonomous, collectively “learn” how to improve their emotional reactions. Their developers also work with these data.

If you think through the idea of “Pepper”, you have to ask yourself to which end this robot should serve – as a replacement of a partner for human beings, caring for their emotional well-being? In which way is this conceived of? How does a robot know how to contribute to human well-being? Imagine of a human couple, where he is a choleric, and her role is to constantly quieten him down (contrapuntal approach). Or another couple which is constantly quarrelling, he shouts at her, and she yells back; this couple judges this to be the normal state and their quarrelling as an expression of well-being (homeopathic approach). Can a robot decide which ‘approach’ is the best? Simply imagine what would happen in a scenario where you have a person – we call him ‘Donald’ – who buys a new emotion robot – whom we call ‘Kim’. Certainly this is not the kind of world we’re looking for, isn’t it?

With personal assistants, it seems to be a choice between the devil and the deep blue sea: Either you are being reduced to a consumer; or you’ll be confronted with some strange product without openly defined goals, with which you can’t exchange at eye level. So the best choices we have is to either abstain from using these AIs; or to participate in civil society dialogues with tech companies on policy debates about the use of AI.

Analogue secrets

Annual leaves and holidays seem to support regressive impulses: Nearly unvoluntarily one stops in front of a souvenir shop to inspect rotating displays with postcards on it. They definitely have a charm of their own – small-sized cardboards with the aura of authenticity, indicating in a special way that “I was here”, as if there weren’t enough selfies and WhatsApp-posts to support that claim. Quickly written and quickly sent, these relicts of a faraway time in which messages were handwritten, postcards seem to provide for a proof that someone has really been far away and happily came back, enriched by experiences of alterity.

This aura of the analogue, which provides a short insight into the inner life of a person distantly on his route, is being exploited by an art project since many years. provides post office boxes in many countries of the world, where people can send in their postcard, anonymously revealing a personal secret. A simple and ingenious idea: One can utter something which can not be told by word of mouth; nor could it be written down, because one would need an addressee who would draw his conclusions. But makes this secret public, and maybe the cathartic effect (and affect) to finally got rid of something which was a heavy burden can be enjoyed in anonymity.

Can we conceive of a digital counterpart of this analogue art project? Only if there are still users there who believe in anonymity in the net. But it is true that literally every activity in the net leaves traces – and why should I reveal secrets in the net and thus become in one way or another susceptible to blackmail?

Praise for the good old hand-written postcard of confidence: Every mailing is an original, every self-made postcard is unique.

Patina pleases

Everything new is blank. New things display their integrity and an undestroyed, immaculate surface. Design objects are icons of youth and timelessness. They demand attention in a special way, because their newness is always at risk. Traces like scratches rob them of their stainlessness and indicate a loss of aura. The ageing of these objects triggers a certain horror in their owners, because it shows the passing of newness and a loss of control. Consumer electronics is a particularly instructive example here, and this is in spite of a miraculous ability of electronic content – the ability NOT to age.

But ageing is precious. Aged objects, which can be found on flea markets, in museums or antique shops are often seen as precious. They represent the past, time, and history, and we appreciate the peculiarities of their surface, which has grown over decades. We are not disturbed by dust, grease, grime, wax, scratches, or cracks – on the contrary, the patina of objects represents their depth and their ability to exhibit their own ageing process. The attention of the observer focuses on the materiality of the object, the patina of the surface and the deformations gained in time mark their singularity and individuality. To be precise: The fascination of the observer focuses more on the signs of ageing, rather than on the object itself.


Schoolbag from a flu market

In the digital world, we don’t find comparable qualities. Ageing would mean that data have become corrupt, unreadable, unusable and therefore worthless. For objects digitized by cultural heritage institutions, this is a catastrophe; it means loss. With consumer electronics, it’s similar: A smartphone is devalued by scratches. No gain in singularity is noted. With software, it is even worse. Software ageing is a phenomenon, where the software itself keeps its integrity and functionality, but if the environment, in which this software works, or the operating system changes, the software will create problems and mistakes, and sooner or later it will become dysfunctional. Ageing here means an incapability to adapt to the digital environment, and surprisingly this happens without wear or deterioration, since it is not data corruption which causes this ageing. In this respect, the process of software ageing can be compared to the ageing of human beings: They drag behind time or are uncoupled of the social world they live in; they lose connectivity, the ability of actualisation, and the skill to exchange with their environment.

With analogue objects, this is not the case. They provoke sensual pleasures without reminding the observer of the negative aspects of human ageing. Even if they have become dysfunctional and useless, they keep the dignity and aura of time, inscribed into their body and surface.  They keep their observers at a beneficial distance, which opens up space for imagination and empathy. The observer is free to visualise to himself the history of these objects and their capability to endure long time distances without vanishing – certainly a faculty which human beings do not dispose of. What remains are the characteristics of dignified ageing. While the nasty implications of ageing are buried in oblivion, analogue objects evoke a beauty of ageing.

Why Messiness Matters

As always in a field where different conceptions are present, there exist differing understandings of what ‘tidy’ or ‘clean’ data on the one hand, and ‘messy’ data on the other might be.

On a very basic level, and coming from the notion of data arranged in tables, Hadley Wickham has defined ‘tidy data’ as those where each variable is a column, each observation is a row, and each type of observational unit is a table. Data not following these rules are understood as messy. Moreover, data cleaning is a routine procedure in statistics and private companies applied to the data gathered: judging the number and importance of missing values, correcting obvious inconsistencies and errors, normalization, deduplication etc. Common examples are the use of non-existing or incorrect postal codes, a number which wrongly indicates an age (in comparison to birth date), the conversion of names from “Mr. Smith” to “John Smith”, etc.

In the sciences, the understanding of ‘messy’ data largely depends on what is understood as ‘signal’ and what is seen as ‘noise’. Instruments collect a lot of data, and only the initiated are able to distinguish between them. Cleaning data and thus peeling out the relevant information in order to receive scientific facts is also a standard procedure here. In settings where the data structure is human-made rather than technically determined, ‘messy’ data remind scientists to reassess whether the relationship between the (research) question and the object of research is appropriate; whether the classifications used to form variables are properly conceived of or whether they inappropriately limit the phenomenon under question; and what kind of data can be found in that garbage category of the ‘other’.

In the humanities, this is not as easily the case as it is with the sciences. Two recent publications provide for examples. Julia Silge’s and David Robinson’s book “Text Mining with R” (2017) bears the subtitle “A tidy approach”. Very much like Wickham, they define ‘tidy text’ as “a table with one-token-per-row”. Brandon Walsh and Sarah Horowitz present in their “Introduction to Text Analysis” (2016) a more differentiated approach to an understanding of what ‘cleaning’ of ‘messy’ data might look like. They introduce their readers to the usual problems of cleaning dirty OCR; the standardization and disambiguation of names (their well-chosen example is Arthur Conan Doyle, who was born as Arthur Doyle, but used one of his given names as addendum to his last name); and the challenges posed by metadata standards. All that seems easy stuff at first glance, but think of Gothic types (there exist more than 3.000 of them), pseudonyms or heteronyms, or camouflage printings published during the inquisition or under political repression. Now you can imagine how hard it can be to keep your data ‘clean’.

And there is, last but not least, another conception of ‘messiness’ in the humanities. It lies in the specific cultural richness, or polysemy, or voluptuousness of the data (or the sources, the research material) under question: A text, an image, a video, a theater play or any other object of research can be interpreted from a range of viewpoints. Humanists are well aware of the fact that the choice of a theory or a methodological approach – the ‘grid’ which provides order to the chaos of what is being examined – never provides an exhausting interpretation. It is the ‘messiness’ of the data under consideration which provides the foundation of alternative approaches and research results, which is responsible for the resistance to interpretation (and, with Paul de Man, to theory) – and which continuously demands an openness towards seeing things in another way.

Critical Data Studies: A dialog on data and space

This article has been published by Craig M Dalton, Linnet Taylor, and Jim Thatcher. ‘Critical Data Studies: A dialog on data and space’. In: In: Big Data & Society, Vol 3 (2016), Issue 1, pp.1—9.

In light of recent technological innovations and discourses around data and algorithmic analytics, scholars of many stripes are attempting to develop critical agendas and responses to these developments (Boyd and Crawford 2012). In this mutual interview, three scholars discuss the stakes, ideas, responsibilities, and possibilities of critical data studies. The resulting dialog seeks to explore what kinds of critical approaches to these topics, in theory and practice, could open and make available such approaches to a broader audience.

The article is available online at:

On digital oblivion

Knowledge is made by oblivion.
Sir Thomas Browne; in: Sir Thomas Browne’s Works: Including His Life and Correspondence, vol.2, p.177.

Remembrance, memory and oblivion have a peculiar relationship to each other. Remembrance is the internal realisation of the past, memory (in the sense of memorials, monuments and archives) its exteriorised form. Oblivion supplements these two to a trinity, in which memory and oblivion work as complementary modes of remembrance.

The formation of remembrance can be seen as an elementary function in the development of personal and cultural identity; oblivion, on the other hand, ‘befalls’, it happens, it is non-intentional and can therefore be seen as a threat to identity. Cultural heritage institutions – such as galleries, libraries, archives, and museums (GLAM) are thus not only the places where objects are being collected, preserved, and organized; they also form bodies of memory, invaluable for our collective identity.

There is a direct line in these cultural heritage institutions from analogue documentation to digital practice: Online catalogues and digitized finding aids present metadata in a publicly accessible way. But apart from huge collections of digitized books, the material under question is mostly not available in digital formats. This is the reason why cultural heritage – and especially unique copies like the material stored in archives – can be seen as ‘hidden data’. What can be accessed are metadata: the most formal description of what is available in cultural heritage institutions. This structure works in a two-fold way towards oblivion: On the one hand, the content of archives and museums is present and existing, but not in digital formats and thus ‘invisible’. On the other hand, the century-long practice of documentation in the form of catalogues and finding aids has been carried over into digital information architectures; but even though these metadata are accessible, they hide more than they reveal if the content they refer to is not available online. We all have to rely on the information given and ordered by cultural heritage institutions, their classifications, taxonomies, and ontologies, to be able to access our heritage and explore what has formed our identities. Is Thomas Browne right in pointing to the structured knowledge gained from oblivion?

This depends on our attitude to the past and the formation of identity. It is possible to appreciate oblivion as a productive force. In antiquity, amnesty was the Siamese twin of amnesia; the word amnesty is derived from the Greek word ἀμνηστία (amnestia), or “forgetfulness, passing over”. Here it is oblivion which works for the generation of identity and unity: let bygones be bygones. In more recent times, it was Nietzsche who underlined the necessity to relieve oneself from the burdens of the past. It was Freud who identified the inability to forget as a mental disorder, earlier called melancholia, nowadays depression. And it was also Freud who introduced the differentiation between benign oblivion and malign repression.

But it is certainly not the task of GLAM-institutions to provide for oblivion. Their function is a provision of memory. Monuments, memorials, and the contents of archives serve a double bind: They keep objects in memory; and at the same time this exteriorisation neutralizes and serves oblivion insofar as it relieves from the affect of mourning; to erect monuments and memorials and to preserve the past in archives is in this sense a cultural technique of an elimination of meaning. To let go what is not longer present by preserving the available – in this relation the complementarity of memory and oblivion becomes visible; they don’t work against each other, but jointly. From this point of view remembrance – the internal realization of the past – is the task of the individual.


Detail of the front of the Jewish Museum Berlin
By No machine-readable author provided. Stephan Herz assumed (based on copyright claims). [Public domain], via Wikimedia Commons

It is not the individual which decides on what should sink into oblivion. Societies and cultures decide in a yet unexplored way which events, persons, or places (the lieux de mémoire) are kept in memory and which are not. If it is too big a task to change the documentation practices of GLAM-institutions, the information architecture, and the metadata they produce, the actual usage of archival content could provide an answer to the question of what is of interest for a society and what is not: The digital documentation of what is being searched for in online catalogues and digitized finding aids as well as which material is being ordered in archives clearly indicate the users’ preferences. Collected as data and analysed with algorithms, we could learn from this documentation not only about what is being kept in memory; we could also learn about what falls into oblivion. And that is a kind of information historians rarely dispose of.

Promise and Paradox: Accessing Open Data in Archaeology

This article has been published by Huggett, Jeremy. ‘Promise and Paradox: Accessing Open Data in Archaeology’. In: Clare Mills, Michael Pidd and Esther Ward. Proceedings of the Digital Humanities Congress 2012. Studies in the Digital Humanities. Sheffield: HRI Online Publications, 2014.

Increasing access to open data raises challenges, amongst the most important of which is the need to understand the context of the data that are delivered to the screen. Data are situated, contingent, and incomplete: they have histories which relate to their origins, their purpose, and their modification. These histories have implications for the subsequent appropriate use of those data but are rarely available to the data consumer. This paper argues that just as data need metadata to make them discoverable, so they also need provenance metadata as a means of seeking to capture their underlying theory-laden, purpose-laden and process-laden character.

The article is available online at:

On the humanities and „Big Data“

Most humanists get a bit frightened, if the term „big data” is being mentioned in a conversation; they are often simply not aware of big datasets provided by huge digital libraries like Google Books and Europeana, the USC Shoah Foundation Visual History Archive, or historical census and economic data, to name a few examples. Furthermore, there is a certain oblivion of historical precursors of ‘digital methods’ and the continuities between manual and computer aided methods. Who in the humanities is aware, for example, of the fact that already in 1851 a certain Augustus de Morgan brought forward the idea to identify an author by the average length of his words?

The disconcertment in the humanities with regard to ‘big data’ is certainly understandable, since it points to the premises of the related disciplines and their long-practiced methodologies. Hermeneutics, for example, can be understood as a holistic penetration through individualistic comprehension. It is an iterative, self-reflexive process, which aims at interpretation. Humanists work on the exemplary and are not necessarily interested in generalization and exact quantities.


Big Data, by contrast, provide for considerable quantities and point to larger structures, stratification and the characteristics of larger groups rather than the individual. But this seeming opposition does not mean that there are no bridges between humanities’ and big data approaches. Humanists can as well expand their research questions with respect to the ‘longue durée’ or larger groups of people. To provide for an example from text studies: A humanist can develop her comprehension of single authors; her comprehension of the relationship between these single authors and the intellectual contexts they live in; and her comprehension of these intellectual contexts, i.e. the larger environment of individuals, the discursive structures prevailing at the time which is being researched, and the socio-economic conditions determining these structures.

That is not to say that humanities and big data could easily blend into each other. A technique like ‘predictive analysis’, for example, is more or less alien to humanistic research. And there is the critical reservation of the humanities with respect to big data: No computing will ever replace the estimation and interpretation of the humanist.

It is therefore both a reminder and an appeal to humanists to bring back the ‘human’ to big data: The critical perspective of the humanities is able to compensate for the loss of reality which big data inevitably implicates; to situate big data regimes in time and space; to critically appraise the categories of supposedly neutral and transparent datasets and tools, which obscure their relationship to human beings; and to reveal the issues of trust, privacy, dignity, and consent inextricably connected to big data.

The strengths of the humanities – the hermeneutic circle, precise observation of details, complex syntheses, long chains of argumentation, the ability to handle alterity, and the creation of meaning – are capacities which have yet to be exploited in the handling of big data.

Has anyone ever analyzed big data classifications for their political or cultural implications?

If we think of big data, we don’t think of unstructured text data. Mostly we even don’t think of (weakly) structured data like XML – we conceive of big data as being organized in tables, with columns (variables) and rows (observations); lists of numbers with labels attached. But where do the variables come from? Variables are classifiers; and a classification, as Geoffrey Bowker and Susan Leigh Star have put it in their inspiring book “Sorting Things Out”, is “a spatial, temporal, or spatio-temporal segmentation of the world. A ‘classification system’ is a set of boxes (metaphorical or literal) into which things can be put to then do some kind of work – bureaucratic or knowledge production. In an abstract, ideal sense, a classification system exhibits the following properties: 1. There are consistent unique classificatory principles in operation. […] 2. The categories are mutually exclusive. […] 3. The system is complete.”[1] Bowker and Star describe classification systems as invisible, erased by their naturalization into the routines of life; in them, conflict, contradiction, and multiplicity are often buried beneath layers of obscure representations.

Humanists are well accustomed this kind of approach in general and to classifications and their consequences in particular; classification is, so to say, the administrative part of philosophy and, also, of disciplines like history or anthropology. Any humanist who has occupied himself with Saussurean structuralism or Derrida’s deconstruction knows that each term (and potential classifier) is inseparably linked to other terms: “the proper name was never possible except through its functioning within a classification and therefore within a system of differences,” Derrida writes in “Of Grammatology”.[2] Proper names, as indivisible units, thus form residual categories. In the language of data-as-tables, proper names would correspond to variables, and where they don’t form a variable, they would be sorted into a ‘garbage category’ – the infamous and ubiquitous “other” (“if it is not that and that and that, then it is – other”). Garbage categories are these columns where things get put that you do not know what to do with. But in principle, garbage categories are okay; they can signal uncertainty at the level of data collection. As Derrida’s work reminds us, we must be aware of exclusions, even if they may be explicit.

The most famous scientist who has occupied himself with implicit exclusions in historical perspective is certainly Michel Foucault. In examining the formative rules of powerful discourses as exclusion mechanisms, he analyzed how power constitutes the “other”, and how standards and classifications are suffused with traces of political and social work. In the chapter “The Birth of the Asylum” of his book “Madness and Civilization”,[3] for example, he describes how the categories of ‘normal’ and ‘deviant’, and classifications of forms of ‘deviance’, go hand in hand with forms of treatment and control. Something which is termed ‘normal’ is linked to modes of conduct, standards of action and behavior, and with judgments on what is acceptable and unacceptable.  In his conception, the ‘other’ or ‘deviant’ is to be found outside of the discourse of power and therefore not taking part in the communication between the powerful and the ‘others’. Categories and classifications created by the powerful justify diverse forms of treatment by individuals in professional capacities, such as physicians, and in political offices, such as gaolers or legislators. The equivalent to the historical setting analyzed by Foucault can nowadays be found in classification systems like the International Classification of Diseases (ICD) or the Diagnostic and Statistical Manual (DSM). Doctors, epidemiologists, statisticians, and medical insurance companies work with these classification systems, and certainly there are equivalent ‘big data’ containing these classifications as variables. Not only are the ill excluded from taking part in the establishment of classifications, but also other medical cultures and their representation systems of classification of diseases. Traditional Chinese medicine is a prominent example here, and the historical (and later psychoanalytic) conception of hysteria had to undergo several major revisions until it nowadays found an entry as “dissociation” (F44) and “Histrionic personality disorder” (F60.4) in the ICD-10. Here Foucault’s work reminds us of the fact that power structures are implicit and thus invisible. We are admonished to render classifications retrievable, and to include the public in policy participation.

A third example that may come to the humanist’s mind when thinking of classifications is the anthropologist Mary Douglas. In her influential work “Purity and Danger”, she outlines the inseparability of seemingly distinct categories. One of her examples is the relation of sacredness and pollution: “It is their nature [of religious entities, J.L.] always to be in danger of losing their distinctive and necessary character. The sacred needs to be continually hedged in with prohibitions. The sacred must always be treated as contagious because relations with it are bound to be expressed by rituals of separation and demarcation and by beliefs in the danger of crossing forbidden boundaries.”[4] Sacredness and pollution are thus in permanent tension with each other and create their distinctiveness out of a permanent process of exchange. This process undermines classification – with Douglas, classifiers have to run over into each other (like it the case with the 5-point Likert scale), or classification systems have to be conceived of as heterogeneous lists or parallel different lists. Douglas’ work thus reminds us of the need for the incorporation of ambiguity and of leaving certain terms open for multiple definitions.

Seen from a humanist’s point of view, big data classifications pretend false objectivity, insofar as they help to forget their political, cultural, moral, or social origins, as well as their constructedness. It lies beyond the tasks of the KPLEX project to analyze big data classifications for their implications; but if anybody is aware of relevant studies, references are most welcome. But what we all can do is to constantly remind ourselves that there is no such thing as an unambiguous, uniform classification system implemented in big data.

[1] Geoffrey C. Bowker, Susan Leigh Star, Sorting Things Out. Classification and Its Consequences, Cambridge, MA / London, England: The MIT Press 1999, p. 10/11.

[2] Jacques Derrida, Of Grammatology. Translated by Gayatri Chakravorty Spivak, Baltimore and London: The Johns Hopkins University Press 1998, chapter “The Battle of Proper Names”, p. 120.

[3] Michel Foucault, Madness and Civilization: History of Insanity in the Age of Reason, translated by Richard Howard, New York: Vintage Books 1988 (first print 1965), pp. 244ff.

[4] Mary Douglas, Purity and Danger. An Analysis of the concepts of pollution and taboo, London and New York: Routledge 2001 (first print 1966), p. 22.