On digital oblivion

Knowledge is made by oblivion.
Sir Thomas Browne; in: Sir Thomas Browne’s Works: Including His Life and Correspondence, vol.2, p.177.

Remembrance, memory and oblivion have a peculiar relationship to each other. Remembrance is the internal realisation of the past, memory (in the sense of memorials, monuments and archives) its exteriorised form. Oblivion supplements these two to a trinity, in which memory and oblivion work as complementary modes of remembrance.

The formation of remembrance can be seen as an elementary function in the development of personal and cultural identity; oblivion, on the other hand, ‘befalls’, it happens, it is non-intentional and can therefore be seen as a threat to identity. Cultural heritage institutions – such as galleries, libraries, archives, and museums (GLAM) are thus not only the places where objects are being collected, preserved, and organized; they also form bodies of memory, invaluable for our collective identity.

There is a direct line in these cultural heritage institutions from analogue documentation to digital practice: Online catalogues and digitized finding aids present metadata in a publicly accessible way. But apart from huge collections of digitized books, the material under question is mostly not available in digital formats. This is the reason why cultural heritage – and especially unique copies like the material stored in archives – can be seen as ‘hidden data’. What can be accessed are metadata: the most formal description of what is available in cultural heritage institutions. This structure works in a two-fold way towards oblivion: On the one hand, the content of archives and museums is present and existing, but not in digital formats and thus ‘invisible’. On the other hand, the century-long practice of documentation in the form of catalogues and finding aids has been carried over into digital information architectures; but even though these metadata are accessible, they hide more than they reveal if the content they refer to is not available online. We all have to rely on the information given and ordered by cultural heritage institutions, their classifications, taxonomies, and ontologies, to be able to access our heritage and explore what has formed our identities. Is Thomas Browne right in pointing to the structured knowledge gained from oblivion?

This depends on our attitude to the past and the formation of identity. It is possible to appreciate oblivion as a productive force. In antiquity, amnesty was the Siamese twin of amnesia; the word amnesty is derived from the Greek word ἀμνηστία (amnestia), or “forgetfulness, passing over”. Here it is oblivion which works for the generation of identity and unity: let bygones be bygones. In more recent times, it was Nietzsche who underlined the necessity to relieve oneself from the burdens of the past. It was Freud who identified the inability to forget as a mental disorder, earlier called melancholia, nowadays depression. And it was also Freud who introduced the differentiation between benign oblivion and malign repression.

But it is certainly not the task of GLAM-institutions to provide for oblivion. Their function is a provision of memory. Monuments, memorials, and the contents of archives serve a double bind: They keep objects in memory; and at the same time this exteriorisation neutralizes and serves oblivion insofar as it relieves from the affect of mourning; to erect monuments and memorials and to preserve the past in archives is in this sense a cultural technique of an elimination of meaning. To let go what is not longer present by preserving the available – in this relation the complementarity of memory and oblivion becomes visible; they don’t work against each other, but jointly. From this point of view remembrance – the internal realization of the past – is the task of the individual.

1024px-JuedischesMuseum_4a

Detail of the front of the Jewish Museum Berlin
By No machine-readable author provided. Stephan Herz assumed (based on copyright claims). [Public domain], via Wikimedia Commons

It is not the individual which decides on what should sink into oblivion. Societies and cultures decide in a yet unexplored way which events, persons, or places (the lieux de mémoire) are kept in memory and which are not. If it is too big a task to change the documentation practices of GLAM-institutions, the information architecture, and the metadata they produce, the actual usage of archival content could provide an answer to the question of what is of interest for a society and what is not: The digital documentation of what is being searched for in online catalogues and digitized finding aids as well as which material is being ordered in archives clearly indicate the users’ preferences. Collected as data and analysed with algorithms, we could learn from this documentation not only about what is being kept in memory; we could also learn about what falls into oblivion. And that is a kind of information historians rarely dispose of.

Promise and Paradox: Accessing Open Data in Archaeology

This article has been published by Huggett, Jeremy. ‘Promise and Paradox: Accessing Open Data in Archaeology’. In: Clare Mills, Michael Pidd and Esther Ward. Proceedings of the Digital Humanities Congress 2012. Studies in the Digital Humanities. Sheffield: HRI Online Publications, 2014.

Abstract:
Increasing access to open data raises challenges, amongst the most important of which is the need to understand the context of the data that are delivered to the screen. Data are situated, contingent, and incomplete: they have histories which relate to their origins, their purpose, and their modification. These histories have implications for the subsequent appropriate use of those data but are rarely available to the data consumer. This paper argues that just as data need metadata to make them discoverable, so they also need provenance metadata as a means of seeking to capture their underlying theory-laden, purpose-laden and process-laden character.

The article is available online at: https://www.hrionline.ac.uk/openbook/chapter/dhc2012-huggett

On the humanities and „Big Data“

Most humanists get a bit frightened, if the term „big data” is being mentioned in a conversation; they are often simply not aware of big datasets provided by huge digital libraries like Google Books and Europeana, the USC Shoah Foundation Visual History Archive, or historical census and economic data, to name a few examples. Furthermore, there is a certain oblivion of historical precursors of ‘digital methods’ and the continuities between manual and computer aided methods. Who in the humanities is aware, for example, of the fact that already in 1851 a certain Augustus de Morgan brought forward the idea to identify an author by the average length of his words?

The disconcertment in the humanities with regard to ‘big data’ is certainly understandable, since it points to the premises of the related disciplines and their long-practiced methodologies. Hermeneutics, for example, can be understood as a holistic penetration through individualistic comprehension. It is an iterative, self-reflexive process, which aims at interpretation. Humanists work on the exemplary and are not necessarily interested in generalization and exact quantities.

HermeneuticCircle

Big Data, by contrast, provide for considerable quantities and point to larger structures, stratification and the characteristics of larger groups rather than the individual. But this seeming opposition does not mean that there are no bridges between humanities’ and big data approaches. Humanists can as well expand their research questions with respect to the ‘longue durée’ or larger groups of people. To provide for an example from text studies: A humanist can develop her comprehension of single authors; her comprehension of the relationship between these single authors and the intellectual contexts they live in; and her comprehension of these intellectual contexts, i.e. the larger environment of individuals, the discursive structures prevailing at the time which is being researched, and the socio-economic conditions determining these structures.

That is not to say that humanities and big data could easily blend into each other. A technique like ‘predictive analysis’, for example, is more or less alien to humanistic research. And there is the critical reservation of the humanities with respect to big data: No computing will ever replace the estimation and interpretation of the humanist.

It is therefore both a reminder and an appeal to humanists to bring back the ‘human’ to big data: The critical perspective of the humanities is able to compensate for the loss of reality which big data inevitably implicates; to situate big data regimes in time and space; to critically appraise the categories of supposedly neutral and transparent datasets and tools, which obscure their relationship to human beings; and to reveal the issues of trust, privacy, dignity, and consent inextricably connected to big data.

The strengths of the humanities – the hermeneutic circle, precise observation of details, complex syntheses, long chains of argumentation, the ability to handle alterity, and the creation of meaning – are capacities which have yet to be exploited in the handling of big data.

Has anyone ever analyzed big data classifications for their political or cultural implications?

If we think of big data, we don’t think of unstructured text data. Mostly we even don’t think of (weakly) structured data like XML – we conceive of big data as being organized in tables, with columns (variables) and rows (observations); lists of numbers with labels attached. But where do the variables come from? Variables are classifiers; and a classification, as Geoffrey Bowker and Susan Leigh Star have put it in their inspiring book “Sorting Things Out”, is “a spatial, temporal, or spatio-temporal segmentation of the world. A ‘classification system’ is a set of boxes (metaphorical or literal) into which things can be put to then do some kind of work – bureaucratic or knowledge production. In an abstract, ideal sense, a classification system exhibits the following properties: 1. There are consistent unique classificatory principles in operation. […] 2. The categories are mutually exclusive. […] 3. The system is complete.”[1] Bowker and Star describe classification systems as invisible, erased by their naturalization into the routines of life; in them, conflict, contradiction, and multiplicity are often buried beneath layers of obscure representations.

Humanists are well accustomed this kind of approach in general and to classifications and their consequences in particular; classification is, so to say, the administrative part of philosophy and, also, of disciplines like history or anthropology. Any humanist who has occupied himself with Saussurean structuralism or Derrida’s deconstruction knows that each term (and potential classifier) is inseparably linked to other terms: “the proper name was never possible except through its functioning within a classification and therefore within a system of differences,” Derrida writes in “Of Grammatology”.[2] Proper names, as indivisible units, thus form residual categories. In the language of data-as-tables, proper names would correspond to variables, and where they don’t form a variable, they would be sorted into a ‘garbage category’ – the infamous and ubiquitous “other” (“if it is not that and that and that, then it is – other”). Garbage categories are these columns where things get put that you do not know what to do with. But in principle, garbage categories are okay; they can signal uncertainty at the level of data collection. As Derrida’s work reminds us, we must be aware of exclusions, even if they may be explicit.

The most famous scientist who has occupied himself with implicit exclusions in historical perspective is certainly Michel Foucault. In examining the formative rules of powerful discourses as exclusion mechanisms, he analyzed how power constitutes the “other”, and how standards and classifications are suffused with traces of political and social work. In the chapter “The Birth of the Asylum” of his book “Madness and Civilization”,[3] for example, he describes how the categories of ‘normal’ and ‘deviant’, and classifications of forms of ‘deviance’, go hand in hand with forms of treatment and control. Something which is termed ‘normal’ is linked to modes of conduct, standards of action and behavior, and with judgments on what is acceptable and unacceptable.  In his conception, the ‘other’ or ‘deviant’ is to be found outside of the discourse of power and therefore not taking part in the communication between the powerful and the ‘others’. Categories and classifications created by the powerful justify diverse forms of treatment by individuals in professional capacities, such as physicians, and in political offices, such as gaolers or legislators. The equivalent to the historical setting analyzed by Foucault can nowadays be found in classification systems like the International Classification of Diseases (ICD) or the Diagnostic and Statistical Manual (DSM). Doctors, epidemiologists, statisticians, and medical insurance companies work with these classification systems, and certainly there are equivalent ‘big data’ containing these classifications as variables. Not only are the ill excluded from taking part in the establishment of classifications, but also other medical cultures and their representation systems of classification of diseases. Traditional Chinese medicine is a prominent example here, and the historical (and later psychoanalytic) conception of hysteria had to undergo several major revisions until it nowadays found an entry as “dissociation” (F44) and “Histrionic personality disorder” (F60.4) in the ICD-10. Here Foucault’s work reminds us of the fact that power structures are implicit and thus invisible. We are admonished to render classifications retrievable, and to include the public in policy participation.

A third example that may come to the humanist’s mind when thinking of classifications is the anthropologist Mary Douglas. In her influential work “Purity and Danger”, she outlines the inseparability of seemingly distinct categories. One of her examples is the relation of sacredness and pollution: “It is their nature [of religious entities, J.L.] always to be in danger of losing their distinctive and necessary character. The sacred needs to be continually hedged in with prohibitions. The sacred must always be treated as contagious because relations with it are bound to be expressed by rituals of separation and demarcation and by beliefs in the danger of crossing forbidden boundaries.”[4] Sacredness and pollution are thus in permanent tension with each other and create their distinctiveness out of a permanent process of exchange. This process undermines classification – with Douglas, classifiers have to run over into each other (like it the case with the 5-point Likert scale), or classification systems have to be conceived of as heterogeneous lists or parallel different lists. Douglas’ work thus reminds us of the need for the incorporation of ambiguity and of leaving certain terms open for multiple definitions.

Seen from a humanist’s point of view, big data classifications pretend false objectivity, insofar as they help to forget their political, cultural, moral, or social origins, as well as their constructedness. It lies beyond the tasks of the KPLEX project to analyze big data classifications for their implications; but if anybody is aware of relevant studies, references are most welcome. But what we all can do is to constantly remind ourselves that there is no such thing as an unambiguous, uniform classification system implemented in big data.

[1] Geoffrey C. Bowker, Susan Leigh Star, Sorting Things Out. Classification and Its Consequences, Cambridge, MA / London, England: The MIT Press 1999, p. 10/11.

[2] Jacques Derrida, Of Grammatology. Translated by Gayatri Chakravorty Spivak, Baltimore and London: The Johns Hopkins University Press 1998, chapter “The Battle of Proper Names”, p. 120.

[3] Michel Foucault, Madness and Civilization: History of Insanity in the Age of Reason, translated by Richard Howard, New York: Vintage Books 1988 (first print 1965), pp. 244ff.

[4] Mary Douglas, Purity and Danger. An Analysis of the concepts of pollution and taboo, London and New York: Routledge 2001 (first print 1966), p. 22.

Data and their visualization

When did data enter mankind’s history? If you would ask an archaeologist or a historian, he might answer: About 35.000 years ago, with cave paintings. The reason why they are being held as data points to their symbolic dimension: Cave paintings may be not mere representations of what has been observed by the people who produced these drawings, but because they may have been designed for religious or ceremonial purpose. Depictions of men dancing around a slayed animal can thus be interpreted not only as human beings performing a spontaneous dance of victory, but they can also be seen as illustrations of rituals and cosmologies. Sacred knowledge written on ceilings of locations not inhabited by men.

Statisticians (and data scientists) see these things differently. For them, data are clearly bound to abstract signs. In spite of the much older presence of hieroglyphs – this peculiar Egyptian mixture of pictograms and abstractions –, they would point to Phoenician letters providing recipes of beer, with numbers provided for amounts of barley; or to Sumerian calculations scratched into wet clay, which has been burned subsequently and preserved as ostraca. Knowledge about intoxicating beverages and bargaining. Data, in this understanding, is connected to numerical values and not to the symbolic values which make up the surplus of the cave paintings. Moreover: While figurative drawings found in caves are visualizations in themselves, a ‘modern’ understanding of data points to maps as the beginning of data visualization (400 BC), star charts (150 BC), and a mechanical knowledge diagram drawn in Spain in 1305. Knowledge about the world and about knowledge itself. Canadian researcher Michael Friendly, who charted the history of data visualization in his Milestones project, sees Michael Florent van Langren’s 1644 representation of longitudes as the first (known) statistical graph.

langren

We see here not only differing conceptions of data, but of what a visualization of data might be. And, if we follow the traces laid out by people writing on the history of data visualization (like Michael Friendly, Edward Tufte or Daniel Rosenberg), we soon note that there seems not to be a straightforward evolution in data visualization. On the one hand, data visualization depended on the possibilities painters and printers provided for data specialists; on the other hand, the development of abstract thinking took its own routes. The indices and registers found as annexes in printed books form the data basis for networks of persons or conceptions, but their visualization came later. Visualizations became more common in the 19th century (John Snow’s map of cholera-contaminated London is one of the more popular examples here), and data visualization was taken out of the hands of the experts (who paid the graphic designers and could afford costly printing) only in the 20th century, with mass distributions of excel and power point. A new phase started with the Internet providing libraries for visualization like D3.js or P5.js. But these different phases also point to what Foucault had written in his “archéologie du savoir” – when history shifted from one view to another, the past became virtually incomprehensible to the present. Visualizations become outdated and are no longer easy to understand. Thus, returning to our starting point: Are cave paintings not simply other visualizations of data and knowledge than Phoenician beer recipes? Is it a failure of different disciplines not to be able to find an agreement about what terms like ‘knowledge’ and ‘data’ mean? Doesn’t our conception of ‘knowledge’ and ‘data’ determine what is being termed ‘visualization’?

How can use be made of NAs?

In elementary school, the little data scientist learns that NAs are nasty. Numerical data are nice, clean, and complete when collected by a sensor. NAs, on the other hand, are these kind of data that result from manually entered data, not properly filled-in surveys; or they are missing values which could not be gathered out of unknown reasons. If they would be NULLs, the case would be much clearer – with a NULL, you can calculate. But NA, that’s really terra incognita, and because of this the data have to checked for bias and skewness. NA thus becomes for the little data scientist an expression of desperation: not applicable, not available, or no answer.

For humanists, the case is easier. Missing values are part of their data life. Only rarely machines support the process of data collection, and experience shows that people, when asked, rarely respond in a complete way; leaving out part of the answer is an all-too-human strategy. Working with this kind of data, researches have become creative; one can even ask whether there is a pattern behind NAs, or whether a systematic explanation for them can be found. In a recently published reader on methodologies in emotion research, there is a contribution on how to deal with missing values. The researcher, Dunya van Troost, collected data from persons committed to political protest all over Europe; the survey contained, amongst others, four items on emotions. By counting the number of missing values across the four emotion items three groups could be identified: respondents with complete records (85 percent), respondents who answered selectively (13 percent), and (3) respondents who answered none of the four emotion items (2 percent). For inferential statistics, it is important to understand the reasons why certain data are missing. Van Troost somehow turned the tables: she conducted a regression analysis to see how the other data available, here both country and demographic characteristics, influenced the number of missing values provided by her 15,272 respondents. She found out that the unanswered items were not missing completely at random. The regression further showed that the generation of protesters born before 1945 had more frequently missing values on the emotions items than younger generations. The same applies for the level of education – the higher the level of education, the more complete the surveys had been filled. Female respondents have more missing values than male ones. Finally, respondents from the southern European countries Spain and Italy had a relatively high rate of missing values.

It is obvious that not all people are going to answer items on emotions with the same willingness; and it is also plausible that the differences in willingness are related to age, gender, cultural socialization. It would be interesting to have a longitudinal study on this topic, to validate and cross-check van Troost’s findings. But this example illuminates a different handling of information in the social sciences from that of information science: Even NAs can be useful.

Dunya Van Troost, Missing values: surveying protest emotions, in: Helena Flam, Jochen Kleres (eds.), Methods of Exploring Emotions, London / New York: Routledge 2015, pp. 294-305.

Ethics and epistemics in biomedical big data research

This recent article explores issues in big data approaches with regard to

a) ethics – such as obtaining consent from large numbers of research participants across a large number of institutions; protect confidentiality; privacy concerns; optimal methods for de-identification; and the limitation of the capacity for the public, and even experts, to interpret and question research findings

and b) epistemics – such as personalized (or precision) treatment that rely on extending concepts that have largely failed or have very high error rates; deficiencies of observational studies that do not get eliminated with big data; challenges of big data approaches due to their overpowered analysis settings; minor noise due to errors or low quality information being easily be translated into false signals; and problems with the view that big data is somehow “objective,” including that this obscures the fact that all research questions, methods, and interpretations
are value-laden.

The article closes with a list of recommendations which consider the tight links between epistemology and ethics in relation to big data in biomedical research.

Wendy Lipworth, Paul H. Mason, Ian Kerridge, John P. A. Ioannidis, Ethics and Epistemology in Big Data Research, in: Journal of Bioethical Inquiry 2017, DOI 10.1007/s11673-017-9771-3 [Epub ahead of print].

Beyond a binary conception of data

An awful lot of data exist in tables, the columns (variables) and rows (observations) filled with numerical values. Numerical values are always binary: Either they have a certain value or not (in the latter case they are NAs or NULLs). Computability is based on information stored in a binary fashion coded as either 0 or 1. Statistics is based on this binary logic, even where nominal data are in use. Nominal variables such as gender, race, color, and profession can be measured only in terms of whether the individual items belong to some distinctively different categories. More precisely: They belong to these categories or not.

A large amount of data, especially in the Internet, consists of unstructured text data (eMails, WhatsApp Messages, WordPress posts, tweets, etc.). Can texts – or other cultural data like images, works of art, or music – be adapted to a binary logic? Principally yes, as the examples of nominal variables show; either a word (or an image …) belongs to a certain category or not. Quite a good of part of Western thinking follows a binary logic; classical structuralism, for example, is fond of structuring oppositions: good – bad, pure – dirty, raw – cooked, orthodox – heretic, avantgarde – arrièregarde, up-to-date – old-fashioned etc. The point here is that one has to be careful to which domain these binary oppositions belong to: Good – bad is not the same as good – evil.

But the habit of thinking in binary terms narrows the perspective; the use of data according to a binary logic means a reduction. This is particularly evident with respect to texts: The meaning of individual words changes with their context. Another example: A smile can be the expression of sympathy and of uncertainty in the Western world, while in other cultures it may be referring to aggression, confusion, sadness or a social distancing from the other. What can be seen as a ‘smile’ in monkeys is most often the expression of fear – a showing of the canines.

Indian logic provides for an example of how to go beyond a binary logic: They have something which is being called “Tetralemma”. While binary systems are based on calculations of 0 and 1 and therefore formulate a dilemma, the tetralemma provides for four possible answers to any logical proposition: Beyond the logics of 0, 1, there is both 0 and 1, and neither 0 nor 1. One can even conceive of a fifth answer: none of these all. Put as a graph and expressed mathematically, the tetralemma would look like this:

Tetralemma

The word “dawn” would be an example for what is at stake in the tetralemma: Depending on how you define its meaning, it can be a category of its own, not fitting into a category (because of it ambivalent character), it is both day and night, and it is neither sunshine nor darkness.

One of the few philosophers to point to the narrowing of logics to binary oppositions (TRUE / FALSE) and to underline the many possibilities in language games was Jean-Francois Lyotard, in his main work “Le Différend” (Paris: Minuit 1983). In information science, it was only more recently that complex approaches have been developed beyond binary systems, which allow for an adequate coding of culture, emotions or human communications. The best examples are ontologies; they can t be understood as networks of entities, while the relations between these entities can be defined in multiple ways. A person can be at the same time a colleague in a team, a partner in a company, a father of another person working in the same company (a visualization of the “friend of a friend” ontology can be found here). Datafication of human signs, be they linguistic, artistic, or part of the historical record, therefore exposes the challenges of data production in particularly evident ways.

Is there an identity crisis of statistics?

It is not without irony that statistics currently seems to live through an identity crisis, since this discipline is often named the “science of uncertainty”. If there is an identity crisis at all, in what way can it be conceived of? And why did this crisis come now – is there a nexus between the rise of ‘big data’, algorithm and the development of the discipline? Three developments can be identified that make the hypothesis of an identity crisis of statistics more tangible.

  • A crisis of legitimacy: Statistics’ findings result from value-free and transparent methods; but this is nothing (and maybe never has been) which can easily be communicated to the broad public. Hence, in the age of ‘big data’, a loss of legitimacy: the more complex the collection of data, statistical methods and presented results are, the more disorienting an average person may find them, especially if there is a steep increase in the sheer quantity of statistical findings. Even for citizens who are willing to occupy themselves with data, statistics, and probability algorithms, Nobel prize laureate Danel Kahnemann has underlined the counterintuitive and intellectually challenging character of statistics (see his book “Thinking, fast and slow” or the classical article “Judgment under Uncertainty”). These peculiarities of the discipline lower the general public’s trust in the discipline: “Whom should I believe?”
  • Crisis of expertise: Statistics has become part of a broad range of scientific disciplines, far beyond mathematics. But the acquisition of competences in statistics quite obviously has its limits. As Gerd Gigerenzer has pointed out already 13 years ago, “mindless statistics” has become a custom and ritual in sciences like f.ex. psychology. In recent years, this crisis of expertise has been termed the crisis of reproducibility (for data from a previous publication) or replicability (for data from an experiment); the renowned journal “Nature” has devoted in 2014 an article series onto this problem, with focus on f.ex. the use of p-values in scientific arguments. The report of the 2013 London Workshop on the Future of the Statistical Sciences is outspoken on this problem, and there is even a Wikipedia article on the crisis of replicability. Statisticians themselves defend themselves by pointing to these scientists’ lack of training in statistics and computation [see Jeff Leek’s recent article here], but quite obviously this crisis of expertise undermines the credibility of scientists as experts.
  • Crisis of the societal function of the discipline: Statistics as a scientific discipline established itself alongside with the rise of nation-states; hence its close connection to national economies and the data collected across large populations. As has been explained in a “Guardian” article posted earlier in this blog, statistics served as the basis of “evidence-based policy“, and statisticians were seen by politicians as the caste of scientific pundits and consultants. But this has changed completely: Nowadays big data are assets of globalised companies which act across the borders of nation-states. This points to a shift in the core societal function of statistics, not longer serving politics and hence the nation, but global companies and their interests: Statistics leaves representative democracy, and it has become unclear how the benefits of digital analytics might ever be offered to the public. Even if the case is still obscure, the possible role of “Cambridge Analytica” in the U.S. presidential election campaign shows that the privatisation of expertise can be turned against the interests of a nation’s citizens.