Blog

ACDH LECTURE 4.1- What can Big Data Research Learn from the Humanities?

csm_events_ACDH_Lecture_4.1_e156aa4aa6

Jennifer Edmond
Director of the Trinity College Dublin Centre for Digital Humanities and Principal Investigator on the KPLEX Project

One of the major terminological forces driving ICT development today is that of ‘big data.’ While the phrase may sound inclusive and integrative, in fact, ‘big data’ approaches are highly selective, excluding, as they do, any input that cannot be effectively structured, represented, or, indeed, digitised. Data of this messy, dirty sort is precisely the kind that humanities and cultural researchers deal with best, however.  In particular, knowledge creation and information management approaches from the humanities shed light on gaps such as: the manner in which data that are not digitised or shared become ‘hidden’ from aggregation systems; the fact that data are human created, and lack the objectivity often ascribed to the term; and the subtle ways in which data that are complex almost always become simplified before they can be aggregated. Humanities insight also exposes the problematic discursive strategies that big data research deploys, strategies that can be seen reflected not only in the research outputs of the field, but also in many of the urgent challenges our digitised society faces.

The lecture is available to view here: https://www.youtube.com/watch?v=E2vdFBo9wB4

The Future of History

If you go to an archive today and look for a personal heritage, what would you expect? Notebooks, letters, photographs, calendars, the documentation of printed publications, drafts of articles or books with hand-written comments, and the like. But how about born-digital contents? In the best case, you will find a backup of the hard disk drive of the person you are researching on. The letters of earlier times have changed to eMails, SMS and WhatsApp messages, Facebook entries, Twitter posts; publications may have changed to online articles, PDF files, and blog contributions distributed all over the net. And that’s the point: What may have been part of somebody’s personal inheritance in paper format, may have nowadays become part of Big Data. Yes, Big Data: They do not consist only of incredibly large tables, with variables and columns filled with numbers; a good part of Big Data consists simply of text files with social media contents (Facebook, Twitter, blogs, and so on).

At first glance, this sounds astonishing. The ‘private’ character of personal heritage seems to have vanished, the proportion of content available in the public sphere seems to have grown. It seems. But this is not surprising; we are reminded of one of the most influential studies on this topic, Jürgen Habermas’ “Structural Transformation of the Public Sphere”. What Habermas analyses here are the constant changes and shifts of the border between private and public. His examination starts in the late 18th and early 19th century, with the formation of an ideal type of bourgeois discourse marked by what Habermas calls “Räsonnement”. This reasoning aims at arguing, but also, in its pejorative form, at grumbling. The study begins with bourgeois members of the public meeting in Salons, coffee houses, and literary round tables, pursuing reasoned exchange by contributing to journals, practicing subjectivity, individualism, and sentimentalism by writing letters and diaries destined either to be published (think of Gellert, Gleim, and Goethe) or to become part of personal heritage. Habermas draws long lines into the 20th century, where his book ends with the opposition between public and private characteristic of that time: Employment is part of the public space, while leisure time is dedicated to private activities; letters and lecture have become much less important, only the high bourgeoisie keeps their own libraries; mass media enhance the passivity of consumerism. This can also be read from personal heritages: The functional differentiation of a modern society created presumed experts for Räsonnement, like journalists, politicians, and publicists, who deliver opinion formation as a service, while editors and scientists professionalise the critique of politics. Habermas is overtly critical in view of the mass media and their potentials for manipulation since they reduce citizens to recipients without agency.

The last edition of Habermas’ book was printed in 1990. Since that time, a lot has changed, especially with the emergence of the internet. The border between public and private has been moved, and the societal-political commitment of citizens has changed. Social media grant an incredibly agency and empower citizens. Hyperdigital hipsters are working in cafés, co-working spaces or start-ups, without having the private leisure time characteristic of the 20th century. Digital media network people across large spaces and form new transnational collectives. Anthropologist Arjun Appadurai has spoken of “diasporic public spheres” in this respect – small groups of people discussing face to face in pubs have been transformed into “communities of sentiment” grumbling at politics. Formerly silent recipients have mutated into counter publics, the sentimental bourgeois has become an enraged citizen. Habermas wouldn’t have liked this development, since his ideal type of Räsonnement doesn’t fit with current realities, and what he overlooked – the existence of large parts of the society consisting of people who don’t participate in mass media discourses because they don’t want to – nowadays informs e.g. right-wing populism.

Facebook-Network
The Facebook network as a new public sphere

This latest transformation of the public sphere has consequences for archivists as well as historians. Consequently, archivists should regard social media contents as part of personal heritages and have thus to struggle with data management and storage problems. Historians (at least historians of the future) have to become familiar with quantitative analysis in order to e.g. examine Twitter-networks in order to determine the impact of the Alt-Right movement onto the presidential election in the U.S. Born-digital contents can therefore be seen as valuable parts of personal heritage. And coming from this point of view, there is certainly a lot that historians can contribute to discussions on Big Data.

 

Jürgen Habermas, The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity 1989.

Arjun Appadurai, Modernity At Large: Cultural Dimensions of Globalization. Minneapolis: University of Minnesota Press 1996.

 

What the stories around data tell us

The strange thing about data is that they don’t speak themselves. They need to be embedded into an interpretation to become palatable and understandable. This interpretation may be an analytic account like the narrative synthesis told after performing a regression. It also may be a story in a more conventional sense, something like a success story of conquest, mastery, submission, or revelation which results out of the usual storytelling used in marketing.

The funny thing about data and stories is that it is easier to create a story out of data than extracting data out of a story told. It is easy to conceived of narratives built on top of data. Companies like Narrative Science create market reports or sporting reports automatically out of the data they receive on a daily basis. On the other hand, it is difficult to imagine to extract statistical data about a soccer game out of the up-to-the-minute scores of the same game presented on a website.

But data form a peculiar basis of stories. Think of the data which are collected when you visit a website – a typical basis of Big Data. These websites collect data on where you go, where you click, how long you stay there and so on; typically, they are behavioural data. What data scientist can get out of that are correlations; these data don’t allow to grasp the causal mechanisms behind the observed behaviour. They are not able to see the whole person behind the behaviour; thus they are e.g. not able to tell what the costumers feel during their visit on the website, why they reacted – and where the visitors see value in the offers they are presented. It is thus a reduction of the perspective which comes along with stories based on data, a reduction which maybe can’t be avoided since data are in themselves a product in a process of estrangement typical for capitalism. The narrative might attenuate or conceal the limitations of the data, but it will not be able to reach far beyond the restrictions imposed.

Market

But there is more about data presented in narratives than a mere reduction of perspective: Other than data collected in scientific disciplines like psychology and anthropology, which might enable representative statements about population groups, the results of Big Data analyses grant a shift in perspective. By performing classifications, groupings of people according to their preferences, assessing the creditworthiness of customers, etc., Big Data allow to view human beings from the perspective of a market. And in their ability to shape the offers presented on a website in real-time and adapt the pricing mechanisms according to the IP-address from where the websites are being accessed; in their ability to build up systems of gratification, rewarding actions of the users which are seen as opportune by the infrastructure, data grant a point of view onto customers which further strengthen commodification and economic governance. The fact that this point of view is equivalent to the perspective of the market becomes especially visible in the narratives accompanying these data.

Identities and Big Data

One of the amenities of Big Data (and the neural nets used to mine them) is its potential to reveal patterns of which we were not aware before. “Identity” in Big Data might either be a variable contained in the dataset. This implies classification, and a feature which cannot easily be determined (like some cis-, trans-, inter-, and alia-identities) might have been put into the ‘garbage’ category of the “other”. Or, identity might arise from a combination of features that was unknown beforehand. This was the case in a study which claimed that neural networks are able to detect sexual orientation from facial images. The claim did not go unanswered; a recent examination of this study by Google researchers exposed that differences in culture, rather than facial structures were the features responsible for the result. Therefore, features that can easily be changed – like makeup, eyeshadow, facial hair, or glasses – were underestimated by the authors of the study.

The debate between these data analysts exposes insights well-known to humanists and social scientists. Identities differ in context; depending on the situation in which she or he is, a person may say “I am a mother of three children”, “I am a vegan”, or “I am a Muslim”. In fields marked by strong tensions and polarizations, identity statements can come close to confessions, and it might be wise to carefully deliberate about whether it is opportune to either provide the one or the other answer: “I am a British”, “I am an European”.

It is not without irony that it is easy to list several identities that have been important throughout the past few centuries. Clan, tribe, nation, heritage, sect, kinship, blood, race – this is the typical stuff ethnographers and historians work on. Beyond the ones just named, identities like family, religion, culture, and gender are currently intensely debated in our postmodern, globalised world. Think of the discussions about cultural identities in a world characterized by migration; and think of gender identity as one of the examples which only recently has split itself into new forms and created a new vocabulary that tries to grasp the new changeableness: genderfluid, cisgender, transgender, agender, genderqueer, non-binary, two-spirit, etc.

Akhenaten
Androgynous portrayal of Akhenaten, the first ruler to introduce monotheism in Ancient Egypt

It is obvious that identity is not a stable category. This is the irony of identity – the promise of an identical self dissolves itself in time and space, and any trial to isolate, fix and homogenise an identity is doomed to failure. In postmodernity, identities are rather constructed out of interactions with the social environment – through constant negotiations, distinction and recognition, exclusion and belonging. Mutations and transformations are the expression of the tensions between vital elements that characterize identities. The path from Descartes’ “Cogito, ergo sum” to Rimbaud’s “I is another” is one of the best-known examples for such a transformative process.

Humanists and social scientists are experts in providing thick descriptions and large contexts in which identities can be located. They are used to relate the different resources to each other which feed into identities, and they are capable to build the contexts in which the relevant features and the relationships between them are embedded. In view of these potentials it is astonishing that citizens with such an academic background do not speak with confidence of the power of their methods and mix themselves into Big Data debates about such elusive concepts like “identities”.

Data. A really short history

“One man’s noise is another man’s signal.” Edward Ng

 

Data are “givens”. They are artefacts. They don’t pre-exist in the world, but come to the world because we, the human beings, want them to do so. If you would imagine an intimate relationship between humans and data, you would say that data do not exist independently from humans. Hunters and gatherers collect the discrete objects of their desire, put them in order, administer them, store them, use them, throw them away, forget about where they were stored and at times uncover those escaped places again, for example during a blitheful snowball fight. At that time data were clearly divided from non-data – edible things were separated from inedible ones, helpful items from uninteresting ones. This selectivity may have benefited from the dependency on shifting availability of resources in a wider space.

Later onwards, when mankind went out of the forests, things became much more complicated. With the advent of agriculture and especially of trade data became more meaningful. Furthermore, rulers became interested in knowledge about their sheep and therefore asked some of their accountants to keep record of their subordinates – the eldest census dates back to 700 B.C. In the times when mathematics became a bit more complicated from the 17th (probability) and 19th (statistics) century onwards, data about people, agriculture, and trade began to heap up and it became more and more difficult to distinguish between relevant data (“signal”) and irrelevant ones (“noise”). The distinction between these two simply became a matter of the question with which the data at hands were consulted.

With the advent of the industrial age, the concept of mechanical objectivity was introduced, and the task of data creation was delegated to machines which were constructed to collect the items in which humans were interested in. Now data were available in huge amounts, and the need for organizing and ordering them became even more pressing. It is over here, where powerful schemes came into force: Selection processes, categorizations, classifications, standards; variables prioritized as signal over others reduced to noise, thus creating systems of measurement and normativity intended to govern societies. They have been insightful investigated in the book “Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life.”*

It was only later, in the beginning of the post-industrial age, when an alternative to this scheme- and signal-oriented approach was developed by simply piling up everything that may be of any interest, a task also often delegated to machines because of their patient effortlessness. The agglomeration of masses presupposes that storing is not a problem, neither in spatial nor in temporal terms. The result of such an approach is nowadays called “Big Data” – the accumulation of masses of (mostly observational) data for no specific purpose. Collecting noise in the hope of signal, without defining what noise and what signal is. Fabricating incredibly large haystacks and assuming there are needles in it. Data as the output of a completely economised process with its classic capitalistic division of labour, including the alienation from their sources.

What is termed “culture” often resembles these haystacks. Archives are haystacks. The German Historical Museum in Berlin is located in the “Zeughaus”, literally the “house of stuff”,** stuffed with the remnants of history, a hayloft of history. Libraries are haystacks as well; if you are not bound to the indexes and registers called catalogues, if you are free to wander around the shelves of books and pick them out at random, you might get lost in an ocean of thoughts, in countless imaginary worlds, in intellectual universes. This is the fun of the explorers, conquerors and researchers: Never get bored through routine, always discover something new which feeds your curiosity. And it is here, within this flurry of noise and signal, within the richness of multisensory, kinetic and synthetic modes of access, where it becomes tangible that in culture noise and signal cannot be thought without the environment out of which they were taken, that both are the product of human endeavours, and that data are artefacts that cannot be understood without the context in which they were created.

 

*Martha Lampland, Susan Leigh Star (Eds.), Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life. Ithaca: Cornell Univ. Press 2009.
** And yes, it should be translated correctly as “armoury”.

Dr Jennifer Edmond on RTÉ Brainstorm

RTE Brainstorm

A piece by Dr. Jennifer Emond, Principal Investigator of the KPLEX project, Director of Strategic Projects in the Faculty of Arts, Humanities and Social Sciences Trinity College Dublin and co-director of the Trinity Centre for Digital Humanities, has been published on the RTÉ Brainstorm website.  In this insightful piece, Dr. Edmond addresses some of the challenges of talking about Big Data, an important theme currently being explored by the KPLEX project team.

You can read the full article here: https://www.rte.ie/eile/brainstorm/2018/0110/932290-the-problem-with-talking-about-big-data/

Big Data and Biases

Big Data – these are per se data which are too big to be inspected by humans. Their bigness has consequences: They are so large that typical applications to store, compute and analyse them are inappropriate. Often processing them is a challenge for a single computer; thus, a cluster of computers have to be used in parallel. Or the amount of data has to be reduced by mapping an unstructured dataset into a dataset where individual elements are key-value pairs; on a reduced selection of these key-value pairs mathematical analyses can be performed (“MapReduce”). Even though Big Data are not collected in response to a specific research question, their sheer largeness (millions of observations x of y variables) promises to provide answers relevant for a large part of a societies’ population. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size is more important. However, on the other hand, large does not mean all; one has to be aware of the universe covered by the data. Statistical inference – conclusions drawn from data about the population as a whole – cannot easily be applied, because the datasets are not established in a way that ensures that they are representative. Therefore, bias in Big Data ironically may come from missing datasets, e.g. on those parts of the population which are not captured in the data collection process.

But biases may also arise in the process of analysing Big Data. This has also to do with the substantial size of the datasets; standard software may be inappropriate to handle it. Beyond parallel computing and MapReduce, the use of machine learning seems to provide solutions. Machine learning designates algorithms that can learn from and make predictions on data through building a model from sample inputs. It is a type of artificial intelligence in which the system learns from lots of examples; results – such as patterns or clusters – become stronger with more evidence. It is for this reason why Big Data and machine learning seem to go hand in hand. Machine learning can roughly be divided into A) analytic techniques which use stochastic data models, most often classification and regression in supervised learning; and B) predictive approaches, where the data mechanism is unknown, as it is the case with neural nets and deep learning. In both cases biases may be the result of the processing of Big Data.

A) The goal of statistical modelling is to find a model which allows to draw quantitative conclusions from data. It has the advantage of the data model being transparent and comprehensible by the analyst. However, what sounds objective (since it is ‘based on statistics’), neither needs to be correct (since if the model is a poor emulation of reality, the conclusions may be wrong), nor need it be fair: The algorithms may simply not be written in a manner which describes fairness or an even distribution as a goal of the problem-solving procedure. Machine learning then commits disparate mistreatment: the algorithm optimizes the discrimination for the whole population, but it is not looking for a fair distribution of this discrimination. ‘Objective decisions’ in machine learning can therefore be objectively unfair. This is the reason why Cathy O’Neill has called an algorithm “an opinion formalized in code”[1] – it does not simply provide objectivity, but works towards the (unfair) goals for which it written. But there is remedy; it is possible to develop mechanisms for fair algorithmic decision making. See for example the publications of Krishna P. Gummadi from the Max Planck Institute for Software Systems.

AlgorithmTanSteinbachKumarch4

Example of an algorithm, taken from: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Boston 2006, p. 164.

B) In recent years, powerful new tools for Big Data analysis have been developed: Neural nets, deep learning algorithms. The goal of these tools is predictive accuracy; they are hardware-hungry and data hungry, but have their strength in complex prediction problems where it is obvious that stochastic data models are not applicable. Therefore, the approach is designed in another way here: What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The challenge is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. The goal is to have the algorithm produce results with a strong predictive accuracy. The focus does not lie with the model by which the input x is transformed into the output y; it does not have to be a stochastic data model. Rather, the model is unknown, complex & mysterious; and irrelevant. This is the reason why accurate prediction methods are addressed as complex “black boxes”; at least with neural nets, ‘algorithms’ are seen as a synecdoche for “black box”. Other than it is the case with stochastic models, the goal is not interpretability, but accurate information. And it is here, on the basis of an opaque data model, where neural nets and deep learning extract features from Big Data and identify patterns or clusters which have been invisible to the human analyst. It is fascinating to see that humans don’t decide what those features are. The predictive analysis of Big Data can identify and magnify patterns hidden in the data. This is the case with many recent studies, like, for example, the facial recognition system recognizing ethnicity which has been developed by the company Kairos or the Stanford study inferring sexual orientation by analysing people’s faces. What comes out here is that the automatic feature extraction amplifies human bias. A lot of the talk about “biased algorithms” is a result out of these findings. But are the algorithms really to blame for the bias, especially in the case of machine learning systems with a non-transparent data model?

This question leads us back again to Big Data. There are at least two possible ways in which the data used predetermine the outcomes: The first is Big Data with built-in bias which is then amplified by the algorithm. Simply go to the Google image search and perform a search either for the words “CEO” or “cleaner”. The second is the difference between the data sets used as training data for the algorithm and the data analysed subsequently. If you don’t have, for example, African American faces in a training set on facial recognition, you simply don’t know how the algorithm will behave when applied to images with African American faces. Therefore, the appropriateness and the coverage of the data set is crucial.

The other point lies with data models and the modelling process. Models are always contextual, be they stochastic models with built-in assumptions about how the world works; or be they charged with context during the modelling process. This is why we should reflect on the social and historical contexts in which Big Data sets have been established; and the way our models and modelling processes are being shaped. And maybe it is also timely to reflect on the term “bias”, and to recollect that it implies an impossible unbiased ideal …

 

[1] Cathy O’Neil, Weapons of Math Destruction. How Big Data increases inequality and threatens Democracy, New York: Crown 2016, p.53.