Will Big Data render theory dispensable?

Scientific theories have been crucial for the development of the humanities and social sciences. Metatheories such as classical social evolution, cultural diffusion, functionalism or structuralism for example guided early anthropologists in their research process. Postmodern theorists rightly criticized their predecessors among other things for their deterministic theoretical models. Their criticism however was still based on theoretical reflections, although many tried to reduce their theoretical bias by combining several perspectives and theories (cf. theory triangulation).

Whereas it was common in the humanities to keep track of “disproven or refuted theories” there could be a trend among proponents of a new scientific realism to put the blinkers on and solely focus on progress towards a universal, objective and true account of the physical world. Even worse, theory could be discarded altogether. Big data might revolutionise the scientific landscape. From the point of view of the physicist Chris Anderson the “end of theory” is near: “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves”.


This approach towards science might gain ground. Digital humanists are said to be “nice”, due to their concern with method rather than theory. Methodological debates in the digital humanities seem to circumnavigate more fundamental epistemological debates on principles. But big data is not self-explanatory. Explicitly or implicitly theory plays a role when it comes to collect, organize and interpret information: “So it is one thing to establish significant correlations, and still another to make the leap from correlations to causal attributes” (Bollier 2010). Theory matters for making the semantic shift from information to meaning.

In order to understand the process of knowledge production we must keep an eye on the mutually constitutive spheres of theory and practice. In the era of big data Bruno Latour’s conclusion: “Change the instruments, and you will change the entire social theory that goes with them” is hence more important than ever.

Featured image was taken from

Data and their visualization

When did data enter mankind’s history? If you would ask an archaeologist or a historian, he might answer: About 35.000 years ago, with cave paintings. The reason why they are being held as data points to their symbolic dimension: Cave paintings may be not mere representations of what has been observed by the people who produced these drawings, but because they may have been designed for religious or ceremonial purpose. Depictions of men dancing around a slayed animal can thus be interpreted not only as human beings performing a spontaneous dance of victory, but they can also be seen as illustrations of rituals and cosmologies. Sacred knowledge written on ceilings of locations not inhabited by men.

Statisticians (and data scientists) see these things differently. For them, data are clearly bound to abstract signs. In spite of the much older presence of hieroglyphs – this peculiar Egyptian mixture of pictograms and abstractions –, they would point to Phoenician letters providing recipes of beer, with numbers provided for amounts of barley; or to Sumerian calculations scratched into wet clay, which has been burned subsequently and preserved as ostraca. Knowledge about intoxicating beverages and bargaining. Data, in this understanding, is connected to numerical values and not to the symbolic values which make up the surplus of the cave paintings. Moreover: While figurative drawings found in caves are visualizations in themselves, a ‘modern’ understanding of data points to maps as the beginning of data visualization (400 BC), star charts (150 BC), and a mechanical knowledge diagram drawn in Spain in 1305. Knowledge about the world and about knowledge itself. Canadian researcher Michael Friendly, who charted the history of data visualization in his Milestones project, sees Michael Florent van Langren’s 1644 representation of longitudes as the first (known) statistical graph.


We see here not only differing conceptions of data, but of what a visualization of data might be. And, if we follow the traces laid out by people writing on the history of data visualization (like Michael Friendly, Edward Tufte or Daniel Rosenberg), we soon note that there seems not to be a straightforward evolution in data visualization. On the one hand, data visualization depended on the possibilities painters and printers provided for data specialists; on the other hand, the development of abstract thinking took its own routes. The indices and registers found as annexes in printed books form the data basis for networks of persons or conceptions, but their visualization came later. Visualizations became more common in the 19th century (John Snow’s map of cholera-contaminated London is one of the more popular examples here), and data visualization was taken out of the hands of the experts (who paid the graphic designers and could afford costly printing) only in the 20th century, with mass distributions of excel and power point. A new phase started with the Internet providing libraries for visualization like D3.js or P5.js. But these different phases also point to what Foucault had written in his “archéologie du savoir” – when history shifted from one view to another, the past became virtually incomprehensible to the present. Visualizations become outdated and are no longer easy to understand. Thus, returning to our starting point: Are cave paintings not simply other visualizations of data and knowledge than Phoenician beer recipes? Is it a failure of different disciplines not to be able to find an agreement about what terms like ‘knowledge’ and ‘data’ mean? Doesn’t our conception of ‘knowledge’ and ‘data’ determine what is being termed ‘visualization’?

The Revolution of the McWord; or, why difference and complexity is necessary.

“It’s a beautiful thing, the destruction of words.”—George Orwell, 1984.

One of my earliest memories of being reprimanded happened when I was in Junior or Senior Infants at Primary School. During a French lesson I needed to use the bathroom (tiny humans often do). We had been told we could only communicate in French, so I sat there attempting to gather and translate my toilet related thoughts into something suitably Francophone. Eventually I put up my hand, got the teacher’s attention, pointed at my chest and said “Moi,” pointed at the door that lead to the bathrooms and said “toilette?” The teacher snapped and said “No Georgina you are not a toilet!”

A little harsh perhaps, especially considering I was a four-year old three-foot high mini-human, but still, I haven’t forgotten it, and now I’m fluent in French. My effort at breaking down a language barrier caused someone to snap and (it seems) be insulted by my tiny human attempt at French.

So certainly I agree with Jennifer’s point from her recent blog article here that “Building intimacy (for this is what I take the phrase “brings you closer” to mean) is not about having a rough idea of what someone is saying, it is about understanding the nuance of every gesture, every reference and resonance.” This is part of the (many) reason(s) why people on the autism spectrum, for example, find social interaction so difficult, because of a difficulty understanding these very gestural nuances that are so central to human communications. And this lack of understanding often brings with it frustration, isolation, loneliness, and pain. The point is: it’s not just about the words; it’s about how they are said, the tone, the gesture, the contexts. These are things a translation program cannot understand or impart, and it is arrogant to suggest that such facets of communication are by-passable or expendable when so many people struggle with them on a day-to-day basis. Moreover, they are facets of human communication that cannot be erased or eliminated from speech-exchanges with a view to making these exchanges “simpler” or “doubleplusgood.” That brings us right into 1984 territory.

From the perspective of Eugene Jolas, author of the “Revolution of the Word” manifesto published in transition magazine, a modernist periodical active in Paris throughout the 1920s and 1930s whose contributors included the likes of James Joyce, Gertrude Stein,  and Samuel Beckett, language was not complex enough:

Tired of the spectacle of short stories, novels, poems and plays still under the hegemony of the banal word, monotonous syntax, static psychology, descriptive naturalism, and desirous of crystallizing a viewpoint… Narrative is not mere anecdote, but the projection of a metamorphosis of reality” and that “The literary creator has the right to disintegrate the primal matter of words imposed on him by textbooks and dictionaries.”[1]

So, language, or rather languages (Jolas was fluent in several and often wrote in an amalgamation that, he felt, better reflected his hybridic Franco-German (Alsatian) and American identity[2]), was not complex enough to fully express the totality of reality.

Mark Zuckerburg proposes something of a devolution, a de-creation, a simplifying of difference, a reintegration and amalgamation of the facets that distinguish us from others. But while it might appear useful (Esperanto, anyone?), is the experience going to lead to richer conversations? To a demolition of barriers? Or will it result in something akin to Point It, the highly successful so-called “Travellers Language Kit” that contains no language at all, but rather an assortment of pictures that allow one to leaf through the book and point at the item you want.

So, with a sensitive interlocutor, one could perhaps intuit that “Me *points at* Coca-Cola” means “Hi, I would like a Coca-Cola.” Or that “Me *points at* toilet” would likely mean “Hi, I desperately need to use your bathroom, could you be so kind as to point me in the right direction?”

After my childhood French toilet incident, I myself could never overcome the mortification involved in using Point It. But even if I did, would the success of an exchange wherein “Me *points at* Coca-Cola” results in my being handed a Coca-Cola give me the same satisfaction as my first successful exchange with someone in another language did? The first time I managed to say something in French to a French person in France and be met with a response in French as opposed to a confused look or (worse) a response in English. Would Zuckerberg’s own much-lauded trip to China in 2014 where he was interviewed and responded in Mandarin have received as much positive press if he had worked through an interpreter, or used a pioneering neural network translation platform? I don’t think so.

Reducing or eliminating language difference also creates hierarchies, and this is dangerous. What language will we agree to communicate in? Why one language and not another? What facets of my individuality are accented in my native language that are perhaps left out or lost in another?

In short, there is an ethical element to this, and one that must be acknowledged and addressed. It’s similar to the argument Todd Presner articulates in “The Ethics of the Algorithm” when he notes the negative affect of reducing human experience (in this case, the testimonies of Holocaust survisors) to keywords so that their experiences become “searchable”: “it abstracts and reduces the human complexity of the victims’ lives to quantized units and structured data. In a word, it appears to be de-humanizing”[3]

We have to resist what Presner calls “the impulse to quantify, modularize, distantiate, technify, and bureaucratize the subjective individuality of human experience,”[4] even if this impulse is driven by a desire to facilitate communications across perceived borders. Finding, maintaining, and celebrating the individual in an era that is putting increasing pressure to separate the “in-” from the “-dividual” for the sake of facility will lead (and has perhaps already lead, if we can refer back to Don DeLillo’s 1985 observation that “You are the sum total of your data.”) to the era of the “dividua”[5]; where instead of championing individuality, people are reduced to their component data sets, or rather the facets of their personhood that can be assigned to data sets, with the rest—the enigmatic “in-” that makes up an individual—deemed unnecessary, a “barrier” to facile communications.

Rather that working to fractalise language, as Eugene Jolas did, universal translation (which is itself a misnomer, all translations are, to a degree, inexact and entail a degree of intuition or creativity to render one word in or through another word) simplifies that which cannot, and should not, be simplified.  This would be doubleplusungood.

Complexity matters.

KPLEX matters.

[1] Eugene Jolas, “Revolution of the Word,” transition 16/17, 1929.

[2] Born in the New Jersey, Jolas moved to Europe the bilingual Alsace-Lorraine region as a young child, and later spent key formative years in the United States.

[3] Presner, in “The Ethics of the Algorithm: Close and Distant Listening to the Shoah Foundation Visual History Archive” in Fogu, Claudio, Kansteiner, Wulf, and Presner, Todd, Probing the Ethics of Holocaust Culture..

[4] Presner, in ibid.

[5] Deleuze, “Postscript on the Societies of Control.”

A recipe for intimacy?

Mark Zuckerberg posted the following statement on his Facebook feed:
“Today we’re publishing research on how AI can deliver better language translations. With a new neural network, our AI research team was able to translate more accurately between languages, while also being nine times faster than current methods.

Getting better at translation is important to connect the world. We already perform over 2 billion translations in more than 45 languages on Facebook every day, but there’s still a lot more to do. You should be able to read posts or watch videos in any language, but so far the technology hasn’t been good enough.

Throughout human history, language has been a barrier to communication. It’s amazing we get to live in a time when technology can change that. Understanding someone’s language brings you closer to them, and I’m looking forward to making universal translation a reality. To help us get there faster, we’re sharing our work publicly so that all researchers can use it to build better translation tools.”

Key messages: taking time to understand people is for fools, and language is the problem.
When did language become a barrier to communication?  Would we not be hard pressed to communicate much at all without it?  Doesn’t machine translation have the potential to create as much distance as ‘understanding?’   Building intimacy (for this is what I take the phrase “brings you closer” to mean) is not about having a rough idea of what someone is saying, it is about understanding the nuance of every gesture, every reference and resonance.  Isn’t the joy of encountering a new culture tied up in the journey of discovery we make on the road to understanding?
I salute Facebook for making their research and software open, but a bit of humility in the face of the awesome and varied systems of signs and significations we humans have built could make this so much better news.

Data before the (alternative) facts.

Ambiguity and uncertainty, cultural richness, idiosyncracy and subjectivity are par for the course when it comes to humanities research, but one mistake that can be made when we approach computational technologies that proffer access to greatly increased quanta of data, is to assume that these technologies are objective. They are not, and just like regular old fashioned analogue research, they are subject to subjectivities, prejudice, error, and can display or reflect the human prejudices of the user, the software technician, or the organisation responsible for designing the information architecture; for designing this supposedly “objective” machine or piece of software for everyday use by regular everyday humans.

In ‘Raw Data’ is an Oxymoron Daniel Rosenberg argues that data is rhetorical.[1] Christine Borgman elaborates on this stating that “Data are neither truth nor reality. They may be facts, sources of evidence, or principles of an argument that are used to assert truth or reality.”[2] But one person’s truth is, or can be, in certain cases, antithetical to another person’s reality. We saw an example of this in Trump aide Kellyanne Conway’s assertion that the Trump White House was offering “alternative facts” regarding the size of the crowd that attended Trump’s inauguration. Conway’s comments refer to the “alternative facts” offered by White House Press Secretary Sean Spicer, who performed an incredible feat of rhetorical manipulation (of data) by arguing that “this was the largest audience to ever witness an inauguration, period.”

While the phrase “alternative facts” may have been coined by Conway, it’s not necessarily that new as a concept;[3] at least not when it comes to data and how big data is, or can be, used and manipulated. So what exactly am I getting at? Large data can be manipulated and interpreted to reflect different users’ interpretations of “truth.” This means that data is vulnerable (in a sense). This is particularly applicable when it comes to the way we categorise data, and so it concerns the epistemological implications of architecture such as the “search” function, and the categories we decide best assist the user in translating the data into information, knowledge and the all-important wisdom (à la the “data, information, knowledge and wisdom” (DIKW) system). Johanna Drucker noted this back in 2011, and her observations are still being cited:

It is important to recognize that when we organize data into categories (according to population, gender, nation, etc.), these categories tend to be treated as if they were discrete and fixed, when in fact they are interpretive expressions (Drucker 2011).[4]

This correlates with Rosenberg’s idea of data as rhetorical, and also with Rita Raley’s argument that data is “performative.”[5] Like rhetoric and performance then, this means the material can be fashioned to reflect the intentions of the user, that it can be used to falsify truth, or to insist that perceived truths are in fact false; that news is “fake news.” Data provided the foundations for the arguments voiced by each side over the ratio of Trump inauguration attendees to Obama inauguration attendees, with each side insisting their interpretation of the data allowed them to present facts.

This is one of the major concerns surrounding big data research and Rosenthal uses it to draw a nice distinction between notions of falsity as they are presented in the sciences and the humanities respectively; with the first discipline maintaining a high awareness of what are referred to as  “false-positive results”[6] whereas in the humanities, so-called “false-positives” or theories that are later disproven or refuted (insofar as it is possible to disprove a theory in the humanities) are not discarded, but rather incorporated into the critical canon, becoming part of the bibliography of critical works on a given topic: “So while scientific truth is falsifiable, truth in the humanities never really cancels out what came before.”[7]

But perhaps a solution of sorts is to be found in the realm of the visual and the evolution of the visual (as opposed to the linguistic, which is vulnerable to rhetorical manipulations) as a medium through which we can attempt to study of data. Rosenthal argues that visual experiences of data representations trigger different cognitive responses:

When confronted with the visual instead of the linguistic, our thought somehow becomes more innocent, less tempered by experience. If data is, as Rosenberg suggests, a rhetorical function that marks the end of the chain of analysis, then in data-driven literary criticism the visualization allows us to engage with that irreducible element in a way that language would not. As Edward R. Tufte (2001, 14), a statistician and author on data visualization, puts it, in almost Heidegerrian language, ‘Graphics reveal data.’[8]

So “Graphics reveal data,” and it was to graphics that people turned in response to Conway and Spicer’s assertions regarding the “alternative facts” surrounding Trump’s inauguration crowd. Arguably, these graphics (below) express the counterargument to Conway and Spicer’s assertions in a more innately understandable way than any linguistic rebuttal, irrespective of how eloquent that rebuttal may be.

But things don’t end there (and it’s also important to remember that visualisations can themselves be manipulated). Faced with graphic “alternative-alternative facts” in the aftermath of both his own assertions and Conway’s defense of these assertions, Spicer countered the visual sparsity of the graphic image by retroactively clarifying that this “largest audience” assertion incorporated material external to the datasets used by the news media to compile their figures, taking in viewers on Youtube, Facebook, and those watching over the internet the world over. The “largest audience to ever witness an inauguration, period” is, apparently, “unquestionable,” and it was obtained by manipulating data.

[1] Daniel Rosenberg, “Data before the Fact,”’ Gitelman, “Raw Data” is an Oxymoron, 18.

[2] Borgman, “Big Data, Little Data, No Data,” 17.

[3] We can of course refer to Orwell’s 1984 and the “2+2=5” paradigm here.

[4] Karin van Es, Nicolàs López Coombs & Thomas Boeschoten, “Towards a Reflexive Digital Data Analysis” in Mirko Tobias Schäfer and Karin van Es, The Datafied Society. Studying Culture through Data, 176.

[5] Rita Raley, “Dataveillance and Countervailance” in Gitelman, “Raw Data” is an Oxymoron, 128.

[6] Rosenthal, “Introduction: Narrative Against Data in the Victorian Novel,” 6.

[7] Ibid., 7.

[8] Ibid., 5.

How can use be made of NAs?

In elementary school, the little data scientist learns that NAs are nasty. Numerical data are nice, clean, and complete when collected by a sensor. NAs, on the other hand, are these kind of data that result from manually entered data, not properly filled-in surveys; or they are missing values which could not be gathered out of unknown reasons. If they would be NULLs, the case would be much clearer – with a NULL, you can calculate. But NA, that’s really terra incognita, and because of this the data have to checked for bias and skewness. NA thus becomes for the little data scientist an expression of desperation: not applicable, not available, or no answer.

For humanists, the case is easier. Missing values are part of their data life. Only rarely machines support the process of data collection, and experience shows that people, when asked, rarely respond in a complete way; leaving out part of the answer is an all-too-human strategy. Working with this kind of data, researches have become creative; one can even ask whether there is a pattern behind NAs, or whether a systematic explanation for them can be found. In a recently published reader on methodologies in emotion research, there is a contribution on how to deal with missing values. The researcher, Dunya van Troost, collected data from persons committed to political protest all over Europe; the survey contained, amongst others, four items on emotions. By counting the number of missing values across the four emotion items three groups could be identified: respondents with complete records (85 percent), respondents who answered selectively (13 percent), and (3) respondents who answered none of the four emotion items (2 percent). For inferential statistics, it is important to understand the reasons why certain data are missing. Van Troost somehow turned the tables: she conducted a regression analysis to see how the other data available, here both country and demographic characteristics, influenced the number of missing values provided by her 15,272 respondents. She found out that the unanswered items were not missing completely at random. The regression further showed that the generation of protesters born before 1945 had more frequently missing values on the emotions items than younger generations. The same applies for the level of education – the higher the level of education, the more complete the surveys had been filled. Female respondents have more missing values than male ones. Finally, respondents from the southern European countries Spain and Italy had a relatively high rate of missing values.

It is obvious that not all people are going to answer items on emotions with the same willingness; and it is also plausible that the differences in willingness are related to age, gender, cultural socialization. It would be interesting to have a longitudinal study on this topic, to validate and cross-check van Troost’s findings. But this example illuminates a different handling of information in the social sciences from that of information science: Even NAs can be useful.

Dunya Van Troost, Missing values: surveying protest emotions, in: Helena Flam, Jochen Kleres (eds.), Methods of Exploring Emotions, London / New York: Routledge 2015, pp. 294-305.

The rule and question of the eggs: how do we decide what data is the right data?

The rule and question of the eggs.

A young maiden beareth eggs to the market for to sell and her meetheth a young man that would play with her in so much that he overthroweth and breaketh the eggs every one, and will not pay for them. The maid doth him to be called afore the judge. The judge condemneth him to pay for the eggs/ but the judge knoweth not how many eggs there were. And that he demandeth of the maid/ she answereth that she is but young, and cannot well count.[1]

“The rule and question of the eggs” is an arithmetic problem contained within the earliest English-language printed treatise on arithmetic, An Introduction for to Learn to Reckon with the Pen.[2] In addition to being flat-out joyous to read, “The rule and question of the eggs” (along with the other examples Travis D. William’s discusses in his brilliant essay on early modern arithmetic in ‘Raw Data’ is an Oxymoron) raises some important issues that are relevant on a multidisciplinary level in terms of how we approach the archives of our respective fields of study with a view to making them available on digital platforms; furthermore, the “rule and question of the eggs” raises interesting questions in terms of how nuanced cultural artefacts (of which this is but one example) are to be satisfactorily yolked together (see what I did there?) and represented en masse in the form of big data.

It is not easy to define what facets of this piece of hella dodgy arithmetic merit particular attention, or to anticipate what facets of the piece other readers (present day or future readers) may find interesting. If we were to go about entering this into an online archive or database, what keywords would we use? What aspects of this problem are of particular importance? What information is essential, what is inessential? For practitioners of maths, is it the bizarre non-existent non-workable formula that seemingly asks the student to figure out how many eggs the girl was carrying in her basket, despite the volume of the basket being left out and the girl herself professing to not knowing how to count? For historians of mathematics and early modern approaches to the pedagogy of mathematics is it the language used to frame the problem? For sociologists, feminists, or historians (etc.) is it the unsettling reference to a “young man that would play with” a young woman? Is it the fact that the crime being reported is the destruction of eggs as opposed to the public assault? Or is it the fact that the young girl seemingly countered his unwelcome advances by committing fraud because she claimed she had been carrying 721 (!) eggs.[3] Simply put, where or what is the data here? And is the data you pull from this piece the same as the data I pull from it, or the same as the data a reader in fifty years time will pull from it?

Houston, we have a problem: if we go to categorise this example, we risk leaving out any one of the many essential facets that make the “rule and question of the eggs” such a rich historical document. But categorise it we must. And, just like moving between languages (English-to-French or vice versa), when we move from rich, nuanced, ambiguous and complex linguistic narratives to the language of the database, the dataset, the encoded set of assigned unambiguous values readable to computers, we expose the material to a translation act that imposes delimiting interpretations on that material by creating datasets that drastically simplify the material. Someone decides what is worth translating, what is incidental, and what should be left out here. Someone converts the information as it stands in one language (linguistic narrative), into supposedly comparable or equivalent information in another language (computer narrative).

Further still, by translating this early modern piece into information that is workable within the sphere of contemporaneous digital archival practices, we create a scenario wherein the material is no longer read as it was intended to be read. We impose our thinking and understanding about maths (or women’s rights, say, because the “play with” part really irks me) onto their text by taking it and making of it a series of functional datasets relevant to our particular scholarly interests. As Williams points out “a data set is already interpreted by the fact that is a set: some elements are privileged by inclusion; while others are denied relevance through exclusion.”[4]

In analysing these pieces Williams establishes four terminologies: “our reading and their reading, our rigor and their rigor.”[5] He elaborates:

Our reading is a practice of interpretation that seeks to understand the appearance and function of texts within their original historical and cultural milieus. Our reading thus incorporates the need to understand with nuance their reading: why and how contemporaneous readers would read the texts produced by their cultures.[6]

That’s all well and good when approached from an analogue-dependent research environment where one is tackling these early modern maths problems one by one. After all, this is merely one maths problem within an entire book containing maths problems. But what if we were to take it to a Borgesian level, to a big data level wherein this “rule of the eggs” is merely one math problem within an entire book containing maths problems, a book contained within a library containing books that contain only maths problems; a library that was in fact the ur-library of maths books, containing every maths book and every maths problem ever written.

When we amp up the scale to the realm of big data and this one tiny problem becomes one tiny problem within an entire ur-library of information, how do we stay cognisant of the fact that every entry in a given dataset, no matter how seemingly incidental or minute, could be as detailed and nuanced as our enigmatic rule and question of the eggs?

[1] Quoted in Travis D. Williams “Procrustean Marxism and Subjective Rigor,” Gitelman, “Raw Data” is an Oxymoron, 45.

[2] In am indebted to Travis D. Williams’s essay “Procrustean Marxism and Subjective Rigor: Early Modern Arithmetic and Its Readers” (to be found in “Raw Data” is an Oxymoron (2013)) for bringing these incredible examples to light.

[3] Williams notes that the “correct” answer (or rather the answer recorded in the arithmetic book as the “correct” answer) is 721 eggs. But this would mean that the young maiden carrying roughly 36 kilos (yes, I’ve done the math) of egg, which seems unlikely. Williams “Procrustean Marxism and Subjective Rigor,” ibid.

[4] Travis D. Williams “Procrustean Marxism and Subjective Rigor,” 41.

[5] Travis D. Williams “Procrustean Marxism and Subjective Rigor,” Gitelman, “Raw Data” is an Oxymoron, 42.

[6] Travis D. Williams “Procrustean Marxism and Subjective Rigor,” ibid.

Featured image was taken from

Ethics and epistemics in biomedical big data research

This recent article explores issues in big data approaches with regard to

a) ethics – such as obtaining consent from large numbers of research participants across a large number of institutions; protect confidentiality; privacy concerns; optimal methods for de-identification; and the limitation of the capacity for the public, and even experts, to interpret and question research findings

and b) epistemics – such as personalized (or precision) treatment that rely on extending concepts that have largely failed or have very high error rates; deficiencies of observational studies that do not get eliminated with big data; challenges of big data approaches due to their overpowered analysis settings; minor noise due to errors or low quality information being easily be translated into false signals; and problems with the view that big data is somehow “objective,” including that this obscures the fact that all research questions, methods, and interpretations
are value-laden.

The article closes with a list of recommendations which consider the tight links between epistemology and ethics in relation to big data in biomedical research.

Wendy Lipworth, Paul H. Mason, Ian Kerridge, John P. A. Ioannidis, Ethics and Epistemology in Big Data Research, in: Journal of Bioethical Inquiry 2017, DOI 10.1007/s11673-017-9771-3 [Epub ahead of print].

Tinfoil hats, dataveillance, and panopticons.

When I started my work with KPLEX, I was not expecting to encounter so many references to literature. Specifically, to works of fiction I have read in my capacity as an erstwhile undergraduate and graduate student of literature who had (and still has) a devout personal interest in the very particular, paranoid postmodern fictions that crawled out of the Americas (North and South) like twitchy angst-ridden spiders in the mid-to-latter half of the 20th century. The George Orwell references did not surprise me all that much; after all, everyone loves to reference 1984. But Jorge Luis Borges, Thomas Pynchon, and Don DeLillo? These guys produced (the latter two are still producing) the kind of paranoiac post-Orwellian literature that could be nicely summed up by the Nirvana line “Just because you’re paranoid/ Don’t mean they’re not after you,” which is itself a slightly modified lift straight out of Joseph Heller’s Catch 22.Pynchon-simpsons.0.0      

It seems, however, that when it comes to outlining, theorising and speculating over the state, uses, and value of data in 21st century society, the paranoid tinfoil hat wearing Americans and their close predecessor, the Argentinian master of the labyrinth, got there first.

We are all by now familiar with—or have at least likely heard reference to—the surveillance system in operation in 1984; a two-way screen that captures image and sound so that the inhabitants of Orwell’s world are always potentially being watched and listened to. In a post-Snowden era this all-seeing all-hearing panoptic Orwellian entity has already been referenced to death, and indeed, as Rita Raley points out, Orwell’s two-way screen has long been considered inferior to the “disciplinary and control practice of monitoring, aggregating, and sorting data.”[1] In other words, to the practice of “dataveillance.[2] But Don DeLillo’s vision of the role data would play in our future was somewhat different, more nuanced, and most importantly, is less overtly classifiable as dystopian; in fact, it reads rather like a description of an assiduous Google Search, yet it is to be found in the pages of a book first published in 1985:

It’s what we call a massive data-base tally. Gladney, J.A.K. I punch in the name, the substance, the exposure time and then I tap into your computer history. Your genetics, your personals, your medicals, your psychologicals, your police-and-hospitals. It comes back pulsing stars. This doesn’t mean anything is going to happen to you as such; at least not today or tomorrow. It just means you are the sum total of your data. No man escapes that.[3]

Dataveillance is interesting because its function is not just to record and monitor, but also to speculate, to predict, and maybe even to prescribe. As a result, as Raley points out, its future value is speculative: “it awaits the query that would produce its value.”[4] By value Raley is referring to the economic value this data may have in terms of its potential to influence people to buy and sell things; and so, we have a scenario wherein data is traded in a manner akin to shares or currency, where “data is the new oil of the internet”:[5]

Data speculation means amassing data so as to produce patterns, as opposed to having an idea for which one needs to collect supporting data. Raw data is the material for informational patterns to come, its value unknown or uncertain until it is converted into the currency of information. And a robust data exchange, with so-termed data handlers and data brokers, has emerged to perform precisely this work of speculation. An illustrative example is BlueKai, “a marketplace where buyers and sellers trade high-quality targeting data like stocks,” more specifically, an auction for the near-instant circulation of user intent data (keyword searches, price searching and product comparison, destination cities from travel sites, activity on loan calculators).[6]

This environment of highly sophisticated, near-constant amassing of data leads us back to DeLillo and his observation, made back in 1985, that “you are the sum total of your data.” And this is perhaps the very environment that leads Geoffrey Bowker to declare, in his provocative afterword to the collection of essays ‘Raw Data’ is an Oxymoron (2013), that we as humans are “entering into”, are “being entered” into, “the dataverse.”[7] Within this dataverse, Bowker—who is being self-consciously hyperbolic—claims it is possible to “take the unnecessary human out of the equation,” envisioning a scenario wherein “our interaction with the world and each other is being rendered epiphenomenal to these data-program-data cycles” and one where, ultimately, “if you are not data, you don’t exist.”[8] But this is precisely where we must be most cautious, particularly when it comes to the nascent dataverse of humanities researchers. Because while we might tentatively make the claim to be within a societal dataverse now, the alignment of data with existence and experience is still far from total. We cannot yet fully capture the entirety of the white noise of selfhood.

And this is where things start to get interesting, because what is perhaps dystopian from a contemporaneous perspective—that is, the presence somewhere out there of near infinitesimal quanta of data pertaining to you, your preferences, your activities— a scenario that might reasonably lead us to reach for those tinfoil hats, is, conversely, a desirable one from the perspective of historians and other humanities researchers. A data sublime, a “single database fantasy”[9] wherein one could access everything, where nothing is hidden, and where the value, the intellectual, historical, and cultural value of the raw data is always speculative, always potentially of value to the researcher, and thus amassed and maintained with the same reverence associated with high value data traded today on platforms such as BlueKai. Because as it is, the amassing of big data for humanities researchers, particularly when it comes to converting extant analogue archives and collections, subjects the material to a hierarchising process wherein items of potential future value (speculative value) are left out or hidden; greatly diminishing their accessibility and altering the richness or fertility of the research landscape available to future scholars. After all, “if you are not data, you don’t exist.”[10] But if you don’t exist then, to paraphrase Raley, you cannot be subjected to the search or query of future scholars and researchers, the search or query that would determine your value.

As we move towards these data sublime scenarios, it is important not to lose sight of the fact that that which is considered data now, this steadily accumulating catalogue of material pertaining to us as individuals or humans en masse, still does not capture everything. And if this is true now then it is doubly true (ability to resist Orwellian doublespeak at this stage in blogpost = zero) of our past selves and the analogue records that constitute the body of humanities research. How do we incorporate the “not data” in an environment where data is currency?

Happy Day of DH!

[1] Raley, “Dataveillance and Countervailance” in Gitelman ed., “Raw Data” is an Oxymoron, 124.

[2] Roger Clarke, quoted in ibid.

[3] Don DeLillo, White Noise, quoted in Gitelman ed., “Raw Data” is an Oxymoron, 121, emphasis in original.

[4] Raley, “Dataveillance and Countervailance” in ibid., 123–4.

[5] Julia Angwin, “The Web’s New Gold Mine: Your Secrets,” quoted in ibid., 123.

[6] Raley, “Dataveillance and Countervailance” in ibid., 123.

[7] Geoffrey Bowker, “Data Flakes: An Afterword to ‘Raw Data’ Is an Oxymoron” in ibid., 167.

[8] Bowker, in ibid., 170.

[9] Raley, “Dataveillance and Countervailance” in ibid., 128

[10] Bowker, in ibid., 170.

Featured image is a still taken the film version of 1984.