Blog

Data before the (alternative) facts.

Ambiguity and uncertainty, cultural richness, idiosyncracy and subjectivity are par for the course when it comes to humanities research, but one mistake that can be made when we approach computational technologies that proffer access to greatly increased quanta of data, is to assume that these technologies are objective. They are not, and just like regular old fashioned analogue research, they are subject to subjectivities, prejudice, error, and can display or reflect the human prejudices of the user, the software technician, or the organisation responsible for designing the information architecture; for designing this supposedly “objective” machine or piece of software for everyday use by regular everyday humans.

In ‘Raw Data’ is an Oxymoron Daniel Rosenberg argues that data is rhetorical.[1] Christine Borgman elaborates on this stating that “Data are neither truth nor reality. They may be facts, sources of evidence, or principles of an argument that are used to assert truth or reality.”[2] But one person’s truth is, or can be, in certain cases, antithetical to another person’s reality. We saw an example of this in Trump aide Kellyanne Conway’s assertion that the Trump White House was offering “alternative facts” regarding the size of the crowd that attended Trump’s inauguration. Conway’s comments refer to the “alternative facts” offered by White House Press Secretary Sean Spicer, who performed an incredible feat of rhetorical manipulation (of data) by arguing that “this was the largest audience to ever witness an inauguration, period.”

While the phrase “alternative facts” may have been coined by Conway, it’s not necessarily that new as a concept;[3] at least not when it comes to data and how big data is, or can be, used and manipulated. So what exactly am I getting at? Large data can be manipulated and interpreted to reflect different users’ interpretations of “truth.” This means that data is vulnerable (in a sense). This is particularly applicable when it comes to the way we categorise data, and so it concerns the epistemological implications of architecture such as the “search” function, and the categories we decide best assist the user in translating the data into information, knowledge and the all-important wisdom (à la the “data, information, knowledge and wisdom” (DIKW) system). Johanna Drucker noted this back in 2011, and her observations are still being cited:

It is important to recognize that when we organize data into categories (according to population, gender, nation, etc.), these categories tend to be treated as if they were discrete and fixed, when in fact they are interpretive expressions (Drucker 2011).[4]

This correlates with Rosenberg’s idea of data as rhetorical, and also with Rita Raley’s argument that data is “performative.”[5] Like rhetoric and performance then, this means the material can be fashioned to reflect the intentions of the user, that it can be used to falsify truth, or to insist that perceived truths are in fact false; that news is “fake news.” Data provided the foundations for the arguments voiced by each side over the ratio of Trump inauguration attendees to Obama inauguration attendees, with each side insisting their interpretation of the data allowed them to present facts.

This is one of the major concerns surrounding big data research and Rosenthal uses it to draw a nice distinction between notions of falsity as they are presented in the sciences and the humanities respectively; with the first discipline maintaining a high awareness of what are referred to as  “false-positive results”[6] whereas in the humanities, so-called “false-positives” or theories that are later disproven or refuted (insofar as it is possible to disprove a theory in the humanities) are not discarded, but rather incorporated into the critical canon, becoming part of the bibliography of critical works on a given topic: “So while scientific truth is falsifiable, truth in the humanities never really cancels out what came before.”[7]

But perhaps a solution of sorts is to be found in the realm of the visual and the evolution of the visual (as opposed to the linguistic, which is vulnerable to rhetorical manipulations) as a medium through which we can attempt to study of data. Rosenthal argues that visual experiences of data representations trigger different cognitive responses:

When confronted with the visual instead of the linguistic, our thought somehow becomes more innocent, less tempered by experience. If data is, as Rosenberg suggests, a rhetorical function that marks the end of the chain of analysis, then in data-driven literary criticism the visualization allows us to engage with that irreducible element in a way that language would not. As Edward R. Tufte (2001, 14), a statistician and author on data visualization, puts it, in almost Heidegerrian language, ‘Graphics reveal data.’[8]

So “Graphics reveal data,” and it was to graphics that people turned in response to Conway and Spicer’s assertions regarding the “alternative facts” surrounding Trump’s inauguration crowd. Arguably, these graphics (below) express the counterargument to Conway and Spicer’s assertions in a more innately understandable way than any linguistic rebuttal, irrespective of how eloquent that rebuttal may be.

But things don’t end there (and it’s also important to remember that visualisations can themselves be manipulated). Faced with graphic “alternative-alternative facts” in the aftermath of both his own assertions and Conway’s defense of these assertions, Spicer countered the visual sparsity of the graphic image by retroactively clarifying that this “largest audience” assertion incorporated material external to the datasets used by the news media to compile their figures, taking in viewers on Youtube, Facebook, and those watching over the internet the world over. The “largest audience to ever witness an inauguration, period” is, apparently, “unquestionable,” and it was obtained by manipulating data.

[1] Daniel Rosenberg, “Data before the Fact,”’ Gitelman, “Raw Data” is an Oxymoron, 18.

[2] Borgman, “Big Data, Little Data, No Data,” 17.

[3] We can of course refer to Orwell’s 1984 and the “2+2=5” paradigm here.

[4] Karin van Es, Nicolàs López Coombs & Thomas Boeschoten, “Towards a Reflexive Digital Data Analysis” in Mirko Tobias Schäfer and Karin van Es, The Datafied Society. Studying Culture through Data, 176.

[5] Rita Raley, “Dataveillance and Countervailance” in Gitelman, “Raw Data” is an Oxymoron, 128.

[6] Rosenthal, “Introduction: Narrative Against Data in the Victorian Novel,” 6.

[7] Ibid., 7.

[8] Ibid., 5.

How can use be made of NAs?

In elementary school, the little data scientist learns that NAs are nasty. Numerical data are nice, clean, and complete when collected by a sensor. NAs, on the other hand, are these kind of data that result from manually entered data, not properly filled-in surveys; or they are missing values which could not be gathered out of unknown reasons. If they would be NULLs, the case would be much clearer – with a NULL, you can calculate. But NA, that’s really terra incognita, and because of this the data have to checked for bias and skewness. NA thus becomes for the little data scientist an expression of desperation: not applicable, not available, or no answer.

For humanists, the case is easier. Missing values are part of their data life. Only rarely machines support the process of data collection, and experience shows that people, when asked, rarely respond in a complete way; leaving out part of the answer is an all-too-human strategy. Working with this kind of data, researches have become creative; one can even ask whether there is a pattern behind NAs, or whether a systematic explanation for them can be found. In a recently published reader on methodologies in emotion research, there is a contribution on how to deal with missing values. The researcher, Dunya van Troost, collected data from persons committed to political protest all over Europe; the survey contained, amongst others, four items on emotions. By counting the number of missing values across the four emotion items three groups could be identified: respondents with complete records (85 percent), respondents who answered selectively (13 percent), and (3) respondents who answered none of the four emotion items (2 percent). For inferential statistics, it is important to understand the reasons why certain data are missing. Van Troost somehow turned the tables: she conducted a regression analysis to see how the other data available, here both country and demographic characteristics, influenced the number of missing values provided by her 15,272 respondents. She found out that the unanswered items were not missing completely at random. The regression further showed that the generation of protesters born before 1945 had more frequently missing values on the emotions items than younger generations. The same applies for the level of education – the higher the level of education, the more complete the surveys had been filled. Female respondents have more missing values than male ones. Finally, respondents from the southern European countries Spain and Italy had a relatively high rate of missing values.

It is obvious that not all people are going to answer items on emotions with the same willingness; and it is also plausible that the differences in willingness are related to age, gender, cultural socialization. It would be interesting to have a longitudinal study on this topic, to validate and cross-check van Troost’s findings. But this example illuminates a different handling of information in the social sciences from that of information science: Even NAs can be useful.

Dunya Van Troost, Missing values: surveying protest emotions, in: Helena Flam, Jochen Kleres (eds.), Methods of Exploring Emotions, London / New York: Routledge 2015, pp. 294-305.

The rule and question of the eggs: how do we decide what data is the right data?

The rule and question of the eggs.

A young maiden beareth eggs to the market for to sell and her meetheth a young man that would play with her in so much that he overthroweth and breaketh the eggs every one, and will not pay for them. The maid doth him to be called afore the judge. The judge condemneth him to pay for the eggs/ but the judge knoweth not how many eggs there were. And that he demandeth of the maid/ she answereth that she is but young, and cannot well count.[1]

“The rule and question of the eggs” is an arithmetic problem contained within the earliest English-language printed treatise on arithmetic, An Introduction for to Learn to Reckon with the Pen.[2] In addition to being flat-out joyous to read, “The rule and question of the eggs” (along with the other examples Travis D. William’s discusses in his brilliant essay on early modern arithmetic in ‘Raw Data’ is an Oxymoron) raises some important issues that are relevant on a multidisciplinary level in terms of how we approach the archives of our respective fields of study with a view to making them available on digital platforms; furthermore, the “rule and question of the eggs” raises interesting questions in terms of how nuanced cultural artefacts (of which this is but one example) are to be satisfactorily yolked together (see what I did there?) and represented en masse in the form of big data.

It is not easy to define what facets of this piece of hella dodgy arithmetic merit particular attention, or to anticipate what facets of the piece other readers (present day or future readers) may find interesting. If we were to go about entering this into an online archive or database, what keywords would we use? What aspects of this problem are of particular importance? What information is essential, what is inessential? For practitioners of maths, is it the bizarre non-existent non-workable formula that seemingly asks the student to figure out how many eggs the girl was carrying in her basket, despite the volume of the basket being left out and the girl herself professing to not knowing how to count? For historians of mathematics and early modern approaches to the pedagogy of mathematics is it the language used to frame the problem? For sociologists, feminists, or historians (etc.) is it the unsettling reference to a “young man that would play with” a young woman? Is it the fact that the crime being reported is the destruction of eggs as opposed to the public assault? Or is it the fact that the young girl seemingly countered his unwelcome advances by committing fraud because she claimed she had been carrying 721 (!) eggs.[3] Simply put, where or what is the data here? And is the data you pull from this piece the same as the data I pull from it, or the same as the data a reader in fifty years time will pull from it?

Houston, we have a problem: if we go to categorise this example, we risk leaving out any one of the many essential facets that make the “rule and question of the eggs” such a rich historical document. But categorise it we must. And, just like moving between languages (English-to-French or vice versa), when we move from rich, nuanced, ambiguous and complex linguistic narratives to the language of the database, the dataset, the encoded set of assigned unambiguous values readable to computers, we expose the material to a translation act that imposes delimiting interpretations on that material by creating datasets that drastically simplify the material. Someone decides what is worth translating, what is incidental, and what should be left out here. Someone converts the information as it stands in one language (linguistic narrative), into supposedly comparable or equivalent information in another language (computer narrative).

Further still, by translating this early modern piece into information that is workable within the sphere of contemporaneous digital archival practices, we create a scenario wherein the material is no longer read as it was intended to be read. We impose our thinking and understanding about maths (or women’s rights, say, because the “play with” part really irks me) onto their text by taking it and making of it a series of functional datasets relevant to our particular scholarly interests. As Williams points out “a data set is already interpreted by the fact that is a set: some elements are privileged by inclusion; while others are denied relevance through exclusion.”[4]

In analysing these pieces Williams establishes four terminologies: “our reading and their reading, our rigor and their rigor.”[5] He elaborates:

Our reading is a practice of interpretation that seeks to understand the appearance and function of texts within their original historical and cultural milieus. Our reading thus incorporates the need to understand with nuance their reading: why and how contemporaneous readers would read the texts produced by their cultures.[6]

That’s all well and good when approached from an analogue-dependent research environment where one is tackling these early modern maths problems one by one. After all, this is merely one maths problem within an entire book containing maths problems. But what if we were to take it to a Borgesian level, to a big data level wherein this “rule of the eggs” is merely one math problem within an entire book containing maths problems, a book contained within a library containing books that contain only maths problems; a library that was in fact the ur-library of maths books, containing every maths book and every maths problem ever written.

When we amp up the scale to the realm of big data and this one tiny problem becomes one tiny problem within an entire ur-library of information, how do we stay cognisant of the fact that every entry in a given dataset, no matter how seemingly incidental or minute, could be as detailed and nuanced as our enigmatic rule and question of the eggs?

[1] Quoted in Travis D. Williams “Procrustean Marxism and Subjective Rigor,” Gitelman, “Raw Data” is an Oxymoron, 45.

[2] In am indebted to Travis D. Williams’s essay “Procrustean Marxism and Subjective Rigor: Early Modern Arithmetic and Its Readers” (to be found in “Raw Data” is an Oxymoron (2013)) for bringing these incredible examples to light.

[3] Williams notes that the “correct” answer (or rather the answer recorded in the arithmetic book as the “correct” answer) is 721 eggs. But this would mean that the young maiden carrying roughly 36 kilos (yes, I’ve done the math) of egg, which seems unlikely. Williams “Procrustean Marxism and Subjective Rigor,” ibid.

[4] Travis D. Williams “Procrustean Marxism and Subjective Rigor,” 41.

[5] Travis D. Williams “Procrustean Marxism and Subjective Rigor,” Gitelman, “Raw Data” is an Oxymoron, 42.

[6] Travis D. Williams “Procrustean Marxism and Subjective Rigor,” ibid.

Featured image was taken from http://www.flickr.com

Ethics and epistemics in biomedical big data research

This recent article explores issues in big data approaches with regard to

a) ethics – such as obtaining consent from large numbers of research participants across a large number of institutions; protect confidentiality; privacy concerns; optimal methods for de-identification; and the limitation of the capacity for the public, and even experts, to interpret and question research findings

and b) epistemics – such as personalized (or precision) treatment that rely on extending concepts that have largely failed or have very high error rates; deficiencies of observational studies that do not get eliminated with big data; challenges of big data approaches due to their overpowered analysis settings; minor noise due to errors or low quality information being easily be translated into false signals; and problems with the view that big data is somehow “objective,” including that this obscures the fact that all research questions, methods, and interpretations
are value-laden.

The article closes with a list of recommendations which consider the tight links between epistemology and ethics in relation to big data in biomedical research.

Wendy Lipworth, Paul H. Mason, Ian Kerridge, John P. A. Ioannidis, Ethics and Epistemology in Big Data Research, in: Journal of Bioethical Inquiry 2017, DOI 10.1007/s11673-017-9771-3 [Epub ahead of print].

Tinfoil hats, dataveillance, and panopticons.

When I started my work with KPLEX, I was not expecting to encounter so many references to literature. Specifically, to works of fiction I have read in my capacity as an erstwhile undergraduate and graduate student of literature who had (and still has) a devout personal interest in the very particular, paranoid postmodern fictions that crawled out of the Americas (North and South) like twitchy angst-ridden spiders in the mid-to-latter half of the 20th century. The George Orwell references did not surprise me all that much; after all, everyone loves to reference 1984. But Jorge Luis Borges, Thomas Pynchon, and Don DeLillo? These guys produced (the latter two are still producing) the kind of paranoiac post-Orwellian literature that could be nicely summed up by the Nirvana line “Just because you’re paranoid/ Don’t mean they’re not after you,” which is itself a slightly modified lift straight out of Joseph Heller’s Catch 22.Pynchon-simpsons.0.0      

It seems, however, that when it comes to outlining, theorising and speculating over the state, uses, and value of data in 21st century society, the paranoid tinfoil hat wearing Americans and their close predecessor, the Argentinian master of the labyrinth, got there first.

We are all by now familiar with—or have at least likely heard reference to—the surveillance system in operation in 1984; a two-way screen that captures image and sound so that the inhabitants of Orwell’s world are always potentially being watched and listened to. In a post-Snowden era this all-seeing all-hearing panoptic Orwellian entity has already been referenced to death, and indeed, as Rita Raley points out, Orwell’s two-way screen has long been considered inferior to the “disciplinary and control practice of monitoring, aggregating, and sorting data.”[1] In other words, to the practice of “dataveillance.[2] But Don DeLillo’s vision of the role data would play in our future was somewhat different, more nuanced, and most importantly, is less overtly classifiable as dystopian; in fact, it reads rather like a description of an assiduous Google Search, yet it is to be found in the pages of a book first published in 1985:

It’s what we call a massive data-base tally. Gladney, J.A.K. I punch in the name, the substance, the exposure time and then I tap into your computer history. Your genetics, your personals, your medicals, your psychologicals, your police-and-hospitals. It comes back pulsing stars. This doesn’t mean anything is going to happen to you as such; at least not today or tomorrow. It just means you are the sum total of your data. No man escapes that.[3]

Dataveillance is interesting because its function is not just to record and monitor, but also to speculate, to predict, and maybe even to prescribe. As a result, as Raley points out, its future value is speculative: “it awaits the query that would produce its value.”[4] By value Raley is referring to the economic value this data may have in terms of its potential to influence people to buy and sell things; and so, we have a scenario wherein data is traded in a manner akin to shares or currency, where “data is the new oil of the internet”:[5]

Data speculation means amassing data so as to produce patterns, as opposed to having an idea for which one needs to collect supporting data. Raw data is the material for informational patterns to come, its value unknown or uncertain until it is converted into the currency of information. And a robust data exchange, with so-termed data handlers and data brokers, has emerged to perform precisely this work of speculation. An illustrative example is BlueKai, “a marketplace where buyers and sellers trade high-quality targeting data like stocks,” more specifically, an auction for the near-instant circulation of user intent data (keyword searches, price searching and product comparison, destination cities from travel sites, activity on loan calculators).[6]

This environment of highly sophisticated, near-constant amassing of data leads us back to DeLillo and his observation, made back in 1985, that “you are the sum total of your data.” And this is perhaps the very environment that leads Geoffrey Bowker to declare, in his provocative afterword to the collection of essays ‘Raw Data’ is an Oxymoron (2013), that we as humans are “entering into”, are “being entered” into, “the dataverse.”[7] Within this dataverse, Bowker—who is being self-consciously hyperbolic—claims it is possible to “take the unnecessary human out of the equation,” envisioning a scenario wherein “our interaction with the world and each other is being rendered epiphenomenal to these data-program-data cycles” and one where, ultimately, “if you are not data, you don’t exist.”[8] But this is precisely where we must be most cautious, particularly when it comes to the nascent dataverse of humanities researchers. Because while we might tentatively make the claim to be within a societal dataverse now, the alignment of data with existence and experience is still far from total. We cannot yet fully capture the entirety of the white noise of selfhood.

And this is where things start to get interesting, because what is perhaps dystopian from a contemporaneous perspective—that is, the presence somewhere out there of near infinitesimal quanta of data pertaining to you, your preferences, your activities— a scenario that might reasonably lead us to reach for those tinfoil hats, is, conversely, a desirable one from the perspective of historians and other humanities researchers. A data sublime, a “single database fantasy”[9] wherein one could access everything, where nothing is hidden, and where the value, the intellectual, historical, and cultural value of the raw data is always speculative, always potentially of value to the researcher, and thus amassed and maintained with the same reverence associated with high value data traded today on platforms such as BlueKai. Because as it is, the amassing of big data for humanities researchers, particularly when it comes to converting extant analogue archives and collections, subjects the material to a hierarchising process wherein items of potential future value (speculative value) are left out or hidden; greatly diminishing their accessibility and altering the richness or fertility of the research landscape available to future scholars. After all, “if you are not data, you don’t exist.”[10] But if you don’t exist then, to paraphrase Raley, you cannot be subjected to the search or query of future scholars and researchers, the search or query that would determine your value.

As we move towards these data sublime scenarios, it is important not to lose sight of the fact that that which is considered data now, this steadily accumulating catalogue of material pertaining to us as individuals or humans en masse, still does not capture everything. And if this is true now then it is doubly true (ability to resist Orwellian doublespeak at this stage in blogpost = zero) of our past selves and the analogue records that constitute the body of humanities research. How do we incorporate the “not data” in an environment where data is currency?

Happy Day of DH!

[1] Raley, “Dataveillance and Countervailance” in Gitelman ed., “Raw Data” is an Oxymoron, 124.

[2] Roger Clarke, quoted in ibid.

[3] Don DeLillo, White Noise, quoted in Gitelman ed., “Raw Data” is an Oxymoron, 121, emphasis in original.

[4] Raley, “Dataveillance and Countervailance” in ibid., 123–4.

[5] Julia Angwin, “The Web’s New Gold Mine: Your Secrets,” quoted in ibid., 123.

[6] Raley, “Dataveillance and Countervailance” in ibid., 123.

[7] Geoffrey Bowker, “Data Flakes: An Afterword to ‘Raw Data’ Is an Oxymoron” in ibid., 167.

[8] Bowker, in ibid., 170.

[9] Raley, “Dataveillance and Countervailance” in ibid., 128

[10] Bowker, in ibid., 170.

Featured image is a still taken the film version of 1984.

Beyond a binary conception of data

An awful lot of data exist in tables, the columns (variables) and rows (observations) filled with numerical values. Numerical values are always binary: Either they have a certain value or not (in the latter case they are NAs or NULLs). Computability is based on information stored in a binary fashion coded as either 0 or 1. Statistics is based on this binary logic, even where nominal data are in use. Nominal variables such as gender, race, color, and profession can be measured only in terms of whether the individual items belong to some distinctively different categories. More precisely: They belong to these categories or not.

A large amount of data, especially in the Internet, consists of unstructured text data (eMails, WhatsApp Messages, WordPress posts, tweets, etc.). Can texts – or other cultural data like images, works of art, or music – be adapted to a binary logic? Principally yes, as the examples of nominal variables show; either a word (or an image …) belongs to a certain category or not. Quite a good of part of Western thinking follows a binary logic; classical structuralism, for example, is fond of structuring oppositions: good – bad, pure – dirty, raw – cooked, orthodox – heretic, avantgarde – arrièregarde, up-to-date – old-fashioned etc. The point here is that one has to be careful to which domain these binary oppositions belong to: Good – bad is not the same as good – evil.

But the habit of thinking in binary terms narrows the perspective; the use of data according to a binary logic means a reduction. This is particularly evident with respect to texts: The meaning of individual words changes with their context. Another example: A smile can be the expression of sympathy and of uncertainty in the Western world, while in other cultures it may be referring to aggression, confusion, sadness or a social distancing from the other. What can be seen as a ‘smile’ in monkeys is most often the expression of fear – a showing of the canines.

Indian logic provides for an example of how to go beyond a binary logic: They have something which is being called “Tetralemma”. While binary systems are based on calculations of 0 and 1 and therefore formulate a dilemma, the tetralemma provides for four possible answers to any logical proposition: Beyond the logics of 0, 1, there is both 0 and 1, and neither 0 nor 1. One can even conceive of a fifth answer: none of these all. Put as a graph and expressed mathematically, the tetralemma would look like this:

Tetralemma

The word “dawn” would be an example for what is at stake in the tetralemma: Depending on how you define its meaning, it can be a category of its own, not fitting into a category (because of it ambivalent character), it is both day and night, and it is neither sunshine nor darkness.

One of the few philosophers to point to the narrowing of logics to binary oppositions (TRUE / FALSE) and to underline the many possibilities in language games was Jean-Francois Lyotard, in his main work “Le Différend” (Paris: Minuit 1983). In information science, it was only more recently that complex approaches have been developed beyond binary systems, which allow for an adequate coding of culture, emotions or human communications. The best examples are ontologies; they can t be understood as networks of entities, while the relations between these entities can be defined in multiple ways. A person can be at the same time a colleague in a team, a partner in a company, a father of another person working in the same company (a visualization of the “friend of a friend” ontology can be found here). Datafication of human signs, be they linguistic, artistic, or part of the historical record, therefore exposes the challenges of data production in particularly evident ways.

Big Data, Little Data, Fabulous Data.

Suzanne Briet, in What is documentation? says that “Documentography is the enumeration and description of diverse documents.”[1] Slightly modified, and paired with a nice little neologism (who can resist neologisms?) I could describe the work I am doing at this stage in the KPLEX Project as “Datamentography.Datamentography, of course, meaning the enumeration and description of diverse data. I’m working to establish what it is we talk about when we talk about data; the established conceptions of what data is among different communities, the why, how and where that lead to the development of the various understandings and conceptions of data active today. Once this has been established, we can use these findings to move towards a new conceptualisation of data.

One of the other passages in Briet that I really like (because it is quite poetic and this is perhaps a little unexpected in a text about documentation standards) is as follows:

Is a star a document? Is a pebble rolled by a torrent a document? Is a living animal a document? No. But the photographs and the catalogues of stars, the stones in a museum of mineralogy, and the animals that are catalogued and shown in a zoo, are documents.[2]

So, this nice descriptive passage outlines a key distinction: the thing itself is not a document, but the material traces of its interactions with humans are; the photos, the specimens, the catalogues, the records (visual, audio, and so on). But how do we capture the richness of these items in a computerised environment? A pebble is one thing, as is a stone in a museum of mineralogy. How to capture the pebble rolling in a torrent? And how to do so in a manner that does not subject the material to an interpretative act that alters how future scholars and researchers approach these records? If all the future scholar can “see” in the online repository is what the person responsible for compiling the repository considered to be important (or codeable) then their interpretative sphere is corrupted (if that is not too dramatic a word) from the onset.

Choice emerges as an implicit facet in this distinction; irrespective of how objective we think we are being, the act of collating information is implicitly subjective. What one person identifies as important (as worthy of documenting, as data), may appear wholly unimportant to someone else, and vice versa. Taste and preference are fine in day-to-day life (“You say tomato, I say tomahto… You say potato, I say vodka…”), but when these inherently human and therefore unavoidable subjective tendencies are let loose on humanities repositories, then a hierarchy is imposed on knowledge that reflects the subjective choices of the person who has classified or codefied them.

Further still, encoding the thing-ness of things is difficult. In a society that increasingly values and priorities codefied data, if what is readily codeable is prioritisied without concordant measures taken to account for the facets of human records and experiences that do not lend themselves so readily to codification, we encounter a scenario wherein that which is not as readily codeable is left out, neglected or even forgotten.

Now, people have gone about defining data in a number of different ways, and almost all are at least a little problematic. Christine Borgman, in her book chapter “What are Data?” from Big Data, Little Data, No Data uses an example from the great Argentinian writer Jorge Luis Borges to explain why defining data by example is unsatisfactory. In his essay “The Analytical Language of John Wilkins” Borges presents us with a taxonomy of animals in the form of a Chinese encyclopedia, Celestial Emporium of Benevolent Knowledge. In this taxonomy we encounter the following classifications:

a) belonging to the emperor, b) embalmed, c) tame, d) sucking pigs, e) sirens, f) fabulous, g) stray dogs, h) included in the present classification, i) frenzied, j) innumerable k) drawn with a very fine camelhair brush, l) et cetera m) having just broken the water pitcher, n) that from a long way off look like flies.

Of course, this list is somewhat absurd, and its absurdity is what makes its funny and what makes Borges so brilliant; but this absurdity should not bely the critique of taxonomic practices that lies at the heart of this so-called “emporium of benevolent knowledge.” Lets take a closer look.

Embalmed animals are included because someone once identified them as worthy of embalming, and that the act of being embalmed somehow signified something that was worth documenting (in the form of putting the sad creature in a jar of formaldehyde; or rather, of making a record of the fact that this creature has been stored in formaldehyde). Similarly, some animals are included merely because they are already in the system (“included in the present classification”), so simply because they are already there and it is easy to carry them over and keep them incorporated; in this way long established practices are maintained, simply because they are long established and not necessarily because they are effective (hello metadata, you cheeky old fox).

In What is documentation? Briet charts the sad odyssey of an “antelope of a new kind […] encountered in Africa by an explorer who has succeeded in capturing an individual that is then brought back to Europe for our Botanical Garden”[3]:

A press release makes the event known by newspaper, by radio, and by newsreels. The discovery becomes the topic of an announcement at the Academy of Sciences. A professor at the Museum discusses it in his courses. The living animal is placed in a cage and catalogued (zoological garden). Once it is dead, it will be stuffed and preserved (in the Museum). It is loaned to an Exposition. It is played on a soundtrack at the cinema. Its voice is recorded on a disk. The first monograph serves to establish part of a treatise with plates, then a special encyclopaedia (zoological), then a general encyclopaedia. The works are catalogued in a library; after having been announced at publication [et cetera].[4]

So we have the genesis here of the thing itself—a “pure or natural object with an essence of [its] own”[5]—from its capture and discovery through to its death when it is stuffed (akin to Borges’s embalming) and a process is initiated wherein it is catalogued and subjected to extensive “documentography” according to the established taxonomy of “Homo documentator.”[6] “Homo documentator” creates detailed portraits of the creature (perhaps using a very fine camelhair brush as in Borges’s encyclopaedia) for inclusion in the classificatory system, records its unique markings, the sound of its voice, whatever aspects of the creatures essence can be readily captured. Once in the system, whether by means of artistic plates outlining specifics of the species, or in the form of photographs and sound recording, and so on, it becomes a de-facto document (de-facto data), and its documentability is exhausted only when the taxonomical system employed by the documentalist has itself been exhausted.

But who has designed this taxonomical system? Who is responsible for deciding what facets of the antelope are important and what not? Are items perhaps ever considered important solely because they are facile to document? And who is to say that this same sad and now stuffed antelope could not also be classified as fabulous (or, once fabulous, had you encountered it in the wild)? Further still, surely this creature, like all creatures when viewed from a certain perspective, could be included in Borges’s category for animals that “from a long way off look like flies.” The point is that the system of taxonomy is not objective, and our conceptions of the facets that are important or unimportant can and have been influenced by the hierarchies imposed upon them by the person or persons responsible for compiling them.

Borgman in “What Are Data?” refers us to Open Archival Information System (OAIS) for a definition of data that, once again, uses examples:

Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen.[7]

This, like most definitions of data, seems relatively reasonable at first, naturally the characters on a page are going to qualify as data, and so if they do, they are or can be encoded as such. But what about the page itself? What about the materials on the page that do not qualify as characters? What about doodles? Pen-tests? Scribbles, drawings, additions and other contextually specific parlipomena? How do we encode these? And, if we decide not to, why do we decide not to, and who—if anyone—holds us accountable for that decision? Because such a decision, inconsequential though it may seem at first, could effect and limit future scholars.

And this is what I am attempting to tease out as part of my contribution to the KPLEX Project: why and how did certain conceptions of data become acceptable or dominant in certain circles, and, going forward, as we move towards bigger data infrastructures for the humanities, is there a way for us to ensure that the thing itself, in all its complex idiosyncratic fabulousness, remains visible, and available to the researcher?

[1] Briet et al., What Is Documentation?, 24.

[2] Ibid., 10.

[3] Ibid., 10.

[4] Ibid.

[5] Borgman, “Big Data, Little Data, No Data,” 18.

[6] Briet et al., What Is Documentation?, 29.

[7] Borgman, “Big Data, Little Data, No Data,” 20, emphasis in original.

The Featured Image was borrowed from Natascha Schwarz’s illustrated edition of Borges’s Book of Imaginary Beings: https://www.behance.net/gallery/10823485/Jorge-Luis-Borges-Book-of-Imaginary-Beings

Is there an identity crisis of statistics?

It is not without irony that statistics currently seems to live through an identity crisis, since this discipline is often named the “science of uncertainty”. If there is an identity crisis at all, in what way can it be conceived of? And why did this crisis come now – is there a nexus between the rise of ‘big data’, algorithm and the development of the discipline? Three developments can be identified that make the hypothesis of an identity crisis of statistics more tangible.

  • A crisis of legitimacy: Statistics’ findings result from value-free and transparent methods; but this is nothing (and maybe never has been) which can easily be communicated to the broad public. Hence, in the age of ‘big data’, a loss of legitimacy: the more complex the collection of data, statistical methods and presented results are, the more disorienting an average person may find them, especially if there is a steep increase in the sheer quantity of statistical findings. Even for citizens who are willing to occupy themselves with data, statistics, and probability algorithms, Nobel prize laureate Danel Kahnemann has underlined the counterintuitive and intellectually challenging character of statistics (see his book “Thinking, fast and slow” or the classical article “Judgment under Uncertainty”). These peculiarities of the discipline lower the general public’s trust in the discipline: “Whom should I believe?”
  • Crisis of expertise: Statistics has become part of a broad range of scientific disciplines, far beyond mathematics. But the acquisition of competences in statistics quite obviously has its limits. As Gerd Gigerenzer has pointed out already 13 years ago, “mindless statistics” has become a custom and ritual in sciences like f.ex. psychology. In recent years, this crisis of expertise has been termed the crisis of reproducibility (for data from a previous publication) or replicability (for data from an experiment); the renowned journal “Nature” has devoted in 2014 an article series onto this problem, with focus on f.ex. the use of p-values in scientific arguments. The report of the 2013 London Workshop on the Future of the Statistical Sciences is outspoken on this problem, and there is even a Wikipedia article on the crisis of replicability. Statisticians themselves defend themselves by pointing to these scientists’ lack of training in statistics and computation [see Jeff Leek’s recent article here], but quite obviously this crisis of expertise undermines the credibility of scientists as experts.
  • Crisis of the societal function of the discipline: Statistics as a scientific discipline established itself alongside with the rise of nation-states; hence its close connection to national economies and the data collected across large populations. As has been explained in a “Guardian” article posted earlier in this blog, statistics served as the basis of “evidence-based policy“, and statisticians were seen by politicians as the caste of scientific pundits and consultants. But this has changed completely: Nowadays big data are assets of globalised companies which act across the borders of nation-states. This points to a shift in the core societal function of statistics, not longer serving politics and hence the nation, but global companies and their interests: Statistics leaves representative democracy, and it has become unclear how the benefits of digital analytics might ever be offered to the public. Even if the case is still obscure, the possible role of “Cambridge Analytica” in the U.S. presidential election campaign shows that the privatisation of expertise can be turned against the interests of a nation’s citizens.

Data scales in applied statistics – are nominal data poor in information?

An earlier blog post by Jennifer Edmond on „Voluptuousness: The Fourth „V“ of Big Data?” focused on cultural data steeped in meaning; proverbs, haikus or poetic language are amongst the best examples for this kind of data.

But computers are not (yet) good in understanding human languages. Nor is statistics – it simply conceives of words as nominal variables. Quite in contrast to an understanding of cultural data as described by Jennifer Edmond, applied statistics regards nominal variables as the ones with the LEAST density in information. This becomes obvious when these data are classified amongst other variables in scales of measurement. The scale (or level) of measurement refers to assigning numbers to a characteristic according to a defined rule. The particular scale of measurement a researcher uses will determine what type of statistical procedures (or algorithms) are appropriate. It is important to understand the nominal, ordinal, interval, and ratio scales; for a short introduction into the terminology, follow this link. Seen in this context, nominal variables belong to qualitative data and are classified into distinctively different categories; in comparison to ordinal variables, these categories cannot be quantified or even ranked. The functions of the different scales are shown in the following graph:

Scales

 

Here it becomes visible that words are classified as nominal variables; they belong to some distinctively different categories, but those categories cannot be quantified or even be ranked; there is no meaningful order in choice.

This has the consequence that in order to be able to compute with words, numeric values are being attributed to them. E.g. in Sentiment Analysis, the word “good” can receive the value +1, while the word “bad” will receive a -1. Now they have become computable; words are thus transformed into values, and it is exactly this process which reduces their “voluptuousness” and robs them of their polysemy; just binary values remain.

To my knowledge, the most significant endeavor to provide for a more complex measurement of linguistic data has been undertaken in the field of psychology: In their book “The Measurement of Meaning”, Charles E. Osgood, George J. Suci, and Percy H. Tannenbaum developed the “semantic differential” as a scale used for measuring the meaning of things and concepts in a multitude of dimensions (see for a contemporary approach f.ex. http://www.semanticdifferential.com/). But imagine each word of a single language measured – with all its connotations and denotations, each in its different functions, in the context of the other words around it … not to speak of figurative language such as metaphors, irony, and sarcasm.

This is why words – and cultural, voluptuous data in a broader sense – are so difficult to compute; and why they are the next big computational challenge.