Has anyone ever analyzed big data classifications for their political or cultural implications?

If we think of big data, we don’t think of unstructured text data. Mostly we even don’t think of (weakly) structured data like XML – we conceive of big data as being organized in tables, with columns (variables) and rows (observations); lists of numbers with labels attached. But where do the variables come from? Variables are classifiers; and a classification, as Geoffrey Bowker and Susan Leigh Star have put it in their inspiring book “Sorting Things Out”, is “a spatial, temporal, or spatio-temporal segmentation of the world. A ‘classification system’ is a set of boxes (metaphorical or literal) into which things can be put to then do some kind of work – bureaucratic or knowledge production. In an abstract, ideal sense, a classification system exhibits the following properties: 1. There are consistent unique classificatory principles in operation. […] 2. The categories are mutually exclusive. […] 3. The system is complete.”[1] Bowker and Star describe classification systems as invisible, erased by their naturalization into the routines of life; in them, conflict, contradiction, and multiplicity are often buried beneath layers of obscure representations.

Humanists are well accustomed this kind of approach in general and to classifications and their consequences in particular; classification is, so to say, the administrative part of philosophy and, also, of disciplines like history or anthropology. Any humanist who has occupied himself with Saussurean structuralism or Derrida’s deconstruction knows that each term (and potential classifier) is inseparably linked to other terms: “the proper name was never possible except through its functioning within a classification and therefore within a system of differences,” Derrida writes in “Of Grammatology”.[2] Proper names, as indivisible units, thus form residual categories. In the language of data-as-tables, proper names would correspond to variables, and where they don’t form a variable, they would be sorted into a ‘garbage category’ – the infamous and ubiquitous “other” (“if it is not that and that and that, then it is – other”). Garbage categories are these columns where things get put that you do not know what to do with. But in principle, garbage categories are okay; they can signal uncertainty at the level of data collection. As Derrida’s work reminds us, we must be aware of exclusions, even if they may be explicit.

The most famous scientist who has occupied himself with implicit exclusions in historical perspective is certainly Michel Foucault. In examining the formative rules of powerful discourses as exclusion mechanisms, he analyzed how power constitutes the “other”, and how standards and classifications are suffused with traces of political and social work. In the chapter “The Birth of the Asylum” of his book “Madness and Civilization”,[3] for example, he describes how the categories of ‘normal’ and ‘deviant’, and classifications of forms of ‘deviance’, go hand in hand with forms of treatment and control. Something which is termed ‘normal’ is linked to modes of conduct, standards of action and behavior, and with judgments on what is acceptable and unacceptable.  In his conception, the ‘other’ or ‘deviant’ is to be found outside of the discourse of power and therefore not taking part in the communication between the powerful and the ‘others’. Categories and classifications created by the powerful justify diverse forms of treatment by individuals in professional capacities, such as physicians, and in political offices, such as gaolers or legislators. The equivalent to the historical setting analyzed by Foucault can nowadays be found in classification systems like the International Classification of Diseases (ICD) or the Diagnostic and Statistical Manual (DSM). Doctors, epidemiologists, statisticians, and medical insurance companies work with these classification systems, and certainly there are equivalent ‘big data’ containing these classifications as variables. Not only are the ill excluded from taking part in the establishment of classifications, but also other medical cultures and their representation systems of classification of diseases. Traditional Chinese medicine is a prominent example here, and the historical (and later psychoanalytic) conception of hysteria had to undergo several major revisions until it nowadays found an entry as “dissociation” (F44) and “Histrionic personality disorder” (F60.4) in the ICD-10. Here Foucault’s work reminds us of the fact that power structures are implicit and thus invisible. We are admonished to render classifications retrievable, and to include the public in policy participation.

A third example that may come to the humanist’s mind when thinking of classifications is the anthropologist Mary Douglas. In her influential work “Purity and Danger”, she outlines the inseparability of seemingly distinct categories. One of her examples is the relation of sacredness and pollution: “It is their nature [of religious entities, J.L.] always to be in danger of losing their distinctive and necessary character. The sacred needs to be continually hedged in with prohibitions. The sacred must always be treated as contagious because relations with it are bound to be expressed by rituals of separation and demarcation and by beliefs in the danger of crossing forbidden boundaries.”[4] Sacredness and pollution are thus in permanent tension with each other and create their distinctiveness out of a permanent process of exchange. This process undermines classification – with Douglas, classifiers have to run over into each other (like it the case with the 5-point Likert scale), or classification systems have to be conceived of as heterogeneous lists or parallel different lists. Douglas’ work thus reminds us of the need for the incorporation of ambiguity and of leaving certain terms open for multiple definitions.

Seen from a humanist’s point of view, big data classifications pretend false objectivity, insofar as they help to forget their political, cultural, moral, or social origins, as well as their constructedness. It lies beyond the tasks of the KPLEX project to analyze big data classifications for their implications; but if anybody is aware of relevant studies, references are most welcome. But what we all can do is to constantly remind ourselves that there is no such thing as an unambiguous, uniform classification system implemented in big data.

[1] Geoffrey C. Bowker, Susan Leigh Star, Sorting Things Out. Classification and Its Consequences, Cambridge, MA / London, England: The MIT Press 1999, p. 10/11.

[2] Jacques Derrida, Of Grammatology. Translated by Gayatri Chakravorty Spivak, Baltimore and London: The Johns Hopkins University Press 1998, chapter “The Battle of Proper Names”, p. 120.

[3] Michel Foucault, Madness and Civilization: History of Insanity in the Age of Reason, translated by Richard Howard, New York: Vintage Books 1988 (first print 1965), pp. 244ff.

[4] Mary Douglas, Purity and Danger. An Analysis of the concepts of pollution and taboo, London and New York: Routledge 2001 (first print 1966), p. 22.

Ode to Spot: We need to talk about Data (and narrative).

We need to talk about data. And narrative. In fact, data and narrative need to talk to each other, work some issues out, attend relationship counselling, try to recapture that “spark,” that “special something” that kept bringing them together, that has made them, at times, seem inseparable, but also led to some pretty fiery clashes.

So, what’s the deal? What is the relationship between data and narrative? What role does narrative play in our use of data? What role does data play in our fashioning of narrative? How much of what we have to say about each is determined by pre-established notions we have about either one of these entities? Why did I instinctively opt, for example, while writing the previous two sentences, to refer to data as something that is “used” and narrative as something that is “fashioned”? And further still is it correct to refer to them as wholly distinct? Can we have a narrative that is bereft of data? And are data or datasets wholly bereft of narrative?

Data and narrative are presented by some as being irreconcilable or antithetical. Lev Manovich presents them as “natural enemies”[1] whereas Jesse Rosenthal, speaking in the context of the relationship between fictional narratives and data, observes how “the novel form itself is consistently adept at expressing the discomfort that data can produce: the uncertainty in the face of a central part of modern existence that seems to resist being brought forward into understanding.”[2] Todd Presner argues that data and narrative exist in a proto-antagonistic relationship wherein narrative begat data, and data begat narrative. I use antagonistic here in the sense of musculature, with the relationship between narrative and data being analogous to why you’re not able to flex your biceps and your triceps at the same time, because for one to flex, the other must relax or straighten.

Presner situates database and narrative as being at odds or somehow irreconcilable

because databases are not narratives […] they are formed from data (such as keywords) arranged in relational tables which can be queried, sorted, and viewed in relation to tables of other data. The relationships are foremost paradigmatic or associative relations […] since they involve rules that govern the selection or suitability of terms, rather than the syntagmatic, or combinatory elements, that give rise to narrative. Database queries are, by definition, algorithms to select data according to a set of parameters.[3]

So databases are not narratives, and while narratives can contain (or be built on data), they are not data-proper. This means there is a continual transference between data and narrative in either direction, a transference that is all the more explicit and controversial in the transition from an analogue to a digital environment. This transition, the extraction of data from narrative, or the injection of data into narrative, is a process that has significant ethical and epistemological implications:

The effect is to turn the narrative into data amenable to computational processing. Significantly, this process is exactly the opposite of what historians usually do, namely to create narratives from data by employing source material, evidence, and established facts into a narrative.[4]

Rosenthal also presents data and narrative as operating in this interrelated but tiered manner, with narrative being built on data, or data serving as the building blocks of narrative. And while Rosenthal focuses on fictional narratives, this is the case irrespective of whether the narrative in question is fictional or non-fictional because, after all, non-fictional narrative is still narrative.[5] Whereas Presner focuses on the complications surrounding the relationship between narrative and data in digital environments, Rosenthal’s engagement is more open to and acknowledging of the established and dynamic nature of the relationship between narrative and data in literature: “Narrative and data play off against each other, informing each other and broadening our understanding of each.”[6]

Data and narrative could be said to exist in a dynamic, dyadic relationship then. Indeed, Kathryn Hayles argues that data and narrative are symbiotic and should be seen as “natural symbionts.”[7] So their relationship is symbiotic, rather than antagonistic; they intermingle and their relationship is mutually beneficial, with data perhaps adding credence to narrative (fictional or otherwise) and narrative helping us understand data by making clear to us what the data is saying, or has the capacity to say (in the eyes of the person working with it). That said, if they are symbionts, what is the ratio of their intermingling? Is it possible for a narrative become data-heavy or data-saturated? Does this impede the narrative from being narrative?  Would a data-driven narrative read something along the lines of Data’s poem to his pet cat Spot from Star Trek The Next Generation (TNG) Season Six Episode Five:

Ode To Spot

Felis catus is your taxonomic nomenclature,

An endothermic quadruped, carnivorous by nature;

Your visual, olfactory, and auditory senses

Contribute to your hunting skills and natural defenses.

I find myself intrigued by your subvocal oscillations,

A singular development of cat communications

That obviates your basic hedonistic predilection

For a rhythmic stroking of your fur to demonstrate affection.

A tail is quite essential for your acrobatic talents;

You would not be so agile if you lacked its counterbalance.

And when not being utilized to aid in locomotion,

It often serves to illustrate the state of your emotion.

O Spot, the complex levels of behaviour you display

Connote a fairly well-developed cognitive array.

And though you are not sentient, Spot, and do not comprehend,

I nonetheless consider you a true and valued friend.

This “ode” is an example of a piece of writing so data-laden[8] that the momentum of the narrative is hampered, or rather the lyricism necessary to bring Data’s sweet ode to his cat into Shakespeare territory is seriously lacking. And what do I mean by lyricism? Well, the answer to that is relatively simple, just take a look at Shakespeare’s “Sonnet 18”:

Shall I compare thee to a summer’s day?

Thou art more lovely and more temperate.

Rough winds do shake the darling buds of May,

And summer’s lease hath all too short a date.

Sometime too hot the eye of heaven shines,

And often is his gold complexion dimmed;

And every fair from fair sometime declines,

By chance, or nature’s changing course, untrimmed;

But thy eternal summer shall not fade,

Nor lose possession of that fair thou ow’st,

Nor shall death brag thou wand’rest in his shade,

When in eternal lines to Time thou grow’st.

     So long as men can breathe, or eyes can see,

     So long lives this, and this gives life to thee.

In contrast to the Data-ode, and to the lyrical Shakespearian ode, would a narrative that is almost entirely bereft of data (and arguably also bereft of narrative, but let’s not go there) read something like a Trump rally speech?

A few days ago I called the fake news the enemy of the people. And they are. They are the enemy of the people.

(APPLAUSE)

Because they have no sources, they just make ’em up when there are none. I saw one story recently where they said, “Nine people have confirmed.” There’re no nine people. I don’t believe there was one or two people. Nine people.

And I said, “Give me a break.” Because I know the people, I know who they talk to. There were no nine people.

But they say “nine people.” And somebody reads it and they think, “Oh, nine people. They have nine sources.” They make up sources.

They’re very dishonest people. In fact, in covering my comments, the dishonest media did not explain that I called the fake news the enemy of the people. The fake news. They dropped off the word “fake.” And all of a sudden the story became the media is the enemy.

They take the word “fake” out. And now I’m saying, “Oh no, this is no good.” But that’s the way they are.

So I’m not against the media, I’m not against the press. I don’t mind bad stories if I deserve them.

And I tell ya, I love good stories, but we don’t go…

(LAUGHTER)

I don’t get too many of them.

But I am only against the fake news, media or press. Fake, fake. They have to leave that word.

I’m against the people that make up stories and make up sources.

They shouldn’t be allowed to use sources unless they use somebody’s name. Let their name be put out there. Let their name be put out.

(APPLAUSE)

“A source says that Donald Trump is a horrible, horrible human being.” Let ’em say it to my face.

(APPLAUSE)

Let there be no more sources.[9]

And now, by means of apology and for some brief respite, I offer you a meme of Data.

But are these our only options? Are narrative and data really at odds in this way? Is there a way to reconcile narrative and the database? Perhaps it is time to stop thinking of data and narrative as being at odds with each other; perhaps it is necessary to break down this dyad and facilitate better integration?

Traditionally, narrative driven criticism took the form of “retelling,” what Rosenthal calls an “artful,” or “opinionated reshaping” of the underlying evidence (aka the data) whereas more contemporaneous data driven criticism largely takes the form of visualisations that attempts to, as Rosenthal puts it, “let the data speak for itself, without mediation.”[10] This turn to the visual is driven by a hermeneutic belief akin to Ellen Gruber Garvey’s assertion that “Data will out.”[11] But this is something of a contradiction of terms considering elsewhere we are told (by Daniel Rosenberg) that data has a “pre-analytical, pre-factual status,”[12] that data is an entity “that resists analysis,”[13] but can also be “rhetorical,”[14] that “False data is data nonetheless”[15] and that “Data has no truth. Even today, when we speak about data, we make no assumptions about veracity.”[16] Borgman elaborates, stating that “Data are neither truth nor reality. They may be facts, sources of evidence, or principles of an argument that are used to assert truth or reality.”[17] That’s a lot of different data on data.

Fictional narratives can be built on supposedly reputable data, this helps the reader to suspend their disbelief and “believe in” the fictions they encounter within the narrative. Supposedly non-fictional narratives, such as presidential speeches, can be based on tenuously obtained, fabricated data, or can make reference to data that is not proffered, and may not even exist, rather like dressing a corgi up in a suit and asking it for a political manifesto.

What we’ve looked at today concerns the evolving discomfiture of our difficulties outlining the relationship between narrative and data. In the interplay between analogue and digital, different factions emerge regarding the relationship between data and narrative, with narrative and data variously presented as being anathematic, antagonistic, or symbiotic, with data presented as something one can be either “for” or “against” and with distinct preferences for one or the other (either narrative or data) being shown on a discipline specific, researcher specific, author specific level. At the same time, irrespective of which of these positions you adopt, it is clear that data and narrative are intricately linked and deeply necessary to each other. The question is then, how can one facilitate and elucidate the other best in a digital environment?

[1] Lev Manovich, The Language of New Media, 2002, 228.

[2] Jesse Rosenthal, “Introduction: ‘Narrative against Data,’” Genre 50, no. 1 (April 1, 2017): 2., doi:10.1215/00166928-3761312.

[3] Presner, in Fogu, Claudio, Kansteiner, Wulf, and Presner, Todd, Probing the Ethics of Holocaust Culture, History Unlimited (Cambridge: Harvard University Press, 2015), http://www.hup.harvard.edu/catalog.php?isbn=9780674970519.

[4] Presner, in ibid.

[5] “Yet the narrative relies for its coherence on our unexamined belief that a preexisting series of events underlies it. While data depends on a sense of irreducibility, narrative relies on a fiction that it is a retelling of something more objective. […] The coherence of the novel form, then, depends on making us believe that there is something more fundamental than narrative.” Rosenthal, “Introduction,” 2–3.

[6] Ibid., 4.

[7] N. Katherine Hayles, “Narrative and Database: Natural Symbionts,” PMLA 122, no. 5 (2007): 1603.

[8] And of course it’s data-laden, it was composed by Data, a Soong-type android, so basically a walking computer, albeit a mega-advanced one.

[9] “Transcript of President Trump’s CPAC Speech,” http://time.com/4682023/cpac-donald-trump-speech-transcript/

[10] Rosenthal, “Introduction,” 4.

[11] Ellen Gruber Garvey, “‘facts and FACTS:’ Abolitionists’ Database Innovations,” Gitelman, “Raw Data” Is an Oxymoron, 90.

[12] Rosenberg, “Data before the Fact,” in Gitelman, “Raw Data” is an Oxymoron, 18.

[13] Rosenthal, “Introduction,” 1.

[14] Rosenberg, “Data before the Fact,” in Gitelman, “Raw Data” Is an Oxymoron, 18.

[15] Rosenberg, “Data before the Fact,” in ibid.

[16] Rosenberg, “Data before the Fact,” in ibid., 37.

[17] Christine L. Borgman, “Big Data, Little Data, No Data,” MIT Press, 17, accessed April 7, 2017, https://mitpress.mit.edu/big-data-little-data-no-data. 

Will Big Data Render Theory Dispensable? (Part II)

There can be no doubt that the availability of big data and the emergence of big data analytics challenges established epistemologies across the academy.  Debates about the datafication of humanities’ research materials – including debates about whether digitisation projects should be countenanced at all – are disputes about the shape of our culture which challenge our understanding of human knowledge and of knowledge production systems.

In a recent article on the “big data” turn in literary studies, Jean-Gabriel Ganascia addresses the epistemological status of Digital Humanities, asking if Digital Humanities are “Sciences of Nature” or “Sciences of Culture?” [1]  Regarding similar concerns, Matthew L. Jockers a proponent of techniques of quantitative text analysis of the literary record writes, “(e)nough is enough.  The label does not matter.  The work is what matters, the outcomes and what they tell us about… are what matters.” [2]  But labels do matter, they matter a great deal and Ganascia’s question is an important one, the answer to which reveals much about the underlying epistemology of Digital Humanities research.

As the basis of all information and knowledge, data is often defined as fact that exists irrespective of our state of knowing.  Rob Kitchin has addressed the explosion of big data and an emergent fourth paradigm in science which is a result of the former definitions of data as the zero sum of all information and knowledge.  This fourth paradigm, labelled New Empiricism or Empiricism Reborn, claims that Big Data and new data analytics ushers in a new era of knowledge production free from theory, human bias or framing and accessible by anyone with sufficient visual literacy.  This is the position advocated by Chris Anderson, former editor-in-chief at Wired magazine.[3]

However, as Kitchin observes data does not arise in a vacuum: “all data provide oligoptic views of the world: views from certain vantage points, using particular tools, rather than an all-seeing, infallible God’s eye view”.  Whilst the end of theory may appear attractive to some, the empiricist model of knowledge creation is based on fundamental misconceptions about formation and analysis of computational data.

There is an alternative to new forms of empiricism labelled “data-driven science” which is described by Kitchin as follows: “more open to using a hybrid combination of abductive, inductive and deductive approaches to advance the understanding of a phenomenon… it forms a new mode of hypothesis generation before a deductive approach is employed. Nor does the process of induction arise from nowhere, but is situated and contextualized within a highly evolved theoretical domain.”

So, when Jockers writes that those who study texts must follow the example of Science and scale their methods according to the realities of big data, which method (or label) is he referring to?  One might infer from his reference in the opening pages of his influential Macroanalysis: Digital Methods & Literary History to Anderson’s 2008 article that it is the former empiricist model.  However, in the article referred to in the opening lines of this post, Jockers refers to his approach as “data-driven”.  Are we to assume that this is a reference to data-driven scientific method? Rightly or wrongly, knowledge production is not impartial and we must always be mindful of the epistemologies to which we align ourselves as they will often reveal as much about our research as the information or data contained therein.

If Digital Humanists are said to be “nice” because we eschew theoretical debates in favour of the more easily resolved methodological ones, then I’d rather not be thought of as a “nice” Digital Humanist.  I’d rather be responsible, argumentative or even wrong.  But nice? Biscuits are nice.

[1] Ganascia, J., (2015) “The Logic of the Big Data Turn in Digital Literary Studies.” In Frontiers in Digital Humanities, 02 Dec. 2015 

[2] Jockers, M., (2015) “Computing Ireland’s Place in the Nineteenth-Century Novel: A Macroanalysis.” In Breac: A Digital Journal of Irish Studies, 07 Oct. 2015.

[3] Kitchin, R., (2014) “Big Data, New Epistemologies and Paradigm Shifts.” In Big Data & Society, April—June 2014: 1-12. 

Will Big Data render theory dispensable?

Scientific theories have been crucial for the development of the humanities and social sciences. Metatheories such as classical social evolution, cultural diffusion, functionalism or structuralism for example guided early anthropologists in their research process. Postmodern theorists rightly criticized their predecessors among other things for their deterministic theoretical models. Their criticism however was still based on theoretical reflections, although many tried to reduce their theoretical bias by combining several perspectives and theories (cf. theory triangulation).

Whereas it was common in the humanities to keep track of “disproven or refuted theories” there could be a trend among proponents of a new scientific realism to put the blinkers on and solely focus on progress towards a universal, objective and true account of the physical world. Even worse, theory could be discarded altogether. Big data might revolutionise the scientific landscape. From the point of view of the physicist Chris Anderson the “end of theory” is near: “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves”.

Implicit-Bias-Registration

This approach towards science might gain ground. Digital humanists are said to be “nice”, due to their concern with method rather than theory. Methodological debates in the digital humanities seem to circumnavigate more fundamental epistemological debates on principles. But big data is not self-explanatory. Explicitly or implicitly theory plays a role when it comes to collect, organize and interpret information: “So it is one thing to establish significant correlations, and still another to make the leap from correlations to causal attributes” (Bollier 2010). Theory matters for making the semantic shift from information to meaning.

In order to understand the process of knowledge production we must keep an eye on the mutually constitutive spheres of theory and practice. In the era of big data Bruno Latour’s conclusion: “Change the instruments, and you will change the entire social theory that goes with them” is hence more important than ever.


Featured image was taken from https://vimeo.com/

Data and their visualization

When did data enter mankind’s history? If you would ask an archaeologist or a historian, he might answer: About 35.000 years ago, with cave paintings. The reason why they are being held as data points to their symbolic dimension: Cave paintings may be not mere representations of what has been observed by the people who produced these drawings, but because they may have been designed for religious or ceremonial purpose. Depictions of men dancing around a slayed animal can thus be interpreted not only as human beings performing a spontaneous dance of victory, but they can also be seen as illustrations of rituals and cosmologies. Sacred knowledge written on ceilings of locations not inhabited by men.

Statisticians (and data scientists) see these things differently. For them, data are clearly bound to abstract signs. In spite of the much older presence of hieroglyphs – this peculiar Egyptian mixture of pictograms and abstractions –, they would point to Phoenician letters providing recipes of beer, with numbers provided for amounts of barley; or to Sumerian calculations scratched into wet clay, which has been burned subsequently and preserved as ostraca. Knowledge about intoxicating beverages and bargaining. Data, in this understanding, is connected to numerical values and not to the symbolic values which make up the surplus of the cave paintings. Moreover: While figurative drawings found in caves are visualizations in themselves, a ‘modern’ understanding of data points to maps as the beginning of data visualization (400 BC), star charts (150 BC), and a mechanical knowledge diagram drawn in Spain in 1305. Knowledge about the world and about knowledge itself. Canadian researcher Michael Friendly, who charted the history of data visualization in his Milestones project, sees Michael Florent van Langren’s 1644 representation of longitudes as the first (known) statistical graph.

langren

We see here not only differing conceptions of data, but of what a visualization of data might be. And, if we follow the traces laid out by people writing on the history of data visualization (like Michael Friendly, Edward Tufte or Daniel Rosenberg), we soon note that there seems not to be a straightforward evolution in data visualization. On the one hand, data visualization depended on the possibilities painters and printers provided for data specialists; on the other hand, the development of abstract thinking took its own routes. The indices and registers found as annexes in printed books form the data basis for networks of persons or conceptions, but their visualization came later. Visualizations became more common in the 19th century (John Snow’s map of cholera-contaminated London is one of the more popular examples here), and data visualization was taken out of the hands of the experts (who paid the graphic designers and could afford costly printing) only in the 20th century, with mass distributions of excel and power point. A new phase started with the Internet providing libraries for visualization like D3.js or P5.js. But these different phases also point to what Foucault had written in his “archéologie du savoir” – when history shifted from one view to another, the past became virtually incomprehensible to the present. Visualizations become outdated and are no longer easy to understand. Thus, returning to our starting point: Are cave paintings not simply other visualizations of data and knowledge than Phoenician beer recipes? Is it a failure of different disciplines not to be able to find an agreement about what terms like ‘knowledge’ and ‘data’ mean? Doesn’t our conception of ‘knowledge’ and ‘data’ determine what is being termed ‘visualization’?

The Revolution of the McWord; or, why difference and complexity is necessary.

“It’s a beautiful thing, the destruction of words.”—George Orwell, 1984.

One of my earliest memories of being reprimanded happened when I was in Junior or Senior Infants at Primary School. During a French lesson I needed to use the bathroom (tiny humans often do). We had been told we could only communicate in French, so I sat there attempting to gather and translate my toilet related thoughts into something suitably Francophone. Eventually I put up my hand, got the teacher’s attention, pointed at my chest and said “Moi,” pointed at the door that lead to the bathrooms and said “toilette?” The teacher snapped and said “No Georgina you are not a toilet!”

A little harsh perhaps, especially considering I was a four-year old three-foot high mini-human, but still, I haven’t forgotten it, and now I’m fluent in French. My effort at breaking down a language barrier caused someone to snap and (it seems) be insulted by my tiny human attempt at French.

So certainly I agree with Jennifer’s point from her recent blog article here that “Building intimacy (for this is what I take the phrase “brings you closer” to mean) is not about having a rough idea of what someone is saying, it is about understanding the nuance of every gesture, every reference and resonance.” This is part of the (many) reason(s) why people on the autism spectrum, for example, find social interaction so difficult, because of a difficulty understanding these very gestural nuances that are so central to human communications. And this lack of understanding often brings with it frustration, isolation, loneliness, and pain. The point is: it’s not just about the words; it’s about how they are said, the tone, the gesture, the contexts. These are things a translation program cannot understand or impart, and it is arrogant to suggest that such facets of communication are by-passable or expendable when so many people struggle with them on a day-to-day basis. Moreover, they are facets of human communication that cannot be erased or eliminated from speech-exchanges with a view to making these exchanges “simpler” or “doubleplusgood.” That brings us right into 1984 territory.

From the perspective of Eugene Jolas, author of the “Revolution of the Word” manifesto published in transition magazine, a modernist periodical active in Paris throughout the 1920s and 1930s whose contributors included the likes of James Joyce, Gertrude Stein,  and Samuel Beckett, language was not complex enough:

Tired of the spectacle of short stories, novels, poems and plays still under the hegemony of the banal word, monotonous syntax, static psychology, descriptive naturalism, and desirous of crystallizing a viewpoint… Narrative is not mere anecdote, but the projection of a metamorphosis of reality” and that “The literary creator has the right to disintegrate the primal matter of words imposed on him by textbooks and dictionaries.”[1]

So, language, or rather languages (Jolas was fluent in several and often wrote in an amalgamation that, he felt, better reflected his hybridic Franco-German (Alsatian) and American identity[2]), was not complex enough to fully express the totality of reality.

Mark Zuckerburg proposes something of a devolution, a de-creation, a simplifying of difference, a reintegration and amalgamation of the facets that distinguish us from others. But while it might appear useful (Esperanto, anyone?), is the experience going to lead to richer conversations? To a demolition of barriers? Or will it result in something akin to Point It, the highly successful so-called “Travellers Language Kit” that contains no language at all, but rather an assortment of pictures that allow one to leaf through the book and point at the item you want.

So, with a sensitive interlocutor, one could perhaps intuit that “Me *points at* Coca-Cola” means “Hi, I would like a Coca-Cola.” Or that “Me *points at* toilet” would likely mean “Hi, I desperately need to use your bathroom, could you be so kind as to point me in the right direction?”

After my childhood French toilet incident, I myself could never overcome the mortification involved in using Point It. But even if I did, would the success of an exchange wherein “Me *points at* Coca-Cola” results in my being handed a Coca-Cola give me the same satisfaction as my first successful exchange with someone in another language did? The first time I managed to say something in French to a French person in France and be met with a response in French as opposed to a confused look or (worse) a response in English. Would Zuckerberg’s own much-lauded trip to China in 2014 where he was interviewed and responded in Mandarin have received as much positive press if he had worked through an interpreter, or used a pioneering neural network translation platform? I don’t think so.

Reducing or eliminating language difference also creates hierarchies, and this is dangerous. What language will we agree to communicate in? Why one language and not another? What facets of my individuality are accented in my native language that are perhaps left out or lost in another?

In short, there is an ethical element to this, and one that must be acknowledged and addressed. It’s similar to the argument Todd Presner articulates in “The Ethics of the Algorithm” when he notes the negative affect of reducing human experience (in this case, the testimonies of Holocaust survisors) to keywords so that their experiences become “searchable”: “it abstracts and reduces the human complexity of the victims’ lives to quantized units and structured data. In a word, it appears to be de-humanizing”[3]

We have to resist what Presner calls “the impulse to quantify, modularize, distantiate, technify, and bureaucratize the subjective individuality of human experience,”[4] even if this impulse is driven by a desire to facilitate communications across perceived borders. Finding, maintaining, and celebrating the individual in an era that is putting increasing pressure to separate the “in-” from the “-dividual” for the sake of facility will lead (and has perhaps already lead, if we can refer back to Don DeLillo’s 1985 observation that “You are the sum total of your data.”) to the era of the “dividua”[5]; where instead of championing individuality, people are reduced to their component data sets, or rather the facets of their personhood that can be assigned to data sets, with the rest—the enigmatic “in-” that makes up an individual—deemed unnecessary, a “barrier” to facile communications.

Rather that working to fractalise language, as Eugene Jolas did, universal translation (which is itself a misnomer, all translations are, to a degree, inexact and entail a degree of intuition or creativity to render one word in or through another word) simplifies that which cannot, and should not, be simplified.  This would be doubleplusungood.

Complexity matters.

KPLEX matters.

[1] Eugene Jolas, “Revolution of the Word,” transition 16/17, 1929.

[2] Born in the New Jersey, Jolas moved to Europe the bilingual Alsace-Lorraine region as a young child, and later spent key formative years in the United States.

[3] Presner, in “The Ethics of the Algorithm: Close and Distant Listening to the Shoah Foundation Visual History Archive” in Fogu, Claudio, Kansteiner, Wulf, and Presner, Todd, Probing the Ethics of Holocaust Culture..

[4] Presner, in ibid.

[5] Deleuze, “Postscript on the Societies of Control.”

A recipe for intimacy?

Mark Zuckerberg posted the following statement on his Facebook feed:
“Today we’re publishing research on how AI can deliver better language translations. With a new neural network, our AI research team was able to translate more accurately between languages, while also being nine times faster than current methods.

Getting better at translation is important to connect the world. We already perform over 2 billion translations in more than 45 languages on Facebook every day, but there’s still a lot more to do. You should be able to read posts or watch videos in any language, but so far the technology hasn’t been good enough.

Throughout human history, language has been a barrier to communication. It’s amazing we get to live in a time when technology can change that. Understanding someone’s language brings you closer to them, and I’m looking forward to making universal translation a reality. To help us get there faster, we’re sharing our work publicly so that all researchers can use it to build better translation tools.”

Key messages: taking time to understand people is for fools, and language is the problem.
When did language become a barrier to communication?  Would we not be hard pressed to communicate much at all without it?  Doesn’t machine translation have the potential to create as much distance as ‘understanding?’   Building intimacy (for this is what I take the phrase “brings you closer” to mean) is not about having a rough idea of what someone is saying, it is about understanding the nuance of every gesture, every reference and resonance.  Isn’t the joy of encountering a new culture tied up in the journey of discovery we make on the road to understanding?
I salute Facebook for making their research and software open, but a bit of humility in the face of the awesome and varied systems of signs and significations we humans have built could make this so much better news.

Data before the (alternative) facts.

Ambiguity and uncertainty, cultural richness, idiosyncracy and subjectivity are par for the course when it comes to humanities research, but one mistake that can be made when we approach computational technologies that proffer access to greatly increased quanta of data, is to assume that these technologies are objective. They are not, and just like regular old fashioned analogue research, they are subject to subjectivities, prejudice, error, and can display or reflect the human prejudices of the user, the software technician, or the organisation responsible for designing the information architecture; for designing this supposedly “objective” machine or piece of software for everyday use by regular everyday humans.

In ‘Raw Data’ is an Oxymoron Daniel Rosenberg argues that data is rhetorical.[1] Christine Borgman elaborates on this stating that “Data are neither truth nor reality. They may be facts, sources of evidence, or principles of an argument that are used to assert truth or reality.”[2] But one person’s truth is, or can be, in certain cases, antithetical to another person’s reality. We saw an example of this in Trump aide Kellyanne Conway’s assertion that the Trump White House was offering “alternative facts” regarding the size of the crowd that attended Trump’s inauguration. Conway’s comments refer to the “alternative facts” offered by White House Press Secretary Sean Spicer, who performed an incredible feat of rhetorical manipulation (of data) by arguing that “this was the largest audience to ever witness an inauguration, period.”

While the phrase “alternative facts” may have been coined by Conway, it’s not necessarily that new as a concept;[3] at least not when it comes to data and how big data is, or can be, used and manipulated. So what exactly am I getting at? Large data can be manipulated and interpreted to reflect different users’ interpretations of “truth.” This means that data is vulnerable (in a sense). This is particularly applicable when it comes to the way we categorise data, and so it concerns the epistemological implications of architecture such as the “search” function, and the categories we decide best assist the user in translating the data into information, knowledge and the all-important wisdom (à la the “data, information, knowledge and wisdom” (DIKW) system). Johanna Drucker noted this back in 2011, and her observations are still being cited:

It is important to recognize that when we organize data into categories (according to population, gender, nation, etc.), these categories tend to be treated as if they were discrete and fixed, when in fact they are interpretive expressions (Drucker 2011).[4]

This correlates with Rosenberg’s idea of data as rhetorical, and also with Rita Raley’s argument that data is “performative.”[5] Like rhetoric and performance then, this means the material can be fashioned to reflect the intentions of the user, that it can be used to falsify truth, or to insist that perceived truths are in fact false; that news is “fake news.” Data provided the foundations for the arguments voiced by each side over the ratio of Trump inauguration attendees to Obama inauguration attendees, with each side insisting their interpretation of the data allowed them to present facts.

This is one of the major concerns surrounding big data research and Rosenthal uses it to draw a nice distinction between notions of falsity as they are presented in the sciences and the humanities respectively; with the first discipline maintaining a high awareness of what are referred to as  “false-positive results”[6] whereas in the humanities, so-called “false-positives” or theories that are later disproven or refuted (insofar as it is possible to disprove a theory in the humanities) are not discarded, but rather incorporated into the critical canon, becoming part of the bibliography of critical works on a given topic: “So while scientific truth is falsifiable, truth in the humanities never really cancels out what came before.”[7]

But perhaps a solution of sorts is to be found in the realm of the visual and the evolution of the visual (as opposed to the linguistic, which is vulnerable to rhetorical manipulations) as a medium through which we can attempt to study of data. Rosenthal argues that visual experiences of data representations trigger different cognitive responses:

When confronted with the visual instead of the linguistic, our thought somehow becomes more innocent, less tempered by experience. If data is, as Rosenberg suggests, a rhetorical function that marks the end of the chain of analysis, then in data-driven literary criticism the visualization allows us to engage with that irreducible element in a way that language would not. As Edward R. Tufte (2001, 14), a statistician and author on data visualization, puts it, in almost Heidegerrian language, ‘Graphics reveal data.’[8]

So “Graphics reveal data,” and it was to graphics that people turned in response to Conway and Spicer’s assertions regarding the “alternative facts” surrounding Trump’s inauguration crowd. Arguably, these graphics (below) express the counterargument to Conway and Spicer’s assertions in a more innately understandable way than any linguistic rebuttal, irrespective of how eloquent that rebuttal may be.

But things don’t end there (and it’s also important to remember that visualisations can themselves be manipulated). Faced with graphic “alternative-alternative facts” in the aftermath of both his own assertions and Conway’s defense of these assertions, Spicer countered the visual sparsity of the graphic image by retroactively clarifying that this “largest audience” assertion incorporated material external to the datasets used by the news media to compile their figures, taking in viewers on Youtube, Facebook, and those watching over the internet the world over. The “largest audience to ever witness an inauguration, period” is, apparently, “unquestionable,” and it was obtained by manipulating data.

[1] Daniel Rosenberg, “Data before the Fact,”’ Gitelman, “Raw Data” is an Oxymoron, 18.

[2] Borgman, “Big Data, Little Data, No Data,” 17.

[3] We can of course refer to Orwell’s 1984 and the “2+2=5” paradigm here.

[4] Karin van Es, Nicolàs López Coombs & Thomas Boeschoten, “Towards a Reflexive Digital Data Analysis” in Mirko Tobias Schäfer and Karin van Es, The Datafied Society. Studying Culture through Data, 176.

[5] Rita Raley, “Dataveillance and Countervailance” in Gitelman, “Raw Data” is an Oxymoron, 128.

[6] Rosenthal, “Introduction: Narrative Against Data in the Victorian Novel,” 6.

[7] Ibid., 7.

[8] Ibid., 5.

How can use be made of NAs?

In elementary school, the little data scientist learns that NAs are nasty. Numerical data are nice, clean, and complete when collected by a sensor. NAs, on the other hand, are these kind of data that result from manually entered data, not properly filled-in surveys; or they are missing values which could not be gathered out of unknown reasons. If they would be NULLs, the case would be much clearer – with a NULL, you can calculate. But NA, that’s really terra incognita, and because of this the data have to checked for bias and skewness. NA thus becomes for the little data scientist an expression of desperation: not applicable, not available, or no answer.

For humanists, the case is easier. Missing values are part of their data life. Only rarely machines support the process of data collection, and experience shows that people, when asked, rarely respond in a complete way; leaving out part of the answer is an all-too-human strategy. Working with this kind of data, researches have become creative; one can even ask whether there is a pattern behind NAs, or whether a systematic explanation for them can be found. In a recently published reader on methodologies in emotion research, there is a contribution on how to deal with missing values. The researcher, Dunya van Troost, collected data from persons committed to political protest all over Europe; the survey contained, amongst others, four items on emotions. By counting the number of missing values across the four emotion items three groups could be identified: respondents with complete records (85 percent), respondents who answered selectively (13 percent), and (3) respondents who answered none of the four emotion items (2 percent). For inferential statistics, it is important to understand the reasons why certain data are missing. Van Troost somehow turned the tables: she conducted a regression analysis to see how the other data available, here both country and demographic characteristics, influenced the number of missing values provided by her 15,272 respondents. She found out that the unanswered items were not missing completely at random. The regression further showed that the generation of protesters born before 1945 had more frequently missing values on the emotions items than younger generations. The same applies for the level of education – the higher the level of education, the more complete the surveys had been filled. Female respondents have more missing values than male ones. Finally, respondents from the southern European countries Spain and Italy had a relatively high rate of missing values.

It is obvious that not all people are going to answer items on emotions with the same willingness; and it is also plausible that the differences in willingness are related to age, gender, cultural socialization. It would be interesting to have a longitudinal study on this topic, to validate and cross-check van Troost’s findings. But this example illuminates a different handling of information in the social sciences from that of information science: Even NAs can be useful.

Dunya Van Troost, Missing values: surveying protest emotions, in: Helena Flam, Jochen Kleres (eds.), Methods of Exploring Emotions, London / New York: Routledge 2015, pp. 294-305.