On digital oblivion

Knowledge is made by oblivion.
Sir Thomas Browne; in: Sir Thomas Browne’s Works: Including His Life and Correspondence, vol.2, p.177.

Remembrance, memory and oblivion have a peculiar relationship to each other. Remembrance is the internal realisation of the past, memory (in the sense of memorials, monuments and archives) its exteriorised form. Oblivion supplements these two to a trinity, in which memory and oblivion work as complementary modes of remembrance.

The formation of remembrance can be seen as an elementary function in the development of personal and cultural identity; oblivion, on the other hand, ‘befalls’, it happens, it is non-intentional and can therefore be seen as a threat to identity. Cultural heritage institutions – such as galleries, libraries, archives, and museums (GLAM) are thus not only the places where objects are being collected, preserved, and organized; they also form bodies of memory, invaluable for our collective identity.

There is a direct line in these cultural heritage institutions from analogue documentation to digital practice: Online catalogues and digitized finding aids present metadata in a publicly accessible way. But apart from huge collections of digitized books, the material under question is mostly not available in digital formats. This is the reason why cultural heritage – and especially unique copies like the material stored in archives – can be seen as ‘hidden data’. What can be accessed are metadata: the most formal description of what is available in cultural heritage institutions. This structure works in a two-fold way towards oblivion: On the one hand, the content of archives and museums is present and existing, but not in digital formats and thus ‘invisible’. On the other hand, the century-long practice of documentation in the form of catalogues and finding aids has been carried over into digital information architectures; but even though these metadata are accessible, they hide more than they reveal if the content they refer to is not available online. We all have to rely on the information given and ordered by cultural heritage institutions, their classifications, taxonomies, and ontologies, to be able to access our heritage and explore what has formed our identities. Is Thomas Browne right in pointing to the structured knowledge gained from oblivion?

This depends on our attitude to the past and the formation of identity. It is possible to appreciate oblivion as a productive force. In antiquity, amnesty was the Siamese twin of amnesia; the word amnesty is derived from the Greek word ἀμνηστία (amnestia), or “forgetfulness, passing over”. Here it is oblivion which works for the generation of identity and unity: let bygones be bygones. In more recent times, it was Nietzsche who underlined the necessity to relieve oneself from the burdens of the past. It was Freud who identified the inability to forget as a mental disorder, earlier called melancholia, nowadays depression. And it was also Freud who introduced the differentiation between benign oblivion and malign repression.

But it is certainly not the task of GLAM-institutions to provide for oblivion. Their function is a provision of memory. Monuments, memorials, and the contents of archives serve a double bind: They keep objects in memory; and at the same time this exteriorisation neutralizes and serves oblivion insofar as it relieves from the affect of mourning; to erect monuments and memorials and to preserve the past in archives is in this sense a cultural technique of an elimination of meaning. To let go what is not longer present by preserving the available – in this relation the complementarity of memory and oblivion becomes visible; they don’t work against each other, but jointly. From this point of view remembrance – the internal realization of the past – is the task of the individual.

1024px-JuedischesMuseum_4a

Detail of the front of the Jewish Museum Berlin
By No machine-readable author provided. Stephan Herz assumed (based on copyright claims). [Public domain], via Wikimedia Commons

It is not the individual which decides on what should sink into oblivion. Societies and cultures decide in a yet unexplored way which events, persons, or places (the lieux de mémoire) are kept in memory and which are not. If it is too big a task to change the documentation practices of GLAM-institutions, the information architecture, and the metadata they produce, the actual usage of archival content could provide an answer to the question of what is of interest for a society and what is not: The digital documentation of what is being searched for in online catalogues and digitized finding aids as well as which material is being ordered in archives clearly indicate the users’ preferences. Collected as data and analysed with algorithms, we could learn from this documentation not only about what is being kept in memory; we could also learn about what falls into oblivion. And that is a kind of information historians rarely dispose of.

Harry Potter and the Philosopher’s DIKW.

DIKW= Data. Information. Knowledge. Wisdom. DIKW!

That’s the way things flow. Or more specifically, that’s the way “the knowledge pyramid” says they flow.

From data we gather information we develop into knowledge which leads to wisdom.

Apparently.

But is it really that straightforward? Are our thought processes that streamlined, hierarchical and, let’s face it, uncomplicated? Or is DIKW simply a nice sounding but somewhat reductive anagram to be used when waxing lyrical about the philosophy of knowledge, information systems, information management, or pedagogy?

I for one am not all that convinced by DIKW. And I’m not the only one: the pyramid is widely criticised. But why? Where and how, exactly, does DIKW misrepresent how we think about and manage data and information? Today we’re going to explore the DIKW pyramid; specifically how exactly “data” gets transformed into “wisdom,” what exactly happens to it, and how a different approach to cleaning or processing that “data” can lead us to come to very different conclusions, and thus to very different states of “wisdom”. And to facilitate this philosophising on the nuances of DIKW and its vulnerability to corruption, I’m going to talk about Harry Potter and the “is Snape good or bad” plot that runs through all seven Harry Potter novels. Because, why not? Specifically I’m going to use Snape’s complexity as a character to highlight DIKW’s shortcomings and in particular how DIKW can be corrupted depending on how the data you collect is processed and interpreted.

As we all know, Snape looks kinda evil, acts kinda evil, hates Harry, and has a pretty dodgy past in which he was aligned with Voldemort, the wizard responsible for Harry’s parents’ deaths. He has a fondness for the “Dark Arts,” and, as head of Slytherin, an unhealthy interest in eugenics and so-called “blood purity” (never a good trait in a person). And he is played to absolute perfection by the unrivaled Alan Rickman, sadly now deceased.

Rowling maintains a near-constant back and forth throughout the series, with the characters forever pursuing the idea that Snape is bad, being thwarted in their pursuit of this idea, or thrown off their suspicions by Dumbledore who always reaffirms his strong faith in Snape. The dampening of any suspicion regarding Snape’s motives generally comes at the conclusion of any given book, only for these suspicions to be re-ignited at the start of the next book and the next adventure.

And just when this continual “is he or isn’t he a bad guy” threatens to get monotonous, with the well-trained reader now six books in and attuned to expect the usual — “Snape’s being shifty, ergo…he must be bad!” / “Nope he’s actually good, Dumbledore says so.” / “Oh okay let’s talk about this again in the next book.” — Rowling bucks our expectations spectacularly, and all of these hints and suspicions about Snape are seemingly verified  in book six, Harry Potter and the Half-Blood Prince, when Snape goes and kills Dumbledore, the one man who trusted and protected him absolutely; a most heinous crime, and one done using “Avada Kedavra,” the unforgivable curse.

Lets take a look at Snape’s first appearance, way back in book one, Harry Potter and the Philosopher’s Stone, or as it’s known in the US, Harry Potter and the Sorcerer’s Stone:Screenshot 2017-06-27 13.33.43.png

    (J. K. Rowling, Harry Potter and the Philosopher’s Stone, Bloomsbury 1999, 126.)

What do we get on Snape here?

He’s unhealthy looking, pale and yellowish. He could probably do with a good shampoo. Oh, and he has a hooked nose.

Now this description is controversial. Snape’s portrait could be considered to be “Jewish-coded” or even anti-Semitic; certainly it can be seen as having uneasy inter-textual chimes with overtly anti-Semitic portraits in classic (and classically anti-Semitic) canonical English-language texts such as Charles Dickens’s Oliver Twist or Shakespeare’s The Merchant of Venice where both Fagin and Shylock respectively (both pictured below) are presented as having the overt hooked noses that were considered characteristic of the so-called “Stage Jew.”

The “Stage Jew” is basically the Jewish equivalent of blackface, a crude form of racial stereotyping that was particularly popular during the Elizabethan period and thereafter. Much like blackface, these racist Jewish stereotypes were not just confined to the realms of theatre and literature, Hitler and the Nazi’s also made full use of racist caricatures in their propaganda.

Most all of this will (thankfully) be lost on a younger audience, but the question remains, does Rowling engage with this sadly all-too familiar visual trope deliberately? Knowing Snape’s story, his full story, as she claims she did all the way back in book one, does she proffer such material with a view to ultimately showing it up to be complete rubbish?

Snape’s appearance screams evil, irrespective of whether you want to connect that with the tradition of the Stage Jew, or with less racially charged narratives that seek to represent a person’s character in or through their appearance. It’s a frequent visual trope in Superhero(ine) movies, for example; the good guys look good, the bad guys look bad. And when the good guy turns bad (as they are wont to do on occasion), their nascent badness is represented visually through some change in their appearance. Again, in relation to the Spiderman 3 poster below, we can ask why the “evil” Spiderman is coded “black,” particularly when Spiderman hails from a country whose law enforcement has a well established track record of subjecting African American males to racial profiling, but we better stay on topic.

To return to Snape, to his appearance, and to how our reading of his appearance can change once we get to the crux of his character: Certainly Snape’s character leads us to question many of the facets of other peoples’ appearances we read and take for granted, their demeanour, their silences, their unreadability, their appearance, their complexity. Snape looks bad, but he is not bad, he is good. We have misread him and are guilty of superficially associating his morality with his appearance. This is not to say that readers of Harry Potter are guilty of racially profiling Snape, not at all, but simply that this tendency to create a link between appearance and morality has a long history in English literature, and one that unfortunately happens to have rather unpleasant racist roots that contemporary readers may not be aware of. This is just one small part of why Rowling is such a good author, and why the Harry Potter books are so rich and rewarding for growing minds in particular. But it’s also an example of how the DIKW pyramid can be drastically altered or corrupted depending on how you read the data at the bottom; the data being the material that proffers the opportunity to reach a state of wisdom regarding that particular material or phenomenon.

In other words, even if you’d only seen a very small selection of Marvel superhero movies, you have been trained to seek out information on a person’s moral character based off of their appearance, you may have taken facets of Snape’s malevolent appearance as “data” on Snape. This “data” would allow you to garner “information” on his character, because most figures that look like this in literature or superhero movies are bad or evil characters. Then, thanks to your experience of encountering Snape throughout the novels, this data would lead you to a position of “knowledge” or even to a position of “wisdom” regarding Snape (and peoples like Snape). This is where what we believe to be “wisdom” can start to cause problems.

—Pause here to note that the backstory of many of the villains in superhero movies are frequently tragic and often not all that different from the backstory of the hero or heroines. Similar data, different narratives, different superhero/ villain costume.—

This is also an example of how data is pre-epistemic. That means data has no truth, it just is, other people come along and clean it up, interpret it, narrativise it, assign it truth-values. Throughout the series Snape just is, he does not explain or excuse himself and he is as true to his love of Lily Potter at the beginning as he is at the end. We just happen not to be privy to his backstory. But Dumbledore is, and Rowling (naturally) is. Without the backstory readers are let run amok regarding the character of Snape, often hating him, feeling anxious in his presence, or fearing for the safety of the characters around him. But how much of that comes from Snape himself, or how much comes from other peoples’ interactions with him and their reading of him? Remember, many of the characters in the Potter-verse have already decided for themselves that Snape is evil and not to be trusted; the narratives we read are fueled by tainted interpretations of his data. Everything from his stare to his tone of voice is presented with adjectives that encourage us to read him as malevolent. The state of “Wisdom” (also, in my case, near-hysterical despair) we arrive at in book six when Snape kills Dumbledore is fuelled by cumulative knowledge and information that stems from misinterpreted data. The Snape-data has been read one way and one way alone, and there is no alternative narrative available, especially not once Dumbledore dies. There is no counter-argument to the data that seems to paint Snape as an absolute villain. None at least until, in his own death scene, he provides the memory (and narrative) that allows Harry to access the “truth value” of Snape’s history and his lifelong love for Lily.

Armed with this poignant insight into Snape’s childhood and history (his personal narrative, his personal account for the data-traces he has left on the Potter-verse), our DIKW pyramid is rewritten from the bottom up, but interestingly, much of it remains the same. The data stays data, but the narratives that we use form information from that data, to create knowledge from that information, and to eventually arrive at a sage-like state of sad wisdom regarding Snape’s sad fate, these narratives change. Snape still says the things he says, and does the things he does, we are just newly wise to his motives. We now know he is a good man. Same data, completely different DIKW.

We already known the DIKW model is problematic and oversimplified. As Christine Borgman notes, the “tripartite division of data, information, and knowledge […] oversimplifies the relationships among these complex constructs.”[1] Data is reinterpretable. And this is key. For the majority of the series Snape is continually heard, seen, and spoken about by characters in the texts using adjectives that assign morally dubious traits to his character. The –IKW part of Snape’s DIKW is unbalanced, because we do not get Snape’s personal narrative until the very end of his story. And it is only through this narrative that we can reassess the data on him we have collected, discarding some of it (such as the tendency to dramatise his appearance into one akin to a stage villain) as absolute rubbish, and reassessing what remains.

While Snape may carry the visual appearance (visual data) that makes it easy for us to suspect or infer that he is a “bad character,” while he may even carry the physiognomical hallmarks that hark back to racist character profiling in English literature, he is essentially a good person. What Rowling is saying here, in the most epic Rowling-esque fashion, is do not judge a person based on their appearance. Judge them on what they do and why they do it. This is Rowling’s real trump card to the “evil is as evil looks” camp of Snape-hating Potter fans. And arguably it is also Rowling’s way of redressing this unpleasant facet of English literary history, which sees race presented through face, and race or racial stereotypes sadly being presented as a measure of a character’s moral compass. Rowling writes back against this tradition by having Snape carry the same facial features of these similarly maligned “villainous” figures, features past readers or audiences would have taken as crude indications of his untrustworthiness. Yet instead of being “untrustworthy,” this same hook-nosed figure turns out to be one of the bravest, strongest, truest characters in the series. Take that Shakespeare and Dickens. Whup-ah.

So, we come back then to the problem of DIKW. Models such as DIKW create misleading and misrepresentative impressions about the supposed distinctions between the various facets of DIKW. DIKW also belies the central role narrative plays in all of this; narrative is the conveyor of information, knowledge, and wisdom. It is how we articulate and spread our opinions on data. And data is the foundation of DIKW, so depending on how that data is narrativised, the other elements in this hierarchy can be drastically different. One sees Snape as evil, and reads this wickedness in and into his every scene, up until his death. The other asks us to think carefully about what we do with our data, and the narratives we create from it, because even when we are wholly convinced in the veracity and justifiableness of our “wisdom,” we could be totally wrong, as we were with Snape.

[1] Christine L. Borgman, “Big Data, Little Data, No Data,” MIT Press, 17, accessed April 7, 2017, https://mitpress.mit.edu/big-data-little-data-no-data, 17.

 

 

Whose context?

When Christine Borgman (2015) mentions the term “native data” she is referring to data in its rawest form, with context information like communication artefacts included. In terms of the NASA’s EOSDIS Data Processing Levels, “native data” even precede level 0, meaning that no cleaning had been performed at all. Scientists who begin their analysis at this stage do not face any uncertainties about what this context information is. It is simply the recordings and the output of instruments, predetermined by the configuration of the instruments. NASA researchers may therefore count them lucky to obtain this kind of reliable context information.

Humanists’ and social scientists’ point of departure is quite different. Anthropologists for example would probably use the term “emic” for their field research data. “Emic” here stands in contrast to “etic” and has been derived from the distinction in linguistics between phonemics and phonetics: “The etic viewpoint studies behavior as from outside of a particular system, and as an essential initial approach to an alien system. The emic viewpoint results from studying behavior as from inside the system” (Pike 1967: 37). An example for the emic viewpoint might be the correspondences between pulses and organs in Chinese medical theory (see picture below) or the relation of masculinity to maleness in a particular cultural setting (MacInnes 1998).

L0038821 Chinese woodcut: Correspondences between pulses and organs

The emic context then for Anthropologists depends on the particular cultural background of their research participants. Disassociated from this cultural background and transferred into an etic context, data may become incomprehensible. Take for example the Kosovo, a sovereign state from an emic point of view, but only recognized by 111 UN member states. In this transition from emic to etic context, the etic context obviously becomes an imposed context.

Applied to libraries, archives, museums and galleries, it might equally be important to know the provenance and original use, so to speak the emic context of the resources. What functions did the materials have for the author or creator? To know about the “experience-near” and not only the “experience-distant” meanings of materials would increase its information content and transparency. One could also say that this additional providing of “emic” metadata enables traceability to the source context and guarantees the credibility of the data. From an operational viewpoint that would nevertheless recreate the problem of standards and making data findable.

If we move up to the next level, metadata from each GLAM-institution could be said to be emic, according to the understanding of the data structure by the curators in that institution. Currently there are over hundred different metadata standards applied. Again, the aggregation of several metadata standards into a unified metadata standard creates the same problem – transfer from an emic (an institution’s inherent metadata standard) into an etic metadata standard.

So what is the solution? Unless GLAM-institutions are willing to accept an imposed standard there remains only the possibility of a mutual convergence and ultimately an inter-institutional consensus.

——
Borgman, Christine L. (2015) Big Data, Little Data, No Data. Scholarship in the Networked World. Cambridge: MIT Press.
MacInnes, John (1998) The end of masculinity. The confusion of sexual genesis and sexual difference in modern society. Buckingham: Open University Press.
Pike, Kenneth L. (1967) Language in Relation to a Unified Theory of the Structure of Human Behavior. The Hague: Mouton.

Featured image was taken from http://www.europeana.eu/portal/de/record/9200105/wellcome_historical_images_L0038821.html

Promise and Paradox: Accessing Open Data in Archaeology

This article has been published by Huggett, Jeremy. ‘Promise and Paradox: Accessing Open Data in Archaeology’. In: Clare Mills, Michael Pidd and Esther Ward. Proceedings of the Digital Humanities Congress 2012. Studies in the Digital Humanities. Sheffield: HRI Online Publications, 2014.

Abstract:
Increasing access to open data raises challenges, amongst the most important of which is the need to understand the context of the data that are delivered to the screen. Data are situated, contingent, and incomplete: they have histories which relate to their origins, their purpose, and their modification. These histories have implications for the subsequent appropriate use of those data but are rarely available to the data consumer. This paper argues that just as data need metadata to make them discoverable, so they also need provenance metadata as a means of seeking to capture their underlying theory-laden, purpose-laden and process-laden character.

The article is available online at: https://www.hrionline.ac.uk/openbook/chapter/dhc2012-huggett

The data (and narrative) behind a good lie.

In 2003 James Frey, a twenty-something American and recovering alcoholic and drug addict, published his memoir A Million Little Pieces. The memoir recounts his experience of addiction and rehabilitation, often in brutalist and unsparingly graphic detail. Two scenes in particular have stayed with me to this day, one involving a trip to the dentist for un-anaesthetised root-canal treatment (the horror), the other something altogether worse involving a woman and the things people in the grips of an addiction will do for drugs.

In 2005 Frey hit the big-time and A Million Little Pieces was selected for the Holy-Grail of Book Clubs, Oprah’s Book Club, an act that virtually guaranteed massive sales. Not long after the book topped the New York Times’s best sellers list, as Oprah’s Book Club choices tend to do, and there it remained for some 15 weeks, selling close to four million copies. Not bad for a former addict, eh?

Except Frey wasn’t all he claimed to be. And his memoir wasn’t entirely factual. To give but one example. Instead of being in jail for a respectable 87 days, Frey was in jail for just a few hours. The Smoking Gun published an article titled “A Million Little Lies” that opened with the amusingly melodramatic statement “Oprah Winfrey’s been had” (gasp!) and proceeded with an article outlining how Frey’s book had circumvented the truth and taken Winfrey along for the ride:

Police reports, court records, interviews with law enforcement personnel, and other sources have put the lie to many key sections of Frey’s book. The 36-year-old author, these documents and interviews show, wholly fabricated or wildly embellished details of his purported criminal career, jail terms, and status as an outlaw “wanted in three states.”

In an odd reversal, Frey’s lack of jail time and status as a not-so-delinquent former delinquent led to outrage and culminated in a massively dramatic face-off with Winfrey that is referred to by Winfrey herself as among her most controversial interviews:

Oprah: James Frey is here and I have to say it is difficult for me to talk to you because I feel really duped. But more importantly, I feel that you betrayed millions of readers. I think it’s such a gift to have millions of people to read your work and that bothers me greatly. So now, as I sit here today I don’t know what is true and I don’t know what isn’t. So first of all, I wanted to start with The Smoking Gun report titled, “The Man Who Conned Oprah” and I want to know—were they right?

So Frey’s memoir was fictional, or if we are being very (very) generous, semi-autobiographical. He lied and blurred the lines between autobiography and real-life; using narrative to distort not only his personal data, but to distort the lives of others. It made for a pretty interesting (if graphic) novel, but this additional layer and awareness that, in addition to struggling with their own addictions, the figures in this novel (many of whom were now dead) had, on top of everything else, now been chewed up and spat out by Frey, was abhorrent. In particular I have always been repulsed by his crude appropriation and exploitation of the altogether dire narrative of a vulnerable female companion and love interest called Lilly, whose eventual suicide Frey “tweaks” both in terms of how and when it was enacted, presenting himself as a proto-saviour for Lilly, one who tragically arrives a few hours two late (this didn’t happen, Lilly’s suicide did not coincide with the day Frey arrived to “save her”).

What was marketed as a memoir narrative, that is to say a narrative wherein the data correlates with and can be backed up by real-world events, was in fact a fictional narrative, that is a narrative wherein the data is used to intimate real-world events. As Jesse Rosenthal points out in “Narrative Against Data” (2017), in these situations data is exploited to give the impression of truth or believability, and “the narrative relies for its coherence on our unexamined belief that a pre-existing series of events underlies it.” Frey’s manipulation of the rules of genre then, together with his exploitation of the co-dependence of data and narrative, was what was so unforgivable to Oprah and her readers. A Million Little Pieces would have been fine as fiction, but as a memoir, it was manipulative, rhetorical with the truth, and deeply exploitative. It was not the narrative that undid Frey’s web of manipulations, then, but the data that should have worked as their foundations, that should have been there to back his story up.

Frey’s hardly the first to do this. Philip Roth’s been skirting the lines that separate fiction and autobiography for decades. Dave Eggers’s 2000 work A Heartbreaking Work of Staggering Genius set the marker for contemporary US “good-guy” memoir-lit. More recently, Karl Ove Knausgård’s six-part “fictional” novel series provocatively titled My Struggle (in Norweigan: Min kamp) establishes the same tenuous, flirtatious relationship with the fictions and facts surrounding the life of its author. But while Knausgård’s work has certainly caused controversy in Sweden and his native Norway for the way it exposes the lives of his friends and family, this geographically localised controversy has served rather to fuel ongoing international interest and fascination with the novel series and its apparent real-life subject, Karl Ove himself, and his relentless mining of his autobiography to fuel these supposedly fictional narratives.

But the reaction to Frey’s twisting of the truth was altogether different. Why?

It seems you can alter facets of your memoirs if you label them as fictional to begin with (as Knausgård has done, an act that in itself is no doubt infuriating to the so-called “fictional” peoples exposed by his writings),  or if your modifications are relatively benign and done for stylistic reasons or dramatic effect (as Eggers has admitted to doing, and indeed as Barack Obama has done in his autobiographies).  But certain narratives, if declared truthful, must be verifiable by facts, evidence, and data. This extends beyond the realm of memoir and into present-day assertions of self. The intense push for President Trump to testify before Congress, for example, the implicit understanding (or hope, this is Trump after all) being that, much like in an autobiographical account of self, a narrative under oath must correlate to something tangible, to something real. It cannot simply turn out to be fictional scaffolding designed to dupe and distort the relationship between fact, fiction, narrative, and data.

On the humanities and „Big Data“

Most humanists get a bit frightened, if the term „big data” is being mentioned in a conversation; they are often simply not aware of big datasets provided by huge digital libraries like Google Books and Europeana, the USC Shoah Foundation Visual History Archive, or historical census and economic data, to name a few examples. Furthermore, there is a certain oblivion of historical precursors of ‘digital methods’ and the continuities between manual and computer aided methods. Who in the humanities is aware, for example, of the fact that already in 1851 a certain Augustus de Morgan brought forward the idea to identify an author by the average length of his words?

The disconcertment in the humanities with regard to ‘big data’ is certainly understandable, since it points to the premises of the related disciplines and their long-practiced methodologies. Hermeneutics, for example, can be understood as a holistic penetration through individualistic comprehension. It is an iterative, self-reflexive process, which aims at interpretation. Humanists work on the exemplary and are not necessarily interested in generalization and exact quantities.

HermeneuticCircle

Big Data, by contrast, provide for considerable quantities and point to larger structures, stratification and the characteristics of larger groups rather than the individual. But this seeming opposition does not mean that there are no bridges between humanities’ and big data approaches. Humanists can as well expand their research questions with respect to the ‘longue durée’ or larger groups of people. To provide for an example from text studies: A humanist can develop her comprehension of single authors; her comprehension of the relationship between these single authors and the intellectual contexts they live in; and her comprehension of these intellectual contexts, i.e. the larger environment of individuals, the discursive structures prevailing at the time which is being researched, and the socio-economic conditions determining these structures.

That is not to say that humanities and big data could easily blend into each other. A technique like ‘predictive analysis’, for example, is more or less alien to humanistic research. And there is the critical reservation of the humanities with respect to big data: No computing will ever replace the estimation and interpretation of the humanist.

It is therefore both a reminder and an appeal to humanists to bring back the ‘human’ to big data: The critical perspective of the humanities is able to compensate for the loss of reality which big data inevitably implicates; to situate big data regimes in time and space; to critically appraise the categories of supposedly neutral and transparent datasets and tools, which obscure their relationship to human beings; and to reveal the issues of trust, privacy, dignity, and consent inextricably connected to big data.

The strengths of the humanities – the hermeneutic circle, precise observation of details, complex syntheses, long chains of argumentation, the ability to handle alterity, and the creation of meaning – are capacities which have yet to be exploited in the handling of big data.

The green’s out: finding narrative in data, mining data for narrative.

When does blue stop becoming blue and start becoming less blue more green? Could you pinpoint the exact point where pigment stops being one and starts becoming another? And would that decision be objective or subjective? Take the Pantone colour bridge below, what is blue and what is green here?

Is there a fully blue blue? Is there a fully green green? And if there is an uber blue and an uber green, are they mutually exclusive?  Can they coagulate or mix? Of course they can, everyone who has played with paint knows this. But if and when they coagulate how do we represent that complex phenomenon? Bluegreen? Greenblue? BLgrUEeen? GREblENue? GbREENlue? And once again, is it possible to do this objectively? Is my blue the same colour blue as your blue? Are these terms in and of themselves sufficient to narrativise the colours we see before us? The modernists’ didn’t think so. Writing in 1921 Virginia Woolf composed the following two pieces, titled “Blue & Green,” and it’s with these two short pieces that I want to initiate this week’s post about data and narrative.

Green

THE PORTED fingers of glass hang downwards. The light slides down the glass, and drops a pool of green. All day long the ten fingers of the lustre drop green upon the marble. The feathers of parakeets—their harsh cries—sharp blades of palm trees—green, too; green needles glittering in the sun. But the hard glass drips on to the marble; the pools hover above the dessert sand; the camels lurch through them; the pools settle on the marble; rushes edge them; weeds clog them; here and there a white blossom; the frog flops over; at night the stars are set there unbroken. Evening comes, and the shadow sweeps the green over the mantelpiece; the ruffled surface of ocean. No ships come; the aimless waves sway beneath the empty sky. It’s night; the needles drip blots of blue. The green’s out.

Blue

The snub-nosed monster rises to the surface and spouts through his blunt nostrils two columns of water, which, fiery-white in the centre, spray off into a fringe of blue beads. Strokes of blue line the black tarpaulin of his hide. Slushing the water through mouth and nostrils he sings, heavy with water, and the blue closes over him dowsing the polished pebbles of his eyes. Thrown upon the beach he lies, blunt, obtuse, shedding dry blue scales. Their metallic blue stains the rusty iron on the beach. Blue are the ribs of the wrecked rowing boat. A wave rolls beneath the blue bells. But the cathedral’s different, cold, incense laden, faint blue with the veils of madonnas.[1]

So here in these pieces we see Woolf exploring the concept of blue and green, attempting to evoke the “feeling” of blue and green, the “experience” of blue and green. Because for Woolf, the adjectives alone were not enough. Blue may signify blue or blueness (just as green signifies green or greenness), but there is more to blue than the word “blue,” and there is more to green than “green.” After all, what, exactly, is so blue about blue or so green about green that is cannot be described by any other term or terms?

This provides us with the foundations for yet another narrative/ data puzzle. When it comes to complex and subjective experiential phenomenon such as colour or emotion, what is data and what is narrative? And how can our assertions about either be considered in any way objective? Even the most banal of assertions, such as an assignation of colour (“blue”) or an emotion (“sad”) are innately tied to our subjective experiences of that colour or emotion. If I am colour blind I may see a different shade of blue to you, but that facet of my experience of reality (my dataset of colours and of the colour blue specifically) will be no less “true” or representative of my reality than yours. Similarly, if I am autistic, my understanding or interpretation of the facial contortions that signify someone in my vinicity is “sad” or “happy” will be different to yours. They will be fed by different experiences and different stimuli, but are no less valid an interpretation of reality that the weird faces you find in your textbook neuro-typical example of a “sad” face.

In the above fictional pieces by Woolf, we encounter narrativised, impressionistic versions of blue and green; but what data or datum could we extract from the pieces? The title, “Blue & Green,” does not really do the respective narratives justice, because there is something more than blue and green here; or is there? Do the narratives elaborate on the data, create narrative from the data, or do the narratives reveal the latent richness inherent in the data? Woolf gave us facets of blue and green-ness we did not know existed within the colours; activating the speculative value of the data, presenting facets of blue that were already present in blue, just waiting for the right person to unlock its richness or full value.

In this way her work is analogous to Picasso’s or Monet’s respective blue periods, periods wherein both artists explored the nature of colours such as blue. We can place Woolf’s “Blue & Green” alongside representative pieces such as Picasso’s Blue Room, or to any one of Monet’s many water lilies studies (where he also explored yellow, but let’s not overload the KPLEX colour palette). To return to the issue of subjectivity, it is significant that in his later life Monet’s colour choices were greatly influenced and affected by sight problems; and so again, the blue I see, the blue you see, is not necessarily the blue he saw, which was itself not the same as the blue he saw following surgery to remove his cataracts in 1923.

The true depth of blue’s blueness, the full-Irish 40 shades green sees the colours incorporate elements that are not simple blues or greens. Taken together, these narratives paint a picture of blue and green that is far too complex for the adjectives alone; perhaps the Pantone colour bridge introduced earlier, or facets of these colour tone experiments by the Impressionists, and later by Picasso and the Cubists (among many, many others) provides us with a more complete dataset of the Woolfian blue and green.

So once again, is the colour blue when I write it the same shade of blue as the colour blue you imagine when you read this sentence? And if we cannot agree on a standardised version of something as commonplace as blue, how can we possibly agree on terms that carry more weight or exert control over us and our surroundings? Covfefe, anyone?

[1] Virginia Woolf, “Blue & Green,” Monday or Tuesday, 1921.

Has anyone ever analyzed big data classifications for their political or cultural implications?

If we think of big data, we don’t think of unstructured text data. Mostly we even don’t think of (weakly) structured data like XML – we conceive of big data as being organized in tables, with columns (variables) and rows (observations); lists of numbers with labels attached. But where do the variables come from? Variables are classifiers; and a classification, as Geoffrey Bowker and Susan Leigh Star have put it in their inspiring book “Sorting Things Out”, is “a spatial, temporal, or spatio-temporal segmentation of the world. A ‘classification system’ is a set of boxes (metaphorical or literal) into which things can be put to then do some kind of work – bureaucratic or knowledge production. In an abstract, ideal sense, a classification system exhibits the following properties: 1. There are consistent unique classificatory principles in operation. […] 2. The categories are mutually exclusive. […] 3. The system is complete.”[1] Bowker and Star describe classification systems as invisible, erased by their naturalization into the routines of life; in them, conflict, contradiction, and multiplicity are often buried beneath layers of obscure representations.

Humanists are well accustomed this kind of approach in general and to classifications and their consequences in particular; classification is, so to say, the administrative part of philosophy and, also, of disciplines like history or anthropology. Any humanist who has occupied himself with Saussurean structuralism or Derrida’s deconstruction knows that each term (and potential classifier) is inseparably linked to other terms: “the proper name was never possible except through its functioning within a classification and therefore within a system of differences,” Derrida writes in “Of Grammatology”.[2] Proper names, as indivisible units, thus form residual categories. In the language of data-as-tables, proper names would correspond to variables, and where they don’t form a variable, they would be sorted into a ‘garbage category’ – the infamous and ubiquitous “other” (“if it is not that and that and that, then it is – other”). Garbage categories are these columns where things get put that you do not know what to do with. But in principle, garbage categories are okay; they can signal uncertainty at the level of data collection. As Derrida’s work reminds us, we must be aware of exclusions, even if they may be explicit.

The most famous scientist who has occupied himself with implicit exclusions in historical perspective is certainly Michel Foucault. In examining the formative rules of powerful discourses as exclusion mechanisms, he analyzed how power constitutes the “other”, and how standards and classifications are suffused with traces of political and social work. In the chapter “The Birth of the Asylum” of his book “Madness and Civilization”,[3] for example, he describes how the categories of ‘normal’ and ‘deviant’, and classifications of forms of ‘deviance’, go hand in hand with forms of treatment and control. Something which is termed ‘normal’ is linked to modes of conduct, standards of action and behavior, and with judgments on what is acceptable and unacceptable.  In his conception, the ‘other’ or ‘deviant’ is to be found outside of the discourse of power and therefore not taking part in the communication between the powerful and the ‘others’. Categories and classifications created by the powerful justify diverse forms of treatment by individuals in professional capacities, such as physicians, and in political offices, such as gaolers or legislators. The equivalent to the historical setting analyzed by Foucault can nowadays be found in classification systems like the International Classification of Diseases (ICD) or the Diagnostic and Statistical Manual (DSM). Doctors, epidemiologists, statisticians, and medical insurance companies work with these classification systems, and certainly there are equivalent ‘big data’ containing these classifications as variables. Not only are the ill excluded from taking part in the establishment of classifications, but also other medical cultures and their representation systems of classification of diseases. Traditional Chinese medicine is a prominent example here, and the historical (and later psychoanalytic) conception of hysteria had to undergo several major revisions until it nowadays found an entry as “dissociation” (F44) and “Histrionic personality disorder” (F60.4) in the ICD-10. Here Foucault’s work reminds us of the fact that power structures are implicit and thus invisible. We are admonished to render classifications retrievable, and to include the public in policy participation.

A third example that may come to the humanist’s mind when thinking of classifications is the anthropologist Mary Douglas. In her influential work “Purity and Danger”, she outlines the inseparability of seemingly distinct categories. One of her examples is the relation of sacredness and pollution: “It is their nature [of religious entities, J.L.] always to be in danger of losing their distinctive and necessary character. The sacred needs to be continually hedged in with prohibitions. The sacred must always be treated as contagious because relations with it are bound to be expressed by rituals of separation and demarcation and by beliefs in the danger of crossing forbidden boundaries.”[4] Sacredness and pollution are thus in permanent tension with each other and create their distinctiveness out of a permanent process of exchange. This process undermines classification – with Douglas, classifiers have to run over into each other (like it the case with the 5-point Likert scale), or classification systems have to be conceived of as heterogeneous lists or parallel different lists. Douglas’ work thus reminds us of the need for the incorporation of ambiguity and of leaving certain terms open for multiple definitions.

Seen from a humanist’s point of view, big data classifications pretend false objectivity, insofar as they help to forget their political, cultural, moral, or social origins, as well as their constructedness. It lies beyond the tasks of the KPLEX project to analyze big data classifications for their implications; but if anybody is aware of relevant studies, references are most welcome. But what we all can do is to constantly remind ourselves that there is no such thing as an unambiguous, uniform classification system implemented in big data.

[1] Geoffrey C. Bowker, Susan Leigh Star, Sorting Things Out. Classification and Its Consequences, Cambridge, MA / London, England: The MIT Press 1999, p. 10/11.

[2] Jacques Derrida, Of Grammatology. Translated by Gayatri Chakravorty Spivak, Baltimore and London: The Johns Hopkins University Press 1998, chapter “The Battle of Proper Names”, p. 120.

[3] Michel Foucault, Madness and Civilization: History of Insanity in the Age of Reason, translated by Richard Howard, New York: Vintage Books 1988 (first print 1965), pp. 244ff.

[4] Mary Douglas, Purity and Danger. An Analysis of the concepts of pollution and taboo, London and New York: Routledge 2001 (first print 1966), p. 22.

Ode to Spot: We need to talk about Data (and narrative).

We need to talk about data. And narrative. In fact, data and narrative need to talk to each other, work some issues out, attend relationship counselling, try to recapture that “spark,” that “special something” that kept bringing them together, that has made them, at times, seem inseparable, but also led to some pretty fiery clashes.

So, what’s the deal? What is the relationship between data and narrative? What role does narrative play in our use of data? What role does data play in our fashioning of narrative? How much of what we have to say about each is determined by pre-established notions we have about either one of these entities? Why did I instinctively opt, for example, while writing the previous two sentences, to refer to data as something that is “used” and narrative as something that is “fashioned”? And further still is it correct to refer to them as wholly distinct? Can we have a narrative that is bereft of data? And are data or datasets wholly bereft of narrative?

Data and narrative are presented by some as being irreconcilable or antithetical. Lev Manovich presents them as “natural enemies”[1] whereas Jesse Rosenthal, speaking in the context of the relationship between fictional narratives and data, observes how “the novel form itself is consistently adept at expressing the discomfort that data can produce: the uncertainty in the face of a central part of modern existence that seems to resist being brought forward into understanding.”[2] Todd Presner argues that data and narrative exist in a proto-antagonistic relationship wherein narrative begat data, and data begat narrative. I use antagonistic here in the sense of musculature, with the relationship between narrative and data being analogous to why you’re not able to flex your biceps and your triceps at the same time, because for one to flex, the other must relax or straighten.

Presner situates database and narrative as being at odds or somehow irreconcilable

because databases are not narratives […] they are formed from data (such as keywords) arranged in relational tables which can be queried, sorted, and viewed in relation to tables of other data. The relationships are foremost paradigmatic or associative relations […] since they involve rules that govern the selection or suitability of terms, rather than the syntagmatic, or combinatory elements, that give rise to narrative. Database queries are, by definition, algorithms to select data according to a set of parameters.[3]

So databases are not narratives, and while narratives can contain (or be built on data), they are not data-proper. This means there is a continual transference between data and narrative in either direction, a transference that is all the more explicit and controversial in the transition from an analogue to a digital environment. This transition, the extraction of data from narrative, or the injection of data into narrative, is a process that has significant ethical and epistemological implications:

The effect is to turn the narrative into data amenable to computational processing. Significantly, this process is exactly the opposite of what historians usually do, namely to create narratives from data by employing source material, evidence, and established facts into a narrative.[4]

Rosenthal also presents data and narrative as operating in this interrelated but tiered manner, with narrative being built on data, or data serving as the building blocks of narrative. And while Rosenthal focuses on fictional narratives, this is the case irrespective of whether the narrative in question is fictional or non-fictional because, after all, non-fictional narrative is still narrative.[5] Whereas Presner focuses on the complications surrounding the relationship between narrative and data in digital environments, Rosenthal’s engagement is more open to and acknowledging of the established and dynamic nature of the relationship between narrative and data in literature: “Narrative and data play off against each other, informing each other and broadening our understanding of each.”[6]

Data and narrative could be said to exist in a dynamic, dyadic relationship then. Indeed, Kathryn Hayles argues that data and narrative are symbiotic and should be seen as “natural symbionts.”[7] So their relationship is symbiotic, rather than antagonistic; they intermingle and their relationship is mutually beneficial, with data perhaps adding credence to narrative (fictional or otherwise) and narrative helping us understand data by making clear to us what the data is saying, or has the capacity to say (in the eyes of the person working with it). That said, if they are symbionts, what is the ratio of their intermingling? Is it possible for a narrative become data-heavy or data-saturated? Does this impede the narrative from being narrative?  Would a data-driven narrative read something along the lines of Data’s poem to his pet cat Spot from Star Trek The Next Generation (TNG) Season Six Episode Five:

Ode To Spot

Felis catus is your taxonomic nomenclature,

An endothermic quadruped, carnivorous by nature;

Your visual, olfactory, and auditory senses

Contribute to your hunting skills and natural defenses.

I find myself intrigued by your subvocal oscillations,

A singular development of cat communications

That obviates your basic hedonistic predilection

For a rhythmic stroking of your fur to demonstrate affection.

A tail is quite essential for your acrobatic talents;

You would not be so agile if you lacked its counterbalance.

And when not being utilized to aid in locomotion,

It often serves to illustrate the state of your emotion.

O Spot, the complex levels of behaviour you display

Connote a fairly well-developed cognitive array.

And though you are not sentient, Spot, and do not comprehend,

I nonetheless consider you a true and valued friend.

This “ode” is an example of a piece of writing so data-laden[8] that the momentum of the narrative is hampered, or rather the lyricism necessary to bring Data’s sweet ode to his cat into Shakespeare territory is seriously lacking. And what do I mean by lyricism? Well, the answer to that is relatively simple, just take a look at Shakespeare’s “Sonnet 18”:

Shall I compare thee to a summer’s day?

Thou art more lovely and more temperate.

Rough winds do shake the darling buds of May,

And summer’s lease hath all too short a date.

Sometime too hot the eye of heaven shines,

And often is his gold complexion dimmed;

And every fair from fair sometime declines,

By chance, or nature’s changing course, untrimmed;

But thy eternal summer shall not fade,

Nor lose possession of that fair thou ow’st,

Nor shall death brag thou wand’rest in his shade,

When in eternal lines to Time thou grow’st.

     So long as men can breathe, or eyes can see,

     So long lives this, and this gives life to thee.

In contrast to the Data-ode, and to the lyrical Shakespearian ode, would a narrative that is almost entirely bereft of data (and arguably also bereft of narrative, but let’s not go there) read something like a Trump rally speech?

A few days ago I called the fake news the enemy of the people. And they are. They are the enemy of the people.

(APPLAUSE)

Because they have no sources, they just make ’em up when there are none. I saw one story recently where they said, “Nine people have confirmed.” There’re no nine people. I don’t believe there was one or two people. Nine people.

And I said, “Give me a break.” Because I know the people, I know who they talk to. There were no nine people.

But they say “nine people.” And somebody reads it and they think, “Oh, nine people. They have nine sources.” They make up sources.

They’re very dishonest people. In fact, in covering my comments, the dishonest media did not explain that I called the fake news the enemy of the people. The fake news. They dropped off the word “fake.” And all of a sudden the story became the media is the enemy.

They take the word “fake” out. And now I’m saying, “Oh no, this is no good.” But that’s the way they are.

So I’m not against the media, I’m not against the press. I don’t mind bad stories if I deserve them.

And I tell ya, I love good stories, but we don’t go…

(LAUGHTER)

I don’t get too many of them.

But I am only against the fake news, media or press. Fake, fake. They have to leave that word.

I’m against the people that make up stories and make up sources.

They shouldn’t be allowed to use sources unless they use somebody’s name. Let their name be put out there. Let their name be put out.

(APPLAUSE)

“A source says that Donald Trump is a horrible, horrible human being.” Let ’em say it to my face.

(APPLAUSE)

Let there be no more sources.[9]

And now, by means of apology and for some brief respite, I offer you a meme of Data.

But are these our only options? Are narrative and data really at odds in this way? Is there a way to reconcile narrative and the database? Perhaps it is time to stop thinking of data and narrative as being at odds with each other; perhaps it is necessary to break down this dyad and facilitate better integration?

Traditionally, narrative driven criticism took the form of “retelling,” what Rosenthal calls an “artful,” or “opinionated reshaping” of the underlying evidence (aka the data) whereas more contemporaneous data driven criticism largely takes the form of visualisations that attempts to, as Rosenthal puts it, “let the data speak for itself, without mediation.”[10] This turn to the visual is driven by a hermeneutic belief akin to Ellen Gruber Garvey’s assertion that “Data will out.”[11] But this is something of a contradiction of terms considering elsewhere we are told (by Daniel Rosenberg) that data has a “pre-analytical, pre-factual status,”[12] that data is an entity “that resists analysis,”[13] but can also be “rhetorical,”[14] that “False data is data nonetheless”[15] and that “Data has no truth. Even today, when we speak about data, we make no assumptions about veracity.”[16] Borgman elaborates, stating that “Data are neither truth nor reality. They may be facts, sources of evidence, or principles of an argument that are used to assert truth or reality.”[17] That’s a lot of different data on data.

Fictional narratives can be built on supposedly reputable data, this helps the reader to suspend their disbelief and “believe in” the fictions they encounter within the narrative. Supposedly non-fictional narratives, such as presidential speeches, can be based on tenuously obtained, fabricated data, or can make reference to data that is not proffered, and may not even exist, rather like dressing a corgi up in a suit and asking it for a political manifesto.

What we’ve looked at today concerns the evolving discomfiture of our difficulties outlining the relationship between narrative and data. In the interplay between analogue and digital, different factions emerge regarding the relationship between data and narrative, with narrative and data variously presented as being anathematic, antagonistic, or symbiotic, with data presented as something one can be either “for” or “against” and with distinct preferences for one or the other (either narrative or data) being shown on a discipline specific, researcher specific, author specific level. At the same time, irrespective of which of these positions you adopt, it is clear that data and narrative are intricately linked and deeply necessary to each other. The question is then, how can one facilitate and elucidate the other best in a digital environment?

[1] Lev Manovich, The Language of New Media, 2002, 228.

[2] Jesse Rosenthal, “Introduction: ‘Narrative against Data,’” Genre 50, no. 1 (April 1, 2017): 2., doi:10.1215/00166928-3761312.

[3] Presner, in Fogu, Claudio, Kansteiner, Wulf, and Presner, Todd, Probing the Ethics of Holocaust Culture, History Unlimited (Cambridge: Harvard University Press, 2015), http://www.hup.harvard.edu/catalog.php?isbn=9780674970519.

[4] Presner, in ibid.

[5] “Yet the narrative relies for its coherence on our unexamined belief that a preexisting series of events underlies it. While data depends on a sense of irreducibility, narrative relies on a fiction that it is a retelling of something more objective. […] The coherence of the novel form, then, depends on making us believe that there is something more fundamental than narrative.” Rosenthal, “Introduction,” 2–3.

[6] Ibid., 4.

[7] N. Katherine Hayles, “Narrative and Database: Natural Symbionts,” PMLA 122, no. 5 (2007): 1603.

[8] And of course it’s data-laden, it was composed by Data, a Soong-type android, so basically a walking computer, albeit a mega-advanced one.

[9] “Transcript of President Trump’s CPAC Speech,” http://time.com/4682023/cpac-donald-trump-speech-transcript/

[10] Rosenthal, “Introduction,” 4.

[11] Ellen Gruber Garvey, “‘facts and FACTS:’ Abolitionists’ Database Innovations,” Gitelman, “Raw Data” Is an Oxymoron, 90.

[12] Rosenberg, “Data before the Fact,” in Gitelman, “Raw Data” is an Oxymoron, 18.

[13] Rosenthal, “Introduction,” 1.

[14] Rosenberg, “Data before the Fact,” in Gitelman, “Raw Data” Is an Oxymoron, 18.

[15] Rosenberg, “Data before the Fact,” in ibid.

[16] Rosenberg, “Data before the Fact,” in ibid., 37.

[17] Christine L. Borgman, “Big Data, Little Data, No Data,” MIT Press, 17, accessed April 7, 2017, https://mitpress.mit.edu/big-data-little-data-no-data. 

Will Big Data Render Theory Dispensable? (Part II)

There can be no doubt that the availability of big data and the emergence of big data analytics challenges established epistemologies across the academy.  Debates about the datafication of humanities’ research materials – including debates about whether digitisation projects should be countenanced at all – are disputes about the shape of our culture which challenge our understanding of human knowledge and of knowledge production systems.

In a recent article on the “big data” turn in literary studies, Jean-Gabriel Ganascia addresses the epistemological status of Digital Humanities, asking if Digital Humanities are “Sciences of Nature” or “Sciences of Culture?” [1]  Regarding similar concerns, Matthew L. Jockers a proponent of techniques of quantitative text analysis of the literary record writes, “(e)nough is enough.  The label does not matter.  The work is what matters, the outcomes and what they tell us about… are what matters.” [2]  But labels do matter, they matter a great deal and Ganascia’s question is an important one, the answer to which reveals much about the underlying epistemology of Digital Humanities research.

As the basis of all information and knowledge, data is often defined as fact that exists irrespective of our state of knowing.  Rob Kitchin has addressed the explosion of big data and an emergent fourth paradigm in science which is a result of the former definitions of data as the zero sum of all information and knowledge.  This fourth paradigm, labelled New Empiricism or Empiricism Reborn, claims that Big Data and new data analytics ushers in a new era of knowledge production free from theory, human bias or framing and accessible by anyone with sufficient visual literacy.  This is the position advocated by Chris Anderson, former editor-in-chief at Wired magazine.[3]

However, as Kitchin observes data does not arise in a vacuum: “all data provide oligoptic views of the world: views from certain vantage points, using particular tools, rather than an all-seeing, infallible God’s eye view”.  Whilst the end of theory may appear attractive to some, the empiricist model of knowledge creation is based on fundamental misconceptions about formation and analysis of computational data.

There is an alternative to new forms of empiricism labelled “data-driven science” which is described by Kitchin as follows: “more open to using a hybrid combination of abductive, inductive and deductive approaches to advance the understanding of a phenomenon… it forms a new mode of hypothesis generation before a deductive approach is employed. Nor does the process of induction arise from nowhere, but is situated and contextualized within a highly evolved theoretical domain.”

So, when Jockers writes that those who study texts must follow the example of Science and scale their methods according to the realities of big data, which method (or label) is he referring to?  One might infer from his reference in the opening pages of his influential Macroanalysis: Digital Methods & Literary History to Anderson’s 2008 article that it is the former empiricist model.  However, in the article referred to in the opening lines of this post, Jockers refers to his approach as “data-driven”.  Are we to assume that this is a reference to data-driven scientific method? Rightly or wrongly, knowledge production is not impartial and we must always be mindful of the epistemologies to which we align ourselves as they will often reveal as much about our research as the information or data contained therein.

If Digital Humanists are said to be “nice” because we eschew theoretical debates in favour of the more easily resolved methodological ones, then I’d rather not be thought of as a “nice” Digital Humanist.  I’d rather be responsible, argumentative or even wrong.  But nice? Biscuits are nice.

[1] Ganascia, J., (2015) “The Logic of the Big Data Turn in Digital Literary Studies.” In Frontiers in Digital Humanities, 02 Dec. 2015 

[2] Jockers, M., (2015) “Computing Ireland’s Place in the Nineteenth-Century Novel: A Macroanalysis.” In Breac: A Digital Journal of Irish Studies, 07 Oct. 2015.

[3] Kitchin, R., (2014) “Big Data, New Epistemologies and Paradigm Shifts.” In Big Data & Society, April—June 2014: 1-12.