Big Data, Little Data, Fabulous Data.

Suzanne Briet, in What is documentation? says that “Documentography is the enumeration and description of diverse documents.”[1] Slightly modified, and paired with a nice little neologism (who can resist neologisms?) I could describe the work I am doing at this stage in the KPLEX Project as “Datamentography.Datamentography, of course, meaning the enumeration and description of diverse data. I’m working to establish what it is we talk about when we talk about data; the established conceptions of what data is among different communities, the why, how and where that lead to the development of the various understandings and conceptions of data active today. Once this has been established, we can use these findings to move towards a new conceptualisation of data.

One of the other passages in Briet that I really like (because it is quite poetic and this is perhaps a little unexpected in a text about documentation standards) is as follows:

Is a star a document? Is a pebble rolled by a torrent a document? Is a living animal a document? No. But the photographs and the catalogues of stars, the stones in a museum of mineralogy, and the animals that are catalogued and shown in a zoo, are documents.[2]

So, this nice descriptive passage outlines a key distinction: the thing itself is not a document, but the material traces of its interactions with humans are; the photos, the specimens, the catalogues, the records (visual, audio, and so on). But how do we capture the richness of these items in a computerised environment? A pebble is one thing, as is a stone in a museum of mineralogy. How to capture the pebble rolling in a torrent? And how to do so in a manner that does not subject the material to an interpretative act that alters how future scholars and researchers approach these records? If all the future scholar can “see” in the online repository is what the person responsible for compiling the repository considered to be important (or codeable) then their interpretative sphere is corrupted (if that is not too dramatic a word) from the onset.

Choice emerges as an implicit facet in this distinction; irrespective of how objective we think we are being, the act of collating information is implicitly subjective. What one person identifies as important (as worthy of documenting, as data), may appear wholly unimportant to someone else, and vice versa. Taste and preference are fine in day-to-day life (“You say tomato, I say tomahto… You say potato, I say vodka…”), but when these inherently human and therefore unavoidable subjective tendencies are let loose on humanities repositories, then a hierarchy is imposed on knowledge that reflects the subjective choices of the person who has classified or codefied them.

Further still, encoding the thing-ness of things is difficult. In a society that increasingly values and priorities codefied data, if what is readily codeable is prioritisied without concordant measures taken to account for the facets of human records and experiences that do not lend themselves so readily to codification, we encounter a scenario wherein that which is not as readily codeable is left out, neglected or even forgotten.

Now, people have gone about defining data in a number of different ways, and almost all are at least a little problematic. Christine Borgman, in her book chapter “What are Data?” from Big Data, Little Data, No Data uses an example from the great Argentinian writer Jorge Luis Borges to explain why defining data by example is unsatisfactory. In his essay “The Analytical Language of John Wilkins” Borges presents us with a taxonomy of animals in the form of a Chinese encyclopedia, Celestial Emporium of Benevolent Knowledge. In this taxonomy we encounter the following classifications:

a) belonging to the emperor, b) embalmed, c) tame, d) sucking pigs, e) sirens, f) fabulous, g) stray dogs, h) included in the present classification, i) frenzied, j) innumerable k) drawn with a very fine camelhair brush, l) et cetera m) having just broken the water pitcher, n) that from a long way off look like flies.

Of course, this list is somewhat absurd, and its absurdity is what makes its funny and what makes Borges so brilliant; but this absurdity should not bely the critique of taxonomic practices that lies at the heart of this so-called “emporium of benevolent knowledge.” Lets take a closer look.

Embalmed animals are included because someone once identified them as worthy of embalming, and that the act of being embalmed somehow signified something that was worth documenting (in the form of putting the sad creature in a jar of formaldehyde; or rather, of making a record of the fact that this creature has been stored in formaldehyde). Similarly, some animals are included merely because they are already in the system (“included in the present classification”), so simply because they are already there and it is easy to carry them over and keep them incorporated; in this way long established practices are maintained, simply because they are long established and not necessarily because they are effective (hello metadata, you cheeky old fox).

In What is documentation? Briet charts the sad odyssey of an “antelope of a new kind […] encountered in Africa by an explorer who has succeeded in capturing an individual that is then brought back to Europe for our Botanical Garden”[3]:

A press release makes the event known by newspaper, by radio, and by newsreels. The discovery becomes the topic of an announcement at the Academy of Sciences. A professor at the Museum discusses it in his courses. The living animal is placed in a cage and catalogued (zoological garden). Once it is dead, it will be stuffed and preserved (in the Museum). It is loaned to an Exposition. It is played on a soundtrack at the cinema. Its voice is recorded on a disk. The first monograph serves to establish part of a treatise with plates, then a special encyclopaedia (zoological), then a general encyclopaedia. The works are catalogued in a library; after having been announced at publication [et cetera].[4]

So we have the genesis here of the thing itself—a “pure or natural object with an essence of [its] own”[5]—from its capture and discovery through to its death when it is stuffed (akin to Borges’s embalming) and a process is initiated wherein it is catalogued and subjected to extensive “documentography” according to the established taxonomy of “Homo documentator.”[6] “Homo documentator” creates detailed portraits of the creature (perhaps using a very fine camelhair brush as in Borges’s encyclopaedia) for inclusion in the classificatory system, records its unique markings, the sound of its voice, whatever aspects of the creatures essence can be readily captured. Once in the system, whether by means of artistic plates outlining specifics of the species, or in the form of photographs and sound recording, and so on, it becomes a de-facto document (de-facto data), and its documentability is exhausted only when the taxonomical system employed by the documentalist has itself been exhausted.

But who has designed this taxonomical system? Who is responsible for deciding what facets of the antelope are important and what not? Are items perhaps ever considered important solely because they are facile to document? And who is to say that this same sad and now stuffed antelope could not also be classified as fabulous (or, once fabulous, had you encountered it in the wild)? Further still, surely this creature, like all creatures when viewed from a certain perspective, could be included in Borges’s category for animals that “from a long way off look like flies.” The point is that the system of taxonomy is not objective, and our conceptions of the facets that are important or unimportant can and have been influenced by the hierarchies imposed upon them by the person or persons responsible for compiling them.

Borgman in “What Are Data?” refers us to Open Archival Information System (OAIS) for a definition of data that, once again, uses examples:

Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen.[7]

This, like most definitions of data, seems relatively reasonable at first, naturally the characters on a page are going to qualify as data, and so if they do, they are or can be encoded as such. But what about the page itself? What about the materials on the page that do not qualify as characters? What about doodles? Pen-tests? Scribbles, drawings, additions and other contextually specific parlipomena? How do we encode these? And, if we decide not to, why do we decide not to, and who—if anyone—holds us accountable for that decision? Because such a decision, inconsequential though it may seem at first, could effect and limit future scholars.

And this is what I am attempting to tease out as part of my contribution to the KPLEX Project: why and how did certain conceptions of data become acceptable or dominant in certain circles, and, going forward, as we move towards bigger data infrastructures for the humanities, is there a way for us to ensure that the thing itself, in all its complex idiosyncratic fabulousness, remains visible, and available to the researcher?

The Featured Image was borrowed from Natascha Schwarz’s illustrated edition of Borges’s Book of Imaginary Beings:

