Beyond a binary conception of data

An awful lot of data exist in tables, the columns (variables) and rows (observations) filled with numerical values. Numerical values are always binary: Either they have a certain value or not (in the latter case they are NAs or NULLs). Computability is based on information stored in a binary fashion coded as either 0 or 1. Statistics is based on this binary logic, even where nominal data are in use. Nominal variables such as gender, race, color, and profession can be measured only in terms of whether the individual items belong to some distinctively different categories. More precisely: They belong to these categories or not.

A large amount of data, especially in the Internet, consists of unstructured text data (eMails, WhatsApp Messages, WordPress posts, tweets, etc.). Can texts – or other cultural data like images, works of art, or music – be adapted to a binary logic? Principally yes, as the examples of nominal variables show; either a word (or an image …) belongs to a certain category or not. Quite a good of part of Western thinking follows a binary logic; classical structuralism, for example, is fond of structuring oppositions: good – bad, pure – dirty, raw – cooked, orthodox – heretic, avantgarde – arrièregarde, up-to-date – old-fashioned etc. The point here is that one has to be careful to which domain these binary oppositions belong to: Good – bad is not the same as good – evil.

But the habit of thinking in binary terms narrows the perspective; the use of data according to a binary logic means a reduction. This is particularly evident with respect to texts: The meaning of individual words changes with their context. Another example: A smile can be the expression of sympathy and of uncertainty in the Western world, while in other cultures it may be referring to aggression, confusion, sadness or a social distancing from the other. What can be seen as a ‘smile’ in monkeys is most often the expression of fear – a showing of the canines.

Indian logic provides for an example of how to go beyond a binary logic: They have something which is being called “Tetralemma”. While binary systems are based on calculations of 0 and 1 and therefore formulate a dilemma, the tetralemma provides for four possible answers to any logical proposition: Beyond the logics of 0, 1, there is both 0 and 1, and neither 0 nor 1. One can even conceive of a fifth answer: none of these all. Put as a graph and expressed mathematically, the tetralemma would look like this:


The word “dawn” would be an example for what is at stake in the tetralemma: Depending on how you define its meaning, it can be a category of its own, not fitting into a category (because of it ambivalent character), it is both day and night, and it is neither sunshine nor darkness.

One of the few philosophers to point to the narrowing of logics to binary oppositions (TRUE / FALSE) and to underline the many possibilities in language games was Jean-Francois Lyotard, in his main work “Le Différend” (Paris: Minuit 1983). In information science, it was only more recently that complex approaches have been developed beyond binary systems, which allow for an adequate coding of culture, emotions or human communications. The best examples are ontologies; they can t be understood as networks of entities, while the relations between these entities can be defined in multiple ways. A person can be at the same time a colleague in a team, a partner in a company, a father of another person working in the same company (a visualization of the “friend of a friend” ontology can be found here). Datafication of human signs, be they linguistic, artistic, or part of the historical record, therefore exposes the challenges of data production in particularly evident ways.

Big Data, Little Data, Fabulous Data.

Suzanne Briet, in What is documentation? says that “Documentography is the enumeration and description of diverse documents.”[1] Slightly modified, and paired with a nice little neologism (who can resist neologisms?) I could describe the work I am doing at this stage in the KPLEX Project as “Datamentography.Datamentography, of course, meaning the enumeration and description of diverse data. I’m working to establish what it is we talk about when we talk about data; the established conceptions of what data is among different communities, the why, how and where that lead to the development of the various understandings and conceptions of data active today. Once this has been established, we can use these findings to move towards a new conceptualisation of data.

One of the other passages in Briet that I really like (because it is quite poetic and this is perhaps a little unexpected in a text about documentation standards) is as follows:

Is a star a document? Is a pebble rolled by a torrent a document? Is a living animal a document? No. But the photographs and the catalogues of stars, the stones in a museum of mineralogy, and the animals that are catalogued and shown in a zoo, are documents.[2]

So, this nice descriptive passage outlines a key distinction: the thing itself is not a document, but the material traces of its interactions with humans are; the photos, the specimens, the catalogues, the records (visual, audio, and so on). But how do we capture the richness of these items in a computerised environment? A pebble is one thing, as is a stone in a museum of mineralogy. How to capture the pebble rolling in a torrent? And how to do so in a manner that does not subject the material to an interpretative act that alters how future scholars and researchers approach these records? If all the future scholar can “see” in the online repository is what the person responsible for compiling the repository considered to be important (or codeable) then their interpretative sphere is corrupted (if that is not too dramatic a word) from the onset.

Choice emerges as an implicit facet in this distinction; irrespective of how objective we think we are being, the act of collating information is implicitly subjective. What one person identifies as important (as worthy of documenting, as data), may appear wholly unimportant to someone else, and vice versa. Taste and preference are fine in day-to-day life (“You say tomato, I say tomahto… You say potato, I say vodka…”), but when these inherently human and therefore unavoidable subjective tendencies are let loose on humanities repositories, then a hierarchy is imposed on knowledge that reflects the subjective choices of the person who has classified or codefied them.

Further still, encoding the thing-ness of things is difficult. In a society that increasingly values and priorities codefied data, if what is readily codeable is prioritisied without concordant measures taken to account for the facets of human records and experiences that do not lend themselves so readily to codification, we encounter a scenario wherein that which is not as readily codeable is left out, neglected or even forgotten.

Now, people have gone about defining data in a number of different ways, and almost all are at least a little problematic. Christine Borgman, in her book chapter “What are Data?” from Big Data, Little Data, No Data uses an example from the great Argentinian writer Jorge Luis Borges to explain why defining data by example is unsatisfactory. In his essay “The Analytical Language of John Wilkins” Borges presents us with a taxonomy of animals in the form of a Chinese encyclopedia, Celestial Emporium of Benevolent Knowledge. In this taxonomy we encounter the following classifications:

a) belonging to the emperor, b) embalmed, c) tame, d) sucking pigs, e) sirens, f) fabulous, g) stray dogs, h) included in the present classification, i) frenzied, j) innumerable k) drawn with a very fine camelhair brush, l) et cetera m) having just broken the water pitcher, n) that from a long way off look like flies.

Of course, this list is somewhat absurd, and its absurdity is what makes its funny and what makes Borges so brilliant; but this absurdity should not bely the critique of taxonomic practices that lies at the heart of this so-called “emporium of benevolent knowledge.” Lets take a closer look.

Embalmed animals are included because someone once identified them as worthy of embalming, and that the act of being embalmed somehow signified something that was worth documenting (in the form of putting the sad creature in a jar of formaldehyde; or rather, of making a record of the fact that this creature has been stored in formaldehyde). Similarly, some animals are included merely because they are already in the system (“included in the present classification”), so simply because they are already there and it is easy to carry them over and keep them incorporated; in this way long established practices are maintained, simply because they are long established and not necessarily because they are effective (hello metadata, you cheeky old fox).

In What is documentation? Briet charts the sad odyssey of an “antelope of a new kind […] encountered in Africa by an explorer who has succeeded in capturing an individual that is then brought back to Europe for our Botanical Garden”[3]:

A press release makes the event known by newspaper, by radio, and by newsreels. The discovery becomes the topic of an announcement at the Academy of Sciences. A professor at the Museum discusses it in his courses. The living animal is placed in a cage and catalogued (zoological garden). Once it is dead, it will be stuffed and preserved (in the Museum). It is loaned to an Exposition. It is played on a soundtrack at the cinema. Its voice is recorded on a disk. The first monograph serves to establish part of a treatise with plates, then a special encyclopaedia (zoological), then a general encyclopaedia. The works are catalogued in a library; after having been announced at publication [et cetera].[4]

So we have the genesis here of the thing itself—a “pure or natural object with an essence of [its] own”[5]—from its capture and discovery through to its death when it is stuffed (akin to Borges’s embalming) and a process is initiated wherein it is catalogued and subjected to extensive “documentography” according to the established taxonomy of “Homo documentator.”[6] “Homo documentator” creates detailed portraits of the creature (perhaps using a very fine camelhair brush as in Borges’s encyclopaedia) for inclusion in the classificatory system, records its unique markings, the sound of its voice, whatever aspects of the creatures essence can be readily captured. Once in the system, whether by means of artistic plates outlining specifics of the species, or in the form of photographs and sound recording, and so on, it becomes a de-facto document (de-facto data), and its documentability is exhausted only when the taxonomical system employed by the documentalist has itself been exhausted.

But who has designed this taxonomical system? Who is responsible for deciding what facets of the antelope are important and what not? Are items perhaps ever considered important solely because they are facile to document? And who is to say that this same sad and now stuffed antelope could not also be classified as fabulous (or, once fabulous, had you encountered it in the wild)? Further still, surely this creature, like all creatures when viewed from a certain perspective, could be included in Borges’s category for animals that “from a long way off look like flies.” The point is that the system of taxonomy is not objective, and our conceptions of the facets that are important or unimportant can and have been influenced by the hierarchies imposed upon them by the person or persons responsible for compiling them.

Borgman in “What Are Data?” refers us to Open Archival Information System (OAIS) for a definition of data that, once again, uses examples:

Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen.[7]

This, like most definitions of data, seems relatively reasonable at first, naturally the characters on a page are going to qualify as data, and so if they do, they are or can be encoded as such. But what about the page itself? What about the materials on the page that do not qualify as characters? What about doodles? Pen-tests? Scribbles, drawings, additions and other contextually specific parlipomena? How do we encode these? And, if we decide not to, why do we decide not to, and who—if anyone—holds us accountable for that decision? Because such a decision, inconsequential though it may seem at first, could effect and limit future scholars.

And this is what I am attempting to tease out as part of my contribution to the KPLEX Project: why and how did certain conceptions of data become acceptable or dominant in certain circles, and, going forward, as we move towards bigger data infrastructures for the humanities, is there a way for us to ensure that the thing itself, in all its complex idiosyncratic fabulousness, remains visible, and available to the researcher?

[1] Briet et al., What Is Documentation?, 24.

[2] Ibid., 10.

[3] Ibid., 10.

[4] Ibid.

[5] Borgman, “Big Data, Little Data, No Data,” 18.

[6] Briet et al., What Is Documentation?, 29.

[7] Borgman, “Big Data, Little Data, No Data,” 20, emphasis in original.

The Featured Image was borrowed from Natascha Schwarz’s illustrated edition of Borges’s Book of Imaginary Beings:

Is there an identity crisis of statistics?

It is not without irony that statistics currently seems to live through an identity crisis, since this discipline is often named the “science of uncertainty”. If there is an identity crisis at all, in what way can it be conceived of? And why did this crisis come now – is there a nexus between the rise of ‘big data’, algorithm and the development of the discipline? Three developments can be identified that make the hypothesis of an identity crisis of statistics more tangible.

  • A crisis of legitimacy: Statistics’ findings result from value-free and transparent methods; but this is nothing (and maybe never has been) which can easily be communicated to the broad public. Hence, in the age of ‘big data’, a loss of legitimacy: the more complex the collection of data, statistical methods and presented results are, the more disorienting an average person may find them, especially if there is a steep increase in the sheer quantity of statistical findings. Even for citizens who are willing to occupy themselves with data, statistics, and probability algorithms, Nobel prize laureate Danel Kahnemann has underlined the counterintuitive and intellectually challenging character of statistics (see his book “Thinking, fast and slow” or the classical article “Judgment under Uncertainty”). These peculiarities of the discipline lower the general public’s trust in the discipline: “Whom should I believe?”
  • Crisis of expertise: Statistics has become part of a broad range of scientific disciplines, far beyond mathematics. But the acquisition of competences in statistics quite obviously has its limits. As Gerd Gigerenzer has pointed out already 13 years ago, “mindless statistics” has become a custom and ritual in sciences like f.ex. psychology. In recent years, this crisis of expertise has been termed the crisis of reproducibility (for data from a previous publication) or replicability (for data from an experiment); the renowned journal “Nature” has devoted in 2014 an article series onto this problem, with focus on f.ex. the use of p-values in scientific arguments. The report of the 2013 London Workshop on the Future of the Statistical Sciences is outspoken on this problem, and there is even a Wikipedia article on the crisis of replicability. Statisticians themselves defend themselves by pointing to these scientists’ lack of training in statistics and computation [see Jeff Leek’s recent article here], but quite obviously this crisis of expertise undermines the credibility of scientists as experts.
  • Crisis of the societal function of the discipline: Statistics as a scientific discipline established itself alongside with the rise of nation-states; hence its close connection to national economies and the data collected across large populations. As has been explained in a “Guardian” article posted earlier in this blog, statistics served as the basis of “evidence-based policy“, and statisticians were seen by politicians as the caste of scientific pundits and consultants. But this has changed completely: Nowadays big data are assets of globalised companies which act across the borders of nation-states. This points to a shift in the core societal function of statistics, not longer serving politics and hence the nation, but global companies and their interests: Statistics leaves representative democracy, and it has become unclear how the benefits of digital analytics might ever be offered to the public. Even if the case is still obscure, the possible role of “Cambridge Analytica” in the U.S. presidential election campaign shows that the privatisation of expertise can be turned against the interests of a nation’s citizens.

Data scales in applied statistics – are nominal data poor in information?

An earlier blog post by Jennifer Edmond on „Voluptuousness: The Fourth „V“ of Big Data?” focused on cultural data steeped in meaning; proverbs, haikus or poetic language are amongst the best examples for this kind of data.

But computers are not (yet) good in understanding human languages. Nor is statistics – it simply conceives of words as nominal variables. Quite in contrast to an understanding of cultural data as described by Jennifer Edmond, applied statistics regards nominal variables as the ones with the LEAST density in information. This becomes obvious when these data are classified amongst other variables in scales of measurement. The scale (or level) of measurement refers to assigning numbers to a characteristic according to a defined rule. The particular scale of measurement a researcher uses will determine what type of statistical procedures (or algorithms) are appropriate. It is important to understand the nominal, ordinal, interval, and ratio scales; for a short introduction into the terminology, follow this link. Seen in this context, nominal variables belong to qualitative data and are classified into distinctively different categories; in comparison to ordinal variables, these categories cannot be quantified or even ranked. The functions of the different scales are shown in the following graph:



Here it becomes visible that words are classified as nominal variables; they belong to some distinctively different categories, but those categories cannot be quantified or even be ranked; there is no meaningful order in choice.

This has the consequence that in order to be able to compute with words, numeric values are being attributed to them. E.g. in Sentiment Analysis, the word “good” can receive the value +1, while the word “bad” will receive a -1. Now they have become computable; words are thus transformed into values, and it is exactly this process which reduces their “voluptuousness” and robs them of their polysemy; just binary values remain.

To my knowledge, the most significant endeavor to provide for a more complex measurement of linguistic data has been undertaken in the field of psychology: In their book “The Measurement of Meaning”, Charles E. Osgood, George J. Suci, and Percy H. Tannenbaum developed the “semantic differential” as a scale used for measuring the meaning of things and concepts in a multitude of dimensions (see for a contemporary approach f.ex. But imagine each word of a single language measured – with all its connotations and denotations, each in its different functions, in the context of the other words around it … not to speak of figurative language such as metaphors, irony, and sarcasm.

This is why words – and cultural, voluptuous data in a broader sense – are so difficult to compute; and why they are the next big computational challenge.


“Voluptuousness:” The Fourth “V” of Big Data?

Dr Jennifer Edmond, TCD

The KPLEX project is founded upon a recognition that definitions of the word ‘data’ tend to vary according to the perspective of the person using the word.  It is therefore useful, at least, to have a standard definition of ‘big’ data.

Big Data is commonly understood as meeting 3 criteria, each conveniently able to be described with a word beginning with the letter V: Volume, Velocity and Variety.

Volume is perhaps the easiest to understand, but a lot of data really means a lot.  Facebook, for example, stores more than 250 billion image .  ‘Nuf said.

Velocity is how fast the data grows.  That >250 billion images figure is estimated to be growing by a further 300-900 million per day (depending on what source you look at).  Yeah.

Variety refers to the different formats, structures etc. you have in any data set.

Now, from a humanities data point of view, these vectors are interesting.  Very few humanities data sets would be recognised as meeting criteria 1 or 2, though some (like the Shoah Foundation Video History Archive) come close.  But the comparatively low number of high volume digital cultural datasets is related to the question of velocity: the fact that so many of these information sources have existed for hundreds of years or longer in analogue formats means that recasting them as digital is a highly expensive process, and born digital data is only just proving its value for the humanist researcher.

But Variety?  Now you are talking.  If nothing else, we do have huge heterogeneity in the data, even before we consider the analogue as well as the digital forms.

Cultural data makes us consider another vector as well, however: if it must start with V, I will call it “voluptuousness.”  Cultural data can be steeped in meaning, referring to a massive further body of cultural information stored outside of the dataset itself.  This interconnectedness means that some data can be exceptionally dense, or exceptionally rich, without being large.  Think “to be or not to be;” think the Mona Lisa; think of a Bashō haiku. These are the ultimate big data, which, while tiny in terms of their footprint of 1s and 0s, sit at the centre of huge nets of referents, referents we can easily trace through the resonance of the words and images across people, cultures, and time.

Will the voluptuousness of data be the next big computational challenge?

Tim Hitchcock on ‘Big Data, Small Data and Meaning’

“I end up is feeling that in the rush to new tools and ‘Big Data’ Humanist scholars are forgetting what they spent much of the second half of the twentieth century discovering – that language and art, cultural construction, human experience, and representation are hugely complex – but can be made to yield remarkable insight through close analysis. In other words, while the Humanities and ‘Big Data’ absolutely need to have a conversation; the subject of that conversation needs to change, and to encompass close reading and small data. ”

KPLEX Kick-Off: 1 February 2017, Dublin Ireland

The KPLEX Project held its official kick-off meeting on 1 February 2017 in Dublin, Ireland.  The project team took this opportunity for some structured discussion and knowledge sharing on our 4 key themes and set out the plans for the work programme in the months ahead:

Toward a New Conceptualisation of Data,

Hidden Data and the Historical Record

Data, Knowledge Organisation and Epistemics

Culture and Representations of System Limitations

We are particularly grateful to our EU project officer, Pierre-Paul Sondag, who joined us in Dublin for this meeting.

Data wealth and data poverty

Jörg Lehmann, FUB

Beyond the ‘long tail’ metaphor, the distribution of data within the scientific field has been described in terms of ‘data wealth’ and ‘data poverty’. Steve Sawyer has sketched a political economy of data in a short essay (a slightly modified version of this paper is freely accessible here). According to him, in data-poor fields data are a prized possession; access to data drives methods; and there are many theoretical camps. By contrast, data-rich fields can be identified by three characteristics: pooling and sharing of data is expected; method choices are limited since forms drive methods; and only a small number of theoretical camps have been validated. This opposition leads to an unequal distribution of grants received, since data wealth provides for legitimacy to claims of insight as well as access to additional resources.

While Sawyer describes a polarity within the scientific field with respect to funding and cyberinfrastructures, which he sees as a means to overcome obstacles in data-poor fields, the KPLEX Project will take a look into how contents and characteristics of data relate to methodologies and epistemologies, integration structures and aggregation systems.

Response to “How can data simultaneously mean what is fabricated and what is not?”

Pic Paul Sharp/SHARPPIX
Dr Jennifer Edmond (TCD)

As discussed in the earlier post this week “How can data simultaneously mean what is fabricated and what is not fabricated?”, the term ‘raw data’ points toward an illusion, to something that is naturally or even divinely given, untouched by human hand or interpretive mind.  The very organicism of the metaphor belies the work it has to do to hide the fact that progression form data to knowledge is not a clean one: knowledge also produces data, or, perhaps better said, knowledge also produces other knowledge.  We need to get away from the idea of data as somehow ‘given’ – Johanna Drucker’s preference for the term ‘capta’ over ‘data’ is hugely useful in this respect.

Raw data is, however, an indicator for where we begin.  Human beings have limited lifespans and limited capacities for experimentation and research, sensitive as we are to what John Guillory referred to as the ‘clock time’ of scholarship.  Therefore to make scientific progress, we start on the basis of the work of others.  Sometimes this is recognised overtly, as in citation of secondary sources.  Sometimes it is perhaps only indirectly visible, as in the use of certain instruments, datasets, infrastructures or libraries, that themselves (to the most informed and observant) belie the conditions in which particular knowledge was born.  In this way, ‘raw data’ merely refers to the place where our organisation started, where we removed what we found from the black box and began to create order from the relative chaos we found there.

Do big data environments make these edges apparent?  Usually not.  In the attempt to gain an advantage over ‘clock time,’ systems hide the nature, perhaps not of the chaos they began with, but of how that chaos came to be, whose process of creating ‘capta’ gave origin to it in the first place.  Like the children’s rhyme about the ‘House that Jack Built’, knowledge viewed backwards toward its starting point can perhaps be seen recursing into data, but going further, that data itself emerges from a process or instrument that someone else created through a process that began elsewhere, with something else, at yet another, far earlier point.  At our best, humanists cling to the provenance of their ideas and knowledge.  Post-modernism has been decried by many as the founding philosophy of the age of ‘alternative facts’, as an abandonment of any possibility for one position to be right or wrong, true or untrue.  In the post-truth world, any position can offend, and attempt at defence result in being labelled a ‘snowflake.’  But Lyotard’s work on The Post-Modern Condition wasn’t about this: it was about recognising the bias that creeps in to our assumptions once we start to generalise at the scale of a grand narrative, a truth.  In this sense, we perhaps need to move our thinking from ‘raw data,’ to ‘minimally biased data,’ and from the grand narratives of ‘big data’ to the nuances of ‘big enough data.’  If we can begin speaking with clarity about that place where we begin, we will have a much better chance of not losing track of our knowledge along the way.

How can data simultaneously mean what is fabricated and what is not fabricated?

Jörg Lehmann, FUB

Common wisdom holds it that data are recorded facts and figures; as an example see Kroenke et Auer, Database Processing, slide 8. This is not astonishing if one considers the meaning of the term “raw data”. Most often it refers to unprocessed instrument data at full resolution; the output of a machine which has differentiated between signal and noise – as if the machine has not been conceived by human beings and designed according to needs defined by them. But the entanglement of data and facts refers to a mechanical understanding of objectivity, where instruments record signals which can easily be identified by humans as something that has been counted, measured, sensed. “Raw data” is seen as the output of that operation, as facts gained by experiment, experience or collection. Data products derived from these machine outputs seem to inherit their ‘factual’ status, even if processed to a high degree in order to enable modelling and analysis. Turning against this distinction between “raw” and “cooked” data and underlining data as a result of human intervention, Geoffrey Bowker has termed the phrase “’Raw data’ is an oxymoron”, which provided the title of a book edited by Lisa Gitelman (find the introduction here). But common sense is not this precise with differentiations; it keeps the aura of data as something always being pre-processed.

Data and facts are not the same. In a contribution to the anthology edited by Gitelman, Daniel Rosenberg explains the etymology of both terms as well as their differences. Facts are ontological, they depend on the logical operation true/false. When they are proven false, they cease to be termed “facts”. Data can have something in common with facts, namely their “out there-ness”, a reference apparently beyond the arbitrary relation between signifier and signified. “False data is data nonetheless”, Rosenberg puts it, and he points out that “the term ‘data’ servers a different rhetorical and conceptual function than do sister terms as ‘facts’ and ‘evidence’.” But what exactly is the rhetorical function of the term “data”? Rosenberg’s answer is that “data” designates something which is given prior to argument. Again, this brings the term “data” close to the term “fact”: In argumentation, both terms assume the task of a proof, of something that substantiates. In which settings is this rhetoric particularly relevant?

As Steve Woolgar and Bruno Latour have pointed out a generation ago, facts are social constructions, purified from the remainders of the process in which they were created: “the process of construction involves the use of certain devices whereby all traces of production are made extremely difficult to detect”, they wrote in “Laboratory Life”. There is a process at work which can be compared to ‘datafication’: In a laboratory, the term “fact” can simultaneously assume two meanings. At the frontier of science, the scientists themselves know about the constructed nature of statements and are aware of their relation to subjectivity and artificiality; and at the same time these statements are referred to as a thing “out there”, i.e. to objectivity and facts. And it is in the very interest of science, aiming at “truth effects”, to make the artificiality of factual statements forgotten, and, as a consequence, to have facts taken for granted. The analogy of the nature of these statements to “raw” and “cooked” data is obvious; “facts” and “data”. With respect to the latter, the process of division and classification of phenomena into numbers obscures ambiguity, conflict, and contradiction; the left-over of this process, “data”, are completely cleansed of irritating factors; and no trace remains of the purifying process itself. Therefore “data” deny their fabricated-ness; and in struggles for legitimacy of insight or the application for resources, they serve their purpose in the very same way as “facts” do.