A recipe for intimacy?

Mark Zuckerberg posted the following statement on his Facebook feed:
“Today we’re publishing research on how AI can deliver better language translations. With a new neural network, our AI research team was able to translate more accurately between languages, while also being nine times faster than current methods.

Getting better at translation is important to connect the world. We already perform over 2 billion translations in more than 45 languages on Facebook every day, but there’s still a lot more to do. You should be able to read posts or watch videos in any language, but so far the technology hasn’t been good enough.

Throughout human history, language has been a barrier to communication. It’s amazing we get to live in a time when technology can change that. Understanding someone’s language brings you closer to them, and I’m looking forward to making universal translation a reality. To help us get there faster, we’re sharing our work publicly so that all researchers can use it to build better translation tools.”

Key messages: taking time to understand people is for fools, and language is the problem.
When did language become a barrier to communication?  Would we not be hard pressed to communicate much at all without it?  Doesn’t machine translation have the potential to create as much distance as ‘understanding?’   Building intimacy (for this is what I take the phrase “brings you closer” to mean) is not about having a rough idea of what someone is saying, it is about understanding the nuance of every gesture, every reference and resonance.  Isn’t the joy of encountering a new culture tied up in the journey of discovery we make on the road to understanding?
I salute Facebook for making their research and software open, but a bit of humility in the face of the awesome and varied systems of signs and significations we humans have built could make this so much better news.

“Voluptuousness:” The Fourth “V” of Big Data?

Dr Jennifer Edmond, TCD

The KPLEX project is founded upon a recognition that definitions of the word ‘data’ tend to vary according to the perspective of the person using the word.  It is therefore useful, at least, to have a standard definition of ‘big’ data.

Big Data is commonly understood as meeting 3 criteria, each conveniently able to be described with a word beginning with the letter V: Volume, Velocity and Variety.

Volume is perhaps the easiest to understand, but a lot of data really means a lot.  Facebook, for example, stores more than 250 billion image .  ‘Nuf said.

Velocity is how fast the data grows.  That >250 billion images figure is estimated to be growing by a further 300-900 million per day (depending on what source you look at).  Yeah.

Variety refers to the different formats, structures etc. you have in any data set.

Now, from a humanities data point of view, these vectors are interesting.  Very few humanities data sets would be recognised as meeting criteria 1 or 2, though some (like the Shoah Foundation Video History Archive) come close.  But the comparatively low number of high volume digital cultural datasets is related to the question of velocity: the fact that so many of these information sources have existed for hundreds of years or longer in analogue formats means that recasting them as digital is a highly expensive process, and born digital data is only just proving its value for the humanist researcher.

But Variety?  Now you are talking.  If nothing else, we do have huge heterogeneity in the data, even before we consider the analogue as well as the digital forms.

Cultural data makes us consider another vector as well, however: if it must start with V, I will call it “voluptuousness.”  Cultural data can be steeped in meaning, referring to a massive further body of cultural information stored outside of the dataset itself.  This interconnectedness means that some data can be exceptionally dense, or exceptionally rich, without being large.  Think “to be or not to be;” think the Mona Lisa; think of a Bashō haiku. These are the ultimate big data, which, while tiny in terms of their footprint of 1s and 0s, sit at the centre of huge nets of referents, referents we can easily trace through the resonance of the words and images across people, cultures, and time.

Will the voluptuousness of data be the next big computational challenge?

Tim Hitchcock on ‘Big Data, Small Data and Meaning’

“I end up is feeling that in the rush to new tools and ‘Big Data’ Humanist scholars are forgetting what they spent much of the second half of the twentieth century discovering – that language and art, cultural construction, human experience, and representation are hugely complex – but can be made to yield remarkable insight through close analysis. In other words, while the Humanities and ‘Big Data’ absolutely need to have a conversation; the subject of that conversation needs to change, and to encompass close reading and small data. ”


KPLEX Kick-Off: 1 February 2017, Dublin Ireland

The KPLEX Project held its official kick-off meeting on 1 February 2017 in Dublin, Ireland.  The project team took this opportunity for some structured discussion and knowledge sharing on our 4 key themes and set out the plans for the work programme in the months ahead:

Toward a New Conceptualisation of Data,

Hidden Data and the Historical Record

Data, Knowledge Organisation and Epistemics

Culture and Representations of System Limitations

We are particularly grateful to our EU project officer, Pierre-Paul Sondag, who joined us in Dublin for this meeting.

Data wealth and data poverty

Jörg Lehmann, FUB

Beyond the ‘long tail’ metaphor, the distribution of data within the scientific field has been described in terms of ‘data wealth’ and ‘data poverty’. Steve Sawyer has sketched a political economy of data in a short essay (a slightly modified version of this paper is freely accessible here). According to him, in data-poor fields data are a prized possession; access to data drives methods; and there are many theoretical camps. By contrast, data-rich fields can be identified by three characteristics: pooling and sharing of data is expected; method choices are limited since forms drive methods; and only a small number of theoretical camps have been validated. This opposition leads to an unequal distribution of grants received, since data wealth provides for legitimacy to claims of insight as well as access to additional resources.

While Sawyer describes a polarity within the scientific field with respect to funding and cyberinfrastructures, which he sees as a means to overcome obstacles in data-poor fields, the KPLEX Project will take a look into how contents and characteristics of data relate to methodologies and epistemologies, integration structures and aggregation systems.

Response to “How can data simultaneously mean what is fabricated and what is not?”

Pic Paul Sharp/SHARPPIX
Dr Jennifer Edmond (TCD)

As discussed in the earlier post this week “How can data simultaneously mean what is fabricated and what is not fabricated?”, the term ‘raw data’ points toward an illusion, to something that is naturally or even divinely given, untouched by human hand or interpretive mind.  The very organicism of the metaphor belies the work it has to do to hide the fact that progression form data to knowledge is not a clean one: knowledge also produces data, or, perhaps better said, knowledge also produces other knowledge.  We need to get away from the idea of data as somehow ‘given’ – Johanna Drucker’s preference for the term ‘capta’ over ‘data’ is hugely useful in this respect.

Raw data is, however, an indicator for where we begin.  Human beings have limited lifespans and limited capacities for experimentation and research, sensitive as we are to what John Guillory referred to as the ‘clock time’ of scholarship.  Therefore to make scientific progress, we start on the basis of the work of others.  Sometimes this is recognised overtly, as in citation of secondary sources.  Sometimes it is perhaps only indirectly visible, as in the use of certain instruments, datasets, infrastructures or libraries, that themselves (to the most informed and observant) belie the conditions in which particular knowledge was born.  In this way, ‘raw data’ merely refers to the place where our organisation started, where we removed what we found from the black box and began to create order from the relative chaos we found there.

Do big data environments make these edges apparent?  Usually not.  In the attempt to gain an advantage over ‘clock time,’ systems hide the nature, perhaps not of the chaos they began with, but of how that chaos came to be, whose process of creating ‘capta’ gave origin to it in the first place.  Like the children’s rhyme about the ‘House that Jack Built’, knowledge viewed backwards toward its starting point can perhaps be seen recursing into data, but going further, that data itself emerges from a process or instrument that someone else created through a process that began elsewhere, with something else, at yet another, far earlier point.  At our best, humanists cling to the provenance of their ideas and knowledge.  Post-modernism has been decried by many as the founding philosophy of the age of ‘alternative facts’, as an abandonment of any possibility for one position to be right or wrong, true or untrue.  In the post-truth world, any position can offend, and attempt at defence result in being labelled a ‘snowflake.’  But Lyotard’s work on The Post-Modern Condition wasn’t about this: it was about recognising the bias that creeps in to our assumptions once we start to generalise at the scale of a grand narrative, a truth.  In this sense, we perhaps need to move our thinking from ‘raw data,’ to ‘minimally biased data,’ and from the grand narratives of ‘big data’ to the nuances of ‘big enough data.’  If we can begin speaking with clarity about that place where we begin, we will have a much better chance of not losing track of our knowledge along the way.

How can data simultaneously mean what is fabricated and what is not fabricated?

Jörg Lehmann, FUB

Common wisdom holds it that data are recorded facts and figures; as an example see Kroenke et Auer, Database Processing, slide 8. This is not astonishing if one considers the meaning of the term “raw data”. Most often it refers to unprocessed instrument data at full resolution; the output of a machine which has differentiated between signal and noise – as if the machine has not been conceived by human beings and designed according to needs defined by them. But the entanglement of data and facts refers to a mechanical understanding of objectivity, where instruments record signals which can easily be identified by humans as something that has been counted, measured, sensed. “Raw data” is seen as the output of that operation, as facts gained by experiment, experience or collection. Data products derived from these machine outputs seem to inherit their ‘factual’ status, even if processed to a high degree in order to enable modelling and analysis. Turning against this distinction between “raw” and “cooked” data and underlining data as a result of human intervention, Geoffrey Bowker has termed the phrase “’Raw data’ is an oxymoron”, which provided the title of a book edited by Lisa Gitelman (find the introduction here). But common sense is not this precise with differentiations; it keeps the aura of data as something always being pre-processed.

Data and facts are not the same. In a contribution to the anthology edited by Gitelman, Daniel Rosenberg explains the etymology of both terms as well as their differences. Facts are ontological, they depend on the logical operation true/false. When they are proven false, they cease to be termed “facts”. Data can have something in common with facts, namely their “out there-ness”, a reference apparently beyond the arbitrary relation between signifier and signified. “False data is data nonetheless”, Rosenberg puts it, and he points out that “the term ‘data’ servers a different rhetorical and conceptual function than do sister terms as ‘facts’ and ‘evidence’.” But what exactly is the rhetorical function of the term “data”? Rosenberg’s answer is that “data” designates something which is given prior to argument. Again, this brings the term “data” close to the term “fact”: In argumentation, both terms assume the task of a proof, of something that substantiates. In which settings is this rhetoric particularly relevant?

As Steve Woolgar and Bruno Latour have pointed out a generation ago, facts are social constructions, purified from the remainders of the process in which they were created: “the process of construction involves the use of certain devices whereby all traces of production are made extremely difficult to detect”, they wrote in “Laboratory Life”. There is a process at work which can be compared to ‘datafication’: In a laboratory, the term “fact” can simultaneously assume two meanings. At the frontier of science, the scientists themselves know about the constructed nature of statements and are aware of their relation to subjectivity and artificiality; and at the same time these statements are referred to as a thing “out there”, i.e. to objectivity and facts. And it is in the very interest of science, aiming at “truth effects”, to make the artificiality of factual statements forgotten, and, as a consequence, to have facts taken for granted. The analogy of the nature of these statements to “raw” and “cooked” data is obvious; “facts” and “data”. With respect to the latter, the process of division and classification of phenomena into numbers obscures ambiguity, conflict, and contradiction; the left-over of this process, “data”, are completely cleansed of irritating factors; and no trace remains of the purifying process itself. Therefore “data” deny their fabricated-ness; and in struggles for legitimacy of insight or the application for resources, they serve their purpose in the very same way as “facts” do.