Using humanities knowledge to explore bias in big data approaches to knowledge creation
Author: Jennifer Edmond
Dr Jennifer Edmond, is the Director of Strategic Projects in the Faculty of Arts, Humanities and Social Sciences at Trinity College Dublin. Trained as a scholar of German literature, Jennifer is mostly engaged professionally with the investigation of knowledge exchange and collaboration in Humanities research and in particular the impact of technology on these processes.
In Europe, copyright laws continue to vary from country to country, in spite of many years’ pressure towards harmonisation.
One of the biggest questions this raises is whether having the right to read a work gives you to right to data mine it. In terms of literature, you might think the right to mine could be the less restrictive one, if anything, as the use and market for mining is so different from a linear reading mode. I don’t know of anyone who takes a nice big set of unstructured data (rather than a novel) with them on a beach holiday for fun, though I am sure such people exist (albeit with sand in their laptops). But no one is sure whether or if the rights to these modes of use will be harmonised, largely because the original rules and customs date to the era of print, and the tension between the unknown benefit of relaxing them and an unknowable future profit model that could emerge from data mining is unresolved.
This is a problem for literary scholars, but also for historians and others working with cultural data. Which leads me to the observation that the very beauty and utility of humanities data creates a two tier system of science, in which legal hurdles hinder research in some disciplines (those that co-own their data) or for some methodologies, but others. There are surely some medical or other data sets that are viewed as having equivalent market value to a best-seller, but even there, the mining model would be the only reading paradigm.
So how would data-driven research in other disciplines be different if the raw materials of their research were sold in airports, hung on the walls of galleries, or revered as the founding documents of a nation?
The ‘long tail’ is a metaphor for a statistical distribution, where about 20 percent of the distribution is in the head of the curve, and the remaining 80 percent are distributed along the tail. This well-known distribution – a power law – was popularized by Chris Anderson, former editor-in-chief of the Wired Magazine. In an article published in 2004 [http://archive.wired.com/wired/archive/12.10/tail.html], Anderson described the market for goods in physical stores versus online stores by means of this distribution: In the head of the curve, there are the hits: blockbusters, bestsellers, top-ten chart tracks, all sold by retailers with stores. But in the long tail, there are the much-loved film classics, the back catalog, older music albums. Combined, the non-hits make for a market bigger than the hits, which opens the door for online stores. As Anderson explains, successful businesses in the digital economy are about aggregating the long tail, focusing on a million of niches rather than onto the mass market.
But the metaphor of the ‘long tail’ can also be applied to scholarly research: Only a fraction of research teams work on large volumes of data, say, about 20%; in comparison to them, a much bigger number of researchers work with ‘little data’. These small datasets are harder to handle and to exchange, as Christine Borgman et al have recently described in their research paper “Data Management in the Long Tail: Science, Software and Service”(1) [http://www.ijdc.net/index.php/ijdc/article/viewFile/11.1.128/432].
The attention of the public is towards ‘big data’; in contrast, the smaller datasets used by the humanities as well as cultural data are ‘hidden’, not standardized, and they lack metadata. Even more, they are unstructured, polyvalent, messy; and this is what the KPLEX-Project focuses on – long tail data.
Christine L. Borgman, Milena S. Golshan, Ashley E. Sands, Jillian C. Wallis, Rebekah L. Cummings, Peter T. Darch, Data Management in the Long Tail: Science, Software and Service. In: International Journal of Digital Curation 11(1), (2016) 128–149. DOI: 10.2218/ijdc.v11i1.428
If not, don’t worry – you are in good company. The majority of people working with data have not been training in this respect: It is surprising that amongst those who establish and collect data, only rarely a statistician or data scientist can be found. We know that datasets consist of variables and observations; but it is surprisingly difficult to precisely define variables and observations in general. Tidy datasets provide a standardized way to link the structure of a dataset with its meaning; they form the basis for effective computation, modeling and visualization. In his contribution “Tidy data” Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics, explains what accounts for the quality of datasets, and how they can effectively be cleaned and prepared for analysis. This article from the Journal of Statistical Software is freely accessible.
Data are contained in millions of spreadsheets, and everybody working with Excel-files has already experienced some peculiarities. This is not only a problem for the exchange of data, since often the formulas used for calculating the values in fields are exported and not the values themselves. Snafus and bugs also an obstacle for sciences like economics or genomics, because the complex values at times are misinterpreted; “SEPT2”, which corresponds to the gene Septin 2, is read by Excel as “September 2nd”. A recent study in the journal Genome Biology investigated papers published between 2005 and 2015, and identified an astonishing proportion of spreadsheet-related errors in them. This poses a real challenge for subsequent scientists who try to build on previously published research results.
Results of statistical operations are far from being intuitive. One needs a lot of experience to reliably judge computed results, and most people would listen to the advice given by experts in the field. This makes a story presented by the BBC all the more remarkable; it is about a student who took over the task to check the calculations of two Harvard professors. He was not only unable to replicate the results that Carmen Reinhart and Ken Rogoff published; it was much more: he spotted a basic error in the spreadsheet and was thus able to correct their research paper called “Growth in a Time of Debt”. That may sound like a modern myth; but sometimes politicians base their decisions on papers like these.
The KPLEX team was pleased to present their project at the Big Data PPP Info Day in Luxembourg, 17-18 January 2017. Although categorised by our panel chair as ‘different,’ the project concept received a good response and a number of good questions and leads for future work. We look forward to contributing more in the future to the widening of perspectives and development of RRI in big data research.
The KPLEX partners started off 2017 with a toast to the official launch of our project, which commenced on 1 January 2017. We look forward to spending the next 15 months applying humanities knowledge and contexts to the investigation of what we reveal (and what we hide) when we speak about ‘big data.’