There can be no doubt that the availability of big data and the emergence of big data analytics challenges established epistemologies across the academy. Debates about the datafication of humanities’ research materials – including debates about whether digitisation projects should be countenanced at all – are disputes about the shape of our culture which challenge our understanding of human knowledge and of knowledge production systems.
In a recent article on the “big data” turn in literary studies, Jean-Gabriel Ganascia addresses the epistemological status of Digital Humanities, asking if Digital Humanities are “Sciences of Nature” or “Sciences of Culture?”  Regarding similar concerns, Matthew L. Jockers a proponent of techniques of quantitative text analysis of the literary record writes, “(e)nough is enough. The label does not matter. The work is what matters, the outcomes and what they tell us about… are what matters.”  But labels do matter, they matter a great deal and Ganascia’s question is an important one, the answer to which reveals much about the underlying epistemology of Digital Humanities research.
As the basis of all information and knowledge, data is often defined as fact that exists irrespective of our state of knowing. Rob Kitchin has addressed the explosion of big data and an emergent fourth paradigm in science which is a result of the former definitions of data as the zero sum of all information and knowledge. This fourth paradigm, labelled New Empiricism or Empiricism Reborn, claims that Big Data and new data analytics ushers in a new era of knowledge production free from theory, human bias or framing and accessible by anyone with sufficient visual literacy. This is the position advocated by Chris Anderson, former editor-in-chief at Wired magazine.
However, as Kitchin observes data does not arise in a vacuum: “all data provide oligoptic views of the world: views from certain vantage points, using particular tools, rather than an all-seeing, infallible God’s eye view”. Whilst the end of theory may appear attractive to some, the empiricist model of knowledge creation is based on fundamental misconceptions about formation and analysis of computational data.
There is an alternative to new forms of empiricism labelled “data-driven science” which is described by Kitchin as follows: “more open to using a hybrid combination of abductive, inductive and deductive approaches to advance the understanding of a phenomenon… it forms a new mode of hypothesis generation before a deductive approach is employed. Nor does the process of induction arise from nowhere, but is situated and contextualized within a highly evolved theoretical domain.”
So, when Jockers writes that those who study texts must follow the example of Science and scale their methods according to the realities of big data, which method (or label) is he referring to? One might infer from his reference in the opening pages of his influential Macroanalysis: Digital Methods & Literary History to Anderson’s 2008 article that it is the former empiricist model. However, in the article referred to in the opening lines of this post, Jockers refers to his approach as “data-driven”. Are we to assume that this is a reference to data-driven scientific method? Rightly or wrongly, knowledge production is not impartial and we must always be mindful of the epistemologies to which we align ourselves as they will often reveal as much about our research as the information or data contained therein.
If Digital Humanists are said to be “nice” because we eschew theoretical debates in favour of the more easily resolved methodological ones, then I’d rather not be thought of as a “nice” Digital Humanist. I’d rather be responsible, argumentative or even wrong. But nice? Biscuits are nice.