As always in a field where different conceptions are present, there exist differing understandings of what ‘tidy’ or ‘clean’ data on the one hand, and ‘messy’ data on the other might be.
On a very basic level, and coming from the notion of data arranged in tables, Hadley Wickham has defined ‘tidy data’ as those where each variable is a column, each observation is a row, and each type of observational unit is a table. Data not following these rules are understood as messy. Moreover, data cleaning is a routine procedure in statistics and private companies applied to the data gathered: judging the number and importance of missing values, correcting obvious inconsistencies and errors, normalization, deduplication etc. Common examples are the use of non-existing or incorrect postal codes, a number which wrongly indicates an age (in comparison to birth date), the conversion of names from “Mr. Smith” to “John Smith”, etc.
In the sciences, the understanding of ‘messy’ data largely depends on what is understood as ‘signal’ and what is seen as ‘noise’. Instruments collect a lot of data, and only the initiated are able to distinguish between them. Cleaning data and thus peeling out the relevant information in order to receive scientific facts is also a standard procedure here. In settings where the data structure is human-made rather than technically determined, ‘messy’ data remind scientists to reassess whether the relationship between the (research) question and the object of research is appropriate; whether the classifications used to form variables are properly conceived of or whether they inappropriately limit the phenomenon under question; and what kind of data can be found in that garbage category of the ‘other’.
In the humanities, this is not as easily the case as it is with the sciences. Two recent publications provide for examples. Julia Silge’s and David Robinson’s book “Text Mining with R” (2017) bears the subtitle “A tidy approach”. Very much like Wickham, they define ‘tidy text’ as “a table with one-token-per-row”. Brandon Walsh and Sarah Horowitz present in their “Introduction to Text Analysis” (2016) a more differentiated approach to an understanding of what ‘cleaning’ of ‘messy’ data might look like. They introduce their readers to the usual problems of cleaning dirty OCR; the standardization and disambiguation of names (their well-chosen example is Arthur Conan Doyle, who was born as Arthur Doyle, but used one of his given names as addendum to his last name); and the challenges posed by metadata standards. All that seems easy stuff at first glance, but think of Gothic types (there exist more than 3.000 of them), pseudonyms or heteronyms, or camouflage printings published during the inquisition or under political repression. Now you can imagine how hard it can be to keep your data ‘clean’.
And there is, last but not least, another conception of ‘messiness’ in the humanities. It lies in the specific cultural richness, or polysemy, or voluptuousness of the data (or the sources, the research material) under question: A text, an image, a video, a theater play or any other object of research can be interpreted from a range of viewpoints. Humanists are well aware of the fact that the choice of a theory or a methodological approach – the ‘grid’ which provides order to the chaos of what is being examined – never provides an exhausting interpretation. It is the ‘messiness’ of the data under consideration which provides the foundation of alternative approaches and research results, which is responsible for the resistance to interpretation (and, with Paul de Man, to theory) – and which continuously demands an openness towards seeing things in another way.