The KPLEX project is founded upon a recognition that definitions of the word ‘data’ tend to vary according to the perspective of the person using the word. It is therefore useful, at least, to have a standard definition of ‘big’ data.
Big Data is commonly understood as meeting 3 criteria, each conveniently able to be described with a word beginning with the letter V: Volume, Velocity and Variety.
Volume is perhaps the easiest to understand, but a lot of data really means a lot. Facebook, for example, stores more than 250 billion image . ‘Nuf said.
Velocity is how fast the data grows. That >250 billion images figure is estimated to be growing by a further 300-900 million per day (depending on what source you look at). Yeah.
Variety refers to the different formats, structures etc. you have in any data set.
Now, from a humanities data point of view, these vectors are interesting. Very few humanities data sets would be recognised as meeting criteria 1 or 2, though some (like the Shoah Foundation Video History Archive) come close. But the comparatively low number of high volume digital cultural datasets is related to the question of velocity: the fact that so many of these information sources have existed for hundreds of years or longer in analogue formats means that recasting them as digital is a highly expensive process, and born digital data is only just proving its value for the humanist researcher.
But Variety? Now you are talking. If nothing else, we do have huge heterogeneity in the data, even before we consider the analogue as well as the digital forms.
Cultural data makes us consider another vector as well, however: if it must start with V, I will call it “voluptuousness.” Cultural data can be steeped in meaning, referring to a massive further body of cultural information stored outside of the dataset itself. This interconnectedness means that some data can be exceptionally dense, or exceptionally rich, without being large. Think “to be or not to be;” think the Mona Lisa; think of a Bashō haiku. These are the ultimate big data, which, while tiny in terms of their footprint of 1s and 0s, sit at the centre of huge nets of referents, referents we can easily trace through the resonance of the words and images across people, cultures, and time.
Will the voluptuousness of data be the next big computational challenge?