Data’s Long Tail

joergbigdada
Jörg Lehmann, FUB

The ‘long tail’ is a metaphor for a statistical distribution, where about 20 percent of the distribution is in the head of the curve, and the remaining 80 percent are distributed along the tail. This well-known distribution – a power law – was popularized by Chris Anderson, former editor-in-chief of the Wired Magazine. In an article published in 2004 [http://archive.wired.com/wired/archive/12.10/tail.html], Anderson described the market for goods in physical stores versus online stores by means of this distribution: In the head of the curve, there are the hits: blockbusters, bestsellers, top-ten chart tracks, all sold by retailers with stores. But in the long tail, there are the much-loved film classics, the back catalog, older music albums. Combined, the non-hits make for a market bigger than the hits, which opens the door for online stores. As Anderson explains, successful businesses in the digital economy are about aggregating the long tail, focusing on a million of niches rather than onto the mass market.

512px-long_tail-svg
The Long Tail distribution. File Source: By User:Husky (Own work) [Public domain], via Wikimedia Commons

But the metaphor of the ‘long tail’ can also be applied to scholarly research: Only a fraction of research teams work on large volumes of data, say, about 20%; in comparison to them, a much bigger number of researchers work with ‘little data’. These small datasets are harder to handle and to exchange, as Christine Borgman et al have recently described in their research paper “Data Management in the Long Tail: Science, Software and Service”(1)  [http://www.ijdc.net/index.php/ijdc/article/viewFile/11.1.128/432].

The attention of the public is towards ‘big data’; in contrast, the smaller datasets used by the humanities as well as cultural data are ‘hidden’, not standardized, and they lack metadata. Even more, they are unstructured, polyvalent, messy; and this is what the KPLEX-Project focuses on – long tail data.

 

  1. Christine L. Borgman, Milena S. Golshan, Ashley E. Sands, Jillian C. Wallis, Rebekah L. Cummings, Peter T. Darch, Data Management in the Long Tail: Science, Software and Service. In: International Journal of Digital Curation 11(1), (2016) 128–149. DOI: 10.2218/ijdc.v11i1.428

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s