In elementary school, the little data scientist learns that NAs are nasty. Numerical data are nice, clean, and complete when collected by a sensor. NAs, on the other hand, are these kind of data that result from manually entered data, not properly filled-in surveys; or they are missing values which could not be gathered out of unknown reasons. If they would be NULLs, the case would be much clearer – with a NULL, you can calculate. But NA, that’s really terra incognita, and because of this the data have to checked for bias and skewness. NA thus becomes for the little data scientist an expression of desperation: not applicable, not available, or no answer.
For humanists, the case is easier. Missing values are part of their data life. Only rarely machines support the process of data collection, and experience shows that people, when asked, rarely respond in a complete way; leaving out part of the answer is an all-too-human strategy. Working with this kind of data, researches have become creative; one can even ask whether there is a pattern behind NAs, or whether a systematic explanation for them can be found. In a recently published reader on methodologies in emotion research, there is a contribution on how to deal with missing values. The researcher, Dunya van Troost, collected data from persons committed to political protest all over Europe; the survey contained, amongst others, four items on emotions. By counting the number of missing values across the four emotion items three groups could be identified: respondents with complete records (85 percent), respondents who answered selectively (13 percent), and (3) respondents who answered none of the four emotion items (2 percent). For inferential statistics, it is important to understand the reasons why certain data are missing. Van Troost somehow turned the tables: she conducted a regression analysis to see how the other data available, here both country and demographic characteristics, influenced the number of missing values provided by her 15,272 respondents. She found out that the unanswered items were not missing completely at random. The regression further showed that the generation of protesters born before 1945 had more frequently missing values on the emotions items than younger generations. The same applies for the level of education – the higher the level of education, the more complete the surveys had been filled. Female respondents have more missing values than male ones. Finally, respondents from the southern European countries Spain and Italy had a relatively high rate of missing values.
It is obvious that not all people are going to answer items on emotions with the same willingness; and it is also plausible that the differences in willingness are related to age, gender, cultural socialization. It would be interesting to have a longitudinal study on this topic, to validate and cross-check van Troost’s findings. But this example illuminates a different handling of information in the social sciences from that of information science: Even NAs can be useful.
Dunya Van Troost, Missing values: surveying protest emotions, in: Helena Flam, Jochen Kleres (eds.), Methods of Exploring Emotions, London / New York: Routledge 2015, pp. 294-305.