Levels of Trace Data for Social and Behavioural Science Research

Kevin Crowston
  • January 2017, Springer Science + Business Media
  • DOI: 10.1007/978-3-319-59186-5_4

Classifying data from online social computing systems by degree of processing

What is it about?

As people interact on online systems (e.g., email, social media, citizen science, user-generated content systems) the systems record huge volumes of data about their actions and interactions (often referred to as trace data). These data can be very useful for social scientific research, but are of a different nature than conventional social science data (e.g., from surveys or interviews). A main difference is that the data are not designed to measure social science constructs and require processing to be scientifically meaningful. The paper presents a framework for describing different levels of data processing that is adapted from a scheme developed for describing satellite data in earth sciences, from level 0, raw data as collected (e.g., raw tweets) to level 1, data with metadata (e.g., identity of sender, time and date) to level 2, derived social or behavioural science variables (e.g., tweets coded for sentiment) to level 3, data aggregated at some unit of analysis (e.g., data about an individual) to level 4, data linked to other datasets. The paper discusses issues in handling data at each of these levels (e.g., the difficulty in ensuring system data are reliably collected at level 0 or in coding data to reach level 2 or in linking across datasets to reach level 4.

Why is it important?

There is much excitement in the social sciences about the potential of the vast amount of data generated by online actions and interactions. However, the bulk of this data are only at levels 0 or 1, while to be useful for research, the data need to move to at least level 2 or 3. However, such processing is challenging and the paper points out a number of possible pitfalls. Further, the processing needs to be carefully documented, which is often not the case in published research. The framework in the paper can be useful in analyzing what documentation is needed.


Kevin Crowston (Author)
Syracuse University

The paper arose from our research group's attempts to use trace data and the realization that to do so successfully required a lot of effort. We have personally run into a number of the pitfalls we identify. I hope that the paper is helpful for others to think more systematically about their use of data and to encourage sharing of data at different levels of processing.

The following have contributed to this page: Kevin Crowston