Effective data curation for big data

As part of our suite of posts on big data, the UK Data Service’s Louise Corti, Associate Director for Collections Development and Producer Relations, and Sharon Bolton, Data Curation Manager, discuss the curation of big data and how to promote its re-use.

Good data management practices are essential for ensuring that research data are of high quality, are well-organised, documented, preserved, findable, accessible and that their validity is guaranteed. Data can be then shared, ensuring their sustainability and accessibility in the long-term, to be used for new research and policy or to replicate and validate existing research and policy. It is important that researchers extend these practices to their work with all types of data, be it big (large or complex) data or smaller, more ‘curatable’ datasets.

The Research Data Alliance’s vision is for researchers and innovators to openly share data “across technologies, disciplines, and countries to address the grand challenges of society.” Louise Corti’s plenary talk at the Digital Curation Centre Research Data Management Forum 13 discussed the reasons behind the increased focus on sharing research data, such as increasing open access and transparency agendas, the huge progress in opening up government data through data.gov.uk for example, the demand for value for money from public spending and as a way of validating academic findings. How do we ensure that big data are shareable? And what are the implications when it comes to researchers who want to re-use these ‘big’ data?

Citing big data

Data sharing places responsibilities on researchers who plan to re-use existing data to provide “full and appropriate acknowledgement, via citation”, as outlined by the ESRC Research Data Policy (2010). The Research Councils UK Common Principles on Data Policy, states: “all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.”

Proper citation is also important because it helps maximise the impact of research. My colleague, Dr Victoria Moody, UK Data Service Communications and Impact Director, when announcing the launch of the UK Data Service’s #CiteTheData campaign, said “The citation of research data (and metadata) can support the understanding and promotion of research impact through the tracking of the use of data in research and on into policy and product development, influencing decisions about public and commercial spending and service provision.”

So, a researcher needs to cite the data – sounds simple. However, big data, being being larger and/or more complex than traditional datasets, present challenges. For example, data sourced from Twitter may be attributed to a particular Twitter account that does not include the author’s real name. To cite a Tweet correctly, the citation must include both the Twitter account name and the author’s real name (the Modern Language Association provides guidelines on how to cite a Tweet, while the American Psychological Association blog also provides recommendations and examples for citing social media sources). One also needs to identify the URL of that tweet – Twitter outlines how to find the URL of an individual Tweet so it can be linked to individually (many people are unaware that all Tweets have an individual URL that provides the exact time and date that the Tweet was posted and also the amount of times that it has been favourited and re-tweeted).

When citing data, common assumptions, such as the assumption that all links are persistent, must be dispelled: as noted by Christine L. Borgman in Big Data, Little Data, No Data: Scholarship in the Networked World, The MIT Press, 2 January 2015, “[I]n a print world, citations are stable links between fixed objects. In a digital world, citations are links between mutable objects. Neither the citing nor the cited object may be fixed in form or location indefinitely.” One way to ensure that the data being cited can be located is to use a Digital Object Identifier (DOI), which ensures that even if the location of the data changes, the DOI will always link to the data that were used. Nevertheless, although a DOI can be assigned to any physical, digital or abstract entity, big data sources such as social media, are unlikely to have formal DOIs.

Who owns what?

Researchers may find it harder to ascertain who the ‘owners’ of some of these ‘big data’ are – a necessity when it comes to properly acknowledging the owner and ascertaining the legal rights over re-use. Researchers need to make best efforts to find out the legal implications of re-publishing data, or parts of data, taking into account the license under which it has been published, copyright laws and where ‘fair dealing’ may or may not apply. Owners of social media data like Twitter and Facebook do have re-use terms in their small print, but these can change over time, so should always be consulted for each instance of data publication.

Assessing data quality

Once researchers become accustomed to working with complex datasets, including those derived from social media, the general principles and challenges of curating these data are fundamentally the same as those for smaller, traditional datasets. The UK Data Service’s team of experts has decades of experience curating data and can provide support for researchers who are about to embark upon processing and analysing big data. The first step in this process is to look at how to manage small data effectively and then to apply those very same principles to big data.

A key part of curating data is assessing the quality of the data to be used. For example, how to treat missing data. The process for dealing with missing data in traditional datasets can be applied to big data also. There are tools that can auto-inform a researcher which parts of a dataset are missing, for example, algorithms that pull out the outliers if a researcher sets the parameters. Google refine can help with this, as can the many free R tools that are available. The UK Data Service also assists researchers in identifying potential issues – we always flag up a quality for which we do not know the provenance, so the researcher is better informed about what they are dealing with.

Typically, survey data includes various types of ‘missing’ codes, such as ‘don’t know’/’not applicable’/’not asked’. However, when it comes to big data, the reason why particular data are missing may not be clear – there may not be an accompanying list of codes to define the reasons. Exploratory data analysis techniques can find the missing responses, but it may take some more investigation to find out what type of ‘missing’ those responses are. Perhaps the category is ‘not applicable’ or perhaps the data are just not included in the file. The overall analysis can be flawed if missing values are not properly defined, or the dataset is incomplete. In the social sciences, methodologists are currently exploring the best ways to deal with such deficiencies in data. As data publishers, it is our responsibility to indicate the quality of data and to point out any known deficiencies or obvious errors.

Validity and reliability of findings

Sometimes there can be a big gap between the data and a researcher’s assertions. Enabling transparency of production ensures that replication of methodology is possible, which can back up a researcher’s findings, providing a form of quality control and validity. In order to replicate the methodology, it is important that the data are accompanied with a clear description of how the data were collected, stored, cleaned, and analysed. Our guide on depositing shareable survey data, provides clear step-by-step instructions are provided covering the planning of fieldwork (such as consent protocols by survey respondents), during fieldwork (such as ensuring appropriate anonymisation, consent forms), negotiating (identifying access and licensing requirements) and depositing (with the owner signing licence agreements).

Yet with big data, one may not know how the data have been gathered. Particularly so, with social media. Taking the example of social media data, Kevin Driscoll and Shawn Walker, in Big Data, Big Questions Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data say: “[T]he transformation of mass-scale Twitter data into statistics, tables, charts and graphs is presented with scant explanation of how the data are collected, stored, cleaned, and analyzed, leaving readers unable to assess the appropriateness of the given methodology to the social phenomena they purport to represent.”

Selling the benefit of sharing data to researchers is important. AllTrials, an initiative of Bad Science, BMJ, Centre for Evidence-based Medicine, Cochrane Collaboration, James Lind Initiative, PLOS and Sense About Science which calls for all past and present clinical trials to publish their full methods and summary results, highlights that a 2012 audit found that one-fifth of trials registered on clinicaltrials.gov had reported results within one year of completion while other research has shown trials with negative results are twice as likely to be left unpublished than those with positive results. Some science journals have policies that specifically address this issue. For example, PLOS ONE states that “the paper must include an analysis of public data that validates the conclusions so others can reproduce the analysis.”

For example, economics graduate, Thomas Herndon, found errors in the research Growth in a Time of Debt. Had the data and methodology not been available for this student to replicate the research, the error would not have been spotted. Similarly, in 2011, publications from the British Journal of Social Psychology and the Basic and Applied Social Psychology had to be withdrawn when a Dutch researcher was found to have fabricated data. Indeed, if the methodology are not available to allow people to replicate any research, how can the validity of any research be challenged (or reinforced)?

Metadata issues

Big data also present metadata challenges. Variables may not be labelled, or the variables may not match the documentation (if there is any). Similarly, if categorical variables are included, they may not have value labels and there may be no explanation of derived data. The existence of good metadata is paramount; it is needed to explain the content of data, its provenance and context. For example, the term ’employment’ in one dataset may not mean the same as that term in the context of another dataset. It is necessary to understand whether different employment coding frames have been used and if they can be matched between the two datasets. Metadata also allow a researcher to ‘match’ multiple disparate sources. An example illustrating the importance of using metadata for big data is the Energy Demand Research Project: Early Smart Meter Trials, 2007-2010 (study number 7591) which covers 14,621 households, 246m gas meter readings, 413m electricity meter readings, all collected every half hour.

So, what do we plan to do with these data?

Our team of experts at the UK Data Service plan to use an Apache Hadoop cluster to store the data and use it to create data visualisations, data products and potentially match it with weather data and fuel poverty data. The only way this can be achieved is to ensure that the data sources include sufficient metadata – where standardised and widely-used definitions of geographic and socio-economic categories have been used across the data sources so they can be matched by Acorn group (a classification of residential neighbourhoods ), local authority and Nomenclature of Units for Territorial Statistics (NUTS) areas (a hierarchical system for the classification of spatial units). The team will also use recognised coding schemes that are easy to match across datasets.

“Metadata needs to be organized for data sharing to be productive, for the simple reason that users need to know what the data is. The need for this becomes increasingly pressing as the number of useful and potentially useful data sources increases…. potential will only be realized if the metadata is well-documented and well-managed.” Robin Bloor “Does Big Data Mean Big Metadata?

Find out more

For researchers who would like to know more about curating big data the UK Data Service team of experts are on hand to help. Visit the Big Data Network Support pages to find out more. Given that the principles of curating all data, be they ‘big’ or ‘small’, are the same, a good starting point is the publication Managing and Sharing Research Data: A Guide to Good Practice by Louise Corti, Veerle Van den Eynden, Libby Bishop and Matthew Woollard.

Louise Corti and Dr Sharon Bolton ran a session on Managing, curating and publishing data at the Essex Big Data and Analytics Summer School 2015, where they illustrated how good data curation can address some of the problems surrounding the use of big data. The presentations from this event are available here:

Data Impact blog