Effective data curation for big data

As part of a suite of articles on big data, the UK Data Service’s Louise Corti, Associate Director for Collections Development and Producer Relations, and Sharon Bolton, Data Curation Manager, write about the effective curation of big data and how to promote its re-use.

Good data management practices are essential, ensuring that research data are of high quality, are well-organised, documented, preserved and accessible and that their validity is controlled. They enable data to be easily shared, ensuring the sustainability and accessibility of data in the long-term so it can be used for new research or to replicate and validate existing research. It is important that researchers extend these practices to their work with all types of data, be it big (large or complex) data or smaller, more traditional datasets.

The practice of sharing data is growing in popularity. The Research Data Alliance’s vision is for researchers and innovators to openly share data “across technologies, disciplines, and countries to address the grand challenges of society.” And making data available is trending. Louise Corti’s plenary talk at the Digital Curation Centre Research Data Management Forum 13 discussed the reasons behind this trend, such as increasing open access and transparency agendas, the huge progress in opening up government data (data.gov and data.gov.uk), the demand for value for money from public funds and a general lack of trust in published academic findings – all of which is resulting in more demands for evidence for claims and verification. Digital data are a growing commodity and for research, big data are out there to be be exploited. Assuming we have the skills to make sense of and analyse big data, how do we ensure that the data sources can be trusted and documented and what are the challenges for replicating research?

Assessing data quality

Once researchers become accustomed to working with complex datasets, including those derived from, for example, social media, the general principles and challenges of curating these data are fundamentally the same as those for smaller, traditional data. The UK Data Service’s team of experts has over 20 years of experience curating data and can provide support for researchers who are about to embark upon processing and analysing big data. The first step in this process is to look at how one manages small data effectively and then to apply those very same principles to big data.

A key part of curating data is assessing the quality of the data to be used. For example, how do you treat missing data? The process for dealing with missing data in traditional datasets can be applied to big data also. There are tools that can auto-inform a researcher which parts of a dataset are missing, for example, algorithms that pull out the outliers if a researcher sets the parameters. Google refine can help with this, as can the many free R tools that are available. The UK Data Service also assists researchers in identifying potential issues – we always flag up a quality for which we do not know the provenance, so the researcher is better informed about the limits of the data.

Typically, a survey assigns various types of “missing”, such as missing/don’t know/weren’t asked. However, when it comes to big data, a researcher may not know the reason why particular data are missing – the data are not always the result of a survey that was gathered by a specific depositor, with a set list of questions and options to select why any variables may be missing. So, with big data, how does one find’ these missing elements? A code can be assigned, but how do we deal with things that we don’t know? The overall analysis can be flawed if missing values are not properly defined. In the social sciences, methodologists are currently exploring the best ways to deal with such deficiencies in data. As data publishers, it is our responsibility to indicate the quality of data and to point out any known deficiencies or obvious errors.

Validity and reliability of findings

Sometimes there can be a big gap between the data and a researcher’s assertions. Enabling transparency of production ensures that replication of methodology is possible, which can back up a researcher’s findings, providing a form of quality control and validity. In order to replicate the methodology, it is important that the data are accompanied by a clear description of how the data were collected, stored, cleaned, and analysed. Our guide on depositing shareable survey data provides clear step-by-step instructions covering the planning of fieldwork (such as consent protocols by survey respondents), during fieldwork (such as ensuring appropriate anonymisation, consent forms), negotiating (identifying access and licensing requirements) and depositing (with the owner signing licence agreements).

Yet with big data, one may not know how the data have been gathered. Particularly so, with social media. Taking the example of social media data, Kevin Driscoll and Shawn Walker, in Big Data, Big Questions Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data say: “[T]he transformation of mass-scale Twitter data into statistics, tables, charts and graphs is presented with scant explanation of how the data are collected, stored, cleaned, and analyzed, leaving readers unable to assess the appropriateness of the given methodology to the social phenomena they purport to represent.”

Selling the benefit of sharing data to researchers is important. AllTrials, an initiative of Bad Science, BMJ, Centre for Evidence-based Medicine, Cochrane Collaboration, James Lind Initiative, PLOS and Sense About Science which calls for all past and present clinical trials to publish their full methods and summary results, highlights that a 2012 audit found that one-fifth of trials registered on clinicaltrials.gov had reported results within one year of completion while other research has shown trials with negative results are twice as likely to be left unpublished than those with positive results. Some science journals have policies that specifically address this issue. For example, PLOS ONE states that “the paper must include an analysis of public data that validates the conclusions so others can reproduce the analysis.”

One example of this is the case where the economics graduate, Thomas Herndon, found errors in the research Growth in a Time of Debt. Had the data and methodology not been available for this student to replicate the research, the error would not have been spotted. Indeed if the methodology is not available to allow people to replicate any research, how can the validity of any research be challenged (or reinforced)? Furthermore, fraud could be easily spotted if data were routinely made available as part of publications.

Metadata issues

Metadata are needed to explain the content of data, its provenance and context. It also allows a researcher to match and link multiple sources. “Metadata needs to be organized for data sharing to be productive, for the simple reason that users need to know what the data is. The need for this becomes increasingly pressing as the number of useful and potentially useful data sources increases…. potential will only be realized if the metadata is well-documented and well-managed.” Robin Bloor “Does Big Data Mean Big Metadata?

However big data can present challenges when attempting to assign metadata to a data collection. Are all variables labelled adequately, are meaningful and intepretable, do they match the documentation (if there is any!), and is there enough information on derived data and weights (if any exist)?

An example illustrating the important role of metadata for big data is the UK’s Energy Demand Research Project: Early Smart Meter Trials, 2007-2010 (study number 7591) held at the Data Service. As along a ‘long and thin’ dataset comprising millions of rows of smart electricity and gas meter readings at half hourly intervals from over 14,000 households, the documentation provided by the depositor was very limited, and details like which company supplier the readings to refer to was not available. This example highlights the importance of staff curating big data sources undertaking further detective work to establish better provenance, where this is even possible. In many cases, with big data, there be no direct link to a ‘depositor’

Who owns what?

Researchers may find it harder to ascertain who the ‘owners’ of some of these ‘big data’ are – a necessity when it comes to properly acknowledging the owner and ascertaining the legal rights over re-use. Researchers need to make best efforts to find out the legal implications of re-publishing data, or parts of data, taking into account the license under which it has been published, copyright laws and where ‘fair dealing’ may or may not apply. Owners of social media data like Twitter and Facebook do have re-use terms in their small print, but these can change over time, so should always be consulted for each instance of data publication.

So, how do we plan to deal with these data at the UK Data Service?

Our team of experts at the UK Data Service are using an Apache Hadoop cluster to store and process big data sources, to create ‘slice and dice’ services, data analysis and visualisations, and linking services. For example, the household smart meter mentioned earlier above could be linked to temperature data at regional level to look at correlations of energy usage and weather. The only way this linkage can be achieved is through the availability of sufficient metadata that includes geographical identifiers based on recognised coding schemes that are easy to match across datasets.

Last year the UK Data Service was awarded funding from the Economic and Social Research Council (ESRC) and National Research Foundation (NRF), South Africa, under the ESRC/NRF International Centre Partnerships in 2015, to work with South Africa’s DataFirst on drafting blue prints for big data infrastructure. The project ‘Smarter Household Energy Data: infrastructure for policy and planning’ aims to strengthen data expertise and research partnerships between the two countries through the formation of a research and data expertise network. You can find out more on the project blog. The outcome is knowledge exchange on scaling up activities for data archives to accommodate big data, showcasing some exemplary energy data projects that utilise the Hadoop platform and a week’s long summer school in ‘Encounters with big data’ aimed at upskilling social scientists.

Citing big data

Researchers who plan to re-use existing data need to factor in “full and appropriate acknowledgement, via citation”, as outlined by the ESRC Research Data Policy (2010) and the Research Councils UK Common Principles on Data Policy. Proper citation helps maximise the impact of research. Our colleague, Victoria Moody, UK Data Service Director of Communications and Impact, announced our #CiteTheData campaign, noted that “The citation of research data (and metadata) can support the understanding and promotion of research impact through the tracking of the use of data in research and on into policy and product development, influencing decisions about public and commercial spending and service provision.”

Citing known sources of data is now a fairly routine practice in scholarly practice. However, big data presents some challenges . For example, data sourced from Twitter may be attributed to a particular Twitter handle that does not include the author’s real name. To cite a Twitter correctly, the citation must include both the Twitter handle and the author’s real name (the Modern Language Association provides guidelines on how to cite a Tweet). One also needs to identify the URL of that tweet – Twitter outlines how to find the URL of an individual Tweet so it can be linked to individually (many people are unaware that all Tweets have an individual URL that provides the exact time and date that the Tweet was posted and also the amount of times that it has been favourited and re-tweeted).

When citing data, common assumptions, such as the assumption that all links to the data are persistent, must be dispelled: as noted by Christine L. Borgman in Big Data, Little Data, No Data: Scholarship in the Networked World, The MIT Press, 2 January 2015, “[I]n a print world, citations are stable links between fixed objects. In a digital world, citations are links between mutable objects. Neither the citing nor the cited object may be fixed in form or location indefinitely.” One way to ensure that the data being cited can be located is to use a Digital Object Identifier (DOI), which ensures that even if the location of the data changes, the DOI will always link to the data that were used. Nevertheless, although a DOI can be assigned to any physical, digital or abstract entity, big data sources such as social media are unlikely to have formal DOIs.

Find out more

For researchers who would like to know more about curating big data, the UK Data Service team of experts are on hand to help. And given that the principles of curating all data, be it big or small, are the same, a good starting point is the publication Managing and Sharing Research Data: A Guide to Good Practice by Louise Corti, Veerle Van den Eynden, Libby Bishop and Matthew Woollard.

The authors of this article, Louise Corti and Dr Sharon Bolton, ran a session on Managing, curating and publishing data at the Essex Big data and Analytics Summer School 2015, where they illustrated how good data curation can address some of the problems surrounding the use of big data. A similar course will be run in September 2016. Slides from the 2015 event are available at: https://ukdataservice.ac.uk/news-and-events/eventsitem/?id=4568

Data Impact blog

Effective data curation for big data

Tags