Spotlight on data quality: the Ingest Services Data Curation team – impact for data creators and users

Dr. Russell Sceats, a data curation expert from the UK Data Service, describes the work and ethos of the UK Data Archive’s Ingest Services Team.

Our Ingest Services Team handle the majority of the datasets deposited with the UK Data Archive, part of the UK Data Service, including government data, economic data, history data and social science research data, resulting from quantitative, qualitative or mixed methods research. We handle data across the information security spectrum, from Open to Safeguarded and Controlled.

We are a team of highly specialised Data Scientists working at the core of the UK Data Archive, validating and enhancing the throughput of data deposits. We have a variety of academic backgrounds, experiences, and a range of highly transferable skills, with some 90 years of Data Curation experience between us.

Dr. Sharon Bolton Criminologist

Dr. Russell Sceats Physicist

Dr. Andy Redfearn Environmental Scientist

Kay Eastaugh Social Scientist

Liz Smy Geographer

Ole Wiedenmann Historian

Christina Magder Criminologist

Ausra Suboniene Computer Scientist

As data curation professionals, our work is certified to ISO 27001, and the team adheres to all data security protocols. We are recognised as a leading authority in managing and curating data, demonstrating our curation techniques to national and international audiences and more recently, advising Big Data researchers in best-practice data handling.

Our objective is to produce and curate high quality datasets, enabling data users access to all of the data and documentation they may need to facilitate their research using the very best possible data products.

Quality assured information packages

We process the materials deposited by data owners – Submission Information Packages (SIPs), to create Archival Information Packages (AIP) for the long-term preservation system and Dissemination Information Packages (DIP) for data users, accessible online or by download. All the datasets processed by the Ingest team are quality assured with consistent value-added updates, data or documentation corrections (as agreed with the data owners), with all amendments recorded for provenance.

Professional coding and syntax

Our responsibilities include the investigation, research and enhancement of data and documentation to the highest standard. We undertake disclosure review, data conditioning, metadata creation, data management, format conversion and version control, to provide suitable data and documentation packages for secondary use by social science researchers and research institutions.

The team has experience with SPSS, Stata, R (and R Studio), Python, HTML, UNIX, FORTRAN, QuickBasic, Java, SQL etc. We spend a lot of time compiling syntax and code for various stages of data processing, analysis, migration and transference. Syntax is compiled for all quantitative datasets, acting as the upgrade agent. The syntax is retained, serving as the archival record of all recoding, variable frequencies, variable labelling, value label and missing value label updates etc., for each data file. Macro programming for qualitative interview transcripts helps locate disclosive text and other transcription errors.

User preferred format options

We adapt and run Python code to act on data files to produce data dictionaries for each quantitative data file, describing survey variables and response values. The UK Data Archive Data Dictionaries are very informative documents, giving a quick and searchable profile of each data file, thus enabling potential data users the opportunity to examine the data content of each dataset beforehand.

The Python code, once executed, creates alternative data formats for data users to choose from. Where data are deposited in SPSS, Stata and tab-delimited versions of the data files are created for those researchers that may wish to analyse the data in other software packages (for example R). Data users can also request alternative formats such as SAS. Qualitative data users can choose to investigate interview transcripts in Rich Text Format and Portable Document Format, and various image, audio and video formats are also made available, where possible.

Where data are available to download, the Ingest Team create or enable system output of check-sum configured downloadable zip bundle datasets, each with a preferred data format as described above, containing the complete data, documentation, metadata, etc. for the data user to choose from.

Dataset consistency

Our team are very fortunate in opening, investigating and researching each of the datasets in full enabling us to become familiar with the research topics, as well as the survey methodology and questionnaire design. We routinely inspect each quantitative data file by running frequency commands and reading through all of the many variables’ outputs, looking for anomalies, missing labels, disclosive string variables, etc. For qualitative data, we examine each interview transcript, looking for disclosive text and transcription errors, and make appropriate rational corrections. Information gained during the dataset research enables us to assign important keywords and other data identifiers when compiling the catalogue records and other metadata associated with each study.

All this dataset investigation and familiarity helps us greatly when researching and processing new deposits or new editions, of familiar or similar data, and creating consistency within a data series. Data users and researchers know what they might expect with our Dissemination Information Packages and related products.

Long-term preservation and version control

The Ingest team undertake long-term collection management. We are responsible for all the data uploads, modifications, format migration, version control, metadata, documentation, and de-cataloguing of data on the UK Data Archive’s main long-term preservation servers. No data or documentation, even earlier obsolete editions, are ever discarded or lost, but are systematically retained in version-controlled directories. If an earlier edition is ever required by a researcher this can be requested and arranged for dissemination.

Safe depository for data owners

Data owners and depositors can rely on our long-term preservation of their data, with quality assured version control, recorded data enhancements, and multiple server backup solutions. We provide a safe depository. With agreed licence conditions for data access data owners can specify the level of dataset dissemination.

Nesstar on-line dataset investigation

We are responsible for the transfer of data resources and value added metadata to Nesstar, the UK Data Service’s on-line data browsing tool, including the full survey question text, interviewer instructions, etc., copied directly from the corresponding survey questionnaires, and thorough variable grouping, to enable easy survey navigation. Nesstar is widely used as a teaching resource and as a research investigation tool, enabling researchers to apply regression analysis, cross-tabulations, correlations, etc., and the opportunity to make early data analysis before deciding upon downloading the SPSS or Stata versions of the data, for further intense analysis. Nesstar carries URL links to the dataset documentation, including user guides and questionnaires etc. so that Nesstar users have a complete on-line research package, viewable in the web browser of their choice.

News related studies

As the UK Data Archive’s in-house data investigators we are in a perfect position to highlight, for the UK Data Service’s news and social media platforms, recent or past datasets that have subjects related to current news events or newsworthy research topics. For example when the BBC News story announcement of the release of the results from the National Railway Passenger Survey 2016 coincided with the deposit of the ONS Train Satisfaction Survey. In February 2016 the BBC News article on the rise in wellbeing among those in their late 60s showcased data from the Annual Population Survey: Personal Well-Being studies. Such news stories are influential and have a measurable impact on related data download and use.

User requested dataset updates

One of the most rewarding aspects of our data research and processing schedules is when we are asked by interested data users to investigate the potential format migration of old “legacy” data, either binary or ASCII format, into current dissemination software formats. Some of this legacy data dates back to the 1950s and 1960s. This requires the composition of appropriate code and syntax to pick up the data and compile it into a format that is easier for the 21^st century data user to work with. Provided the legacy dataset has readable documentation (often from scanned paper questionnaires or codebooks) to accompany the legacy data files, there is a very good likelihood that this format migration can be achieved. We do our best to research and improve the accompanying documentation and where feasible attempt as much value-added variable and response value labelling as possible. Hopefully the interested data user will find the new dissemination format dataset useful for their research, often leading to a succession of related legacy dataset requests.

Legacy data format migration

We have a programme of legacy data format investigation, with the aim of migrating many legacy datasets forward, if possible. Some legacy datasets are the precursors to more familiar highly used Government data series. These old binary and ASCII data files are at present not very user-friendly for non-technical data users, not used to writing syntax and code. We are in the process of migrating forward the National Readership Surveys which are currently very popular with data users. Also, with the recent Scottish Referendum and the forthcoming EU Referendum in the UK, we are investigating old Common Market Referendum studies for possible format migration, as these studies from the 1970s and 1980s may well be useful to researchers. As each dataset is successfully processed to current Dissemination Information Packages and Archival Information Packages, these are flagged as newly available for interested researchers.

So, as we can see from the above examples, some of the usual and less usual aspects and impacts of data curation at the UK Data Archive have a significant effect on data consistency, reliability and availability, supporting the aim of the UK Data Service to provide an excellent service for data owners and data users. Our team continue to develop and learn, with new skills and data handling techniques, as data deposits become larger. We now routinely handle bigger data, and now that the Service has a Big Data responsibility, we’re fully integrated into that. Quite whether some Big Data will be able to receive the same levels of preservation, version control and value-added enhancements remains to be seen, but undoubtedly it will require statistical disclosure control and anonymisation, and similar levels of diligent ingest processing, before dissemination. We look forward to the challenges of the predicted data deluge!

Data Impact blog

Spotlight on data quality: the Ingest Services Data Curation team – impact for data creators and users

Tags