QAMyData: A health check for numeric data

Louise Corti reports on the QAMyData tool recently developed by the UK Data Service. The tool is a free easy-to-use open source tool that provides a health check for numeric data. It uses automated methods to detect and report on some of the most common problems in survey or numeric data, such as missing data, duplication, outliers and direct identifiers.

Why is a checking tool needed?

Social science research benefits from accountability and transparency, which can usefully be underpinned by high quality and trustworthy data.

It can be a challenge when curating data to locate appropriate tools for checking and cleaning data in order to make data FAIR. The tasks of checking, cleaning and documenting data by repository staff can be manual and time-consuming.

Disciplines that require high quality data to be created, prepared and shared benefit from more regularised ways to do this.

The objective of this grant was to develop a light weight, open-source tool based on open-source software for implementation of quality assessment of research data.

This can be viewed as a ‘data health check’ that identifies the most common problems in data submitted by users in disciplines that utilise quantitative methods. We believe this could be appealing to a range of disciplines beyond social science where work involves surveys, clinical trials or other numeric data types. Furthermore, a tool that could be easily slotted into general data repository workflow would be appealing.

How were quality checks selected?

Requirements were gathered around the kinds of checks that could be included through a series of engagement exercises with the UK Data Service’s own data curation team, other data publishers, managers and quantitative researchers.

A number of repositories were invited to online meetings to discuss their data assessment methods. Common examples included:

column and row (and case) issues
missing values
missing or incomplete labels
odd characters
obviously disclosive personal information.

A comprehensive list of ‘tests’ was produced to include the most commonly used when quality assessing numeric data files.

Next the team worked on appropriate methods of assessment, how to set up input controls and consider reporting thresholds. For example, what threshold might constitute a ‘fail’?

A critical feature of the tool was that a user should be able to specify and set thresholds to indicate what they are prepared to accept, be it no missing data or data must be fully labelled.

Issues would be identified in both a summary and detailed report, setting out where to find the error/issue (for example, the column/variable and case/row number).

This at-a-glance report aspect is appealing for data repositories, to help quickly assess data as it comes in, instead of relying on manual processes that are often a large part of the evaluation workflow. An early plan was also that the system must be extensible to add new tests.

The types of checks were broken down into four types:

File checks

File opens	Checks whether acceptable format
Bad filename check, regular expression via RegEx pattern	Regex requires quotes “[a-z]”. To use a special characters, e.g. a backslash (\) a backslash before is required e.g. \\

Metadata checks

Report on number of cases and variables	Always run
Count of grouping variables
Missing variable labels	Must be set to true. If set to false the test will not run
No label for user defined missing values e.g. – 9 not labelled	SPSS only
‘Odd’ characters in variable names and labels	User specifies the characters
‘Odd’ characters in value labels	User specifies the characters
Maximum length of variable labels, e.g. >79 characters	User specifies the length
Maximum length of value labels, e.g. >39 characters	User specifies the length
Spelling mistakes (non-dictionary words) in variable labels using a dictionary file	User specifies a dictionary file.
Spelling mistakes (non-dictionary words) in value labels using a dictionary file	User specifies a dictionary file

Data integrity checks

Report number of numeric and string variables	Always run
Check for duplicate IDs	User specifies the variables. Multiple variables can be added on new lines e.g. – Caseno – AnotherVariableHere
‘Odd’ characters in string data	User specifies the characters
Spelling mistakes (non-dictionary words) in string data using a dictionary file	User specifies a dictionary file
Percentage of values missing (‘Sys miss’ and undefined missing)	User sets the threshold, e.g. more than 25%

Disclosure control checks

Identifying disclosure risk from unique values or low thresholds (frequencies of categorical vars or minimum values)

User sets the threshold value, e.g. 5

Direct identifiers using a RegEx pattern search

User runs separately for postcodes, telephone numbers etc.

Advise tests are separately as may be resource intensive

The tool development: technology choices

We are very fortunate to have Jon Johnson, ex database manager for the British Cohort Studies as our lead on technical work.

At the time he was leading on the user side on the UK Data Service’s big data platform work (Smart Meter Research Portal) with UCL, thus bringing dual aspects of small-scale survey work with the challenge of ingesting and quality-assessing large-scale streaming data. The tool envisaged should be able to consider a range of QA solutions for all numeric data, regardless of scale.

We were also happy to have recruited a local part-time programmer, a dynamic final year computer science undergraduate, who had previously worked for the UK Data Archive. Myles Offord proved to be an ambitious and hugely productive software engineer who undertook some thorough R&D with Jon before the final software solutions were selected.

The choice of technology underpinning the tool went through at least four months of research, experimenting with different open source programming languages and libraries of statistical functions. R, Python and Clojure, initially on SPSS and STATA files.

During the course of the development phase of project, we found that the open source library, Readstat supported all the commonly used file types. And had been noticed in the statistical community. As the library was being actively maintained in the community, it is a strong backbone for QAMyData and ultimately, was a very good choice for the tool.

As different statistical software treats data differently, input checks needed to be software specific; for example, Stata insists on its own input conditions. Output reporting had to ensure that a standard frame as built and that the error/issue were easily locatable by the user.

A relatively new agile programming language, Rust, was discovered and selected as the best choice for the wrapper. The Rust application was developed, iterated and the code is published on the UK Data Service GitHub along with comprehensive instructions on how to download the programme

The software was designed to be easily downloaded to a laptop or server and can be quickly used and integrated into data cleaning and processing pipelines for data creators, users, reviewers and publishers.

The QAMyData software is available for Linux, Windows and Mac, and can be quickly used and integrated into data cleaning and processing pipelines. It is available to download from the UK Data Service Github pages under a MIT Licence.

Running the tests

The first release of QAMyData allowed a small number of critical quality tests to run, the intention being to add the remaining desirable tests following initial external testing.

SPSS, Stata, SAS and CSV formats can all be loaded in. The tool uses a configuration file (written in yaml) that has default settings for each test; such as a threshold for pass or fail on various tests (e.g. detect value labels that are truncated, email addresses identified as a string, or undefined missing values) can be easily adapted.

# Checks whether any user-defined missing values do not have labels (sysmis) - SPSS only

value_defined_missing_no_label:

setting: true

desc: "User-defined missing values should have a label (SPSS only)"

Example of a check in the config file

The regular expressions checks to detects e.g. emails or telephone number and so on, can be quite resource intensive to run, so these are best run separately; they can be commented out from the default configuration file and run again.

regex_patterns:

setting:

- "^[A-Za-z]{1,2}[0-9A-Za-z]{1,2}[ ]?[0-9]{0,1}[A-Za-z]{2}$"

desc: Values matching the regex pattern fail (full UK postcodes found in the data)

Example of a regex check in the config file

The software creates a report as a ‘data health check’ that details errors and issues, as both a summary and providing a location of the failed test.

Tests run that are highlighted in green in the summary report have passed, meaning that there were no issues encountered according to the thresholds set.

Failed tests are shown in red, indicating that QAMyData has identified issues in particular variables or values.

To locate the problems, a user can click on a red test, which takes them to more detailed table, which shows the first 1000 cases.

In the example below, to view the results of the failed ‘Variable odd characters’ test, a click on the failed test will scroll down to the result, in this case that variables V137 and OwnTV contain “odd” characters in their label.

Summary report from the QAMyData tool

Example of a summary report

Detailed report from the QAMyData tool

Example of a detailed report for particular failed checks

Data depositors and publishers can act on the results and resubmit the file until a clean bill of health is produced.

Testing, testing

The project undertook evaluation of the tool, algorithm and upload process with researchers, teachers, students and data repositories, including partner international data archives, university data repositories and journals.

Our first hands-on session with users was held as an NCRM course in February at the LSE focused around the principles of, and tools for Assessing data quality and disclosure risk in numeric data.

Half of the 20 attendees came from outside the academia sector, from government departments and the third sector.

For the hands-on part, materials included a worksheet on data quality, using a purposefully ‘messy’ dataset, containing common errors that we asked participants to locate and deal with.

Feedback from this early workshop recognised the importance of undertaking data assessment. Given the implications of the GDPR when creating and handling data, users also appreciated opportunities for a greater understanding of how to review data for direct identifiers and welcomed the idea of a simple, free and extensible tool to help with data cleaning activities.

We received feedback that the tool would be useful in teaching on quantitative data analysis courses, suggesting that it would be useful to set up a longer-term dedicated teaching instance.

Further presentation and hands-on training sessions held from March to June unearthed some constructive more feedback on accessing the software, pointing to improvements to the User Guide and suggestions for additional data checks.

By the end of the second workshop our resources were refined, ready for release. We are delighted that we experienced such interest from a variety of sectors, and expect more enquiries and opportunities to promote and showcase the tool and training aspects of the project.

The final few weeks of our project saw the team fully document the tool, annotate the config file and provide a step-by-step user guide.

Page from QAMyData User Guide

Capacity building aims and deliverables

One of the key aims was also to support interdisciplinary research and training by creating practical training materials that focus on understanding and assessing data quality for data production and analysis.

We sought to incorporate data quality assessment into training in quantitative methods. In this respect, both the UK Data Service training offerings and NCRM research methods training nodes are excellent vehicles for promoting such a topic.

A training module on what makes a clean and well-documented numeric dataset was created. This included a very messy and purposely-erroneous dataset and training exercises compiled by Cristina Magder, plus a detailed user guide. These were road tested and versions iterated during early training sessions.

The tool in use

Version 1.0 of the QAMyData tool is available for use. Since releasing earlier versions of the software in the spring, we have undertaken some work to embed the tool into core workflows in the UK Data Service.

The Data Curation team now use it to QA data as it comes in, to help with data assessment, and we are scoping the needs for integration into the UK Data Service self-archiving service, ReShare, so that depositors can check their numeric data files before they submit them for onward sharing.

We hope that the tool will be picked up and used widely, and that the simple configuration feature will enable data publishers to create and publish their own unique Data Quality Profile, setting out explicit standards they wish to promote.

We welcome new suggestions for new tests, which can be added by opening a ‘New Issue’ on the Issues space in our Github area.

Open a new issue in our Github space

End note

Louise gained a grant from the National Centre for Research Methods (NCRM) under its Phase 2 Commissioned Research Projects fund, which enabled us to employ a project technical lead and a software engineer. The project ran from January 2018 to July 2019 and version 1.0 of the QAMyData tool is available for use.

The QAMyData project, and its resulting software and training materials, was a very satisfying project to lead. My colleagues Jon Johnson, Myles Offord, Cristina Magder, Anca Vlad and Simon Parker were a real pleasure to work, making up a friendly dedicated team, who were open to ideas and responsive to the feedback from user testing.

Louise Corti is Service Director, Collections Development and Producer Relations for the UK Data Service. Louise leads two teams dedicated to enriching the breadth and quality of the UK Data Service data collection: Collections Development and Producer Support. She is an associate director of the UK Data Archive with special expertise in research integrity, research data management and the archiving and reuse of qualitative data.

Data Impact blog