Introducing QAMyData: A health check for numeric data

Louise Corti introduces the QAMyData tool recently developed by the UK Data Service. The tool is a free easy-to-use open source tool that provides a health check for numeric data. It uses automated methods to detect and report on some of the most common problems in survey or numeric data, such as missing data, duplication, outliers and direct identifiers.

 

Why do we need a checking tool?

Accountability, transparency and reproducibility are key to high quality social science research. Such research can usefully be built upon high-quality and trustworthy data.

However, finding the right tools to check and clean data in order for it to be FAIR can be a challenging task. For repository staff, this can lead to time-consuming, manual checking, cleaning and documenting data.

We recognised that developing a light weight, open-source tool based on open-source software for quality assessment of research data would really benefit researchers and repository staff. Consequently, we gained a grant from the National Centre for Research Methods (NCRM) under its Phase 2 Commissioned Research Projects fund to develop the QAMyData tool.

QAMyData undertakes four types of checks on datasets:

  • File checks
  • Metadata checks
  • Data integrity checks
  • Disclosure control checks

QAMyData then creates a report as a ‘data health check’ that details errors and issues, as both a summary and providing a location of the failed test.

Summary report from the QAMyData tool

Example of a summary report

Detailed report from the QAMyData tool

Example of a detailed report for particular failed checks

 

Data depositors and publishers can act on the results and resubmit the file until a clean bill of health is produced.

Using QAMyData can be viewed as a ‘data health check’ that identifies the most common problems in data submitted by users in disciplines that utilise quantitative methods.

We believe that QAMyData has the potential to be of support to researchers in both the social sciences and beyond where work involves surveys, clinical trials or other numeric data types. Furthermore, a tool that could be easily slotted into general data repository workflow would be appealing.

Capacity building aims and deliverables

One of the key aims was also to support interdisciplinary research and training by creating practical training materials that focus on understanding and assessing data quality for data production and analysis.

We sought to incorporate data quality assessment into training in quantitative methods. In this respect, both the UK Data Service training offerings and NCRM research methods training nodes are excellent vehicles for promoting such a topic.

A training module on what makes a clean and well-documented numeric dataset was created. This included a very messy and purposely-erroneous dataset and training exercises compiled by Cristina Magder, plus a detailed user guide. These were road tested and versions iterated during early training sessions.

 

The tool in use

Version 1.0 of the QAMyData tool is available for use.

Since releasing earlier versions of the software in the spring, we have undertaken some work to embed the tool into core workflows in the UK Data Service.

The Data Curation team now use it to QA data as it comes in, to help with data assessment, and we are scoping the needs for integration into the UK Data Service self-archiving service, ReShare, so that depositors can check their numeric data files before they submit them for onward sharing.

We hope that the tool will be picked up and used widely, and that the simple configuration feature will enable data publishers to create and publish their own unique Data Quality Profile, setting out explicit standards they wish to promote.

We welcome new suggestions for new tests, which can be added by opening a ‘New Issue’ on the Issues space in our Github area.

 

Louise has written more about the development and details of QAMyData.


About the author

Louise Corti is Service Director, Collections Development and Producer Relations for the UK Data Service. Louise leads two teams dedicated to enriching the breadth and quality of the UK Data Service data collection: Collections Development and Producer Support. She is an associate director of the UK Data Archive with special expertise in research integrity, research data management and the archiving and reuse of qualitative data.

Leave a Reply

Your email address will not be published. Required fields are marked *