Innovation Fund project: Risk-Utility Data Management Tool (RUM)

Richard Welpton of the UK Data Service introduces the Risk-Utility Data Management Tool (RUM) which will enable data producers and service providers (such as the UK Data Service) to produce anonymised versions of datasets from confidential sources consistently, efficiently, and quickly

Introduction

Microdata provide the foundation for many types of insightful research with impact. Researchers routinely access microdata through the UK Data Service portal. Their data requirements will determine whether they access data through an End User Licence (the UK equivalent of a Public Use File), a Special Licence (Scientific Use File), or through a Controlled Licence (Secure Use File). The sensitivity of these types of data increases as the data become more detailed about individuals and commercial sector organisations in the versions. For some datasets which the UK Data Service holds, such as the Labour Force Survey or Understanding Society, there is a version available through each licence type.

A controlled version of a dataset has had little anonymisation undertaken applied. Only survey respondents’ names and addresses are removed. An End User Licence version by contrast, has had many anonymisation (or Statistical Disclosure Control, SDC) techniques applied: for example, top-coding to remove outliers; banding of variables such as age, income; creating averages, ratios, among others and removing sensitive values.

Similar techniques are applied to produce Special Licence data, but to a lesser extent, since the conditions of accessing such data are more strict.

Specifications for the different version types are available, however, the process for undertaking the anonymisation is currently undertaken manually. Consequently, datasets of each version type vary from one month, quarter or year to the next. For example, variables may or may not be included in a particular years’ worth of data. This confounds researchers who rely upon a stream of consistently produced data to undertake robust analysis of our society.

What will RUM deliver?

The product will enable data producers and service providers (such as the UK Data Service) to produce anonymised versions of datasets from confidential sources consistently, efficiently, and quickly.

Users will access a simple, easy-to-use graphical interface which will allow them to examine the original source data, and undertake tests to determine which variables need to be anonymised, and whether any observations need to be removed.

Users will then follow a journey consisting of a small number of steps in which they can adjust the ‘degree’ of anonymisation that will be applied to the data. The amount of ‘anonymisation’ applied to the data will depend on whether the intended output data are for dissemination as Open Data, Safeguarded Data or Controlled Data.

In addition, the program will include the ability to produce synthetic data versions of the confidential sources, which will allow prospective researchers to examine a ‘fake’ version of the data before making an application to access the real data.

The program can save scripts to enable a data producer to re-run the anonymisation specification on later sources of data, in order to produce a consistent trend of datasets.

Progress so far?

RUM is based on an existing R program called ‘sdcMicroGUI’. Much of the work involves adapting this existing package to improve the feel and user experience. In addition, the underlying R code is being examined to ensure that the anonymisation techniques conform to current specifications as used by data providers who supply data to the UK Data Service, e.g. the Government Statistical Service SDC guidelines.

Progress is going well, and our achievements to date include:

the ability to preview changes to a dataset before the anonymisation techniques are applied
inclusion of variable metadata to enable speedy sorting of variables into categorical and continuous buckets
simpler functionality for importing and exporting data

Following a round of User Experience (UX) testing with staff internally at the Archive before Christmas, our partners at the Norwegian Social Science Data Services (NSD) are working on a number of improvements to the overall tool flow. This includes:

better variable management
improved SDC functionality
more logical workflow

In addition, we were invited to present our work to date at a workshop on statistical disclosure at the Office for National Statistics (ONS), a key data provider in the UK. We’ve received positive feedback from the ONS on our work and look forward to working with them closely using the tool.

Work to do?

We hope to be able to launch the product by the end of May 2015. The remaining work falls into two camps: Technical and Engagement. Here’s what we’ll be doing in the coming month:

Technical:

complete enhancements to the User Experience
explore and adapt SDC requirements as appropriate, including ‘disclosure risk’ measures

Engagement

undertake user testing (more UX and also SDC testing) with data providers
ensure rigorous SDC testing to build confidence in the product by the anonymisation community

We’ll also publish a web page on the UK Data Service website which will also contain updates. This is the first of a series of blogs entries that I’ll publish to keep all interested parties up-to-date on our progress. I hope you find our work interesting, and please do contact me if you’d like to support our work or have any queries. Please find out more about all our Economic and Social Research Council (ESRC) Innovation Fund projects here.

Data Impact blog