Louise Corti explores the challenges of ensuring longitudinal survey data is classified appropriately for safe, non-disclosive access.
How does the UK Data Service protect data in its collection?
The UK Data Service uses a three tier access policy to determine the most appropriate level of access for the data collection it publishes. The three tiers are:
- open
- safeguarded
- controlled
The three tiers combine modes of access and conditions of use.
Safeguarded data are not deemed to fall into the category of having minimal to no disclosure risk, and are not personal data, according to the relevant legislation, e.g. Data Protection Act 2018 and the Digital Economy Act 2017; subjects are always de-identified and the data anonymised. Any minimal risks are mitigated by robust safeguards placed around access to data, using the AAA framework, Authentication, authorization, and accounting, where access policies are enabled and enforced and usage is audited.
Signing up to the terms and conditions of the standard End User Licence (EUL) is the first safeguard put in place at the UK Data Service, a basic requirement for registering with the UK Data Service. Safeguarded data may then be directly downloaded by groups of researchers who are eligible.
Controlled data are data that can be defined as personal, but have been de-identified. To use Controlled, or ‘Secure Access’ data, access is governed by the use of the 5 Safes framework; researchers must be experienced in statistical analysis. complete an application that must be approved by the data owner, attend a safe researcher training course; and access data via a remote safe haven .
The more we know about a research participant, the greater the risk of re-identification, for example through data being linked with other public data sources. Longitudinal studies, and especially cohort studies need to be extra careful that the data they release minimises any re-identification of their cohort members.
There is very useful expert guidance available in the community, such as the ICO Code of Conduct on Anonymisation and the ONS Policy for social survey microdata. In addition, survey data providers have their own approach to assessing risk. Data owners have a team of expert data managers and statistical experts in house who assess and classify risk in variables so that data delivered can meet the most appropriate safeguarding requirements.
Classifying a dataset for its risk
For good reason, many data producers/owners depositing data often take an initial risk averse stance, wishing to prevent any re-identification of their respondents.
On the other hand, the desire to make data widely accessible, through meeting their funder’s data sharing expectations or broader open science mandates, means that a balanced approach needs to be taken.
Further, the more one aggregates survey data the less its utility for powerful research and deriving policy relevant findings.
‘Disclosure review’ is an exercise that is used to help classify data and decide upon the appropriate ‘granularity’ of specific variable; ensuing that fine-grained potentially identifying variables are placed under greater restriction. The best practice guidance, such as that above, is used as a first port of all, ensuring that the most risky variables for the UK are checked for their level of granularity (e.g. raw age or banded age groups).
The UK Data Service Curation team provide an additional safety check for incoming data deposits, using in-house processes to check appropriate access levels . The Curation team uses the R package, sdcMicro, to check combinations of potentially identifying variables.
Variables with potential known risk, such as raw age, fine grained industry and occupational codes for employment, and low level geography can be made available in a more restrictive access tier; in some cases under the controlled access, where a legal gateway is required for access (e.g. GDPR, Education Act, Digital Economy Act).
An example: Next Steps data
Next Steps is a longitudinal cohort study, following a nationally representative group of nearly 16,000 people born in 1989-90 who attended secondary school in England.
The Next Steps study was established and run by the Department for Education until 2013, and was designed to examine key factors affecting educational progress, attainment and transitions following the end of compulsory education. It recruited almost 16,000 children in adolescence, at age 13/14 in Year 9, as its cohort members through schools in England. Information was collected annually through face to face surveys through face-to-face surveys for the first four sweeps, and a mixed mode approach (web, telephone, face to face) was used for subsequent sweeps 5, 6 and 7, providing detailed and repeat measures on education during this crucial educational period. The questionnaires covered attitudes to school, aspirations for future work and study and transitions to college, university and work. Resident parents were also interviewed for the first four sweeps.
Next Steps is now based at the UCL Centre for Longitudinal Studies (CLS) and has become a multidisciplinary study gathering data on many different aspects of cohort members’ lives. The most recent sweep was when cohort members were age 25. CLS also manages three of the UK’s other major longitudinal cohort studies, and the data from all four CLS studies are available to researchers through the UK Data Service.
These cohorts are incredibly popular data resources and also form part of the Cohort and Longitudinal Studies Enhancement Resources (CLOSER) initiative. CLOSER provides a learning hub for these data and a metadata catalogue.
The questionnaire data has been and is being supplemented by a number of linked administrative data sources (where consent has been gained) including: DfE data on participation and attainment in school and vocational education from Individualised Learner Record and National Pupil Database, as well as NHS records on hospital episode statistics. In addition, CLS is working to link further administrative data to the survey data. This includes information about higher education applications, offers and qualifications, information about earnings, tax and benefits, and information from the Police National Computer about arrests, cautions and sentences.
Reclassifying data from a longitudinal study
CLS use their own data access policy to classify variables from their cohort studies, so that when negotiating a deposit, they already have a very good idea about which variables they would expect to make available under more restrictive conditions.
When CLS took over management of the Next Steps cohort and data, they felt it was very good opportunity to undertake a review of the Next Steps data content, in line with their data access policy, and data classification principles already used for all of their other cohorts. Some additional variables to those contained in the EUL version were already available to users via an ad hoc request to the DfE.
In the summer of 2020, the CLS Research Data Management team carried out a disclosure assessment of the Next Steps (sweeps 1-7; ages 14-20) sensitive datasets, at the time available under Controlled Access tier, to determine whether they could be accessed under a less restrictive safeguarded tier.
After assessing 1841 variables held under Secure Access to check whether the data might contain information that could re-identify an individual (disclosivity) and how damaging re-identification might be to an individual (sensitivity), CLS moved 1238 survey variables to EUL access. These include information related to ethnicity, family background, employment and qualifications.
The CLS Research Data Management team also carried out a review of individual variables and datasets not previously deposited with UK Data Service, and released new EUL data for research use, including 563 additional variables, Household Grid files from sweep 1 (age 14) to sweep 7 (age 20) and Activity History files from sweep 4 (age 17) to sweep 7 (age 20).
Potential use of new data
The addition of hundreds more variables after all these years represents a major enhancement to the Next Steps study data.
By opening up variables, researchers have an easier access route to data previously available only under restricted access. The newly released datasets provide opportunities for research into higher education, training, employment and income for the millennial generation and members of their household.
Dr Lisa Calderwood, Director of Next Steps, said:
“We are thrilled to make these variables available to a wider range of researchers, which will open up new opportunities for the study of the millennial generation. The newly released variables and datasets will contribute to a better understanding of their experiences, which will have the potential to benefit policy and practice.”
We acknowledge and appreciate the input that Sarab Rihal and Aida Sanchez from the CLS Research Data Management team gave to the creation of this post.
About the author
Louise Corti is Service Director, Data Publishing and Access Services for the UK Data Service. Louise leads five teams dedicated to bringing in and disseminating data, as well as supporting data creators and users of our secure access data. She has special expertise in research integrity and governance, research data management and the archiving and reuse of data or the social sciences.