Louise Corti, a Service Director at the UK Data Service and Andrew Engeli, of the Secure Research Service at the Office for National Statistics recount findings from the #LoveYourCode2020 event, held on 14 February 2020. Focused on research reproducibility for national statistics, the UK Data Service and the Office for National Statistics came together to run the event as part of the 2020 Love Data week.
The Love Your Code event
In the autumn of 2019 Louise and Andrew spent some time mulling over how to distil some of the positive things happening in their areas of work around reproducibility into an event. Planning led to a knowledge exchange day which they hoped would contribute to building common standards for reproducibility with survey and administrative data.
Experts in the field, from data owners, producers, statistical regulators, statistical data users, peer and data reviewers to data curators and managers of Trusted Research Environments (TREs) came together to discuss and debate a range of issues around research based on statistical data; published and unpublished.
And, what a better way than to spend Valentine’s day with lovers… of code, and the challenge of reproducibility in research!
What is reproducibility?
Before we start, its useful to clarify what ‘Reproducibility’ is and how it differs from ‘Replication’.
- A replication study is generally seen as one that is conducted from the collection of new data or undertaking new experiments, which can be implemented eventually with other methods.
- Reproducibility – of a published research article – is where the authors have provided all the data, code, and processing instructions necessary to rerun exactly the same analysis and obtain identical results.
Reproducibility requires a will to work openly and share.
It has many benefits, from enabling others to rapidly follow and build on ongoing scientific progress, giving others an opportunity to reuse and adapt programmes and code already painstakingly constructed, to promoting scientific integrity to demonstrating trust; trust for others to pick apart one’s own work.
In my view, it’s also a mindset, an act of good will and generosity.
We all know we need to share code…
Sharing code is good research practice and helps actors in the research domain to be reproducible.
Data producers benefit from including syntax in data documentation to demonstrate how derived variables were constructed; data users benefit from publishing code that underpins research findings; and data users can helpfully share back value-added work they have done in the course of their analysis so that ne users can benefit.
A number of leading journals have adopted open science policies, following from the earlier TOPS Guidelines. Some journals now require hypotheses to be preregistered, code to be uploaded, and some journals actually rerun this code to ensure that results in an article are indeed replicable.
‘Showing the code’ can help demonstrate trust in published work, but how far should we go to validate code or enforce that it is reproducible?
Can we define a set of baseline best practice recommendations for publishing code in the social and behavioural sciences?
And what about the challenges of reproducing work undertaken in TREs, where the gates to access data can make reproducibility a major challenge.
Stakeholders and their challenges in the landscape
We highlight the actors in the landscape, who all have a part to play in enabling statistical and research reproducibility.
- Official and regulatory bodies can demonstrate trustworthiness in data sources underlying national statistics, and transparency in the statistical system and published national statistics.
- Data owners can make their data reproducible by improving transparency when generating a research-ready dataset, including documenting the full provenance chain (creation, cleaning and versioning).
- Policy makers can strive for transparency and trust, demanding solid evidence in the production of statistics, ensuring public confidence in good policy through the use of data, and encouraging robust policy evaluation tools and measurement of policy and impacts.
- Data publishers can make data FAIR (Findable, Accessible, Interoperable, Reusable); enabling the use of persistent identifiers and capturing versioning, checking data quality and integrity and collating rich description.
- Researchers, peer community and higher educational institutions can better appreciate research integrity and reproducibility and strive to follow best practice on being reproducible. This will necessarily include a willingness to learn new skills e.g. good coding and the sue of code tracking software, plus encouraging capacity building for the researchers of tomorrow.
- Journals can include explicit reproducibility requirements for articles they publish, where the editorial boards recognise and approve this mandate. This can range from a simple data available statement to submitting code where the code is run and checked by the journal.
- Funders can raise the bar for research they fund, building in explicit reproducibility requirements to research calls and contracts, supporting high level key messaging on research integrity and supporting methodological and impact work that demonstrate research transparency.
Bringing representatives from these groups together, the meeting discussed the benefits and challenges for researchers’ work to be ‘reproducible’. Carrots from good practice or sticks from the journals, and what might be the balance? How far could formal ‘Reproducibility Services’ can go towards demonstrating robustness in research findings and would the time and cost to thoroughly rerun findings allow a sustainable model? If We did move to a more culture of compliance, how much upskilling is needed across different socio-economic disciplines? The meeting further explored the responsibilities of data owners and data infrastructures, and what steps they could take to improve reproducibility for their own data publishing.
Making the strategic case for reproducibility
Ed Humpherson, Director General for Regulation, of the Office for Statistical Regulation gave a key note in which he stressed why the OSR cared about reproducibility and how the Trustworthiness, Quality and Value (TQV) mantra, the core aims of the Code of Practice for Statistics, actively supported it. His view that statistics and data should have reproducible ‘social lives’ helps us reimagine these objects and products as entities that require history, relationship and context.
While data sharing is fundamental to creating statistical knowledge, we have responsibility to demonstrate trust in the data lifecycle. Indeed, ‘statistics and data fail to serve the public good’ when they are not accurate, transparent or reproducible.
Ed emphasised that the TQV mantra can be applied to reproducibility through good data governance and reproducible processes to support both quality and accuracy in production.
Paul Jackson, Head of Strategic Research, Impact & Communications at the Administrative Data Research-UK (ADR-UK) highlighted the ADR’s work to support the UK’s Statistics Authority’s Joining Up Data for Better Statistics mission.
ADR-UK aims to ensure that data needs are recognised and invested in as a collective national asset, so that government priorities for analysis and evidence can be met. He highlighted government data linkage work and projects already underway using the Digital Economy Act (DEA) as the legal gateway for data governance.
Replication matters so that resulting insights are credible, can stand up to scrutiny, and importantly, are trusted.
Reproducibility in practice in Trusted Research Environment (TREs)
Anyone who has tried to link administrative data sources in the UK will know how hard it is; once the governance and secure data environment issues are handled, there follows the challenge of understanding which variables mean what and where they come from: provenance information may be missing, and key unique linkage variables may not be available.
Consequently, there is lots of trial and error in linkage methods to compile a research-ready dataset.
Those who have spent many hours on this activity have come up with some useful guidance. The illuminating poster at the 2019 ADR conference by Ruth Gilbert and her team at the UCL Great Ormond Street Institute of Child Health showed how the team had devoted over three years to linking four one-year birth cohorts of the National Pupil Database and Hospital Episode Statistics to facilitate research into outcomes for children with chronic conditions.
Ruth Harron from the same UCL team, spoke to the room about linkage methods that work well. Her teams’ Guidance for Information about Linking Data sets (GUILD) is a useful and set of practical protocols to help document linkage methods and reduce error and potential biases. She called for the ‘black box’ of linkage algorithms to be opened, methods and code published and QA by others to be undertaken.
The Health Foundation (THF) is perhaps one of the TREs that are leading the way in supporting research reproducibility.
The charity gives grants to carry out, and itself carries out, research and policy analysis in the area of health. Fiona Grimm reported on how their open health analytics approach supported their aim to ensure people’ health and care benefits from analytics and data-driven technologies.
THF run a networked data lab that enables collaborative analysis across the UK, based on an open coding approach. The THF Analytics Lab share code (internally and externally) to enable scrutiny and reproducibility of results, but also for others to learn from their methods. About 10% is shared via their GitHub instance.
Stefan Bender and Paulo Guimares, who run TREs for their banks, Deutsche Bundesbank and the central Bank of Portugal, spoke about their experiences of running research services for access to sensitive granular data.
For Stefan, in addition to the challenges of demonstrating impact for the research undertaken in his busy service, he was aiming to ensure that data in his TRE can be used more efficiently. This could be by allocating more resource to improve the quality and description of important data, and creating greater opportunities for knowledge sharing. Standard rules for programming and code sharing are part of the solution.
Paolo added that making code available for derived variables in the TRE, for example through R Markdown, would lighten up the coding burden for researchers and make it easier to do harmonisation. It may even prevent the sometimes odd things that researchers do to overcome problems in the data. Examples of good code could further be used for training junior researchers, and who could be ‘fast tracked’ for journal data editing work!
Being reproducible in data and statistics publishing
On the other hand, data owners do not have such an established history in publishing underlying code for their derived measures in published and unpublished data.
The ONS have long been regarded as pioneers of high quality survey data collection and preparation of datasets for onward research use.
Technical report and code books that accompany the datasets are thorough, but code that showed how derived variables were created has not generally been part of this package of user documentation. The team behind the Labour Force Survey have produced a detailed report on workflows that show the origins of standard derived variables. At the time of writing, the underlying code is not yet available.
The analytic community encourages data owners to provide metadata based on recognised standards, publish data with a (versioned) persistent identifier and publish code. In the meeting we heard from a small number of proactive data owners, who had agreed to attend the meeting to talk openly about the challenges of code sharing.
NatCen is Britain’s largest independent social research agency, and collect data for many of the long running UK national surveys, including the Health Survey for England and the British Social Attitudes Survey.
As a highly professional and well-regarded survey agency, Jess Bailey explained how the data managers at NatCen work to provide hundreds of derived variables for each survey release (anything from 250 to 800 per survey). NatCen data managers use agreed in-house protocols, using comprehensive code and second checking, consistent variable naming and a short human description of what the derived variable aims to do. There is a desire to publish supporting code.
The challenges of creating and documenting derived variables and harmonised datasets in longitudinal studies is exacerbated due to historical data collection and management.
CLS and CLOSER
Aida Sanchez, Senior Data Manager at the Centre for Longitudinal Studies (CLS) and Dara O’Neill, Data Harmonisation Lead at Cohort and Longitudinal Studies Enhancement Resources (CLOSER), spoke about the way that the UCL centres publish data and metadata from the UK’s well-regarded birth cohort studies, including the 1958 National Child Development Study, the 1970 British Cohort Study, the 1990 Next Steps study, and the 2001 Millennium Cohort Study.
The CLS data managers follow a protocol for creating detailed logic and algorithms for their syntax that includes exact variable names, values and so on. For measures over time, like histories, these can get quite complicated (loops, arrays, macros, etc) and consequently, it is often very technical work. User Guides that show how derived variables were created are published as user documentation.
While the underlying code is not yet available, CLS plans to release code via a CLS GitHub. While CLS see value in researchers making available their own derived code, and only where it is reproducible and well-commented, they would not have the resources to vet each piece of code; and thus would not put their name to these. This raises important questions, such as
- Might having multiple iterations of code published in different places have knock on effect on the reputation of their data?
- Are new users happy to accept disclaimers for user-derived code?
Dara reported on post-hoc harmonisation work being done as part of CLOSER across the UK cohort studies in the following areas:
- socio-economic status
- earnings and income
- household overcrowding
- childhood environment and adult mental wellbeing
- cognitive measures
- mental health measures
- mental health and wellbeing measure usage
- strategies for analysing biological samples
- cross-study biomarker data availability
- visual functioning
- methods for determining pubertal status
- body size and composition
- physical activity measures
- dietary data
The effort to identify, evaluate, derive and validate measures is significant. Value-added outputs comprise resource comparability reports and releases of harmonised data that include detailed documentation and the derivation code. Dra highlighted the systematic approach to harmonising, using agreed code style, standardised documentation and metadata templates and linking code directly to the metadata available in CLOSER Discovery.
Office for National Statistics (ONS)
Richard Heys, Deputy Director & Deputy Chief Economist at the ONS supported the notion that the statistics production chain would certainly benefit from kitemarking of underlying (survey and administrative) data for validity. As an analyst, integrity of derived variables was paramount and something to aim for. There are emerging standards and frameworks across government that will help to support trust in data and statistics.
The Statistical Quality Improvement Strategy sets out an action plan for improving quality across the ONS in the production of statistics and analysis and, including management of the processes surrounding their production. High quality data and its robust management are noted as being a fundamental pillar, underpinned by the ONS’ Data Strategy. Objectives also, importantly, include improvements to and use of standards for metadata, cross government data architecture and training and capacity building.
Good metadata should support data lineage practices, and documenting the data journey from collection to publication of national statistics, or from ‘field to fork’. Hopefully, within the lifecycle of metadata capture, derived variables from the social surveys will be included.
When it comes to administrative data, where being explicit about the full provenance chain can be so difficult, the ONS has already established Quality Assurance Toolkits, including one aimed at Administrative Data (QAAD), and one under development for its surveys. Reports on various administrative data, including cancer registrations/survival education, trade, agriculture, fishing and construction are already available and are a good read.
Alex Newson, a data scientist from the Best Practice and Impact (BPI) team at ONS explained how expertise and the emerging Data Access Platform were being used to pilot Reproducible Analytical Pipeline (RAP) techniques in the production of statistics.
The RAP ambition would help support the validity matter, across the Government Statistical Service.
The ONS was moving in the direction of open source programming, with consultations currently underway by the BPI, on making it easier for analysts to access tools like as R, Python and Gitlab/Hub. Alex alerted the room to the concern, that unless these data science tools and approaches becomes business as usual, there may be a drain of skilled staff from the civil service into the commercial sector (also seen in the academic sector).
Office for Statistics Regulation (OSR)
Catherine Bromley of the Office for Statistics Regulation (OSR) introduced the ‘Quality Assurance of Administrative Data’ work.
The OSR is an independent agency that helps shape good research practice through the Code of Practice for Statistics. Though setting standards, assessing compliance and challenging (e.g. politicians claims), trustworthiness, quality and value (TQV) in our national statistics is demonstrated.
A number of excellent regulatory guidance documents have been created by the OSR, such as ‘Building Confidence in the handling and use of data’, and the OSR are busy partnering with other agenda setters to support the Code of Practice. Catherine provided a recent example of this, where the OSR and RSS had established a ‘Voluntary Application Awards’ for recognising commitment to TQV.
Champions and capacity building to support reproducibility
There are many encouraging strands of work happening that are contributing to the capacity building angle. Researchers themselves can be taught to be reproducibly diligent, and research institutions have a role to play championing research integrity, and going on to requires and implement underpinning standards.
The UK Research Integrity Office (UKRIO) and the Royal Society work on creating practical resources on Research Integrity should be lauded. Stemming from this we have seen some excellent training modules created, for example by Epigeum.
Whether it is journals rising to the call, or institutions offering collective opportunities, there are many places to go to support oneself as a trustworthy scientist.
Just to highlight an appealing example, ReScience C, itself housed on GitHub, is an open-access peer-reviewed journal that targets computational research and encourages replication of already published research, promoting new open-source documented implementations in order to ensure that the original research can be reproduced. In true GitHub collaborative styles, all published code can be forked and modified. In 2019-20, this journal ran the Ten Years Reproducibility Challenge that invited researchers, across all disciplines, to try to run the code they produced for a scientific publication published before 2010. It was auspiciously advertised as, ‘Would you dare to run the code from your past self?
Cathie Marsh Institute, University of Manchester
Professor Debbie Price, of the Cathie Marsh Institute at the University of Manchester spoke to us about her view from that very side. Prior to the workshop she has fruitfully conducted her own small survey of colleagues. She revealed that many had used self-learning to learn programming languages and felt that they should be working to use modern open source tools like Python.
A few quotes from Debbie’s presentation rang true:
I keep meaning to do some training in Python, but I don’t get time and I’m not sure I would use it. I suppose we get stuck in our software paradigms (because we get stuck in specific data paradigms.
I know that I definitely do some things the long way but because I know it works, I continue to use the long way.
…sometimes quick and dirty code has to be enough…
She suggested that social researchers should think about making elegant and efficient code that can be easily understood by others, for example, using annotation. This may take additional time, and that this needed to be better recognised, by peers, managers and publishers. She cautioned that the ‘best code’ may look different for researchers compared to programmers; analysts should not be afraid to take the long-winded way, as this may be clearer for others to follow the steps.
Alan Turing Institute and the UK Reproducibility Network
In helping to shed light on how to be reproducible, Kirsty Whittaker of the Alan Turing Institute entranced the room with her dynamic run through of the pioneering work of the Turing Way.
Branded as ‘A handbook for reproducible data science’, the resource includes a book, a community, and a global collaboration. In striving to ‘nudge data science to being more efficient, effective and understandable, it’s main goal is to provide information that researchers will need at the start of their research projects to reproduce at the end. It prompts researcher to better planning in analysis and to think carefully about what their code is doing. GitHub can help support code sharing and build collaboration.
Kirsty also introduced the UK Reproducibility Network (UKRN) is a national peer-led consortium that seeks to understand the factors that contribute to poor research reproducibility and replicability. It works collaboratively to disseminate best practice and promote training activities.
A great example of an Early Career Researcher-led initiative is the ‘ReproducibiliTea’ Journal Club, which invites staff to get together to read papers and discuss issues on reproducibility.
The Turing Institute also supports a number of ‘Turing Reproducible Research Champions’, who propose an example of their work that they want to make reproducible and go on to share their experiences.
Aiming to get the message over early, Chris introduced his 1st year syllabus for his ‘Introduction to Social Data’ and 2nd/3rd year ‘Data Analysis in Social Science ‘modules. Students learned how to interweave code and outputs using R Markdown, and the Q-Step Centre used GitHub classroom innovatively to share and mark assignments.
Chris emphasised how critical it as to have really good teaching assistants to support this sometimes challenging learning activity.
The Health Foundation also support the learning and application of R in the NHS, running workshops, tutorials and a platform for discussion and sharing of best practice solutions to NHS problems. And at the ONS, the Government Data Quality Hub have rolled out training to ONS staff on the Quality Assurance toolkits.
Raising the bar: Reproducibility services
Our post lunch session examined what some journals were doing as early adopters of ‘high bar’ reproducibility requirements for authors articles to be accepted.
Lars Vilhuber, of Cornell University and Data Editor of the American Economic Association.
The AEA have been early adopters of data and code availability to support research publications, appoint a dedicated Data Editor in 2017. The AEA journals require well- documented analysis and code and details of the computations that are sufficient to allow replication; and these should be openly available. The AEA uses pre-publication verification, where certain quality standards around citation, metadata, the Readme, code and its reproducibility must be met by authors. These are deposited in the AEA Data and Code Repository
The process is necessarily labour-intensive, as each undergoes 2 rounds of assessment. Lars encourages students to help with this process, and his team – currently compromising around 14 undergraduates and a graduate assistant, meet a turnaround goal of 2 weeks per article. From the training and experience, it is clear that these students learn to be better FAIR scientists.
Joan Llull, Associate Professor of Economics at the Universitat Autònoma de Barcelona
Joan is also a data editor, for the Economic Journal. He echoed Lars view that the code checking for his journal was demanding work and incredibly resource intensive, and hard to be cost effective.
We heard about a novel concept that reaches beyond the work of the journals; a formal reproducible service, cascad that uses certification as an incentive for demonstrating reproducibility.
Christophe Perignon of the international business school, HEC Paris
Christophe introduced us to the cascad process, established recently in France. A cascad certification request requires three elements:
1. A pdf of the research article, making sure the tables/figures to certify are included.
2. The primary data and documented code used to produce the tables /figures as a single compressed archive file.
3. A readme file that lists the files with a brief description, the list of software packages and functions and routines required to run the codes, and the list of the variables used.
Various ratings of certification can be gained with RRR being the highest rating. A team at cascad rerun the code using appropriate software. Currently SAS, Matlab, Mathematica, R, Python, Stata, Fortran, C++ are supported.
When it comes to undertaking research on confidential data within restricted data environments, checking results becomes much more complicated. Indeed, Lars had told us that some 40% of in economics papers use restricted data, and that these are exempt from standard journal reproducibility checks.
Data accessed within a TRE, under the 5 SAFES framework, constitute a high bar for entry. Reproducing published results can only be done if the reproducer is authorised and authenticated to enter the TRE. Cascad have set up a process for certifying reproducibility with confidential data, together with the Centre d’Accès Securisé aux Données (CASD).
Code with your tea?
After lunch saw the attendees group for facilitated discussions, exploring different dimensions of ‘Reproducibility for UK national statistics:
- Quality review and certification for reproducibility: good code, bad method?
- Automating processes for reproducibility: benefits and issues around code tracking tools e.g. Gitlab, R Markdown and Jupyter Notebooks
- Assessing disclosure risk in code: rules, tools and skills. What about automated output checking?
Image: Louise and Andrew at the #LoveYourData2020 event (from @ONS Twitter feed)
Recommendations and next steps
The workshop sessions helped emerge some baseline best practice recommendations, as well as some training and capacity building needs for documenting reproducibility. Sticks and carrots are needed, from policy and regulation, community action and celebrating and badging best practice, to creative pedagogical approaches for capacity building.
A number of concrete ideas came out of the day:
- Ask data producers, such as the ONS and NatCen, to further develop policies on publishing their code for derived variables in their published survey data, making findings based on their data easier to reproduce.
- Encourage data owners and publishers work towards FAIR data publishing, using persistent identifiers for their data assets and that these identify versions [ONS are currently in the consultation stage around the use of DOIs for ONS assets]
- Researchers need to be more open, or at least have trust in themselves to share their code for data preparation, for adding value to data, and the code to support their analysis. This could be done inside or outside TREs, depending on the sensitivity if the data. Code publishing platforms like GitHub, or GitLab (for safe environments) can be used, or code for derived variable could be deposited as a dataset with a DOI, linked to the ‘mother’ data.
- Encourage TREs to provide open source code tracking tools, like Jupyter Notebook or R Markdown, and that these are routinely used by researchers so that statistical functions can be rerun, and reproduced. Docker containers can be used to help speed up software management.
- Support capacity building activities that introduce the art of writing great code. This can build on the great work by:
- Q-Step Centres;
- the UK Reproducibility Network;
- the Alan Turing Institute’s ‘Turing Way’ (A handbook for reproducible data science);
- the Safe Data Access Professional (SDAP) Network who gather together staff working in RDCs across the UK to share experiences and publish guidance, such as the Handbook on Statistical Disclosure Control for Outputs
- Continue to pursue mechanisms and tools for assuring data quality of administrative data, building on the great work done so far by the UK Statistics Authority on Quality Assurance Toolkits, e.g. the Administrative Data (QAAD), by the OSR, the cross government Government Statistical Service Best Practice an Impact work and the network of Reproducible Analytical Pipelines (RAP) Champions.
- Run a pilot to call for early adopters of reproducibility in TREs, following on from the initial work started by the UK Data Service
- Explore paid for Reproducibility Services, such as the cascad model, where researchers submit their own work and code for detailed checking by a third party, aiming to gain certification of quality.
The presentations from the day are available on the UK Data Service website.
Louise Corti was Service Director, Data Publishing and Access Services for the UK Data Service and will be joining the ONS’ Secure Research Service in 2021.
We wish her well in her new role.