Not real – but really useful: synthetic data – a cost-benefit analysis of its practical value

Jools Kasmire (left) and Christina Magder (right) introduce synthetic data, how it’s used and what fidelity means, before introducing us to a project exploring the benefits of low-fidelity synthetic data.

Understanding synthetic data

What is it?

Put simply, synthetic data are any data that is generated rather than observed.

The output of a random number generator, your own imagination, or a complex and custom-built machine learning model are all synthetic data whereas survey results, sensor output, and audio or video recordings are all examples of observed data.

Synthetic data can be generated to mimic or act in place of a real, observed data set, with the similarity between the two known as fidelity. High-fidelity synthetic data is made to look and act more like the real data while low-fidelity synthetic data is less similar or more random.

Importantly, synthetic data is never totally realistic – it preserves some aspects of the real data with higher fidelity and others with lower fidelity.

Why use synthetic data?

Synthetic data is used for many reasons, including to save time or money, to act in place of data that is impossible or difficult to acquire, to test a vast range of theoretical values, to protect privacy and avoid risks, or to adhere to data storage regulations when collaborating with diverse research partners.

These different uses require different features of synthetic data, with some requiring large volumes of data while others need only very small synthetic datasets.

What is synthetic data used for?

The different uses for synthetic data entail different requirements. If only small amounts of data are needed, then manual generation is an option. If high-fidelity data is needed, then sophisticated machine learning models are probably best while low-fidelity data can be made with simple generators or off-the-shelf models.

Creating synthetic data

Some methods of synthetic data generation are MUCH faster, cheaper and easier than others and these are typically used for creating low-fidelity data.

At the same time, most synthetic data have limited value outside of its original intended use because the decisions around what must be high-fidelity (and exactly how the high-fidelity features were generated and checked) make it very specific to the original research question.

Thus, a given high-fidelity data set is typically only useful for its original high-fidelity application or for applications that require only low-fidelity data.

Essentially, this means that general purpose, high-fidelity synthetic data sets are NOT A THING. Once we accept this, we can start focussing on the myriad and generalised uses for low-fidelity synthetic data.

The project: a cost-benefit analysis of low-fidelity synthetic data for data owners’ Trusted Research Environment

Synthetic data offers considerable potential across the research and data science landscape. But how does that potential play out in the real world, particularly when we’re talking about low-fidelity synthetic data? That’s precisely what our project set out to investigate.

Funded by Administrative Data Research UK (ADR UK) and undertaken by a team from the UK Data Service, the project explored when, why, and how low-fidelity synthetic data might be used by organisations responsible for managing sensitive data.

This project focused on the practicalities: What does it take to produce synthetic data? What does it cost? And what are the actual benefits for those who hold and share data?

To answer these questions, the project employed a mixed methods approach and combined several strands of evidence:

A literature review to map the current state of knowledge.
A survey of data owners to capture current practice and perceptions.
Case studies with data producers already creating synthetic data.
And a focus group with Trusted Research Environment (TRE) professionals to understand operational implications.

Together, these approaches offered a rounded picture of the current landscape and the barriers to wider adoption.

The people, skills and expertise behind the project

This project would not have been possible without the dedication, insights, and collaboration of an experienced and diverse team.

Led by Cristina Magder, the core team included Dr Maureen Haaker, Dr Jools Kasmire, Dr Hina Zahid as project co-leads, and Melissa Ogwayo as researcher. Jools, our synthetic data expert, provided crucial technical insight. Hina led the literature review, mapping the synthetic data landscape. Maureen expertly guided the qualitative research, ensuring the voices of data owners and TRE professionals were heard. And Melissa contributed across both qualitative and quantitative research, helping turn findings into actionable recommendations.

Equally vital to the project’s success were the data owners and TRE professionals whose time and expertise were essential to the project’s success. Their openness shaped our understanding and the practical guidance we offer.

The team also collaborated closely with Robert Trubey and Dr Fiona Lugg-Widger, colleagues from a parallel ADR UK project, DELIMIT, which explored public perceptions of synthetic data and developed complementary recommendations for data owners.

The project also benefited from the expert leadership and ongoing support of Emily Oliver, who leads the ADR UK Synthetic Data Working Group, helping to shape the project’s strategic direction and maximise its impact.

What we learned

Synthetic data, especially when generated quickly and cheaply at low fidelity, can offer real benefits to data owners and researchers alike.

It can reduce waiting times, lower pressure on secure computing systems, and allow researchers to familiarise themselves with datasets before ever touching sensitive information. In some cases, it can even be used to pilot research methods or prepare teaching materials.

But despite the promise, several recurring challenges are slowing progress. These include uncertainty about the legal status of synthetic data and how it should be governed, a lack of dedicated skills and resources in organisations to support synthetic data production, inconsistent approaches to creating, documenting, and sharing synthetic datasets, and ongoing confusion or mistrust about what synthetic data can realistically be used for.

So, how do we move from promising concept to practical reality?

Four key recommendations

Based on our findings, the project put forward four core recommendations, each targeted to a key stakeholder group. These are designed to be phased, starting with the most urgent and foundational, and building up to broader awareness and capacity-building over time.

1. For policymakers: clarify governance and legal frameworks

First and foremost, there is a pressing need for clarity. Policymakers and regulators should work with data owners, legal experts, and statisticians to develop clear and practical guidance around the legal and governance aspects of synthetic data.

This includes questions around intellectual property, licensing, and whether different levels of fidelity require different legal considerations.

At present, the lack of clarity is a major source of hesitation. By addressing this first, we create a foundation of confidence that allows everything else to follow.

2. For funders: invest in skills and infrastructure

Producing useful synthetic data, especially in a sustainable and scalable way, requires people with the right expertise, and systems that can support them. That means funding for training, for specialist roles, and for infrastructure such as secure computing environments or software licences.

This is particularly important for smaller organisations, which may not have the capacity to experiment on their own. Targeted funding for pilot projects could unlock real value here.

3. For data owners and TREs: establish standards and sharing models

Synthetic data is often created in isolation, with little consistency in how it’s labelled, documented, or tested. This undermines both confidence and usability.

We recommend developing shared quality standards, such as checklists for fidelity and utility, and adopting clear documentation practices.

Datasets should be transparently labelled as synthetic, with clear explanations of how they were produced and what they’re suitable for. Datasets should always be accompanied by useful documentation and as much as possible shared in similar ways.

Greater consistency will make synthetic data more useful, more interoperable, and more trusted.

4. For the research community: improve awareness and public engagement

Finally, we must tackle the misconceptions. Many researchers still don’t fully understand what synthetic data is, or when and how to use it. Members of the public may also have concerns that synthetic data is being used as a substitute for real data, or that it introduces risks.

That’s why we’re calling on the research community to deliver more training and public engagement. Researchers need clear, practical education on using synthetic data responsibly. At the same time, outreach initiatives can help the public understand what synthetic data is, and what it isn’t.

Next Steps: from insight to action

All project outputs, including the final report and practical guidance, are available via the project webpage which will continue to be updated as needed to reflect new developments.

Work continues with the ADR UK Synthetic Data Working Group, which is now focusing on developing a set of guiding principles for synthetic data policies.

Additionally, a new community group, DARE UK, has been established to bring together stakeholders across the synthetic data landscape, fostering collaboration and knowledge sharing. You can find out more about this community and get involved at www.syntheticdata.uk.

The project team is also actively engaging with international collaborators, such as those at FAIR AI Data, to ensure UK synthetic data practices are aligned with global standards and benefit from shared expertise.

These ongoing efforts aim to build on the project’s findings, supporting a responsible and scalable future for synthetic data across the UK research ecosystem.

Synthetic data won’t replace real data, but it can absolutely complement it. By investing in its responsible use, we can streamline data access, improve efficiency, and ultimately support better, faster, and safer research across the UK.

About the authors

Cristina leads two teams at the UK Data Service responsible for data collections development and research data management support.

She and her teams focus on making data FAIRer, from securing high-quality, population-representative data collections for the UK Data Archive to supporting researchers through the ReShare repository and data management training and resources.

A strong advocate for synthetic data and dedicated to collaborative engagement, Cristina recently led the ADR UK-funded project exploring the practical challenges and opportunities for using low-fidelity synthetic data, focusing on what’s needed to enable its wider adoption.

Jools Kasmire researches and teaches on how to use new forms of data for social scientists with the UK Data Service and the Cathie Marsh Institute at the University of Manchester.

They approach this task as an interesting combination of thinking like a computer (essential for data sciences) and thinking like a human (essential for social sciences) in the context of complex adaptive systems. They are deeply committed to equality, diversity and inclusivity and sometimes dabble with stand-up comedy as a form of science communication.

Comment or question about this blog post?

Please email us!

Data Impact blog