Top tips for working with sensitive text data

Dr Diego Arenas from DataKind UK shares his top tips for working with sensitive text data.

DataKind UK has been supporting charities, local government, and social enterprises to use data and data science since 2013. Working with skilled volunteers, we help charities to discover more about their service users, improve their processes, and get closer to reaching their mission with the power of their own data.

As support services moved online during the pandemic, and digital becomes the default mode of communication, more charities and public services than ever before find themselves collecting records of conversations from webchats, or phone calls. As well as containing important information about beneficiaries or service users, this ‘text data’ holds vital insights about their needs and impact. In recent years, many organisations have begun to analyse this text, and gain in-depth insights about their work.

But this kind of analysis can be challenging for many reasons. Records of conversations are likely to contain sensitive, private, or upsetting material. Qualitative data like text can have a lot of Personally Identifiable Information (PII), which needs to be approached with extreme caution. This is particularly important if the people the data is about or from are vulnerable individuals, including children.

How can you tackle these challenges to better support your service users? This blog looks at the risks of using sensitive data for analysis, and gives our recommendations on how they can be overcome.

Getting started

Consider whether using data that might identify your users is necessary to achieve your purpose. Do not include any sensitive data unless you need to, to improve your service and the lives of the community your organisation serves. When processing any information, you must always have a lawful basis for using it – the Information Commissioner’s Office (ICO) provides excellent guidance.

Ensuring anonymity

A crucial step in using sensitive data is thoroughly removing identifying information. In most cases, this doesn’t reduce the value or scope of the analysis – and it’s often a legal, if not ethical, imperative. Here are our top tips:

Masking by replacing names with aliases, or simply indicating the participant’s role. For example, if you have professionals giving advice, you can replace their names with ‘Professional’ or ‘Agent’ so that it is clear the information comes from them, but they are not identifiable
Remove addresses, email addresses, phone numbers, and any PII that could be used to single out an individual
Replace addresses with a wider geographic area, so it is possible to summarise statistics without identifying individuals. Ensure this area is large enough that someone’s identity cannot be inferred from wider context
As above, consider whether there is other personal data that, when combined, could lead to identification of an individual – for example their age, ethnicity, and school combined
As well as thinking about the text data itself, the ‘metadata’ – information that describes the data, like user IDs, timestamps, etc – needs careful handling, as it may allow individuals to be identified

Create expectations about data being used

Think about how beneficiaries or service users (“data subjects”) may feel about how their data is being used. Many people may be surprised to know their data has been analysed, even if they checked a box to give full consent during collection. Ensure you have a clear Privacy Policy that aligns with your organisation’s values. Have a process of informed consent, which helps them clearly understand what you intend to do with their data and why, and lets them choose to opt in or out without losing access to your services. Emphasise the steps you’ll take to protect their privacy, and the end purpose of this work. If you’re not sure what your clients would be happy with, ask them.

Run a Data Protection Impact Assessment

Running a Data Protection Impact Assessment will include creating a ‘Risk Register’ to identify specific risks within your data (mainly the examples in this blog!). List what actions you’ve taken to mitigate these risks and what the likelihood and severity of any harm might be. Then zoom out and look at the overall impact once your actions have been put in place.

Make this a transparent, meaningful part of your process – not a tick box exercise. Data analysts working with sensitive data need to be aware that behind each data point, there is or was a life. This will help them consider the consequences of their analysis. How would you feel if the data you are analysing were about you? How much care would you put into the analysis and accuracy of the results?

Give yourself time!

This all takes time – build it in from the start so you know you can do it right. All the points above need tackling thoughtfully – not in a rush! In particular, build in plenty of time for cleaning and anonymising the data as possible, and this process can take a lot of time. You can automate your anonymisation process – but should manually check the outputs to ensure the automation is doing what you want it to.

Consider your analysts’ wellbeing

Sensitive data can affect not only your users, but also those performing the analysis. Always provide a warning about the content of the documents they are about to see and give them the option of not receiving the most sensitive data. Consider sharing a sample of the data with them so that they can get an idea of what they will encounter.

Safeguarding in Action: Action for Children

Analysing webchat data from their Parent Talk service was a key way for family support organisation Action for Children to understand their users’ needs. DataKind UK helped them to ensure the data was fully anonymised and safe for volunteer data analysts to look at. Together, they put together a full Risk Register and Action for Children ran a Data Protection Impact Assessment. They also put the utmost importance on ensuring no personal information was seen by anyone outside of their service team.

The outcomes from their analysis made their preparation worthwhile. Assessing their users’ conversations for common keywords showed them which areas of their service needed more resources, and how they might make these resources more intuitive to find on their website.

They saw changes in the type of advice sought as the pandemic progressed, from a peak in conversations about behavioural management, sleep, and living arrangements early in lockdown; to a rise in conversations around education and Special Educational Needs and Disabilities, presumably as children returned to school. Overall, mental health and SEND issues stood out as increasing dramatically since the beginning of the pandemic.

Feedback from their webchat helped Action for Children to make the case for increasing their capacity, and apply for further funding to do so. They are also working on building their reporting and how they record their impact. To support this work, they have been able to use the skills identified during the project to build a job description and hire for a data role.

Lynn Roberts, Director of Growth and Service Design at Action for Children, said that:

“Working with DataKind completely opened our mind to the possibilities of data, and gave us access to so many brilliant data scientists and their different perspectives. It’s something we could never have done by ourselves, and it’s given us a talking point to share the benefits of investing in data within our organisation.

It also created the basis of our first annual Parent Talk report, meaning we are sharing children and families needs with the UK public and decision makers, and raising awareness of the gaps in support.”

Read about their project in more detail.

Resources

A few recommended resources from DataKind UK and Action for Children, for groups who want to use their data responsibly:

The ICO’s guidance on running Data protection impact assessments
The Data Ethics Project from LOTI, who support London boroughs to improve public services with digital, data and innovation
Consent issues in data sharing is a free online workshop that explains the difference between types of consent
The UK Data Service provides access and training to use the UK’s largest collection of economic, social and population data for research and teaching
The Data for Children Collaborative’s ethics and safeguarding training pack includes everything you need to successfully complete the ethical assessment and safeguarding training for your project, reflect on any ethical challenges you may face as your project progresses, and to remind you of the importance of protecting children and their rights at all times
DigiSafe is a step-by-step digital safeguarding guide for charities designing new services or taking existing ones online
Identifying Potential Data Risks and Harms is a short, simple, and practical resource to help teams who are designing projects to identify potential risks and harms and provide clear and practical actions and activities to help reduce the risks and harms
Radical Safeguarding – A Social Justice Workbook for Safeguarding Practitioners is designed for practitioners working with children and young people, particularly in school contexts, who might be wondering how to start doing things differently when it comes to safeguarding
Self-care in User Research is a blog by Janice Hannaway and Jane Reid as part of User Research Explained
What should you include in your not-for-profit organisations safeguarding policies and procedures? Insurance provider Zurich has produced a detailed guide to explain many of the key considerations when creating safeguarding policies and procedures as a charity

If you’d like to learn more from DataKind UK about how your organisation can work with data, you can sign up to their mailing list for more news from other charities, resources, and articles.

About the Author

Dr Diego Arenas is Senior Researcher at DFKI, The German Research Center for Artificial Intelligence, and Chapter Leader of DataKind UK’s Scoping and Impact Committee.

Data Impact blog