Dr Diego Arenas from DataKind UK shares his top tips for working with sensitive text data.
DataKind UK has been supporting charities, local government, and social enterprises to use data and data science since 2013. Working with skilled volunteers, we help charities to discover more about their service users, improve their processes, and get closer to reaching their mission with the power of their own data.
As support services moved online during the pandemic, and digital becomes the default mode of communication, more charities and public services than ever before find themselves collecting records of conversations from webchats, or phone calls. As well as containing important information about beneficiaries or service users, this ‘text data’ holds vital insights about their needs and impact. In recent years, many organisations have begun to analyse this text, and gain in-depth insights about their work.
But this kind of analysis can be challenging for many reasons. Records of conversations are likely to contain sensitive, private, or upsetting material. Qualitative data like text can have a lot of Personally Identifiable Information (PII), which needs to be approached with extreme caution. This is particularly important if the people the data is about or from are vulnerable individuals, including children.
How can you tackle these challenges to better support your service users? This blog looks at the risks of using sensitive data for analysis, and gives our recommendations on how they can be overcome.
Consider whether using data that might identify your users is necessary to achieve your purpose. Do not include any sensitive data unless you need to, to improve your service and the lives of the community your organisation serves. When processing any information, you must always have a lawful basis for using it – the Information Commissioner’s Office (ICO) provides excellent guidance.
A crucial step in using sensitive data is thoroughly removing identifying information. In most cases, this doesn’t reduce the value or scope of the analysis – and it’s often a legal, if not ethical, imperative. Here are our top tips:
- Masking by replacing names with aliases, or simply indicating the participant’s role. For example, if you have professionals giving advice, you can replace their names with ‘Professional’ or ‘Agent’ so that it is clear the information comes from them, but they are not identifiable
- Remove addresses, email addresses, phone numbers, and any PII that could be used to single out an individual
- Replace addresses with a wider geographic area, so it is possible to summarise statistics without identifying individuals. Ensure this area is large enough that someone’s identity cannot be inferred from wider context
- As above, consider whether there is other personal data that, when combined, could lead to identification of an individual – for example their age, ethnicity, and school combined
- As well as thinking about the text data itself, the ‘metadata’ – information that describes the data, like user IDs, timestamps, etc – needs careful handling, as it may allow individuals to be identified
Create expectations about data being used
Run a Data Protection Impact Assessment
Running a Data Protection Impact Assessment will include creating a ‘Risk Register’ to identify specific risks within your data (mainly the examples in this blog!). List what actions you’ve taken to mitigate these risks and what the likelihood and severity of any harm might be. Then zoom out and look at the overall impact once your actions have been put in place.
Make this a transparent, meaningful part of your process – not a tick box exercise. Data analysts working with sensitive data need to be aware that behind each data point, there is or was a life. This will help them consider the consequences of their analysis. How would you feel if the data you are analysing were about you? How much care would you put into the analysis and accuracy of the results?
Give yourself time!
This all takes time – build it in from the start so you know you can do it right. All the points above need tackling thoughtfully – not in a rush! In particular, build in plenty of time for cleaning and anonymising the data as possible, and this process can take a lot of time. You can automate your anonymisation process – but should manually check the outputs to ensure the automation is doing what you want it to.
Consider your analysts’ wellbeing
Sensitive data can affect not only your users, but also those performing the analysis. Always provide a warning about the content of the documents they are about to see and give them the option of not receiving the most sensitive data. Consider sharing a sample of the data with them so that they can get an idea of what they will encounter.
Safeguarding in Action: Action for Children
Analysing webchat data from their Parent Talk service was a key way for family support organisation Action for Children to understand their users’ needs. DataKind UK helped them to ensure the data was fully anonymised and safe for volunteer data analysts to look at. Together, they put together a full Risk Register and Action for Children ran a Data Protection Impact Assessment. They also put the utmost importance on ensuring no personal information was seen by anyone outside of their service team.
The outcomes from their analysis made their preparation worthwhile. Assessing their users’ conversations for common keywords showed them which areas of their service needed more resources, and how they might make these resources more intuitive to find on their website.
They saw changes in the type of advice sought as the pandemic progressed, from a peak in conversations about behavioural management, sleep, and living arrangements early in lockdown; to a rise in conversations around education and Special Educational Needs and Disabilities, presumably as children returned to school. Overall, mental health and SEND issues stood out as increasing dramatically since the beginning of the pandemic.
Feedback from their webchat helped Action for Children to make the case for increasing their capacity, and apply for further funding to do so. They are also working on building their reporting and how they record their impact. To support this work, they have been able to use the skills identified during the project to build a job description and hire for a data role.
Lynn Roberts, Director of Growth and Service Design at Action for Children, said that:
“Working with DataKind completely opened our mind to the possibilities of data, and gave us access to so many brilliant data scientists and their different perspectives. It’s something we could never have done by ourselves, and it’s given us a talking point to share the benefits of investing in data within our organisation.
It also created the basis of our first annual Parent Talk report, meaning we are sharing children and families needs with the UK public and decision makers, and raising awareness of the gaps in support.”
A few recommended resources from DataKind UK and Action for Children, for groups who want to use their data responsibly:
- The ICO’s guidance on running Data protection impact assessments
- The Data Ethics Project from LOTI, who support London boroughs to improve public services with digital, data and innovation
- Consent issues in data sharing is a free online workshop that explains the difference between types of consent
- The UK Data Service provides access and training to use the UK’s largest collection of economic, social and population data for research and teaching
- The Data for Children Collaborative’s ethics and safeguarding training pack includes everything you need to successfully complete the ethical assessment and safeguarding training for your project, reflect on any ethical challenges you may face as your project progresses, and to remind you of the importance of protecting children and their rights at all times
- DigiSafe is a step-by-step digital safeguarding guide for charities designing new services or taking existing ones online
- Identifying Potential Data Risks and Harms is a short, simple, and practical resource to help teams who are designing projects to identify potential risks and harms and provide clear and practical actions and activities to help reduce the risks and harms
- Radical Safeguarding – A Social Justice Workbook for Safeguarding Practitioners is designed for practitioners working with children and young people, particularly in school contexts, who might be wondering how to start doing things differently when it comes to safeguarding
- Self-care in User Research is a blog by Janice Hannaway and Jane Reid as part of User Research Explained
- What should you include in your not-for-profit organisations safeguarding policies and procedures? Insurance provider Zurich has produced a detailed guide to explain many of the key considerations when creating safeguarding policies and procedures as a charity
If you’d like to learn more from DataKind UK about how your organisation can work with data, you can sign up to their mailing list for more news from other charities, resources, and articles.
About the Author
Dr Diego Arenas is Senior Researcher at DFKI, The German Research Center for Artificial Intelligence, and Chapter Leader of DataKind UK’s Scoping and Impact Committee.