Matthew Woollard presents his response to the Commons Science and Technology Select Committee to examine the opportunities and risks of ‘big data’.
Back in July, the Commons Science and Technology Select Committee announced an inquiry to examine the opportunities and risks of ‘big data’. The inquiry (which is ongoing as I write) is focussing on whether the government is doing enough to allow entrepreneurs to benefit from the big ‘data revolution’, and to protect the rights of data subjects.
The UK Data Service has a long history in responding to government consultations, and this one was no exception. During July and August I coordinated a joint response from the UK Data Service and the Administrative Data Service (ADS), and this has been published alongside all the other responses to the Committee. What makes this response special, and worth reproducing here, is that we included it as part of our regular reporting to the ESRC’s Data Infrastructure Strategic Advisory Board (which has governance responsibilities for the UK Data Service), and they were hugely encouraging about it, since it gives a social science perspective on the opportunities and problems of some big data. It was this Committee which suggested that we publish it on this blog.
The response is 24 short paragraphs which are directly aligned to the questions posed by the Committee. And, when it says as the top that it was compiled by me, that’s the truth. At least half a dozen people from within the Services contributed to the final text; we also had comments on drafts from both the ESRC and researchers working in the field at the University of Essex. It was a real collaboration, and a demonstration that the UKDS and ADS work closely together on strategic initiatives. It was also commended by our governing body for its concision, clarity and relevance.
Written response to the ‘The big data dilemma’ consultation from the UK Data Service and the Administrative Data Service.
The UK Data Service is a data service infrastructure, funded by the Economic and Social Research Council (ESRC) to support researchers, teachers and policymakers who depend on high-quality social and economic data. It provides a free-at-the-point-of-use service to researchers who want access to social and economic data. The lead organisation of the UK Data Service, is the UK Data Archive at the University of Essex, which has had almost 50 years’ experience of making data available for research. The UK Data Service is currently investigating the integration of new and novel sources of data for research into its existing service, and supporting key ESRC investments in the management of these data.
The UK Data Service has staff at the Universities of Essex, Manchester, Southampton, Edinburgh, Leeds, University College London and Jisc. The Administrative Data Service coordinates ESRC’s Administrative Data Research Network and is the first point of contact for researchers who want access to administrative data, with partners at the Universities of Manchester, Oxford, the West of England and Edinburgh.
Both services desire to submit evidence to this committee in order to bring attention to the ways in which some of the risks relating to the use of these data within and beyond Higher Education can be resolved.
What are the opportunities for big data, and what are the risks?
1. “Big data” is so variously defined that starting with the term big data may not be helpful. In the world of socio-economic research, big data may be used to describe the contents of a government business information system, a transaction management system of a commercial organisation, the search activities of users of a social media site, tracking data created through the use of mobile devices, as well as surveillance images. These may meet some of the characteristics of being large, dynamic and complex, but they don’t necessarily need new techniques for their analysis. “Big data” should not be understood as being primarily about the data per se; it should be understood to mean the use of new methods of data analytics on data which have not traditionally been used to produce insight.
2. From the point of view of socio-economic research most of these forms of data share an additional characteristic, which is that they contain information about people, and those people have not explicitly given their consent for the reuse of these data for any particular purposes (informed consent).
3. Each of the generic types of data above has exceptional opportunities for reuse across the public/private/third sectors. For example, hospital admission records could help private care providers be more efficient; search activities could potentially assist law and order agencies in pre-empting civil unrest; tracking data from mobile devices may allow commercial organisations to target advertising more effectively; surveillance images have the potential to monitor power usage at a macro level. The possibilities are endless, and mostly positive, but each of these particular opportunities would be likely to cause considerable disquiet amongst a majority of people who are represented in the underlying data.
4. Significant potential for the use of these types of data in socio-economic research exists, but in many cases the cost/benefit ratio has not been measured or compared with other forms of data analysis. One should not assume that these types of data are always cheaper to draw conclusions from.
5. The risks are numerous, and will depend on the type of data, the context in which he data were created, the owner of the data, the method of ‘analysis’, and the purpose to which the analysis is put. If personal data from a government business information system is used by respected researchers, whose research is approved by data owners for public good, who understand the complexities of the data, work in a secure environment, have their outputs vetted, and are aware of the penalties under the law, the risks of anything “bad” happening is virtually zero.
6. Some very specific risks surround the quality of these data. There is the possibility decisions may be based on insights from attractive new forms of big data, without the necessary work to understand and calibrate the extent to which it provides valid alternatives to more traditional forms of data being done. In the main, traditional census and survey type data sources involve considerable effort in the design of questions, sampling frames and definitions, allowing users to understand and quantify issues such as representativeness and bias. Most big data sources do not result from any explicit design considerations (for research) but are reflections of what has already been measured. Consequently they will often over- and under-represent different groups within the population in different and complex ways. For example, the very young and the very elderly do not and will not engage with the digital economy in the same way as young adults – they therefore may leave very incomplete traces in the big datasets. All ‘big data’ has the possibility of suffering from some sort of bias which has the potential to affect results.
7. Big data may offer exciting insights on topics which may previously have been hard to measure, but also enormous opportunities to misrepresent and misunderstand important social and economic questions. Nowhere near enough methodological work has yet been done to bridge this gap. The nature of some of these data and the allied techniques used to analyse them are such that validation and research integrity may be extraordinarily hard to perform. Two statisticians may be able to take the same raw survey and end up with the same result; two data scientists doing the same with big data will almost certainly not. The golden thread of integrity from data to policy (evidence-based policy) has the potential to be broken. Again there has not been enough work done to bridge the gap between the two.
8. A further risk is that ‘traditional’ sources of socio-economic data become less reliable and/or less accessible. The consultation for the 2021 census proposes the exploration of administrative data (really no more than another term for big data) to supplement the 2021 census. The exploration of this topic is warmly welcomed, but these sources should not be seen as the magic bullet to reduce costs. Obviously data collection costs might fall, but data analysis costs will rise, and the richness of available data becomes diluted. Perhaps less important, but not a negligible problem will be access by a third party. The UK has one of the longest running and most successful data archives which make data accessible to researchers. The replacement of traditional surveys with less robust big data may, for various rights-related reasons, reduce the opportunity for reuse. We note that government has been keen on the idea of increasing efficiency by sharing and linking administrative data but we are still at a very early stage in effective research access to and linkage of these extraordinary well-regulated forms of data; any policy developments towards the further use of big data needs to be cognizant of the fact that effective access and use of existing administrative data is still far from being achieved.
Has the Government set out an appropriate and up-to-date path for the continued evolution of big data and the technologies required to support it?
9. The government does not seem to have a single path for the evolution of big data and the technologies to support it. The ‘Seizing the data opportunity’ (October 2013) strategy is the closest to an over-arching path, and there is clear movement along aspects of this strategy. It would be helpful to see a progress report for this strategy.
Where do gaps persist in the skills needed to take advantage of the opportunities, and be protected from the risks, and how can these gaps can be filled?
10. One key skills gap surrounds the intersection between statistics and data analytics. We can typify one as a branch of mathematics, and the other as a branch of computer science. This rather simplifies the distinction, but both need to understand the provenance and design of the data in order to apply it to ‘real-world’ situations. This has been part of statistics courses for decades, but is only starting to permeate into computer science.
11. The second major skills gap relates to the data management and data handling. We are at the beginning of a major reconceptualisation in the curation and the ‘processing’ of these types of data source for reuse. There has been little work in the academic sector about maximising the reuse value of these types of data.
12. The third and fourth skills gaps surround the ethical and legal issues around these data. Ethicists are only just becoming aware of the complex (and potentially unresolvable within existing privacy paradigms) issues. (See paragraph 16 below.)
13. Many of the risks surrounding the use of these data can be mitigated by training researchers in best practices in handling and using these types of data. There are ethical and ‘secure’ training courses in the handling and analysis of personal data, and these are generally harmonised across multiple data owners, e.g., the Office for National Statistics, Her Majesty’s Revenue and Customs, the Administrative Data Research Network and ESRC’s UK Data Service. Further work is required on re-examining ethical and legal frameworks for the reuse of these types of data which are based on personal information.
14. Openness and transparency in the construction of “big data” needs to be driven by government initiatives. It is no use in simply publishing vast quantities of open data, unless there is sufficient background information to be able to use them properly.
How can public understanding of the opportunities, implications and the skills required be improved, and ‘informed consent’ secured?
15. The Ipsos Mori Dialogue on Data (2014) report, which explores public views on using administrative data for research purposes, was an excellent guide to the above topics.
16. At the intersection of ‘research’ and ‘ethical behaviour’ there is a growing conundrum: a key feature of data analytics in the business sector is to uncover hidden patterns, or things which are unknown; in the social science (and medical) domains ‘informed consent’ is to provide sufficient information on all aspects of research which may be carried out (or use to which the data may be put.) In the big data world, real informed consent may become a contradiction. Baroness O’Neill has argued eloquently that “genuine consent for the reuse of highly complex data for highly complex purposes is unworkable.” Big Data allows us the opportunity to reimagine the complex interplay of the ethical reuse of data for all data, and may also allow for a real data revolution. (Some work is being done in this area by the Cabinet Office from a national perspective, and the compiler of these comments is an UK representative on an OECD group working on some international recommendations.)
What further support is needed from Government to facilitate R&D on big data, including to secure the required capital investment in big data research facilities and for their ongoing operation?
17. Funding for data service infrastructure needs to be provided on a more sustainable basis. In order to maximise the reuse potential of any type of data, it needs to be managed effectively through the whole of the data life-cycle, and if this management can be centralised, harmonised controls (covering consent, privacy rights, the maintenance of ownership, (rights) and research ethics/integrity) can be applied consistently, and independently from either the data controller or the data users.
18. Data service infrastructure also needs to have the authority to ensure that data and linked data can be curated, and that long term access can be provided, and managed together rather than in silos.
19. Data service infrastructure (like data.gov.uk) can also provide guidance on the harmonisation of data (so HESA’s student record allows people to report their sex as Male, Female or Other (where Other is explicitly used “for people who associate with the terms intersex, androgyne, intergender, ambigender, gender fluid, polygender and genderqueer”), but DVLA license holders can report only Male or Female. Why do different government-supported organisations use different categories for the most straight-forward of questions?
20. Research needs to be carried out into identifying the value of big data assets. Data which ‘stores value’ and has the potential to realise future benefit need to be properly curated. Curation comes at a cost.
21. Scalable services and infrastructure needs to exist to maximise the benefits provided by big data opportunities. Infrastructure and services for data, especially data which are personal are not simply capital costs. Infrastructure and services for these data include hardware and software, but also the skills and human resources to make them function. Shared infrastructure is likely to be more efficient. In the social sciences the ratio of pure capital and operational costs may be in the region of 1 to 10; in the hard sciences this ratio may be in the region of 10 (or more) to 1. This imbalance needs to be taken into consideration in investment.
22. Resources are also required to exploit the data itself. Capital investment alone does not secure robust methodologies to analyse those data and / or to even access those data; resource is required for research, especially research which does not have a commercial application, so that it can inform and influence commercial applications. Further research and public dialogues around ethical and legal areas needs to happen.
23. Big Data facilities, like the Administrative Research Data Network need to unequivocally set themselves apart from organisations which commodify personal data without proper ethical safeguards, and maintain levels of independence which the public or customers (as data subjects) have confidence in.
24. Decisions on support for further facilitation should be made on the basis of the costs across the whole lifecycle of the data, not a one-off research “hit”. The most effective data infrastructures are to all intents and purposes collaborations between data specialists and researchers.