Quality: The Trade-off of Working with Big Data

Esmeralda Bon, one of our #DataImpactFellows, was inspired to write this blog post by a reflective piece she wrote for this British Academy report, as well as discussions with peers that took place at a workshop held at the Sir Bernard Crick Centre, University of Sheffield last year.

Big data.

Machine learning.

Artificial intelligence.

Three recent buzzwords (buzzphrases?).

Some fear the threat of artificial intelligence and the rise of a surveillance state, run by algorithms and powered by big data. Whether fearful or not, as researchers we can now collect and analyse large quantities of data. But with this, comes responsibility.

For my PhD project, I study social media data, consisting of thousands of observations. Big data, one could say, collected algorithmically.

These big data sets are exciting to work with: such statistical power! However, the enthusiastic young researcher will soon find that while great in size, this data is less than great in terms of quality.

What then, is quality data?

I would define quality data as rich, complete and with a traceable history. When we collect digital data automatically, using the interface provided by web sites, we are in essence dealing with a black box. The interface determines what information we receive. It limits the studies we can conduct, influences our research design and may skew our findings.

We may then ask: to what extent are we studying what we set out to study, a true phenomenon, in its entirety?

Digital data is subject to change and once collected, the data we have may present little more than a snapshot. Furthermore, data and information may be taken down or deleted during the collection process (trust me, I have been there). Thus, data that was there one day, could be gone the next. In the end, our big data-data set is still small, minuscule compared to all the information out there. It is therefore important to not oversell our data.

Photo: “Data critical squirrel”, personal image / CC BY

But let us not undersell this data either. Our big data may be incomplete and we may make mistakes in our data collection.

For example, my approach to collecting Facebook data failed to capture the reactions to posts. A pity. However, the data we gather through other means, such as interviews and focus groups, is likewise incomplete. Cross-sectional data most certainly also reflects a snapshot of reality. A sample of interviewees may not be representative and transcripts may be lost or damaged during the interviewing process.

In the end, what matters is that we remain vigilant and that we are open about these quality concerns.

Big data is exciting, high in statistical power and potentially high-impact, for our world now listens to numbers and quantification, algorithms and AI. It has become our responsibility to flag data quality concerns. Our data may be big, but it is also fallible. It is not a holy grail.

If something seems too good to be true, then it probably is.

Note: For interesting insights regarding the quality of Twitter data more specifically, I recommend work by Fred Morstatter and colleagues.

About the author

Esmeralda Bon, @EsmeraldaVBon, is one of our UK Data Service Data Impact Fellows.

Esmeralda is a final-stage doctoral researcher at the School of Politics and International Relations, University of Nottingham. Her dissertation focuses on MP communication and representation on Facebook, using the EU referendum as a case study. Esmeralda also works as a research associate on the project ” Digital Campaigning and Electoral Democracy (DiCED)” at the Cathie Marsh Institute for Social Research, University of Manchester. This is a major new comparative project that will study the drivers and effects of digital campaigning in 5 countries and 7 national elections during the period 2020-2023.

Data Impact blog

Quality: The Trade-off of Working with Big Data

About the author

Tags