Why and what do we need to know about data pre-processing?

Anran Zhao discusses the importance and challenges of good data preparation.

A deluge of data

In today’s world, we are creating ‘Big Data’ on an unprecedented scale. Can you imagine, for example, that four petabytes of data are generated each day on Facebook, including 350 million photos and 100 million hours of video watch time? With people all over the globe spending more time than ever before on the internet and electronic devices, our digital footprint is growing, and so too is the data being collected and stored by companies wanting to analyse our usage and preferences. It is fair to say that we are being confronted by a ‘data deluge’; the amount of data being generated is overwhelming the capacity of organisations to process it.

These companies, especially large, international ones, invest extensively in data analytics for all sorts of reasons, from developing new tools and applications to understanding target audiences and opportunities for increased profit. For example, Google Trends records people’s search history and presents the change in popularity of search queries over time, which has proven useful across multiple fields, from economics to sociology; whilst Amazon uses data on customer searches and purchases to make personalised recommendations on log-in, with the aim of encouraging further orders.

Anyone who works in data science knows that raw data requires a lot of wrangling in order to be useful. Data will only provide valuable knowledge and insights if it has been prepared well for rigorous analysis. Poorly prepared data leads to ‘Garbage in, Garbage out’ (GIGO), a Computer Science term that implies that bad input will result in bad output. If we imagine a business as a house, and data as its foundations, then low quality, or ‘dirty’, data equates to unstable foundations that will result in a poorly built and potentially unusable house. Poorly prepared data can therefore lead to poor analysis and business outcomes, and wasted time and effort spent rectifying problems that could have been avoided had the data been well prepared in the first place.

There is a big gap between the expectation and the reality of this sort of data wrangling, especially where Big Data is concerned. Whilst preparing clean data sets may appear to be a straightforward process, it is actually full of challenges, many of which are time consuming and difficult – and in some cases impossible – to overcome.

Rather than simply developing fancy algorithms or plotting eye-catching figures, data scientists spend much of their time collecting and preparing data. According to a 2016 survey by CrowdFlower, as much as 80% of the work of data scientists consists of data preparation, including 19% of time spent on collecting data sets and 60% on cleaning and organising data. Other tasks like building training sets and refining algorithms were found to take up much less time in comparison. See Table 1 below.

Table 1: What data scientists spend the most time doing (source: CrowdFlower survey 2016)

Activity	% of time spent on activity
Building training sets	3%
Cleaning and organising data	60%
Collecting data sets	19%
Mining data for patterns	9%
Refining algorithms	4%
Other	5%

What is data pre-processing?

Data pre-processing (i.e. data preparation) is the process of manipulating or pre-processing raw data from one or more sources into a structured and clean data set for analysis. It is an important part of data analytics, and is crucial for generating meaningful results.

There are many ways to check the quality of your data, and here I’d like to cover two common ways.

Use metadata.
Metadata summarises basic information about data; for example, means of creation of the data, purpose of the data, time and date of creation, etc. Metadata is thus “data about data”, and helps us to better understand the context of the data, its quality, and any potential underlying risks. Many data sets available from the UK Data Service are accompanied by comprehensive metadata, which enables researchers to get a good picture of the data and the possibilities of working with it: e.g. whether it is possible to enrich the data set by integrating external data, whether the granularity is sufficient, etc.
Another option is Exploratory Data Analysis (EDA).
EDA is widely used in data analytics, and checking data quality is one of its various purposes. It is a useful approach to summarise data’s main characteristics and statistics, often with visual methods. For example, checking the distribution of null values, exploring seasonal or departmental patterns by examining a subset of the data set, etc. EDA is usually the first step after you create your programming notebooks, primarily for seeing what the data can tell us before the formal modelling or hypothesis testing.

Once we have data at hand, pre-processing the data can involve various different tasks, including:

Data integration: integration of multiple databases, data cubes, or files;
Data description, summarisation and visualisation;
Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies;
Data reduction: obtain reduced representation in volume which produces the same or similar analytical results;
Data transformation: normalisation and aggregation;
Data discretisation (for numerical data) and generalisation.

Please bear in mind that there is no particular order for carrying out these tasks! There is currently no standardised method for data pre-processing; the tasks listed above are simply the most common ones. In fact, it is relatively common for some of these tasks to be repeated multiple times; for example, after reading the metadata, you may clean the data to remove all null values, however, joining tables creates new missing values, which requires further data cleaning.

It should also be noted that data pre-processing is usually a complex set of interrelated steps, rather than simply a matter of completing individual tasks in different pieces of software. I tend to use Python to do it, but you can use whichever language you prefer.

An example: handling missing values

During the UK Data Service Data Pre-processing webinar series (for recordings, see links below), one of the most frequently asked questions is ‘what should I do with the missing values in my data sets?’ Missing values are arguably one of the most annoying issues for data scientists and researchers. On the one hand, it is necessary to check missing values as some modelling algorithms require a complete data set to run properly; on the other hand, nearly all data sets have some missing values, and handling them can be a very tricky task!

There are various strategies for dealing with missing values, including, for example, deleting the row or the column, or imputing a specific statistic or a random number. The best solution will be the one that can approximate the original, unknown value, and thus optimise the eventual performance of the data model. However, there is no easy fix and it is usually hard to provide a direct solution without knowing the exact context of the research project in question.

Here is an extract from the UK Data Service webinars, which provides some of the most frequent strategies for handling missing values in your data sets. Note that all strategies have potential limitations, so these have been included in the list below to help you determine which strategy might be useful for tackling missing values in your own data sets.

Table 2: Possible strategies for handling missing values

Strategy	Limitation
Throw out the records with missing values	This creates a bias for the sample
Delete the column with missing values	Only a feasible option if the column data is unnecessary
Replace missing values with a “special” value (e.g., -99)	This resembles any other value to data analytics
Replace with some “typical” value (mean, median, or mode)	Possible changes to the distribution
Impute a value (Imputed values should be flagged)	Use distribution of values to randomly choose a value. This could still lead to possible changes to the distribution.
Use data mining techniques that can handle missing values	For example, decision tree can be applicable, which might consume much time and effort
Partition records and build multiple models	This is possible when data is sufficient

In conclusion

Given that data preparation takes around 80% of a data scientist’s time, having a full toolkit available of approaches to cleansing and organising data will give you a better chance of making effective use of the 20% of your time for analysis.

Links

UK Data Service webinar: ‘Data Pre-processing: Introduction and Integration‘

UK Data Service webinar: ‘‘Data Pre-processing: Clean, Reduce, Transform‘

Resources

García, S., Luengo, J. and Herrera, F., 2015. Data Preprocessing in Data Mining. New York: Springer.

Acknowledgements

Some content in this blog post is inspired by Dr. Yu-wang Chen’s ‘Understanding Data and Their Environment – Data Preprocessing’ lectures at the University of Manchester.

About the author

Until recently, Anran Zhao was a Research Associate in the Computational Social Science team for the UK Data Service, based at the University of Manchester.

Whilst in post, she developed and delivered training events, including workshops and webinars, and online training materials to meet the needs of social science and cross-disciplinary researchers. She holds degrees in Art History, English and Business Communication, and Data Science.

Data Impact blog