In the first of a suite of articles on big data, the UK Data Service’s Big Data Network Support team writes about how researchers can make big data more manageable.
Faced with a research project involving ‘big data’, the first question often seems to be: where do I start? Big data are larger or more complex than traditional datasets, so traditional processing applications may simply not be able to manage them. The sheer amount and diversity of information available make big datasets physically different to the typical data information that researchers are accustomed to handling. For those engaging in analysis of real-time data streams, such as social media feeds, the sheer speed of incoming information may seem overwhelming. Navigating any big dataset must be carried out at considerable speed or velocity, in order to be effective and to avoid becoming mired in this volume and variety of information.
The ‘three Vs’ model of big data management – volume, variety, and velocity – can provide a useful starting point for natural science or social science researchers interested in big data. Compared to more traditional datasets, different strategies and technologies are required in order to effectively manage and glean insights from big data.
Researchers need to be sure of the questions that they wish to ask and the knowledge that they are trying to gain, before even bringing in the data. Too often people jump into looking at the technology required to analyse big data in all its complexity, bypassing the critical first step of putting the data into context. It’s not even necessary to download the entire dataset in order to discover what you need to know.
So: where to start?
Well, it’s all about getting into the appropriate mindset for approaching big data. In essence, a big data mindset forms around examining the data approach, posing and answering such questions as what counts as big data? What data should be used? How do I know what is relevant?
An appropriate strategy, rooted in a set of (scientific) principles is required. Key principles include:
- Sharing – taking a proactive stance on access to, and the exploration of, shareable data through appropriate, flexible and task-specific technologies, approaches and policies that enable widespread data sharing, discoverability, management, curation and reuse;
- Accessibility – a commitment to making data accessible for re-use. ‘Open’ data is of little use if it is inaccessible or unintelligible because of technical or institutional constraints; (this is a bit confusing – if it’s open – it’s accessible if not – it wouldn’t be called open)
- Security – Protecting privacy and intellectual property. Researchers working with Big Data must be mindful of security and privacy, presenting data and findings in ways that that minimise the risks of disclosure. The UK Data Service has strong ethical and legal frameworks and procedures in place for managing sensitive data, and decades of experience in techniques, such as de-identification techniques. The Managing and Sharing Research Data: A Guide to Best Practice, by Louise Corti, Libby Bishop and Matthew Woollard, covers these aspects. Resources in this area are also provided on the companion website that accompanies the book.
- Community, competencies, and inclusivity – embracing researchers from a wide diversity of disciplines as well as interdisciplinary researchers, to ensure that the big data scientific community is sufficiently broad. Big datasets are diverse and always ever-expanding, presenting potentially invaluable resources across all scientific fields.
Making big data more manageable
Plotting data and selecting the correct data format are important when working with big data, but pose trade-offs between accuracy and variance. However, even if plotting the dataset in its entirety is unfeasible (perhaps because of a lack of sufficient scalability within a working environment), this does not mean that the data is unusable. Big data research always involves breaking down the data first.
Researchers need to establish what the dataset represents, appreciate how the data can best be pre-processed and analysed and determine how much of the data is required to answer the question. Appropriate sampling methods can quickly transform a big dataset into something more manageable.
According to an article in The Register, published 25 March 2015 by John Nicholson on “Big data wizards: LEARN from CERN, not the F500,” the Large Hadron Collider provides “One of the best examples of the management of big data.” The team monitors 600 million collisions per second but has around 100 collisions of interest per second that it wants to review — CERN (the European Organisation for Nuclear Research) filters the data and disregards around 99.99% of the sensor stream data produced. As Nicholson puts it, while the CERN team “may not “know what it’s looking for, it knows what it has already seen”.
If having broken the data down, it is still too large to run on a PC, you don’t necessarily need a Hadoop-type framework (Hadoop is an open-source java-based programming framework that supports the processing of large datasets in a distributed computing environment). Other options include building a big data dashboard using Google BigQuery, adding visualisations through Google Chart tool, or using some of the excellent if lesser known open source software.
This article is authored by members of the UK Data Service Big Data Network Support, which is working to provide a coherent, connected and accessible data infrastructure to support big data for the social and natural sciences, and to secure the long-term future of the ESRC Big Data Network. The team’s goal is to promote accessibility and usability of big data in scientific research.
Team Leader and Functional Director for Big Data Network Support Dr Nathan Cunningham ran an introductory course on Big Data and Statistics at Essex University’s Big Data and Analytics Summer School in August. The course covered all the above topics and provide an introduction to a variety of key concepts including: MapReduce, Data transformation (Apache Pig), Cascading frameworks Hadoop MapReduce, Data Classification (Mahout), Statistical Analysis for Massive Data Sets (bigmemory, biganalytics big linear regression) and When to build and when to outsource. Find out more about what the Big Data Network Support team are working on at our Big Data Network Support area.
[i] The 3Vs model was introduced by Doug Laney of the commercial technology research group Gartner in 2000. Doug Laney ‘3D Data Management: Controlling Data Volume, Velocity, and Variety’, Meta Group, 2001.