Innovation Fund project: Open, linked data innovation

ralp2

Ralph Cochrane of AppChallenge.net introduces the Innovation Fund project focused on opening more UK Data Service data and the AppChallenge.

Back in October 2014, we were awarded innovation funding from the ESRC (Economic & Social Research Council) to work with the UK Data Service and open up a maximum of three datasets for public use. The second phase is to run an AppChallenge that encourages developers across the world to access the data and create innovative applications.

Three months into the project and we’ve come a long way, but interestingly we’ve had to conquer many topics that aren’t really talked about when people refer to Open Data, in order to get the point that we can build an API.

Our team is comprised of some of the most talented data analysts at the UK Data Service, as well as Pontydysgu (who we worked with on our LMI for All App Challenge called Careerhack) and LogtoMobile, the brilliant winners of our last contest.

So where are things now? We’ve successfully harmonised one dataset which is the Natcen British Social Attitudes Survey. Covering over 25 years worth of data, this is the survey that shows how people feel about crime, attitudes to immigration and lots of other factors all the way through to how they vote and what newspaper they read. In all there are over 100 variables (think of them as survey questions) and 90,000 people have taken part.

Next we will be working on the European Quality of Life survey and then finally we will be opening up a survey from the UK Department of Culture, Media & Sport called “Taking Part” administered by TNS-BRB, which regularly surveys people about the social activities in which they take part outside of work or school.

What’s interesting about these datasets is that together they build up a patchwork quilt of information across Europe about how people feel. As a team we’re really keen on mashups of open datasets with other data and we’ve been seeking to discover who else is publishing data, mainly open data, so that we can build some marketing tools and tutorials for developers later this summer.

Lessons learned

One of the biggest lessons learned is that data quality is really important. The open data movement is still in it’s infancy. In fact the ODI (Open Data Institute) recently had it’s second annual conference in London. This whole industry is new.

The definition of open data from the Open Knowledge Foundation is:

“Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).”

When you look at open datasets they vary massively in quality. Also accessibility is an issue. Many are a simple spreadsheet or a file you can download. That isn’t much use for an army of developers who want to access this data programmatically so we’re building our open data into an API using the UK Data Service Infrastructure. We’ve learned a lot about this process from designing other Open Data APIs, most notably LMI for All. However, there were still a few things we’ve learned this time around:

  1. response times need to be fast. This means your infrastructure needs to be up the job of providing data quickly to multiple apps on potentially thousands of handsets. We’re using 3scale to improve our API loading.
  2. programmatic access of data is flatter than perhaps traditional data analysts are used to. Traditionally a researcher would look at a dataset, run a query and then run more queries on the results returned. Through an API, it’s less about multi-tiered queries and more about getting into the minds of developers. For example, I might want to plot the results of a survey question over time or I might want to iterate through an array of values to visualise them within my app. We quickly realised some of the traditional tools that are used by a limited number of researchers at a time are not going to work for this larger-scale programmatic access of the data
  3. data weighting can’t be left to developers. For some of us it was a whole new topic that we hadn’t heard of. Speaking as a developer, maybe I’m lazy. I expect a well defined API, with good documentation and some coding examples. I assume that the data is correct, but without weighting you can get strange results. Weighting is the process of balancing out data, particularly from surveys, that has too many answers from one demographic and not enough from another. The classic example is young men, because young men apparently don’t like to answer surveys (You learn something new every day), so datasets have to be “weighted” or balanced to take this into account, otherwise the three young men that did answer the survey could massively skew the results. Luckily we have the expert help of the UK Data Service to guide us on this.
  4. workflow – many datasets have metadata associated with them. This describes the data, how it was collected and may include useful comments. If you’re not careful you can lose this in the creation of the new dataset. We also wanted to try to avoid manually creating “new defacto” datasets which were not easily updated. The workflow from tools like Nesstar through to our RESTful API and infrastructure are part of next phase of work.

We’re currently designing the API and infrastructure whilst the data team works on the final two datasets.  We’ve also heard that there is going to be a quality assessment tool from the Open Data Institute which will serve as a reference to discover good quality datasets that other people are making available. Finally, one of our Innovation Fund sister projects from the University of Liverpool is also looking at a programmatic API for census information from 2011. We’ll be investigating this further in the coming months.

To keep up to date with our project and learn more about how to make your data open, please subscribe to our App Challenge blog. Please find out more about all our Economic and Social Research Council (ESRC) Innovation Fund projects here.

Leave a Reply

Your email address will not be published. Required fields are marked *