The UK Data Service‘s Victoria Moody and Rob Dymond-Green on freeing data from legacy data access platforms.
Legacy data access platforms are beginning to offer us a window into a recent past which provides a sense of the processing and memory size limitations developers had to work with; now difficult to imagine.
These platforms offer a sense of how data access models were defined by technology that at the time looked like a limitless opportunity to construct new architectures of data access – designed to be ruthless in their pursuit of a universal standard of full query return – that is, identifying all possible query lines, and being engineered to fetch them. An unsustainable technical debt which precludes discovery and analysis by newer workflows.
Although designed for form to follow function, it is elements of the human scale which surface and make them endearing as well as brilliant in the way they approached problems they solved. They proliferated, but how many endure, especially from the late 1990s to early 2000s? The team here is interested in first generation data access platforms still in use, we’d love you to tell us about any you still use.
In the late 1990s the team making UK census data available to UK academics and researchers designed CASWEB which gives access to census aggregate data from 1971 to 2001. Casweb is nearly 20 years old and survives – it is still used by many thousands doing research using census aggregate data, returning tables which users can select by searching their preferred geography then topic.
Image: The Casweb data engine returning data in ‘steampunk’ style
InFuse – the next stage innovation to Casweb gives access to 2001 and 2011 census aggregate statistics also used by thousands each year and offers the user the chance to select their data through the route of geography or topic.
Image: InFuse
Casweb is now a platform at risk, with the potential that it will eventually not be compatible with browsers in use; its programming language predates modern standards. Now that people can fit a ’90s supercomputer on their mobile ‘phone, the capacity for data access offers a different, more prosaic but more communal route.
Image: Legacy Casweb code – “Please install Netscape 3.0”
The data are now all open but that doesn’t mean they are discoverable or straightforward to get. We’re taking the opportunity to retrieve the data built into the table-based web pages of Casweb and freeing the data from the prescribed routes into the data that both Casweb and Infuse offer, turning instead to routes to improve discoverability and usability – cost effectively, and aligned with the latest standards for frictionless data.
Our plan is to go back to basics, to deliver the minimum viable product using DKAN.
But we need to do some work. We are also working on where we are liberating metadata from Casweb and making it more usable for our users. We are going to go a step further and export the data which go with this new metadata, to create datasets to load into our shiny new dkan instance.
Luckily, the metadata in InFuse is stored within the application in an easier to digest format than Casweb, so we are also in the process of exporting this together with the data into a set of lightweight data packages. Our aim will be to format the data and metadata from Casweb and InFuse to use the same structures, which means that our users won’t have to grapple with learning how data is structured when moving between censuses.
We’ll start by exporting data from InFuse 2011 using information from our download logs to determine which are our most popular datasets. For our initial data release we’ll be delivering 50 datasets through dkan, and will continue to release data in tranches of 50 until we’ve finished making all our data available as csv.
But we’re interested in hearing how you’d like us to approach these dataset releases. An option could be for us to release 50 datasets from 2001 which cover the same type of topics as the ones in 2011, then equivalent datasets from 1991 and back to 1971, thus building a time series as it were of data. It’s just one idea, you may of course have other ideas as to the priorities we should use when releasing this data, please get in touch and help shape how this process goes.
We’ll take care of and maintain Casweb and Infuse, but won’t add any data or develop them any further. We’ll also move away from impenetrable acronyms to name our data access platform (although ‘Cenzilla’ was mooted at one point…)
Ultimately, we aim to free the data for better discovery and, in the spirit of frictionless data, make it easy to use the data for sharing and utilisation whether that’s in CSV, Excel, R, or Hadoop.
Please let Rob know what data combinations you’re using from censuses from 1971 up to help us structure our retrieval schedule.