Building an international data access architecture; a technical perspective

Fiona Cameron, Application Developer at the UK Data Service, discusses exciting developments from the Architecture Task Force sessions at the Statistical Information System Collaboration Community (SIS-CC) Annual Workshop in Paris #SISCCWS2015 to enhance our fantastic international data resource UKDS.Stat.

March 2015 saw the 5th Annual workshop of the Statistical Information System Collaboration Community (SIS-CC) held at the offices of the Organisation for Economic Co-operation and Development (OECD) in Paris.

The UK Data Service is a member of the collaboration community, and as a technical member of the International Data team I represented the Service at the meetings of the SIS-CC Architecture Task Force (ATF) workshop.

Formed in April 2014, the ATF is made up of Business and Technical representatives from the community member organisations. The role of the ATF is to review and provide recommendations for the business and technical architecture of the .Stat statistical storage and dissemination platform in accordance with the 5 year Strategic Directions for the product developed in 2014.

An important part of the strategic direction of .Stat is the support and promotion of the Statistical Data and Metadata standard eXchange (SDMX) format, an ISO standard for exchanging and sharing statistical data and metadata among organisations, in order to improve data accessibility and quality, and reduce costs.

.Stat already has the capacity to export data in an XML format that is a subset of the SDMX-ML modelling language. Also, a statistical chart generator, which enables API extraction of charts, maps and graphs from .Stat in SDMX-JSON format, has been developed by the OECD as a Common Statistical Production Architecture (CSPA) compliant service. The SIS-CC members continue to work to further build the SDMX information model into .Stat, and some of this work was considered by the ATF at the workshop.

For example, the OECD is planning to incorporate SDMX attributes into the system. Attributes here can be things like a unit of measure, a currency, a power code, and can be attached to observations, dimensions, combinations of dimensions, and at dataset level. The usual method of providing this information at the moment is through separate reference metadata, but the aim is for it to be included within the data itself.

There are several considerations. First off, data are currently modelled and stored within a relational database structure which does not lend itself easily to modelling attachment of attributes. Different modelling solutions were examined and the pros and cons of duplicate tables versus multiple table joins were presented, taking into account the effect on disk space, import time, and ease of data extraction.

Also for consideration is how data import files would be structured to include attributes, whether to allow free text as used in abstracts, descriptions, and titles, whether current datasets would need to be re-created to include attributes, and what would be the effect of including attributes on the data extraction process.

Ultimately, the solution that provides the best data extraction experience for the user is deemed to outweigh considerations of complexity of data loading and cost of storage space, and work continues to find the optimal solution.

Another SDMX related topic under review was the integration of SDMX Reference Infrastructure (SDMX-RI) within .Stat. The rationale behind SDMX-RI is to provide the infrastructure and supporting tools to allow production of SDMX modelled data from existing data reference and dissemination databases.

A presentation by the Italian National Institute of Statistics (ISTAT) focused on how the Mapping Assistant SDMX-RI tool could be used in .Stat. The Mapping Assistant is intended to facilitate the mapping between the information held in a database of a dissemination environment such as .Stat, and the structural metadata for a dataset provided by an SDMX-ML Data Structure Definition (DSD). (Broadly, a DSD describes a dataset in terms of concepts (dimensions) and code lists (dimension members)). These mappings are maintained in a Mapping Store within the system.

It is hoped that setting up such a mapping store within .Stat would lead to users being able to browse a ‘catalogue’ of DSDs for datasets held by the database, via API or web interface, thereby assisting the user in the discovery of data. The mapping store could also be extended to include reference metadata, to further enhance the discovery of relevant datasets.

An additional potential use of SDMX-RI is to make Resource Description Framework (RDF) queries on the data possible via a web service. The model underpinning the RDF Data Cube vocabulary is compatible with the cube model that underlies SDMX, therefore SDMX can be “translated” into RDF. ISTAT have done proof of concept work on this with census data, thereby opening the possibility of taking .Stat into the realm of the Semantic Web and Linked Data dissemination for open data.

In addition to these SDMX initiatives for .Stat, other topics discussed included the introduction of an open search engine, support for very large datasets, and an analysis of the web data browser. All these developments will keep the Architecture Task force busy over the coming months until the next workshop.

The OECD will publishing a comprehensive ‘highlights report’ of the 2015 workshop at https://siscc.oecd.org/Home/Collaboration/ shortly.

Data Impact blog

Building an international data access architecture; a technical perspective

Tags