How web-scraping for Covid-19 data could inform policy

Diarmuid McDonnellDiarmuid McDonnell discusses how social scientists can access real-time datasets on the Covid-19 outbreak, and how this data might be used to inform policy relating to the crisis.

 

At the UK Data Service, we are developing a training series on new forms of data for social scientists, such as how researchers might collect data from the web or social media platforms, or process and analyse large volumes of text.

As part of this series we are exploring ways in which social scientists can capture data relating to the Covid-19 emergency, which is having such a profound impact on our lives at this moment in time.

This will be covered in a webinar that I am hosting on April 23 and which will be available online shortly after.

The webinar will cover the step-by-step process of collecting or web-scraping data, as it is known, from a web page, including providing interactive sample code written in the popular Python programming language.

We have all been inspired by the work of medical staff on the front line of the Covid-19 crisis but there is also a push for social scientists to play a role in analysing some of the datasets being generated because of the outbreak.

Social scientists are well placed to use this data to provide theoretical and empirical insight into the socioeconomic consequences of this disease.

For example, time-series data on Covid-19 – which is publicly available from Johns Hopkins University Center for Systems Science and Engineering – is well suited to causal analyses of the impact of policy interventions e.g. which interventions caused the biggest shift in the rate of infections etc.

It is also possible to link Covid-19 data to other country-level datasets of interest to social scientists, such as quarterly statistics on the use of public transport in the UK (released by the Department for Transport) or to fortnightly surveys of businesses (released as part of a range of faster experimental statistics by the Office for National Statistics – ONS).

Representation of data within coronavirus

Image: Representation of data within covid-19 virus (adapted from original: Felipe Esquivel Reed, CC BY-SA 4.0)

 

It is clear that the Covid-19 crisis is a rapidly evolving public health emergency with considerable geographic and demographic variation in its impact.

By using web-scraping techniques, researchers can collect data in real-time and on a routine basis – this is important because with so much data being produced, some datasets might become overwritten, archived or lost over time. Computational methods offer one such means of ensuring you have the latest data on this disease.

It is also important to choose carefully which datasets you track to avoid picking up misinformation.

A number of trusted, expert research centres and data producers are working tirelessly to capture and share the latest statistics relating to this disease. As mentioned above, I would start with the JHU CSSE and the ONS data.

While social media platforms are a useful means of rapidly disseminating material, perhaps take what you read with a pinch of salt until you can confirm its veracity via a trusted source.

My webinar Webscraping for Social Science Research webinar will take place on April 23 from 3pm to 4pm.


About the author

Dr Diarmuid McDonnell (@DiarmuidMc) is a Research Associate at the Cathie Marsh Institute at the University of Manchester, focusing on new forms of data.

His research centres on charitable organisations, studying their accountability behaviour and motivations, and regulatory environment. His research employs large-scale administrative datasets that are mostly held by charity regulators in numerous jurisdictions.

Leave a Reply

Your email address will not be published. Required fields are marked *