We wanted to mine some data from HEFCE’s API of Impact Case Studies submitted to the REF2014 to find out how many UK Data Service data collections were used in Impact Case Studies in REF2014 – and the case studies they were used in. Basic analysis of the database showed high usage of data – but very low citation using persistent identifiers (data DOIs).
John Matthews, software engineer in the census support team, talks us through the process:
“The first step was to figure out how the API worked. Luckily the API is very well built, and includes a good amount of documentation to go along with it. Sure it might read like a wall of text, but some documentation is better than no documentation. The API itself works in a fairly regular way, with multiple endpoints and built in handy functions that we can utilise later on.
The UK Data Service uses persistent identifiers for users to cite data in the collection, but it seemed that a lot of the case studies using data in the UK Data Service collection had included data collections by name, rather than data DOIs, so it seemed reasonable to start with a list of our collections from ukdataservice.ac.uk/get-data/key-data.
There’s a joke within programming circles that says “if you want to be the best programmer, you’ve got to be lazy.” Essentially, “script everything“. Why bother doing something manually when you can just get a computer to do it for you. Doing so will tune up your scripting and logic skills, plus you never know when something like this might come in handy later. So with this in mind it shouldn’t come as a surprise that instead of copying down each and every survey title by hand, we just got a script to do it for us. Here’s how the script worked:
– Get the source code from ukdataservices.ac.uk/get-data/key-data
– Find <ul class=”list”> (there is only one of these tags in the whole source)
– For every <li> section find the first link inside it
– Grab the text in between the <a> and </a> tags
– Side note: We could have used the <h1> tag, but then we’d have to deal with removing the <a> tags inside of it, so it was just easier to go for the text directly.
So now we’ve got a shiny new list of Impact Case studies using data in the UK Data Service collection. The REF2014 studies are ordered by Unit of Assessment (UOA) numbers. UOA numbers are essentially the subject of study that the publication refers to. For example, an article titled *Knee Injuries In Contact Sports* would likely fall under *UOA 26 – Sport and Exercise Sciences, Leisure and Tourism*.
The API allows us to grab all the articles that fall under each UOA, so we downloaded everything from each Unit of Assessment into its own json file. We wrote a script to run through all 36 UOAs and grab the associated json:
for i in {1..36}
do
wget -O UOA$i.json “https://impact.ref.ac.uk/casestudiesapi/REFAPI.svc/SearchCaseStudies?UoA=$i”
done
Next on the to-do list was to search through each of the 36 json files (all stored in the same directory) for each of the UK Data Service studies out of the list we created earlier. The best way to do this was to use a search tool built into Unix systems call ACK. Here’s the script first, we’ll go into it in a second:
declare -a Topics=(‘1970 British Cohort Study’ ‘British Household Panel Survey’ ‘English Longitudinal Study of Ageing’ ‘Growing Up in Scotland’ … );
for i in {0..44}
do
echo “SEARCHING:” ${Topics[i]}
ack –ignore-case –files-with-matches –json “${Topics[i]}” ~/dev/public/projects/ref/
done
So first up we’ve got a huge array lists each of the UK Data Service studies. Pretty simple, not much going on here.
Next up we start the for loop. For the novice programmers out there, the reason we’re using a for loop and not a while loop is because we know the exact number of times that we needed to run the loop through. If we didn’t then a while loop would work perfectly. We ran this loop through 45 times, since programming bash counts from (and includes) 0.
Once inside the loop we ran a simple echo command, just to that we (the user) know what’s being search for at the moment in time.
The next line is where the magic happened. First up we called the ack command to initialise the tool, and everything after this point is either a tag, a string, or a directory:
–ignore-case: does exactly what it says on the tin, it ignores the case of the search term and so will match both upper and lowercase characters.
–files-with-matches: only print filenames containing matches
–json: only searches through files with this file extention (.json)
“$s{Topics[I]}”: searches for this number item in the Topics array we just declared.
~/dev/public/projects/ref/: searches for .json files in this directory
And that’s it! The bash terminal then outputs a big list of all the data in the UK Data Service collection that are referenced or mentioned in any REF2014 Impact Case Study!
SEARCHING: 1970 British Cohort Study
/home/john_matthews/dev/public/projects/ref/UOA4 – Psychology, Psychiatry and Neuroscience.json
SEARCHING: British Household Panel Survey
/home/john_matthews/dev/public/projects/ref/UOA22 – Social Work and Social Policy.json
SEARCHING: English Longitudinal Study of Ageing
SEARCHING: Growing Up in Scotland
SEARCHING: Longitudinal Study of Young People in England
SEARCHING: Millennium Cohort Study
/home/john_matthews/dev/public/projects/ref/UOA4 – Psychology, Psychiatry and Neuroscience.json
SEARCHING: National Child Development Study
/home/john_matthews/dev/public/projects/ref/UOA25 – Education.json
SEARCHING: Understanding Society
/home/john_matthews/dev/public/projects/ref/UOA4 – Psychology, Psychiatry and Neuroscience.json
SEARCHING: Annual Population Survey
SEARCHING: British Social Attitudes
SEARCHING: 1970 British Cohort Study tells us that ACK is searching for the term “1970 British Cohort Study” in each of the json files. The line directly underneath that tells us that ACK found the term within the UOA4 json file. If you take a look at the line that reads SEARCHING: English Longitudinal Study of Ageing you’ll notice that there is no directly listed directly below it. This is because that particular search term was not found in any of the UOA json files.”
Our data mining is by no means complete – we haven’t run all of our data collections through the process, but we now hope to analyse ‘what impact where’ from UK Data Service data featuring in Impact Case Studies submitted to the REF2014 and we’ll blog about it as we go.