David Kingman from HM Treasury discusses how reading books on R and Python has helped him achieve more impact from his work as a data scientist.
For as long as I’ve known how to read I’ve been a voracious reader who enjoys reading books on a wide variety of different subjects. In recent years the skills I’ve learnt from reading books have played a key role in my professional development as a data scientist, in ways which have helped me to generate more impact from my work.
Books have been particularly useful in my efforts to learn both R and Python. There are many well-written, accessible titles available on both languages written by experts in their fields. They include ones by the actual developers who created the R or Python packages which I’ve been interested in learning about.
For me, books have a number of advantages as a tool for learning new skills. Books are (generally) cheap and portable, and they make it easier for me to fit in learning new skills around other demands on my time. Virtually all books about R and Python also include code examples which show how to do things. This is great for me because I can immediately have a go at putting the ideas I’ve been reading about into practice by repeating and adapting these examples.
How have books helped to generate impact?
Since I first started learning R, I’ve got two new jobs which have both involved using the language every day. In a sense, almost everything I’ve done at work over the past few years can be directly attributed to skills I’ve learned from reading books.
However, I have a few specific examples of where skills I’ve learnt from reading books have helped me to generate impact. Back when I was working at the Intergenerational Foundation (IF) I worked on a project where I wanted to use the Energy Performance Certificates open database – a 20GB open dataset – to measure the number of new-build properties which had been built over the previous ten years in every English local authority area and which were had less than 30 square metres of floorspace.
This presented me with a difficulty: I wanted to do the analysis using R. However, one of the limitations of R is that you have to store the data you are analysing in memory, and the size of the datasets you can hold in memory is limited by the amount of RAM your computer has. Because of this limitation I couldn’t actually import the whole dataset into the same R session at once in order to analyse it.
Fortunately, I had recently read a book which was about how to overcome exactly this type of problem. It was called Big Data Analysis with R by Simon Walkowiak, which explains a number of different strategies for analysing ‘larger than memory’ datasets with R.
I learned from that book that the easiest way for me to do what I wanted to do was to use the RSQLite package to write a script that would create an SQLite database on my computer’s hard drive. Then I could write all of the EPC data to this database programmatically from within my R session. Next I was able to send SQL queries to my new database from R which summarised the data from each local authority one at a time. This process meant I could answer my research question without having to load the entire database into memory at once. I was also helped in doing this by another book which I’d read beforehand which was called Teach Yourself SQL in 10 minutes by Ben Forta, as that taught me how to write the SQL queries which I needed to use to communicate with my new database.
This piece of research went on to be included in a report which IF subsequently published called Rabbit Hutch Homes: The growth of micro-homes (January 2020), which achieved a significant amount of press coverage and was submitted to several government inquiries on improving Britain’s housing supply.
Another example is my knowledge of how to create R Markdown documents to make pieces of analysis more impactful and easier to reproduce. R Markdown is a tool that I first came across while I was working at the Greater London Authority, which I thought was likely to be extremely useful in my career, so I read a copy of R Markdown: The definitive guide by Yihui Xie, J.J. Allaire and Garrett Grolemund to become more familiar with how to use the technology.
By applying the things I read in that book I’ve now had quite a lot of experience of working with R Markdown, and when I was being interviewed for the job I have now as a Senior Policy Advisor in HM Treasury I actually gave a demonstration of an R Markdown document I’d created which automatically updates a report about London’s demography. The document is available to the public on the London Datastore.
Since I’ve started working at HM Treasury, I’ve provided a live training demonstration on writing R Markdown documents to a group of my colleagues, and I’ve used R Markdown to write a report for a project which involved comparing how much different types of people use public transport using data from the National Travel Survey and Understanding Society.
Following the success of this project, my team decided to use R Markdown documents to produce the documentation for some of our other projects, which demonstrates how my knowledge of R Markdown has helped me to generate impact.
I can definitely recommend a few books which I’ve found especially useful when it came to developing my skills as a data scientist with R, in addition to the ones which I’ve already mentioned.
I’ve now taught R to other people in all three of my most recent jobs, and the one book I always recommend to anyone who is trying to learn R from scratch is R for Data Science by Hadley Wickham and Garrett Grolemund (like a lot of books about R, this is available as a free online ‘book’ which has been created using R’s bookdown package and hosted online, or you can also buy a physical copy). This was the book from which I first really learned about the power of the tidyverse and ggplot, and it explains everything a beginner needs to learn to start undertaking data analysis projects from beginning to end within R.
Hadley Wickham is both an incredibly influential R developer and an accessible and entertaining writer, so I would also highly recommend any of his other books. These include ggplot2: Elegant graphics for data analysis, R packages for people who are interested in package development, and Advanced R. This last one is the book which really demonstrated to me a lot of useful things about the inner workings of R as a programming language which I’ve subsequently used to write more complicated functions and programmes. Wickham has also just published another new book, Mastering Shiny, which is a guide to developing Shiny dashboards with R which I’m very much looking forward to reading.
There are also many excellent books which cover statistics and machine learning using R, of which one of my favourites is An Introduction to Statistical Learning with Applications in R by James et al. I found it really helped me to understand both the theory behind and the practical uses of some very powerful machine learning algorithms, such as decision trees and support vector machines.
This is really just scratching the surface when it comes to explaining all of the things which you can learn about programming and data science without using anything more complicated than the Medieval technology of the printed book. Hopefully it has demonstrated what a useful tool reading books can be for anyone interested in learning more about these subjects.
David Kingman is one of the UK Data Service Data Impact Fellows 2019. He is a Senior Policy Advisor at HM Treasury, where he works on policy microsimulation and data science. Prior to that he was a Senior Research and Statistical Analyst at the Greater London Authority, and Senior Researcher at the Intergenerational Foundation think tank, where his research focused on investigating the implications of an ageing population on UK society.