Data download and Preparation
In this tutorial we will look at generating some basic statistics in R using a subset of the Food Hygiene Rating Scores dataset provided by the Food Standards Agency (FSA).
Visit http://ratings.food.gov.uk/open-data/en-GB now and download the data for an area you are interested in. I’ve downloaded City of London Corporation.
R is able to parse XML files but it’s easier to load the file into Excel (or a similar package) and save as a CSV file (visit this page if you’re unsure how to do this: https://support.office.com/en-us/article/import-xml-data-6eca3906-d6c9-4f0d-b911-c736da817fa4).
R and RStudio
R is a statistical programming language and data environment.
Unlike other statistics software packages (such as SPSS and Stata) which have point and click interfaces, R runs from the command line. The main advantage of using the command line is that scripts can be saved and quickly rerun, promoting reproducible outputs. If you’re completely new to R, you may want to follow a basic tutorial beforehand to learn R’s basic syntax.
The most commonly used Graphical User Interface for R is called RStudio (https://www.rstudio.com/products/rstudio/) and I highly recommend you use this as it has nifty functionality such as syntax highlighting and auto completion which helps ease the transition from point and click to command line programming.
Once installed, launch RStudio. You should see something similar to this setup with the ‘Console’ on the left-hand side, the ‘Environment window’ on the top right and another window with several tabs (Files, Plots, Packages, Help, Viewer) on the bottom right:
Don’t worry if your screen looks slightly different, you can visit View > Panes from the top menu to change the layout of the windows.
The console area is where code is executed. Outputs and error messages are also printed here but content within this area cannot be saved. As one of the main advantages of using R is its ability to create easily reproducible outputs, let’s create a new script which we can save and rerun later. Hit CTRL+SHIFT+N to create a new script. Save this within your working directory using the save icon.
Let’s get on with loading our data. Type
data = read.csv(file.choose())
into the script file and again hit CTRL + Enter whilst your cursor is on the same line to run the command, you can also highlight a block of code and using CTRL + Enter to run the whole thing.
You should see a file browser window; navigate to the CSV file you saved earlier containing the FHRS data. Note the syntax of this command, it creates a variable called data on the left hand side of the equals sign and assigns it to the file loaded in using the
read.csv command. Once loaded, you should see the new variable,
data, appear in the environment window on the right hand side. To view the data you can double click on the variable name in the environment window and it will appear as a new tab in the left hand window. Note the variables that this data contains. The object includes useful information such as the business name, rating value, last inspection date and address.
Let’s do some basic analysis. To remove any records with missing values first run the
data = data[complete.cases(data),]
here we pass our
data variable into
complete.cases which removes any incomplete cases and overwrites our original object.
To run some basic statistics we need to convert the
RatingValue variable to an integer:
data$RatingValue = strtoi(data$RatingValue,base =0L)
Note how we use the $ to access the variables of our data object.
To see the minimum and maximum rating values of food outlets in London we can use the minimum and maximum functions:
These commands simply give us the minimum and maximum values without any additional information. To see the full records for these particular establishments we can take a subset of our data to only include those which have been awarded a zero star rating for example:
star0 = data[which(data$RatingValue==0), ]
Creating a graph
Lastly, let’s create a barchart to look at the distribution of star ratings for food outlets in London. We will use the
ggplot library, to install and then load this library, call:
To create a simple barchart use the following code:
ggplot(data = data, aes(x = RatingValue)) + geom_bar(stat = "count")
Here you can see we have passed
RatingValue as the X axis variable in the ‘aesthetics’ function and passed in ‘count’ as the statistic. The output of which should look something like this:
To add x and y labels and a title to your graph use the labs command at the end of the previous line of code:
ggplot(data = data, aes(x = RatingValue)) + geom_bar(stat = "count") + labs(x = "Rating Value", y = 'Number of Food Outlets', title = 'Food Outlet Rating Values in London')
About the author
Rachel Oldroyd is one of our UK Data Service Data Impact Fellows. Rachel is a quantitative human geographer based at the Consumer Data Research Centre (CDRC) at the University of Leeds, researching how different types of data (including TripAdvisor reviews and social media) are used to detect illness caused by contaminated food or drink.