Joseph Allen discusses how researchers can benefit from using Twitter data for research, and how to get started.
Generating a snapshot with survey data and machine learning
When we try to analyse an individual we have access to a snapshot of who this person is.
We could collect quantitative data on ages, sex, family details, salary and beyond. While these data points are valuable, they only capture a part of an individual’s complicated history and limitless potential.
Beyond this, we make a few assumptions:
- That our data is recent enough to affect upcoming policy.
- Our individuals told the truth, to us and themselves. Pessimists might downplay the impact of a diet. Many of us would overstate the amount of exercise we do.
- Events outside our dataset are insignificant. An individual may have hidden traumas or pressures steering their reported behaviour.
Imagine an individual reporting a salary of £25,000.
This might imply that they are early in their career, climbing in salary with each job change. Or, this could be the salary of somebody tired of a corporate career retraining as a teacher.
With traditional machine learning, these attributes may have implied connections. We could predict somebody’s salary from their age, relationship status, education and more. Or, we could ask them about their career.
As humans, we understand the qualitative side of data. If you were to ask a friend how their new vegan diet is going, you might recognise their exhaustion or glee despite the words flowing from their mouths. When we communicate traditionally we can recognise words, body language, intonation and more.
This is where tweets come into play
A tweet is a snapshot of a topic an individual feels is important enough to share. Limited to 280 characters it is high in sentiment and focused in topic.
Twitter provides web APIs which allow us to scrape the history of millions of individuals. Individuals offer up this information, fantastic from a social science perspective but perhaps dangerous from a societal perspective.
For a social scientist interested in the perception of a topic such as “veganism” over time, we now have a tool that, with a small tech investment, can collect thousands of tweets from classified vegan users. This data is public, freely available, and cheap to scrape. The social science alternative to this would be paying individuals for qualitative interviews, which we can still find use for.
Beyond this, Twitter is also a cluttered, opinionated and messy space. We need to be aware of the double-edged nature of any data source, particularly social media data.
Accessing Twitter data
There are a few tools that make analysing Twitter data easier. Twitter has a built-in analytics service that showcases some basic features. There are fantastic third-party tools including Twitter Archive Analyzer.
These tools serve as inspiration sources for academic research. They showcase what can be glistened from Twitter data. Both shrink in functionality compared to the depth of the Twitter data itself. With 500 published works from the University of Manchester making use of Twitter data, it’s worth exploring.
With the standard track of Twitter’s API, you can only access the last 7 days of data. Until recently two methods of getting around this limitation were:
- Paying for a commercial license – giving access to 30 days of Twitter data.
- Web scraping – giving access to all Twitter data, but at the cost of a more technical solution.
Twitter has luckily noticed this growing need for Twitter data for research. Twitter’s solution to this is a new Academic Tier. The application process for this is simple and quick, turning around in a few days. This new tier gives access to an archive search and allows you to scrape up to 10,000,000 Tweets.
If you are a researcher at the University of Manchester, you can join the Research IT Drop-in sessions for support.
Joseph Allen is a Research Associate at the UK Data Service , based at the Cathie Marsh Institute for Social Research at the University of Manchester.