Introduction Into R for Social Scientists
Dr. Elze G. Ufkes
This handout shows a step-by-step approach to harvest, visualize and analyze twitter data. You can use it to run the code on your own computer. Copy pasting the codes in the gray blocks and running it should produce the same outcomes as shown in the document.
2 Activating libraries used for this assignment
First we'll need to activate the libraries used for this assignment. Remember that if this is the first time you use a package that you need to install the package using
library(twitteR) library(tm) library(wordcloud) library(psych)
3 Load functions
To make the lecture easier for you we created some functions that will do some work for you. You can load these functions with the command below.
4 Log in to twitter
Before we can start we need to log in to a twitter account (make sure you're connected to internet). For this assignment I will put the login details on Canvas. If you want to use your own twitter account you can find the codes used for these keys and tokens on apps.twitter.com.
# Store your login details in the following objects consumer_key <- "" consumer_secret <- "" access_token <- "" access_secret <- ""
Establish the connection with the following code and send
1 in the console to confirm that you want to cache the connection.
# Login to twitter setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
##  "Using direct authentication"
5 Harvesting twitter data
Okay we're set! Now we can "harvest" data currently shared at twitter. For instance try the following example to get the 5 most recent tweets from our departments twitter account use the following code. Note that setting
FALSE determines if you only harvest original tweets or re-tweets as well.
# Search for tweets of specific user userTimeline("PCRSuTwente", n = 3, includeRts = TRUE)
## [] ##  "PCRSuTwente: Do you want to get a better idea of the researchers and lecturers working at our department? Watch our video! <U+2935><U+FE0F> https://t.co/8z4gsVzlUr" ## ## [] ##  "PCRSuTwente: Yesterday prof. dr. Skip Rizzo came all the way to @utwenteEN to give a great talk on \"Is Clinical Virtual Reality https://t.co/Caxubl7Gy9" ## ## [] ##  "PCRSuTwente: RT @DefensieOnline: Staatssecretaris Visser: Dit rapport is een belangrijke stap in het herstel van Defensie. We hebben de sociale veiligh "
Or use the following code to search for 10 tweets containing a specific keyword. Feel free to change the keyword to something you find interesting.
# Search for specific keyword searchTwitter("trump", n = 10)
The next step is to not only look at the tweets (we can do that online in Twitter). We want to analyze the tweets! The first step is to save the tweets we're interested in an R-object. You can give any name to this object, so let's use
tweets. See how we used the previous command to create this R-object?
# Save n = 100 tweets to a list tweets <- searchTwitter("trump", n = 100)
Okay now we can inspect the information that is stored in the object
tweets we just created. For instance try the following code just to show the first tweet stored in this object. Since
tweets is a list we created with 100 twitter messages, we can type the name of the object
tweets followed by the number of the Tweet you want to see
 to show the first message.
## [] ##  "MurfAD: RT @mchooyah: ADM McRaven was born to lead this mission. Ill follow him anywhere. If only people heard the real speech he gave the Team... "
We also can have a look at the so-called meta-data connected to this tweet. For instance how often did Twitter users re-tweet, this message?
##  750
Similarly you can use the following functions (try it!):
$screenNameto get the name of the twitter account tweeting this message
$favoriteCountto get the number of likes or favorites
$retweetCountto get the number of re-tweets
Finally you can also download meta-data (information) of any public twitter profile. For instance try the following.
# Save information from a specific user, based on their twitter name user <- getUser("PCRSuTwente")
# Now you can inspect for instance the number of followers for this user. user$followersCount
##  91
Similarly you can use the following functions (try it! What do they do?):
6 Consquences of a terrorist attack
On Tuesday November 31st 2017 a man drove his truck into people on a bike path in Manhattan, NYC. Terror attacks may have consequences for how people feel. But do they also have effects on how they Tweet? For instance how people tweet about the US president? Let's find out!
First we can download a random selection of twitter messages mentioning Trump in the days before (pre) and after (post) the attack in NYC. In this case we download the maximum number of tweets Twitter allows you to download in one time using their public access point: 1500.
Note that Twitter only allows you to download messages based on a keyword for 6-9 days back. Therefore you can skip the next step and use the commands below for your own twitter search.
# Harvesting messages nyc_pre <- searchTwitter( "trump", n=1500, lang="en", since="2017-10-29", until="2017-10-31" ) nyc_post <- searchTwitter( "trump", n=1500, lang="en", since="2017-10-31", until="2017-11-2" )
To load the results from the search above downloade "Session-03-Data-Trump-Pre-and-Post-NYC.zip" from the website, unpack it and run the code below.
7 Creating a word cloud of most frequently tweeted words
searchTwitter function stores the tweets in a Large list format which is hard to access. Therefore, use the following transformation to convert the tweets to a data base which you can inspect.
# Convert to dataframe format nyc_pre_df <- twListToDF(nyc_pre) nyc_post_df <- twListToDF(nyc_post)
Then as a final preparation step the text needs to be "cleaned". That is we'll remove all formatting, punctuation and other information irrelevant for the next steps. We will use one of the custom built functions
twitter_clean_up to do this.
nyc_pre_df$text <- twitter_clean_up(nyc_pre_df$text) nyc_post_df$text <- twitter_clean_up(nyc_post_df$text)
Now open the environment (on the right-top of R Studio) and click on
nyc_pre_df to inspect the data-sets.
A powerful advantage of R is the possibilities in visualizing the data. For text data an ideal method to visualize the data is creating a word cloud with the most frequently used words. The word cloud below also uses font size to show how often a word is used: the bigger words representing words that are tweeted more often.
To get to the word cloud we need to transform the data-set a couple of times. Don't worry about these steps, we also created a custom function (
twitter_word_freq) for you that is already loaded.
You only need to define words that will be excluded (stop words) from the word cloud. For instance, we don't want to see words like "and", "or", or "trump".
# Define stopwords that will be excluded from word cloud stopwords <- c( "trump", "trumps", "donald", "http", "https", stopwords("english") ) # Create a data frame with words and their frequencies nyc_pre_freq_df <- twitter_word_freq(nyc_pre_df$text, stopwords) nyc_post_freq_df <- twitter_word_freq(nyc_post_df$text, stopwords)
Okay now we're ready to create the word cloud for the tweets before the attack using the code below.
# word cloud wordcloud( nyc_pre_freq_df$word, nyc_pre_freq_df$freq, random.order = FALSE, colors = brewer.pal(8, "Dark2"), min.freq = 20 )
And similarly you can create the word cloud for the tweets after the attack using the code below.
# word cloud wordcloud( nyc_post_freq_df$word, nyc_post_freq_df$freq, random.order = FALSE, colors = brewer.pal(8, "Dark2"), min.freq = 20 )