1 Instructions

This handout shows a step-by-step approach to harvest, visualize and analyze twitter data. You can use it to run the code on your own computer. Copy pasting the codes in the gray blocks and running it should produce the same outcomes as shown in the document.

2 Activating libraries used for this assignment

First we'll need to activate the libraries used for this assignment. Remember that if this is the first time you use a package that you need to install the package using install.packages() first.

library(twitteR)
library(tm)
library(wordcloud)
library(psych)

3 Load functions

To make the lecture easier for you we created some functions that will do some work for you. You can load these functions with the command below.

source("S03D02-Twitter-Functions.R")

4 Log in to twitter

Before we can start we need to log in to a twitter account (make sure you're connected to internet). For this assignment I will put the login details on Canvas. If you want to use your own twitter account you can find the codes used for these keys and tokens on apps.twitter.com.

# Store your login details in the following objects

consumer_key <- ""
consumer_secret <- ""
access_token <- ""
access_secret <- ""

Establish the connection with the following code and send 1 in the console to confirm that you want to cache the connection.

# Login to twitter

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"

5 Harvesting twitter data

Okay we're set! Now we can "harvest" data currently shared at twitter. For instance try the following example to get the 5 most recent tweets from our departments twitter account use the following code. Note that setting includeRts to TRUE or FALSE determines if you only harvest original tweets or re-tweets as well.

# Search for tweets of specific user

userTimeline("PCRSuTwente", n = 3, includeRts = TRUE)
## [[1]]
## [1] "PCRSuTwente: Do you want to get a better idea of the researchers and lecturers working at our department? Watch our video! <U+2935><U+FE0F> https://t.co/8z4gsVzlUr"
## 
## [[2]]
## [1] "PCRSuTwente: Yesterday prof. dr. Skip Rizzo came all the way to @utwenteEN to give a great talk on \"Is Clinical Virtual RealityÂ… https://t.co/Caxubl7Gy9"
## 
## [[3]]
## [1] "PCRSuTwente: RT @DefensieOnline: Staatssecretaris Visser: “Dit rapport is een belangrijke stap in het herstel van Defensie. We hebben de sociale veiligh…"

Or use the following code to search for 10 tweets containing a specific keyword. Feel free to change the keyword to something you find interesting.

# Search for specific keyword

searchTwitter("trump", n = 10)

The next step is to not only look at the tweets (we can do that online in Twitter). We want to analyze the tweets! The first step is to save the tweets we're interested in an R-object. You can give any name to this object, so let's use tweets. See how we used the previous command to create this R-object?

# Save n = 100 tweets to a list
tweets <- searchTwitter("trump", n = 100)

Okay now we can inspect the information that is stored in the object tweets we just created. For instance try the following code just to show the first tweet stored in this object. Since tweets is a list we created with 100 twitter messages, we can type the name of the object tweets followed by the number of the Tweet you want to see [1] to show the first message.

tweets[1]
## [[1]]
## [1] "MurfAD: RT @mchooyah: ADM McRaven was born to lead this mission. IÂ’ll follow him anywhere. If only people heard the real speech he gave the Team...Â…"

We also can have a look at the so-called meta-data connected to this tweet. For instance how often did Twitter users re-tweet, this message?

tweets[[1]]$retweetCount
## [1] 750

Similarly you can use the following functions (try it!):

  • $screenName to get the name of the twitter account tweeting this message
  • $favoriteCount to get the number of likes or favorites
  • $retweetCount to get the number of re-tweets

Finally you can also download meta-data (information) of any public twitter profile. For instance try the following.

# Save information from a specific user, based on their twitter name

user <- getUser("PCRSuTwente")
# Now you can inspect for instance the number of followers for this user. 

user$followersCount
## [1] 91

Similarly you can use the following functions (try it! What do they do?):

  • user$location
  • user$profileImageUrl
  • user$name

6 Consquences of a terrorist attack

On Tuesday November 31st 2017 a man drove his truck into people on a bike path in Manhattan, NYC. Terror attacks may have consequences for how people feel. But do they also have effects on how they Tweet? For instance how people tweet about the US president? Let's find out!

First we can download a random selection of twitter messages mentioning Trump in the days before (pre) and after (post) the attack in NYC. In this case we download the maximum number of tweets Twitter allows you to download in one time using their public access point: 1500.

Note that Twitter only allows you to download messages based on a keyword for 6-9 days back. Therefore you can skip the next step and use the commands below for your own twitter search.

# Harvesting messages

nyc_pre <- searchTwitter(
  "trump", n=1500, lang="en", since="2017-10-29", until="2017-10-31"
)

nyc_post <- searchTwitter(
  "trump", n=1500, lang="en", since="2017-10-31", until="2017-11-2"
)

To load the results from the search above downloade "Session-03-Data-Trump-Pre-and-Post-NYC.zip" from the website, unpack it and run the code below.

load("S03D03-Data-Trump-Pre-and-Post-NYC.RData")

7 Creating a word cloud of most frequently tweeted words

The searchTwitter function stores the tweets in a Large list format which is hard to access. Therefore, use the following transformation to convert the tweets to a data base which you can inspect.

# Convert to dataframe format

nyc_pre_df <- twListToDF(nyc_pre)
nyc_post_df <- twListToDF(nyc_post)

Then as a final preparation step the text needs to be "cleaned". That is we'll remove all formatting, punctuation and other information irrelevant for the next steps. We will use one of the custom built functions twitter_clean_up to do this.

nyc_pre_df$text <- twitter_clean_up(nyc_pre_df$text)
nyc_post_df$text <- twitter_clean_up(nyc_post_df$text)

Now open the environment (on the right-top of R Studio) and click on nyc_post_df and/or nyc_pre_df to inspect the data-sets.

A powerful advantage of R is the possibilities in visualizing the data. For text data an ideal method to visualize the data is creating a word cloud with the most frequently used words. The word cloud below also uses font size to show how often a word is used: the bigger words representing words that are tweeted more often.

To get to the word cloud we need to transform the data-set a couple of times. Don't worry about these steps, we also created a custom function (twitter_word_freq) for you that is already loaded.

You only need to define words that will be excluded (stop words) from the word cloud. For instance, we don't want to see words like "and", "or", or "trump".

# Define stopwords that will be excluded from word cloud

stopwords <- c(
  "trump", "trumps", "donald", "http", "https", stopwords("english")
)

# Create a data frame with words and their frequencies

nyc_pre_freq_df <- twitter_word_freq(nyc_pre_df$text, stopwords)
nyc_post_freq_df <- twitter_word_freq(nyc_post_df$text, stopwords)

Okay now we're ready to create the word cloud for the tweets before the attack using the code below.

# word cloud
wordcloud(
  nyc_pre_freq_df$word, nyc_pre_freq_df$freq, random.order = FALSE,
  colors = brewer.pal(8, "Dark2"), min.freq = 20
)

And similarly you can create the word cloud for the tweets after the attack using the code below.

# word cloud
wordcloud(
  nyc_post_freq_df$word, nyc_post_freq_df$freq, random.order = FALSE,
  colors = brewer.pal(8, "Dark2"), min.freq = 20
)

Now you created these word clouds, try the following:

  • Change "Dark2" to "Accent" or "Pastel1"
  • Change min.freq = 20 to a different value

8 Sentiment analysis

These word cloud may look nice, they don't give us much information about how positive tweets about president Trump were before and after the attack in New York. To get an idea of this we could do a sentiment analysis.

The idea of the sentiment analysis is that we calculate for each tweet a positivity score based on valence words that are used in a tweet. The text files provided for this course are two lists: one with positive words, and one with negative words. These are based on the work of Hu and Liu (2004)

#Load sentiment words

words_pos <- scan(
  "S03D04-Data-Positive-Words.txt",
  what = "character",
  comment.char = ";"
)

words_neg <- scan(
  "S03D05-Data-Negative-Words.txt",
  what = "character",
  comment.char = ";"
)

We created another function twitter_score_sentiment for you that compares these value words to the text of each tweet to these lists. For each positive word the message gets a +1 and for each negative word the messages gets a -1. Negative scores thus mean that the message is using more negative than positive words. A score of zero means that the message is neutral, either because it doesn't contain valence words or because it contains an equal amount of negative and positive words. And a positive score means that the message is using more positive than negative words.

Run the following code to calculate overall sentiment scores for the two lists of tweets we downloaded for this assignment. First load this function based on the work from Jeffrey Breen

Using the twitter_score_sentiment function we can calculate the sentiment scores for the tweets before and after the attack separately.

#Calculate sentiment score

nyc_pre_score <- twitter_score_sentiment(
  nyc_pre_df$text, words_pos, words_neg, .progress = "text"
)

nyc_post_score <- twitter_score_sentiment(
  nyc_post_df$text, words_pos, words_neg, .progress = "text"
)

Finally we can inspect the sentiment scores. So which tweets were more positive? Those before or after the attack?

#Inspect sentiment scores

describe(nyc_pre_score$score)
##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 1500 0.28 1.38      0    0.34 1.48  -7   4    11 -1.58     7.26
##      se
## X1 0.04
describe(nyc_post_score$score)
##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 1500 0.39 1.28      0    0.42 1.48  -5   5    10 -0.16     0.69
##      se
## X1 0.03

9 Back to you!

This concludes the copy-paste part of this tutorial. Now think about a topic and a research question you'd like to answer using twitter data. Try to adapt the code in such a way that you can download and analyze the data so that you can answer your research question.

Three students will be randomly selected to present their findings at the end of the lecture.