Movie selection using R

Screen shot 2015-07-20 at 6.04.27 PM


Tuesday night is usually my “movie night” thanks to the nearby theater that makes Tuesday’s shows at $6!

I am hesitating between: Jurassic World, Antman, Minions and Terminator Genisys. The first two have good reviews however I have heard bad feedbacks while it is the exact opposite for the last two movies.
I then thought it could be fun to turn this choice into a little exercise! As I already made an sentiment analysis using Python in a previous post, I decided to learn to do it in R and see if people’s tweets can help me to choose.

Getting tweets

install.packages("Rstem", repos = "", type="source")
install.packages("/Users/wolk/Downloads/sentiment_0.2.tar.gz", repos = NULL, type="source")


Signing into Twitter with a regular Twitter account, one can create a sample app here . This generates the standard OAuth 1.0A authentication identifiers: consumer key, consumer secret, access token, and access token secret

consumer_key <- '...'
consumer_secret <- '...'
access_token <- '...'
access_secret <- '...'
setup_twitter_oauth(consumer_key,consumer_secret, access_token, access_secret)

jurassic = searchTwitter("Jurassic World", n=1500, lang="en")
antman = searchTwitter("Ant Man", n=1500, lang="en")
minions = searchTwitter("Minions", n=1500, lang="en")
terminator = searchTwitter("Terminator Genisys", n=1500, lang="en")

To extract the text:

# get the text
jurassic_txt = sapply(jurassic, function(x) x$getText())
antman_txt = sapply(antman, function(x) x$getText())
minions_txt = sapply(minions, function(x) x$getText())
terminator_txt = sapply(terminator, function(x) x$getText())

I found on this website a function to clean tweets

clean.tweets <- function(some_txt) 
  some_txt = gsub("&amp", "", some_txt)
  some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
  some_txt = gsub("@\\w+", "", some_txt)
  some_txt = gsub("[[:punct:]]", "", some_txt)
  some_txt = gsub("[[:digit:]]", "", some_txt)
  some_txt = gsub("http\\w+", "", some_txt)
  some_txt = gsub("[ \t]{2,}", "", some_txt)
  some_txt = gsub("^\\s+|\\s+$", "", some_txt)
  # define "tolower error handling" function
  try.tolower = function(x)
    y = NA
    try_error = tryCatch(tolower(x), error=function(e) e)
    if (!inherits(try_error, "error"))
      y = tolower(x)
  some_txt = sapply(some_txt, try.tolower)
  some_txt = some_txt[some_txt != ""]
  names(some_txt) = NULL

I thus use this function and in addition remove stop words

jurassic_txt <- clean.tweets(jurassic_txt)
jurassic_txt = removeWords(jurassic_txt, stopwords("english"))
jurassic_txt = jurassic_txt[!]

antman_txt <- clean.tweets(antman_txt)
antman_txt = removeWords(antman_txt, stopwords("english"))
antman_txt = antman_txt[!]

minions_txt <- clean.tweets(minions_txt)
minions_txt = removeWords(minions_txt, stopwords("english"))
minions_txt = minions_txt[!]

terminator_txt <- clean.tweets(terminator_txt)
terminator_txt = removeWords(terminator_txt, stopwords("english"))
terminator_txt = terminator_txt[!]

Sentiment Analysis using R

Emotion classification

The classifier is a Naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon. A description can be found here.

# classify emotion
class_emo_jurassic = classify_emotion(jurassic_txt,algorithm="bayes",prior=1.0)
class_emo_antman = classify_emotion(antman_txt,algorithm="bayes",prior=1.0)
class_emo_minions =classify_emotion(minions_txt,algorithm="bayes",prior=1.0)
class_emo_terminator = classify_emotion(terminator_txt,algorithm="bayes",prior=1.0)

# get emotion best fit
emo_jurassic = class_emo_jurassic[,7]
emo_antman = class_emo_antman[,7]
emo_minions = class_emo_minions[,7]
emo_terminator = class_emo_terminator[,7]

# substitute NA's by "unknown"
emo_jurassic[] = "unknown"
emo_antman[] = "unknown"
emo_minions[] = "unknown"
emo_terminator[] = "unknown"

The emotions are divided into anger, disgust, fear, joy, sadness and surprise. A little bit like Inside Out 🙂

colors = c("red", "green", "orange", "violet", "blue", "yellow", "grey")
jpeg('emotionJurassic.jpg', , width=500)
barplot(count, main="Emotion classification of tweets about Jurassic World", xlab="Emotion", ylab="Counts", col=colors)
#Same thing for other movies


Polarity classification

We can now look at the overall polarity of the tweets:

# classify polarity
class_pol_jurassic = classify_polarity(jurassic_txt, algorithm="bayes")
class_pol_antman = classify_polarity(antman_txt,algorithm="bayes")
class_pol_minions = classify_polarity(minions_txt,algorithm="bayes")
class_pol_terminator = classify_polarity(terminator_txt, algorithm="bayes")

# get polarity best fit
pol_jurassic = class_pol_jurassic[,4]
pol_antman = class_pol_antman[,4]
pol_minions = class_pol_minions[,4]
pol_terminator = class_pol_terminator[,4]

# data frame with results
df_jurassic = data.frame(text=jurassic_txt, emotion=emo_jurassic,polarity=pol_jurassic, stringsAsFactors=FALSE)
df_antman = data.frame(text=antman_txt, emotion=emo_antman,polarity=pol_antman, stringsAsFactors=FALSE)
df_minions = data.frame(text=minions_txt, emotion=emo_minions,polarity=pol_minions, stringsAsFactors=FALSE)
df_terminator = data.frame(text=terminator_txt, emotion=emo_terminator, polarity=pol_terminator, stringsAsFactors=FALSE)

count_jurassic <- data.frame(table(df_jurassic$polarity))
count_antman <- data.frame(table(df_antman$polarity))
count_minions <- data.frame(table(df_minions$polarity))
count_terminator <- data.frame(table(df_terminator$polarity))

all <- data.frame(count_jurassic$Freq, count_antman$Freq,
count_minions$Freq, count_terminator$Freq)
colnames(all) <- c("Jurassic World", "Ant Man", "Minions", "Terminator Genysis")

Plotting the results:

jpeg('polarity.jpg', width=500)
barplot(as.matrix(all), main="Polarity of tweets", ylab = "Polarity",
 cex.lab = 1.5, cex.main = 1.4, beside=TRUE, col=c("blue", "grey", "violet"), legend = c("Negative", "Neutral", "Positive"), ylim=c(0,1500))


It seems that Antman and Terminator Genisys have the most “positive” feelings however for Jurassic World people seems more unanimous. It is also important to note that it is the oldest movie.

Word cloud

To better understand the classification, let’s create a word cloud for each movie

# sort data frame
df_jurassic = within(df_jurassic, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
df_antman = within(df_antman, emotion <- factor(emotion,levels=names(sort(table(emotion), decreasing=TRUE))))
df_minions = within(df_minions, emotion <- factor(emotion,levels=names(sort(table(emotion), decreasing=TRUE))))
df_terminator = within(df_terminator, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

# separating text by emotion
emos = levels(factor(df_jurassic$emotion))
nemo = length(emos) = rep("", nemo)
for (i in 1:nemo)
   tmp = df_jurassic$text[df_jurassic$emotion == emos[i]][i] = paste(tmp, collapse=" ")

# create corpus
corpus = Corpus(VectorSource(
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = emos

# comparison word cloud
jpeg('cloudJurassic.jpg', width=500), colors = brewer.pal(nemo, "Dark2"),
   scale = c(2.0,.5), random.order = FALSE, title.size = 1.5)

#Same thing for the other movies

Wordclouds for Jurassic World, Antman, Minions and Terminator Genisys (in this order):



I find pretty funny the fact (among others) that “Velociraptor” falls in “Surprise”, “Weight” in “Sadness”, “Tweets” in “Anger” and “Arnold Schwarzenegger” in “Disgust” 🙂

Well despite that I think I will opt for Terminator hoping it will be as good as the first two!
I hope it was useful and to paraphrase “Hasta la Vista, baby!”


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s