HARVESTING AND ANALYZING Presentation by Group 5
Natasha Christabelle (15753)
TWEETS Reifita Ayu P (15762)
Rubila Dwi (15759)
INTRODUCTION
Twitter is a fabulous source for information. Whenever
something is happening, people around the world start
tweeting away. Many twitter users also engage in
conversations, and looking at these conversations allows
us to identify leaders and frequent actors.
Harvesting tweets allow users to focus on a certain topic
or subject that requires a further understanding. So, to
harvest tweets is to basically collect a cluster of tweets
depending on how many tweets you want.
INTRODUCTION
Why? How?
Because the collecting of information is There are a lot of methods and
essential for business or personal applications to harvest and analyze
purposes. Since twitter draws 310 million tweets, those including:
users worldwide, and 79% accounts -R
come from outside the U.S, which
concludes that twitter possesses various - Python
audiences from different background. - ScraperWiki
This would come in handy for those
trying to get a grasp of where the - Mozdeh
public opinion falls upon. - BeautifulSoup
R STUDIO
What it is
R is a language and environment for
statistical computing and graphics. It is
a GNU project which is similar to the S
language and environment which was
developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John
Chambers and colleagues.
THE PURPOSE OF THIS PROJECT
-To harvest tweets into R
-To analyze a particular topic in twitter by doing sentiment analysis
-To visualize the most frequent words and terms contained in the tweets for the topic
we search by using wordcloud and bars
SENTIMENT ANALYSIS
Sentiment analysis, also referred to as
Opinion Mining, implies extracting
opinions, emotions and sentiments in text.
One of the most common applications of
sentiment analysis is to track attitudes
and feelings on the web, especially for
tacking products, services, brands or
even people.
The main idea is to determine whether
they are viewed positively or negatively
by a given audience.
R SENTIMENT PACKAGES BY TIMOTHY JURKA
classify_emotion
This function helps us to analyze some text and classify it in different types of
emotion: anger, disgust, fear, joy, sadness, and surprise.
classify_polarity
In contrast to the classification of emotions, the classify_polarity function allows us to
classify some text as positive or negative.
NAVE BAYER CLASSIFIER
A naive Bayes classifier applies Bayes Theorem in an attempt to suggest possible
classes for any given text. To do this, it needs a number of previously classified
documents of the same type. The theorem is as follows:
NAVE BAYER CLASSIFIER MODIFICATION
In its application, NaveBayesClassifier often not only used just like that, but still need some
modifications to improve the performance of the algorithm itself. Some modifications that can be
done is:
1. Prepocessing, it is performed for the early stages of starting a
sentiment analysis process. In this preprocessing, there are several stages that must be
undertaken, namely:
a. Changing the status of the entire text to lowercase (lowercase).
b. Delete url contained in text status (http: //www....com).
c. Removing the tag (@) with username
d. Delete a hashtag (#).
e. Changing repeating letters, for example 'hunggrryyy'
or 'huuuungry' becomes 'hungry'.f. Removing punctuation like a comma, single / double quote,
question marks contained in the status text, for example, beautiful !!!!!
replaced by beautiful.
g. Words must start with an alphabet - For simplicity sake, we can remove all those words
which don't start with an alphabet. E.g. 15th, 5.34am
NAVE BAYER CLASSIFIER MODIFICATION
2. Stopword removal is a removal process words that are sentiment and can be
removed. Examples of stopword for English
is is, a, all. For Indonesian, such as the names of the months, pronouns,
and conjunctions.
3. N-Grams is a process that is done to deal with the negative word like (not, is not).
In addition, N-Grams are also used to handle the appearance of the phrase in a
text status. In sentiment analysis, N grams commonly used is Bigram (two-word
combination).
For example, if there is a text "Serviceisbad", it will be tokenized with unigram
and Bigram into Service, is, bad, Serviceis, isbad
STEP 1: LOAD PACKAGES
# required pakacgeslibrary(twitteR)
library(sentiment)
library(plyr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
STEP 2: AUTHORIZATION
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
apiKey <- "zCtaknzj3oVJplxCoN4X8uj2B"
apiSecret <- "5B6C1vYXehHvnffXZoRZEisoWXPoNJCyW7f2vSgga76OxkCdkJ"
access_token <- "176005517-Lv1Y6sT0MT0lYDbmlNxH9eY0iRkdTyzwVQbGfreH"
access_token_secret <-"71L9IhcEasZKabmCEHdJuF2yulS9rnVixq0H62HW8urgC"
setup_twitter_oauth(apiKey,apiSecret,access_token,
access_token_secret)
STEP 3: COLLECT SOME TWEETS CONTAINING THE
TERM DONALD TRUMP"
# harvest some tweets
some_tweets = searchTwitter(donald trump", n=1000, lang=eng")
# get the text
some_txt = sapply(some_tweets, function(x) x$getText())
STEP 4: PREPARE THE TEXT FOR SENTIMENT
ANALYSIS
# remove retweet entities # remove unnecessary spaces # if not an error
some_txt = some_txt = gsub("[ \t]{2,}", "", if (!inherits(try_error, "error"))
gsub("(RT|via)((?:\\b\\W*@\\w+)+ some_txt)
)", "", some_txt) y = tolower(x)
some_txt = gsub("^\\s+|\\s+$", "",
# remove at people some_txt) # result
some_txt = gsub("@\\w+", "", return(y)
some_txt)
}
# remove punctuation # define "tolower error handling"
function # lower case using try.error with
some_txt = gsub("[[:punct:]]", "", sapply
some_txt) try.error = function(x)
some_txt = sapply(some_txt, try.error)
# remove numbers {
some_txt = gsub("[[:digit:]]", "", # create missing value
some_txt)
y = NA # remove NAs in some_txt
# remove html links
# tryCatch error some_txt = some_txt[!is.na(some_txt)]
some_txt = gsub("http\\w+", "",
some_txt) try_error = tryCatch(tolower(x), names(some_txt) = NULL
error=function(e) e)
STEP 5: PERFORM SENTIMENT ANALYSIS"
# classify emotion
class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0)
# get emotion best fit
emotion = class_emo[,7]
# substitute NA's by "unknown"
emotion[is.na(emotion)] = "unknown"
# classify polarity
class_pol = classify_polarity(some_txt, algorithm="bayes")
# get polarity best fit
polarity = class_pol[,4]
STEP 6: CREATE DATA FRAME WITH THE RESULTS
AND OBTAIN SOME GENERAL STATISTICS
# data frame with results
sent_df = data.frame(text=some_txt, emotion=emotion,
polarity=polarity, stringsAsFactors=FALSE)
# sort data frame
sent_df = within(sent_df,
emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
STEP 7: CREATE PLOT DISTRIBUTION OF
EMOTION
# plot distribution of emotions
ggplot(sent_df, aes(x=emotion)) +
geom_bar(aes(y=..count.., fill=emotion)) +
scale_fill_brewer(palette="Dark2") +
labs(x="emotion categories", y="number of tweets") +
ggtitle("Sentiment Analysis of Tweets about Donald Trump\n(classification by
emotion")
STEP 7: CREATE PLOT DISTRIBUTION OF
POLARITY
# plot distribution of polarity
ggplot(sent_df, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) +
scale_fill_brewer(palette="RdGy") +
labs(x="polarity categories", y="number of tweets") +
ggtitle("Sentiment Analysis of Tweets about Donald Trump\n(classification by
polarity")
STEP 8: CREATE COMPARISON CLOUD BASED ON
EMOTION
# separating text by emotion } TermDocumentMatrix(corpus)
emos = tdm = as.matrix(tdm)
levels(factor(sent_df$emotion))
# remove stopwords colnames(tdm) = emos
nemo = length(emos)
emo.docs =
emo.docs = rep("", nemo) removeWords(emo.docs,
stopwords("english")) # comparison word cloud
for (i in 1:nemo)
# create corpus comparison.cloud(tdm, colors =
{ brewer.pal(nemo, "Dark2"),
corpus =
tmp = some_txt[emotion == Corpus(VectorSource(emo.docs scale = c(3,0.5),
emos[i]] )) random.order = FALSE,
title.size = 1.5)
emo.docs[i] = paste(tmp, tdm =
collapse=" ") warnings()
THANK YOU FOR YOUR ATTENTION