0% found this document useful (0 votes)
36 views14 pages

Samplepaper

Uploaded by

sergiulimboi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views14 pages

Samplepaper

Uploaded by

sergiulimboi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Contribution Title?

Sergiu Limboi and Laura Dioşan

Babeş-Bolyai University, Faculty of Mathematics and Computer Science,


Cluj-Napoca, Romania

Abstract. The Twitter platform is one of the most popular social me-
dia environments that gathers concise messages regarding the topics of
the moment expressed by its users. The valuable information that is ex-
tracted from tweets can be applied in many areas or activities and the
process of Twitter Analysis can be defined as an activity from the Text
Mining domain. Processing sentiments from tweets is a challenging task
due to the natural language complexity, misspelling and short forms of
words. Sentiment Analysis is a field that identifies emotional information
into various polarity classes (positive, negative and neutral). The results
and the classification can be used in strategic and managerial decision
making activity.
The goal of this article is to present a variety of Sentiment Analysis
approaches, focused on information from social media, especially Twit-
ter. The baseline perspective is presented based on different scenarios
that take into consideration preprocessing techniques, data representa-
tions, methods and evaluation measures. In addition, two interesting ap-
proaches are detailed described: the hashtag-based and the synset one.
All these methods are highlighted in order to prove the high importance
and impact that analysis of tweets has on social studies and society in
general.

Keywords: Sentiment Analysis, Twitter, Hashtags

1 Introduction

Nowadays there is an increased interest from various areas like politics, business,
economy or marketing to find answers to questions regarding people’s opinions
and feelings. Some of the questions can be ”What do they hate about iPhone
6?”, ”Does it worth to watch this movie?”. This interest leads to the analysis of
social media content which is very useful in activities like opinion mining from
product reviews or sentiment polarity task.
Twitter is considered ”a valuable online source for opinions” [9] and shows a
way to catch public’s ideas and interests for social studies.
Bearing in mind all these questions, it arrives at the point when it needs a
methodology or area that can solve these issues and can help people to analyze
the context in order to understand it. Therefore, an interesting domain is defined
?
Supported by organization x.
2 S. Limboi, [Link]̧an

for all these problems and it is called Sentiment Analysis. Sentiment Analysis [1]
is the field of Natural Language Processing that identifies and extracts opinions
from written or spoken language. Other related tasks with it are: information
extraction (it implies the removal of subjective information), question answering
(it means to identify opinion-oriented inquiries) and summarization (it reflects
a representation of the original text).-TO BE RESTRUCTURED
Paper structure- TO BE COMPLETED

2 Twitter Environment

Twitter is a popular social media for communicating with other people, ex-
pressing feelings and opinions and broadcasting news. The advantages of such
a powerful tool are: the availability on different electronic devices, the opportu-
nity to have a large friend pool and the fact that you can send small and concise
messages (called tweets) to other friends and on a variety of subjects [8]. There
is a challenge to gather all relevant data, detect and summarize news on a spe-
cific topic. For a user seems to be a problem to find other users with interesting
tweets due to the fact that it has to read through status updates and follow
links attached to the tweet in order to obtain more information. The analysis of
Twitter is a research area with high and growing interest due to the fact that
some research problems are poorly defined and new difficulties are described day
by day. In recent years, researchers have focused on issues like event detection,
topic mining or sentiment analysis.
The main concepts that are used in a Twitter environment are: url, mention,
user, friend, follower, tweet, hash-tag and re-tweet [7]. An url is a link that
reflects information about a specific topic from the message posted on Twitter.
A user is a person or a system that can posts messages on Twitter [8]. This
social media defines a friend-follower relationship. For example, let consider two
users a and b. The user a has the option to receive all the tweets written by user
b. So, b becomes friend of a and a is a follower of b. The vice-versa relation
is not mandatory, because user b is not forced to receive messages from user a.
Also, a user is defined by several properties: name, source location, list of friends
and followers, number of tweets, photo and a short description. In the Twitter
background there are two main abbreviations: RT and DM. RT means re-tweet,
so posting again a message by a specific user and DM signifies Direct Message
when you want to send a message to the user.
A tweet is composed of two main parts: a short and simple text message
(maximum 140 characters) which is posted on the social media and hash-tags and
hypertext-based elements: related media (maps, photos, videos) and websites.
Hash-tags [8] are keywords prefixed with the ”#” symbol that can appear in a
tweet. Twitter users use this notation to categorize their messages and enable
or mark them to be more easily found in search.
Contribution Title 3

3 Sentiment Analysis
The features of Sentiment Analysis can be the following:
– scalability: despite the fact that there is a big amount of textual information,
Sentiment Analysis can handle it and can process data at scale in an ”efficient
and cost-effective way” [1]
– real-time analysis: it helps to define new strategies and analysis for current
problems (e.g. Is there an angry client? Is a famous presidential candidate
going to lose the competition?)
– detects changes in people’ opinions: applying a Sentiment Analysis process
during different periods of time we can detect relevant changes about a
product, service or another topic
– tracking client’s satisfaction
– it helps to improve business marketing
Around Sentiment Analysis area can be defined multiple concepts like opin-
ion, polarity, subject or opinion holder. An opinion is an expression that reflects
people’s feelings about a specific topic. This expression can take various forms:
written or text-based, spoken or voice-based and behavior or gesture-based Sen-
timent Analysis can be modeled as a classification problem that implies subjec-
tivity (classify opinions into subjective and objective ones) and polarity (classify
expressions into negative, positive and neutral), based on input represented as
textual information (documents, messages, etc). In this paper we focus on po-
larity classification task. The subject is the object that people talked about. It
can be a product, a service, an event or a famous politician. An opinion holder
is the person that expresses the sentiment about a topic.
The complex process of classifying polarity opinions can be applied on dif-
ferent levels [6]:
– document level: the whole document or text is treated as a single piece
of information. This approach fits when the document refers to only one
subject/ topic
– sentence-level: A sentence is the unit of information and for each sentence is
determined the polarity.
– word-level: it is the most fain-grained analysis.

3.1 Phases
Sentiment Analysis, visualized as a polarity classification task (detecting opin-
ions and classifying them into positive, negative or neutral), implies the next
phases, suggested in Figure 1.
The initialization step prepares the data for the classification algorithm. Data
collection means to retrieve data and analyze the content of it (How many mes-
sages are positive, negative, neutral? Is the data balanced?). If the data is not
already labeled, a manually annotation is required in order to validate the ap-
proach. The preprocessing phase means to transform the unstructured informa-
tion into a clear one without misspellings, abbreviations or slang words. Then,
4 S. Limboi, [Link]̧an

Fig. 1. Sentiment Analysis Phases

an attribute selection stage determines how data is represented in terms of rel-


evant features. The output of the initialization step will be the input for the
learning step. The need of a learning phase is mandatory because the systems
need a supervisor or a trainer that can tell it which is the expected output for
the given input. Text preprocessing and analyzing sentiments from it represent a
very hard task and they cannot be done without an automatic component (Ma-
chine Learning component). Here, the training model is passed to a Machine
Learning algorithm that will classify messages into different classes. The last
phase is the evaluation when classification is applied on a test dataset and per-
formance measures are computed in order to reflect how good is the Sentiment
Analysis methodology.

Preprocessing techniques The preprocessing step is very important for the


polarity classification task by providing clean and relevant information for next
phases.
Cleaning operations are those that involve only to normalize words from
textual information and to remove the disadvantages given by the free way of
writing opinions. Therefore, can be applied the following methods that are used
to define a uniform text [3]:

– removal of urls, hashtags from messages.


– remove numbers and special characters.
– replace repeated sequences with only one.
– remove blank spaces.
– lowercasing: all letters are converted into lower ones.
Contribution Title 5

Negation is an essential operation because negative words influence a lot the


polarity of a message. Ignoring negation is one of the causes for misclassification.
So, all negative words (can’t, don’t, never) are replaced with not. Dictionary
approach means to convert slang words or abbreviations with their formal forms.
Then, a word like ”l8” will be converted to ”late” [3] or ”approx.” to ”approxi-
mately”. The use of it reduces the noise in the dataset. Removing stop words
is very important and it can improve the performance of the classifier. Stop
words can be pronouns or articles (e.g. our, me, myself, that, because, etc).
Stemming means to reduce the inflective or derivation form to a common
radix of it (e.g. cars become car). Basically, stemming implies to cut off the
prefixes or suffixes of a word. A flavor of stemming is lemmatization that works
with morphological context based on dictionaries (e.g. studies is converted into
study).

Attributes determination In this phase, the relevant features from text are
extracted and used as inputs for the classification algorithms. Text mining deals
with several attributes, but only some of them are frequently used in the context
of tweets (e.g. hash-tags). Document-term matrix representation reflects the
frequency of extracted words from a collection of documents (texts, sentences).
In this paper are designed several granularities for the document-term matrix
representation: word and bi-gram granularity. In other words, each row from
the matrix corresponds to the document and columns represent the granularity
level(word/bi-gram). Bi-gram granularity means a list of unordered words of
size 2.
Also, two scenarios are defined for determining the values from the matrix.
The value from each cell can be integer or real. In the case of integer values, it is
described the frequency of the granularity level in the document. The real values
are computed considering TF (term frequency) and IDF (inverse document
frequency) formula:
m
T F (t) =
M
and
N
IDF (t) = log( ),
n
where m is the number of times term t appears in the document, M is the
number of terms in the document, N is the number of documents and n is the
number of documents where term t appears [2]

Machine learning algorithms and performance measures For the clas-


sification task can be applied various techniques like lexicon-based, machine
learning methods or hybrid approaches. For this paper the focus is on machine
learning algorithms. So, we will briefly describe several classification methods
[5]:
– Naı̈ve Bayes is a probabilistic algorithm based on Bayes theorem. For each
word will be computed the probability of belonging to a class.
6 S. Limboi, [Link]̧an

– Support Vector Machine (SVM) is a deterministic algorithm used for finding


a hyperplane that separates the data input in two classes, each of one side
of it.
– Logistic regression is a statistical classifier based on a logistic function.

In terms of performance measures, accuracy, precision, recall and f-score can


be computed in order to indicate which algorithm fits the best.

4 Baseline Sentiment Analysis (BSA)

Several experiments were conducted focusing on preprocessing techniques, data


representation in terms of relevant extracted features and Machine Learning
algorithms that were involved in the classification process.

4.1 Dataset

For the following methodologies, it is used Sanders dataset [4]. It is consisted of


5113 annotated tweets: 519 marked as positive, 572 negative, 2333 neutral and
1689 irrelevant. These tweets are messages related to four main topics, important
companies from world, and they are Twitter, Apple, Google and Microsoft. The
focus is to classify messages into positive and negative ones. Therefore, irrelevant
and neutral messages are removed and 1091 tweets are considered for the process.
20% were used as testing dataset and the rest of them for training.

4.2 Data preprocessing

Data preprocessing is an essential step in the Sentiment Analysis domain because


it can improve the whole process by removing the disadvantages or problems
reflected by the free writing style of microblogs (e.g. misspellings, use of slang
words, abbreviations). The following techniques were applied on Sanders dataset
due to the fact that it is desired to have a fast and simple procedure:

– cleaning operations: removal of punctuation, lowercasing


– removal of stop words
– stemming

4.3 Data representation

After the preprocessing phase, the focus is to represent the tweets for the clas-
sification algorithms. Consequently, document-term matrix is built considering
the granularity levels (word and bi-gram) and the way of computing the values:
determining the frequency of words/bi-grams and the [Link] computation, ex-
plained in the previous section.
Contribution Title 7

4.4 Classification algorithms and evaluation measures

As classification algorithms, Naı̈ve Bayes (NB), Support Vector Machine (SVM)


and Logistic regression (LR) are used for the BSA. For the evaluation phase, ac-
curacy and precision are computed. As parameters, for the classification task, for
logistic regression, the inverse of regularization strength parameter is considered
having the value 1.5. The SVM classifier with linear kernel and regularization
parameter (set to value 1.0 ) is also applied for the proposed approach. Last but
not least, the Multinomial version of Naı̈ve Bayes is used for the classification
task.
Data for all experiments 786 messages (we removed the messages without
hashtags): 415 positive and 371 negative.

4.5 Text without hashtags

Table 1. Accuracy for BSA with word granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 59.49% 57.59% 60.13%
Removal of punctuation 77.85% 73.42% 70.89%
Removal of stop words 58.23% 58.86% 55.7%
Lowercasing 65.19% 58.23% 57.59%
Stemming 59.49% 57.59% 60.13%
All 55.06% 56.96% 56.96%

Table 2. Precision for BSA with word granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 63.15% 60.75% 61.79%
Removal of punctuation 81.81% 76.92% 69.38%
Removal of stop words 62.85% 62.66% 57.95%
Lowercasing 68.83% 61.25% 59.55%
Stemming 63.15 % 60.75% 61.79%
All 56.70% 58.16% 58.00%

Word granularity and bigram granularity (frequency and tf-idf values)


Due to the short length of messages (maximum 140 characters), applying all pre-
processing techniques seems to be a bad operation. Best values are achieved when
8 S. Limboi, [Link]̧an

Table 3. Accuracy for BSA with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 61.39% 60.13% 65.19%
Removal of punctuation 74.68% 73.42% 73.42%
Removal of stop words 61.39% 61.39% 65.19%
Lowercasing 59.49% 57.59% 63.29%
Stemming 61.39% 60.13% 65.19%
All 51.27% 53.80% 53.80%

Table 4. Precision for BSA with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 60.74% 62.65% 69.86%
Removal of punctuation 74.44% 75.60% 72.34%
Removal of stop words 61.11% 65.33% 68.83%
Lowercasing 57.93% 59.77% 65.85%
Stemming 60.74% 62.65% 69.86%
All 52.63% 55.91% 56.04%

Table 5. Accuracy for BSA with bi-gram granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 59.49% 58.86% 59.49%
Removal of punctuation 76.58% 75.32% 75.95%
Removal of stop words 60.13% 60.13% 63.29%
Lowercasing 61.39% 63.29% 62.66%
Stemming 58.86% 58.23% 61.39%
All 47.57% 56.33% 57.59%

Table 6. Precision for BSA with bi-gram granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 63.51% 63.01% 63.88%
Removal of punctuation 80.51% 78.48% 75.55%
Removal of stop words 64.78% 65.67% 69.11%
Lowercasing 65.33% 67.56% 65.43%
Stemming 62.66 % 62.50% 65.33%
All 49.09% 58.82% 60.49%
Contribution Title 9

Table 7. Accuracy for BSA with bi-gram granularity and [Link] values

Preprocessing technique NB LR SVM


Without preprocessing 59.49% 62.66% 60.76%
Removal of punctuation 74.05% 73.42% 72.78%
Removal of stop words 61.39% 62.03% 60.13%
Lowercasing 56.96% 62.66% 61.39%
Stemming 59.49% 61.39% 60.13%
All 45.57% 55.06% 55.06%

Table 8. Precision for BSA with bi-gram granularity and [Link] values

Preprocessing technique NB LR SVM


Without preprocessing 61.36% 66.66% 64.86%
Removal of punctuation 74.15% 75.60% 72.52%
Removal of stop words 62.36% 66.21% 66.15%
Lowercasing 58.33% 67.12% 64.93%
Stemming 61.36 % 65.33% 64.78%
All 49.21% 57.30% 57.64%

removal of punctuation is applied. Also, for pure textual information the qual-
ity of Machine Learning techniques is decreased due to the removal of hashtags
which bring valuable information.

5 Hashtag-based Sentiment Analysis (HSA)

5.1 Hash pur - doar hash-taguri

Word granularity- integer values (frequencies)

Table 9. Accuracy for HSA with word granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 66.46% 65.19% 65.82%
With preprocessing (removal of punctuation) 65.19% 69.62% 66.46%

Bigram granularity
10 S. Limboi, [Link]̧an

Table 10. Precision for HSA with word granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 73.84% 68.83% 68.29%
With preprocessing (removal of punctuation) 71.01% 73.07% 65.80%

Table 11. Accuracy for HSA with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 64.65% 63.29% 66.46%
With preprocessing (removal of punctuation) 64.56% 68.35% 67.09%

Table 12. Precision for HSA with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 69.44% 65.11% 68.67%
With preprocessing (removal of punctuation) 67.94% 70.73% 71.05%

Table 13. Accuracy for HSA with bigram granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 64.56% 67.09% 65.82%
With preprocessing (removal of punctuation) 63.29% 63.29% 65.19%

Table 14. Precision for HSA with bigram granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 68.91% 70.51% 69.73%
With preprocessing (removal of punctuation) 67.10% 65.11% 69.86%

Table 15. Accuracy for HSA with bigram granularity and [Link] values

Preprocessing technique NB LR SVM


Without preprocessing 66.46% 66.46% 65.19%
With preprocessing (removal of punctuation) 65.19% 63.92% 62.03%
Contribution Title 11

Table 16. Precision for HSA with bigram granularity and [Link] values

Preprocessing technique NB LR SVM


Without preprocessing 71.83% 70.12% 68.83%
With preprocessing 67.46% 68.00% 66.21%

Table 17. Accuracy for impure text with word granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 63.29% 67.09% 65.19%
With preprocessing (removal of punctuation) 79.75% 75.32% 71.52%

Table 18. Precision for impure text with word granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 68.57% 68.18% 68.35%
With preprocessing (removal of punctuation) 84.21% 78.48% 75.32%

Table 19. Accuracy for impure text with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 63.29% 65.82% 68.35%
With preprocessing (removal of punctuation) 80.38% 76.58% 77.22%

Table 20. Precision for impure text with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 63.2% 68.18% 68.35%
With preprocessing (removal of punctuation) 79.121% 77.64% 77.50%

Table 21. Accuracy for impure text with bigram granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 65.19% 65.82% 65.19%
With preprocessing (removal of punctuation) 79.75% 77.22% 70.89%
12 S. Limboi, [Link]̧an

Table 22. Precision for impure text with bigram granularity and frequency values

Preprocessing technique NB LR SVM


Without preprocessing 68.35% 70.83% 69.33%
With preprocessing (removal of punctuation) 83.33% 81.57% 76.38%

Table 23. Accuracy for impure text with bigram granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 63.29% 65.82% 68.35%
With preprocessing (removal of punctuation) 80.83% 76.58% 77.22%

6 Text impur - text pur concatenat cu hashtaguri

7 Text pur + hashtaguri- textul initial

References
1. Sentiment analysis overview. [Link]
2. Tf idf feature. [Link]
3. Angiani, G., Ferrari, L., Fontanini, T., Fornacciari, P., Iotti, E., Magliani, F., Mani-
cardi, S.: A comparison between preprocessing techniques for sentiment analysis in
twitter. In: KDWeb (2016)
4. Deshmukh, R., Pawar, K.: Twitter sentiment classification on sanders data using
hybrid approach 17, 118–123 (07 2015)
5. Mitchell, R., Michalski, J., Carbonell, T.: An artificial intelligence approach.
Springer (2013)
6. Patil, P., Yalagi, P.: Sentiment analysis levels and techniques: A survey. space 1, 6
(2013)
7. Pawar, K.K., Shrishrimal, P.P., Deshmukh, R.: Twitter sentiment analysis: A review.
International Journal of Scientific & Engineering Research 6(4), 9 (2015)
8. Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.: Twit-
terstand: News in tweets. In: Proceedings of the 17th ACM SIGSPATIAL Interna-
tional Conference on Advances in Geographic Information Systems. pp. 42–51. GIS
’09, ACM, New York, NY, USA (2009). [Link]
[Link]

Table 24. Precision for impure text with bigram granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 61.81% 68.75% 70.73%
With preprocessing (removal of punctuation) 79.12% 77.64% 81.57%
Contribution Title 13

Table 25. Accuracy for pure text with hashtags with word granularity and frequency
values

Preprocessing technique NB LR SVM


Without preprocessing 65.19% 67.09% 63.92%
With preprocessing (removal of punctuation) 79.75% 79.75% 77.85%

Table 26. Precision for pure text with hashtags with word granularity and frequency
values

Preprocessing technique NB LR SVM


Without preprocessing 71.01% 67.09% 63.92%
With preprocessing (removal of punctuation) 84.21% 85.13% 84.50%

Table 27. Accuracy for pure text with hashtags with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 64.56% 65.82% 69.62%
With preprocessing (removal of punctuation) 79.75% 79.75% 77.85%

Table 28. Precision for pure text with hashtags with word granularity and tf-idf values

Preprocessing technique NB LR SVM


Without preprocessing 62.50% 68.75% 71.42%
With preprocessing (removal of punctuation) 79.54% 84.21% 85.50%

Table 29. Accuracy for pure text with hashtags with bigram granularity and frequency
values

Preprocessing technique NB LR SVM


Without preprocessing 67.09% 65.82% 63.92%
With preprocessing (removal of punctuation) 78.48% 74.68% 77.85%

Table 30. Precision for pure text with hashtags with bigram granularity and frequency
values

Preprocessing technique NB LR SVM


Without preprocessing 70.51% 70.83% 68.00%
With preprocessing (removal of punctuation) 82.89% 80.55% 83.56%

Table 31. Accuracy for pure text with hashtags with bigram granularity and tf-idf
values

Preprocessing technique NB LR SVM


Without preprocessing 64.56% 66.46% 63.29%
With preprocessing (removal of punctuation) 77.22% 77.22% 77.85%
14 S. Limboi, [Link]̧an

Table 32. Precision for pure text with hashtags with bigram granularity and tf-idf
values

Preprocessing technique NB LR SVM


Without preprocessing 64.58% 70.12% 67.56%
With preprocessing (removal of punctuation) 78.26% 81.57% 82.43%

9. Zhang, L., Ghosh, R., Dekhil, M., Hsu, M., Liu, B.: Combining lexicon-based and
learning-based methods for twitter sentiment analysis (2011)

You might also like