ML Paper (Namrit & Ritika)
ML Paper (Namrit & Ritika)
LEARNING
Twitter
Sentiment
analysis
Submitted in partial
fulfillment of the
requirement for the degree
MBA-BUSINESS
ANALYTICS
NAMRIT MEHTA(2K19/BMBA/11)
RITIKA(2K19/BMBA/13)
Namrit Mehta
RITIKA
Millions of people use social networking sites to
ABSTRACT express their thoughts, feelings, and concerns about
their daily life. However, people write anything like
public works or any comments about products.
Through online communities it provides a platform
Sentiment analysis (also known as visual mines or
for consumers to inform and influence others. In
emotional AI) refers to the use of natural language
addition, social media provides an opportunity for
processing, textual analysis, computational
businesses that provide a platform for
languages, and biometric to systematically identify,
communication with their customers as social media
extract, measure, and study the corresponding
to advertise or speak directly to customers by
regions and details below. Emotional analysis is
communicating with customer feedback on products
widely used in the voice of client material such as
and services. On the contrary, consumers are more
reviews and research responses, online and social
powerful when it comes to what consumers want to
media, and healthcare materials for applications
see and how consumers respond. With this, the
ranging from marketing to customer services to
successes and failures of the company are shared
medical treatment.
publicly and keep the word of mouth. However,
social networking can change behavior and
Analysis of social data from social media can also
consumer decisions, for example, noting that 87%
produce interesting results in detail in the world of
of internet users are influenced by their purchases
public opinion on almost any product, service or
and problem through customer reviews. So that, if
kindness. Social data is one of the most effective
an organization is able to quickly come to terms
and accurate indicators of public sentiment. The
with what its customers are thinking, it can be very
explosion of Web 2.0 has led to an increase in
helpful to plan the response in time and come up
activity in Podcasting, Blogging, Tagging, RSS
with a good strategy to compete with its
Contributions, Social Bookmarks, and Social
competitors.
Networking. As a result, there was an explosion of
interest in the public mine. These are used for a lot
In this project, we use machine learning and natural
of ideas. Sensory or Vision Analysis Mining is a
language processing techniques to understand the
treatment for the ideas, feelings and humility of
patterns and symbols of tweets and predict the
text. In this paper we will be discussing a method
emotions (if any) that prevail. Specifically, we
that allows for use as well translation of twitter data
create a computer model that can distinguish a
to get public comments.
given tweet as positive, negative or neutral based on
Creating an emotional analysis system is the
the emotions it expresses. The positive and negative
method to be used moderately to balance customer
section will contain polar tweets expressing
perceptions. This paper reports on construction for
emotion. However, a neutral section may contain a
emotional analysis, extracting and training a large
purposeful or directed tweet that the user does not
number of tweets. Results edit customer feedback
show neutrality in or contains any opinion at all.
via tweets for good and bad, namely represented in
Examples of each category can be found in Table 1.
a pie chart, in a web map, distributing the structure
The decision to use the three classes was made to
using php, css and html pages.
address the problem and is in line with ongoing
research in the field. The tests performed on our
INTRODUCTION emotional predictor randomly show that our system
is among the best performing programs in this field.
We use our mood predictor, and create an integrated
consultation tool to help businesses interpret and
visualize public perceptions about their product and
products. This tool enables the user to not only
visualize the distribution of emotions across the
database, but also equips users to perform emotional
analysis for the duration, location, and capabilities
of the user.
Class Tweet
positive @hon1paris: I
<3 1D too!
#muchlove
negative The new
Transformers
suck!! Wasted
my time and Figure: Overview of Supervised Sentiment
money!!! Classification of Tweets
neutral Well, I guess the Before we can understand the research for
govt did what it Twitter's emotional analysis, we need to explain
could. More the general process for dealing with this
needed though! problem. Supervised Text Segmentation, a
I plan to wake up
machine-readable method in which a class
early in the
predictor is taken from data with a training
morning
#early2bad label, a standard method for the emotional
separation of tweets. The whole view of this
method, which is modified by the emotional
separation of the tweet, is illustrated.
MachineLearning 1. First, a database of labeled tweets is
Background compiled. Each tweet in this set has been
marked as identifying, inappropriate or neutral
by personal annotation based on the perception
of expired comments after analyzing the tweet.
Data Processing:
Data processing includes Tokenization which is the
process of separating tweets into individual words
called tokens. Tokens can be categorized using
white letters or punctuation marks. It can be
unigram or bigram depending on the partition
model used. The word-bag model is one of the most
Data collection: widely used models in classification. It is based on
Data in the form of raw tweets is retrieved by using the fact that the text is classified as a bag or a set of
the Scala library “Twitter4j” which provides a individual words that have no link or dependence.
package for real time twitter streaming API. The An easy way to incorporate this model into our
API requires us to register a developer account project is to use n gram as features. Just a collection
with Twitter and fill in parameters such as of individual words in a file for
consumerKey, consumerSecret, accessTokenaccess, the text will be separated, therefore, we separate
and TokenSecret. This API allows to get all random each tweet using whitespace. For example, the
tweets or filter data by using keywords. tweet "Met met aziz today !!" separated by each
Filters supports to retrieve tweets which match a white area next.
specific criterion defined by the developer. We used {Met Aziz !! ”}
this to retrieve tweets related to specific keywords The next step in data processing is typical by
which are taken as input from users. Initially, we converting a tweet into smaller letters. Tweets are
set at least set an application name and mode. We typically converted into lower case letters making
execute the program in local mode their comparison with the dictionary easier.
However, this project is very focused on getting
emotions on twitter streaming so TF-IDF is not
Data Filtering:
done.
The tweet received after the data processing is still
part of the raw material in it that we may or may not
find useful in our application. Therefore, these Sentiment Analysis:
tweets continue to be filtered by removing static
words, numbers and punctuation marks. Set words: Emotional analysis is performed using a custom
For example, tweets that contain stop words are algorithm that finds the magnitude as below.
more common words such as “is”, “am”, “are” and
have no additional information. These names are
useless and this feature is created using a list stored
Finding polarity:
on steffile.dat. Then we compare each word in the
tweet with this list and delete the words that
correspond to the stop list. For discovering the polarity, we used a simple
Deleting non-alphabetical characters: Symbols such algorithm of counting positive and negative words
as "# @" and numbers are not important in case of in a tweet. For both, positive and negative words,
emotion analysis and are also removed using pattern different lists were made. Next step is to compare
matching. Ordinary expressions are used to match every word in a tweet against both of these lists. If
only the letters of the alphabet and pauses are the current word matches a word in a positive list,
ignored. This helps reduce clutter from twitter then a score of 1 is incremented and if a negative
streaming. Determination: It is the process of word is found then it is decremented. More positive
reducing words based on their roots. Examples words lead to higher sentiment scores. However,
include words such as "fish" with the same roots as Stanford NLP can be used to predict accurate
"fishing" and "fish". The stemming library is sentiment analysis which provides complex
Stanford NLP which also offers various algorithms algorithms to predict it.
such as porter stemming. In our case, we have never
used any basic algorithm due to time constraints. SentimentAnalysis Output:
The output contains a list of tweets in real time
Feature Extraction: along with their sentiment score on the
TF-IDF is an open source format used in quoting left-hand side. The first tweet has score of -2 which
texts to determine the value of a term in a text in a is due to two negative keywords. Next two tweets
corpus. The recommended API is a Data Frame are positive as they contain keywords like “good”
based API. This feature is useful in cases where we and “great. Both these
need to find the best titles or create voice clouds.
words are in the positive words list. It is to be noted country. This prevented us to retrieve tweets from a
that if a tweet has a score of 0, then it is ignored specific region to analyze which could be a future
from final output. work.
The problem with neutral tweets is that they Library dependencies: There were some initial
serve no challenges in building the application using SBT
purpose as they don‟t convey any sentiment towards tools due to incompatible versions of Scala and
the product.
Scala SDK as we had
The last tweet is most negative tweet with sentiment limited knowledge about the technologies we were
score of -2 which contains some abuse word not using.
shown. Negative tweets indicate hate and dislike Moreover, the given examples used outdated
towards a product or public figure. The result here libraries which we update to latest by comparing the
indicate that People don‟t hate Donald Trump as given version against maven repository.
portrayed in media and news as general
sentiment regarding trump is positive as indicated
by the results.
Twitter Sentiment Analysis
DISCUSSION
Emotions can be found in the comments or in the
Developing the project proved to be a lot more tweet to provide useful clues for many different
challenging than expected due to the relative purposes. And, it also meant that emotions can be
inexperience we had with Apache Spark and Scala. divided into two groups, which are bad and good
words. Emotional analysis is a natural language
A) Project Limitation & challenges
processing method for measuring a expressed
Following challenges were faced during opinion or emotion within the selection of tweets.
implementation.
Apache Spark Memory error: Apache spark has a Emotional analysis refers to the general process of
extracting polarity and subjugation from a semantic
setting related to allotted memory for processing the
concept referring to the power of words and text of
program and polarity or phrases. There are two main ways to
the default value was less than what our application express emotion which are dictionary-based and
needed. The solution was to change settings in VM machine-based learning methods.
Implementation &
Result:
like to make a web application for users to input
keyword and get analyzed results. In this project,
we have worked only with unigram models, but we
would like to extend it to bigram and further which
will increase linkage between the data and provide
accurate sentiment analysis results.
Computation of overall tweet score can be done for
a single keyword which can provide an overall
sentiment of public
regarding a topic.
CONCLUSION