Shri Vaishnav Vidyapeeth Vishwavidyalaya
Shri Vaishnav Institute of Information Technology
Department of Computer Science Engineering
Lab File
Degree (CSE-Redhat)
Section – f
3rd Year / 6th Semester
Submitted to: Mrs. Archana Choubey
Submitted by: Ayushi Upadhyay
Enrollment Number: 19100BTCSES05363
Course Name: Data Science
INDEX
Sr. No. EXPERIMENTS DATE
1 Study and installation of jupyter notebook anaconda. 25/03/2
2
2 Study of google colab. 01/04/2
2
3 Write a program in Python to predict the class of the 08/04/2
flower based on available attributes. 2
4 Write a program in Python to predict if a loan will get 22/04/22
approved or not.
5 Write a program in Python to identify the tweets which are 28/04/22
hate tweets and which are not.
6 Case study on Hate Tweets 29/04/22
BTCS 608 Data Science
EXPERIMENT – 1
OBJECTIVE –Study and installation of jupyter notebook anaconda.
Introduction of Jupyter Notebook:
The Jupyter Notebook is an open-source web application that you can use to create and share documents
that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained by the people
at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython
Notebook project itself. The name, Jupyter, comes from the core supported programming languages that
it supports: Julia, Python, and R. Jupyter ships with the IPython kernel, which allows you to write your
programs in Python, but there are currently over 100 other kernels that you can also use.
Installing Jupyter using Anaconda and conda in windows OS:
Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for
scientific computing and data science.
Use the following installation steps:
1. Download Anaconda’s latest Python 3 version from
https://www.anaconda.com/download/#windows (currently Python 3.7).
2. Install the version of Anaconda which you downloaded.
3. Double click on the installer that we have just downloaded.
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
4. Launch Anaconda Navigator > Click on install jupyter notebook button to install.
5. Jupyter Notebook has been installed successfully.
6. To run the notebook:
jupyter notebook
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
EXPERIMENT – 2
OBJECTIVE –Study of google colab.
Introduction of Google Colab:
Google colaboratory, or “colab” for short, is a product from Google Research. Colab allows anybody to
write and execute arbitrary python code through the browser, and is especially well suited to machine
learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that
requires no setup to use, while providing access free of charge to computing resources including GPUs.
How to Use Google colab:
To start working with Colab you first need to log in to your google account, then go to this
link https://colab.research.google.com .
Opening Jupyter Notebook:
On opening the website you will see a pop-up containing following tabs –
Else you can create a new Jupyter notebook by clicking New Python3 Notebook or New Python2
Notebook at the bottom right corner.
Notebook’s Description:
On creating a new notebook, it will create a Jupyter notebook with Untitled0.ipynb and save it to your
google drive in a folder named Colab Notebooks. Now as it is essentially a Jupyter notebook, all
commands of Jupyter notebooks will work here.
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
Change Runtime Environment:
Click the “Runtime” dropdown menu. Select “Change runtime type”. Select python2 or 3
from “Runtime type” dropdown menu.
Now, we are
good to go
with Google
colab.
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
EXPERIMENT – 3
OBJECTIVE –Write a program in python to predict the class of the flower based on
available attributes.
SOURCE CODE-
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
OUTPUT-
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
EXPERIMENT – 4
OBJECTIVE –Write a program in python to predict if a loan will get approved or not.
SOURCE CODE-
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
OUTPUT-
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
EXPERIMENT – 5
OBJECTIVE –Write a program in Python to identify the tweets which are hate tweets
and which are not.
Source Code-
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
OUTPUT-
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
EXPERIMENT – 6
OBJECTIVE –Case Study for Hate Tweets based on tw0o different types of approaches.
Business Problem:
Toxic online content has become a major issue in today’s world due to an exponential increase in the use
of the internet by people of different cultures and educational backgrounds. Differentiating hate speech
and offensive language is a key challenge in the automatic detection of toxic text content. Using the
Twitter dataset, we perform experiments by leveraging bag of words and the term frequency-inverse
document frequency (TFIDF) values to multiple machine learning models. We perform comparative
analysis of the models considering both of these approaches. After tuning the model giving the best
results, we achieved an accuracy of 89% and a recall of 84% upon applying the Logistic Regression
model. We also create a module using flask which serves as a real time application of our model.
Problem Statement:
Differentiating hate speech and offensive language in twitter. In this report, we propose an approach to
automatically classify tweets on Twitter into two classes: hate speech and non-hate speech. Using the
Twitter dataset, we perform experiments by leveraging bag of words and the term frequency-inverse
document frequency (TFIDF) values to multiple machine learning models.
Approches:
Our data preprocessing step involved 2 approaches, Bag of words and Term Frequency Inverse
Document Frequency (TFIDF).
The bag-of-words approach is a simplified representation used in natural language processing and
information retrieval. In this approach, a text such as a sentence or a document is represented as the bag
(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a
collection. It is used as a weighting factor in searches of information retrieval, text mining, and user
modeling.
Before we input this data into various algorithms, we have to clean it as the tweets contain many different
tenses, grammatical errors, unknown symbols, hashtags, and Greek characters.
Data
(Source: https://www.kaggle.com/vkrahul/twitter-hate-speech)
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
Data Dictionary
Visualizations and Word Clouds
A word cloud is created to get an idea of the most common words utilized in tweets. This was done for
both categories of hate and non-hate tweets. Next, we created a bar graph to visualize the utilization
frequencies between the most common words in both the positive as well as the negative sentiment.
Positive tweets
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
Negative tweets
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
Data Architecture
We perform strategic sampling and separate the data into a temporary set and test set. Note that since we
have performed strategic sampling, the ratio of good tweets to hate tweets is 93:7 for both the temporary
and test datasets. On the temporary data, we first tried to perform up sampling of hate tweets using
SMOTE (Synthetic Minority Oversampling Technique). Since the SMOTE packages don’t work directly
for textual data, we wrote our own code for it. The process is as follows:
We created a corpus of all the unique words present in hate tweets of the temporary dataset. Once we had
a matrix containing all possible words in hate tweets, we created a blank new dataset and started filling it
with new hate tweets. These new tweets were synthesized by selecting words at random from the corpus.
The lengths of these new tweets were determined on the basis of the lengths of the tweets from which the
corpus was formed.
We then repeated this process multiple times until the number of hate tweets in this synthetic data was
equal to the number of non-hate tweets we had in our temporary data. However, when we employed the
Bag of Words approach for feature generation, the number of features went up to 100,000. Due to an
extremely high number of features, we faced hardware and processing power limitation and hence had to
discard the SMOTE oversampling method.
As it was not possible to up-sample hate tweets to balance the data, we decided to down-sample non-hate
tweets to make it even. We took a subset of only the non-hate tweets from the temporary dataset. From
this subset, we selected n random tweets, where n is the number of hate tweets in the temporary data. We
then joined this with the subset of hate tweets in the temporary data. This dataset is now the training data
that we use for our feature generation and modelling purposes.
The test data is still in a 93:7 ratio of good tweets to hate tweets as we did not perform any sampling on it.
Sampling was not performed as real world data comes in this ratio.
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
Approaches
We have looked at two major approaches for feature generation: Bag of Words (BOW) and Term
Frequency Inverse Document Frequency (TFIDF).
1. Bag of Words
2. Term Frequency Inverse Document Frequency
Bag of words
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling,
such as with machine learning algorithms. It is called a “bag” of words because any information about
the order or structure of words in the document is discarded. The model is only concerned with whether
known words occur in the document, not where in the document. The intuition is that documents are
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
similar if they have similar content. Further, from the content alone we can learn something about the
meaning of the document. The objective is to turn each document of free text into a vector that we can
use as input or output for a machine learning model. Because we know the vocabulary has 10 words, we
can use a fixed-length document representation of 10, with one position in the vector to score each word.
The simplest scoring method is to mark the presence of words as a Boolean value, 0 for absent, 1 for
present.
Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first
document (“It was the best of times “) and convert it into a binary vector.
The scoring of the document would look as follows:
● “it” = 1
● “was” = 1
● “the” = 1
● “best” = 1
● “of” = 1
● “times” = 1
● “worst” = 0
● “age” = 0
● “wisdom” = 0
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
● “foolishness” = 0
As a binary vector, this would look as follows:
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
The other three documents would look as follows:
1. “it was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
2. “it was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
3. “it was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
If your dataset is small and context is domain specific, BoW may work better than Word Embedding.
Context is very domain specific which means that you cannot find corresponding Vector from pre-trained
word embedding models (GloVe, fastText etc.)
TFIDF
TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse
document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the
TF and IDF scores of a term is called the TF*IDF weight of that term.
Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.
The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that
keyword based on the number of times it appears in the document. More importantly, it checks how
relevant the keyword is throughout the web, which is referred to as corpus.
19100BTCSES05363 Ayushi Upadhyay
BTCS 608 Data Science
For a term t in a document d, the weight Wt, d of term t in document d is given by:
Wt, d = TFt, d log (N/DFt)
Where:
● TFt, d is the number of occurrences of t in document d.
● DFt is the number of documents containing the term t.
● N is the total number of documents in the corpus.
How is TF*IDF calculated? The TF (term frequency) of a word is the frequency of a word (i.e. number of
times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or
too little.
For example, when a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is
TFcat = 12/100 i.e. 0.12
The IDF (inverse document frequency) of a word is the measure of how significant that term is in the
whole corpus.
Key Insights and Learnings: Politics, race and sexual tweets form a major chunk of hate tweets. Results
obtained when we played around with an imbalanced data set were inaccurate. The model predicts new
data to belong to the major class in the training set due to the skewed nature of the train data. The
weighted accuracy of a classifier is not the only metric to be looked at while evaluating the performance
of the model. Business context plays a vital role as well.
19100BTCSES05363 Ayushi Upadhyay