0% found this document useful (0 votes)

132 views30 pages

Ayushi Data Science Final File

The document is a lab file submitted by student Ayushi Upadhyay for their Data Science course. It details 6 experiments conducted in Jupyter Notebook and Google Colab involving predictive modeling. The first experiment involves installing Anaconda and Jupyter Notebook. The second describes using Google Colab for Python notebooks. Later experiments include using machine learning models to predict flower types from attributes, loan approvals, and identifying hate tweets. The final experiment is a case study comparing bag-of-words and TF-IDF approaches for hate tweet classification.

Uploaded by

vishesh Kumar singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views30 pages

Ayushi Data Science Final File

Uploaded by

vishesh Kumar singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Shri Vaishnav Vidyapeeth Vishwavidyalaya

Shri Vaishnav Institute of Information Technology

Department of Computer Science Engineering

Lab File
Degree (CSE-Redhat)
Section – f
3rd Year / 6th Semester

Submitted to: Mrs. Archana Choubey

Submitted by: Ayushi Upadhyay
Enrollment Number: 19100BTCSES05363
Course Name: Data Science
INDEX

Sr. No. EXPERIMENTS DATE

1 Study and installation of jupyter notebook anaconda. 25/03/2

2
2 Study of google colab. 01/04/2
2
3 Write a program in Python to predict the class of the 08/04/2
flower based on available attributes. 2

4 Write a program in Python to predict if a loan will get 22/04/22

approved or not.

5 Write a program in Python to identify the tweets which are 28/04/22

hate tweets and which are not.

6 Case study on Hate Tweets 29/04/22

BTCS 608 Data Science
EXPERIMENT – 1
OBJECTIVE –Study and installation of jupyter notebook anaconda.
Introduction of Jupyter Notebook:

The Jupyter Notebook is an open-source web application that you can use to create and share documents
that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained by the people
at Project Jupyter.

Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython
Notebook project itself. The name, Jupyter, comes from the core supported programming languages that
it supports: Julia, Python, and R. Jupyter ships with the IPython kernel, which allows you to write your
programs in Python, but there are currently over 100 other kernels that you can also use.

Installing Jupyter using Anaconda and conda in windows OS:

Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for
scientific computing and data science.

Use the following installation steps:

1. Download Anaconda’s latest Python 3 version from

https://www.anaconda.com/download/#windows (currently Python 3.7).
2. Install the version of Anaconda which you downloaded.
3. Double click on the installer that we have just downloaded.

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
4. Launch Anaconda Navigator > Click on install jupyter notebook button to install.

5. Jupyter Notebook has been installed successfully.

6. To run the notebook:
jupyter notebook

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
EXPERIMENT – 2
OBJECTIVE –Study of google colab.
Introduction of Google Colab:
Google colaboratory, or “colab” for short, is a product from Google Research. Colab allows anybody to
write and execute arbitrary python code through the browser, and is especially well suited to machine
learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that
requires no setup to use, while providing access free of charge to computing resources including GPUs.
How to Use Google colab:
To start working with Colab you first need to log in to your google account, then go to this
link https://colab.research.google.com .

Opening Jupyter Notebook:

On opening the website you will see a pop-up containing following tabs –

Else you can create a new Jupyter notebook by clicking New Python3 Notebook or New Python2
Notebook at the bottom right corner.

Notebook’s Description:

On creating a new notebook, it will create a Jupyter notebook with Untitled0.ipynb and save it to your
google drive in a folder named Colab Notebooks. Now as it is essentially a Jupyter notebook, all
commands of Jupyter notebooks will work here.

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
Change Runtime Environment:
Click the “Runtime” dropdown menu. Select “Change runtime type”. Select python2 or 3
from “Runtime type” dropdown menu.

Now, we are
good to go
with Google
colab.

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
EXPERIMENT – 3
OBJECTIVE –Write a program in python to predict the class of the flower based on
available attributes.
SOURCE CODE-

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
OUTPUT-

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
EXPERIMENT – 4
OBJECTIVE –Write a program in python to predict if a loan will get approved or not.
SOURCE CODE-

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
OUTPUT-

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
EXPERIMENT – 5
OBJECTIVE –Write a program in Python to identify the tweets which are hate tweets
and which are not.

Source Code-

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

OUTPUT-

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science
EXPERIMENT – 6
OBJECTIVE –Case Study for Hate Tweets based on tw0o different types of approaches.
Business Problem:
Toxic online content has become a major issue in today’s world due to an exponential increase in the use
of the internet by people of different cultures and educational backgrounds. Differentiating hate speech
and offensive language is a key challenge in the automatic detection of toxic text content. Using the
Twitter dataset, we perform experiments by leveraging bag of words and the term frequency-inverse
document frequency (TFIDF) values to multiple machine learning models. We perform comparative
analysis of the models considering both of these approaches. After tuning the model giving the best
results, we achieved an accuracy of 89% and a recall of 84% upon applying the Logistic Regression
model. We also create a module using flask which serves as a real time application of our model.
Problem Statement:
Differentiating hate speech and offensive language in twitter. In this report, we propose an approach to
automatically classify tweets on Twitter into two classes: hate speech and non-hate speech. Using the
Twitter dataset, we perform experiments by leveraging bag of words and the term frequency-inverse
document frequency (TFIDF) values to multiple machine learning models.

Approches:
Our data preprocessing step involved 2 approaches, Bag of words and Term Frequency Inverse
Document Frequency (TFIDF).

The bag-of-words approach is a simplified representation used in natural language processing and

information retrieval. In this approach, a text such as a sentence or a document is represented as the bag

(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a

collection. It is used as a weighting factor in searches of information retrieval, text mining, and user

modeling.

Before we input this data into various algorithms, we have to clean it as the tweets contain many different

tenses, grammatical errors, unknown symbols, hashtags, and Greek characters.

Data

(Source: https://www.kaggle.com/vkrahul/twitter-hate-speech)

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

Data Dictionary

Visualizations and Word Clouds

A word cloud is created to get an idea of the most common words utilized in tweets. This was done for

both categories of hate and non-hate tweets. Next, we created a bar graph to visualize the utilization

frequencies between the most common words in both the positive as well as the negative sentiment.

Positive tweets

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

Negative tweets

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

Data Architecture

We perform strategic sampling and separate the data into a temporary set and test set. Note that since we

have performed strategic sampling, the ratio of good tweets to hate tweets is 93:7 for both the temporary

and test datasets. On the temporary data, we first tried to perform up sampling of hate tweets using

SMOTE (Synthetic Minority Oversampling Technique). Since the SMOTE packages don’t work directly

for textual data, we wrote our own code for it. The process is as follows:

We created a corpus of all the unique words present in hate tweets of the temporary dataset. Once we had

a matrix containing all possible words in hate tweets, we created a blank new dataset and started filling it

with new hate tweets. These new tweets were synthesized by selecting words at random from the corpus.

The lengths of these new tweets were determined on the basis of the lengths of the tweets from which the

corpus was formed.

We then repeated this process multiple times until the number of hate tweets in this synthetic data was

equal to the number of non-hate tweets we had in our temporary data. However, when we employed the

Bag of Words approach for feature generation, the number of features went up to 100,000. Due to an

extremely high number of features, we faced hardware and processing power limitation and hence had to

discard the SMOTE oversampling method.

As it was not possible to up-sample hate tweets to balance the data, we decided to down-sample non-hate

tweets to make it even. We took a subset of only the non-hate tweets from the temporary dataset. From

this subset, we selected n random tweets, where n is the number of hate tweets in the temporary data. We

then joined this with the subset of hate tweets in the temporary data. This dataset is now the training data

that we use for our feature generation and modelling purposes.

The test data is still in a 93:7 ratio of good tweets to hate tweets as we did not perform any sampling on it.

Sampling was not performed as real world data comes in this ratio.

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

Approaches

We have looked at two major approaches for feature generation: Bag of Words (BOW) and Term

Frequency Inverse Document Frequency (TFIDF).

1. Bag of Words

2. Term Frequency Inverse Document Frequency

Bag of words

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling,

such as with machine learning algorithms. It is called a “bag” of words because any information about

the order or structure of words in the document is discarded. The model is only concerned with whether

known words occur in the document, not where in the document. The intuition is that documents are

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

similar if they have similar content. Further, from the content alone we can learn something about the

meaning of the document. The objective is to turn each document of free text into a vector that we can

use as input or output for a machine learning model. Because we know the vocabulary has 10 words, we

can use a fixed-length document representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a Boolean value, 0 for absent, 1 for

present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first

document (“It was the best of times “) and convert it into a binary vector.

The scoring of the document would look as follows:

● “it” = 1

● “was” = 1

● “the” = 1

● “best” = 1

● “of” = 1

● “times” = 1

● “worst” = 0

● “age” = 0

● “wisdom” = 0

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

● “foolishness” = 0

As a binary vector, this would look as follows:

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look as follows:

1. “it was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

2. “it was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

3. “it was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

If your dataset is small and context is domain specific, BoW may work better than Word Embedding.

Context is very domain specific which means that you cannot find corresponding Vector from pre-trained

word embedding models (GloVe, fastText etc.)

TFIDF

TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse

document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the

TF and IDF scores of a term is called the TF*IDF weight of that term.

Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.

The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that

keyword based on the number of times it appears in the document. More importantly, it checks how

relevant the keyword is throughout the web, which is referred to as corpus.

19100BTCSES05363 Ayushi Upadhyay

BTCS 608 Data Science

For a term t in a document d, the weight Wt, d of term t in document d is given by:

Wt, d = TFt, d log (N/DFt)

Where:

● TFt, d is the number of occurrences of t in document d.

● DFt is the number of documents containing the term t.

● N is the total number of documents in the corpus.

How is TF*IDF calculated? The TF (term frequency) of a word is the frequency of a word (i.e. number of

times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or

too little.

For example, when a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is

TFcat = 12/100 i.e. 0.12

The IDF (inverse document frequency) of a word is the measure of how significant that term is in the

whole corpus.

Key Insights and Learnings: Politics, race and sexual tweets form a major chunk of hate tweets. Results

obtained when we played around with an imbalanced data set were inaccurate. The model predicts new

data to belong to the major class in the training set due to the skewed nature of the train data. The

weighted accuracy of a classifier is not the only metric to be looked at while evaluating the performance

of the model. Business context plays a vital role as well.

19100BTCSES05363 Ayushi Upadhyay

DMlab 2021
No ratings yet
DMlab 2021
4 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
Artificial Intelligence 3171105 Lab Manual
No ratings yet
Artificial Intelligence 3171105 Lab Manual
38 pages
Lab Manual (AI)
100% (1)
Lab Manual (AI)
17 pages
Lab Manual
No ratings yet
Lab Manual
19 pages
Ocs353 Data Science Fundamentals Laboratory-Eee
No ratings yet
Ocs353 Data Science Fundamentals Laboratory-Eee
52 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
CI JupyterPython
No ratings yet
CI JupyterPython
12 pages
Large Language Model Based Search Tool Prototype
No ratings yet
Large Language Model Based Search Tool Prototype
2 pages
Data Analytics Lab Course Overview
No ratings yet
Data Analytics Lab Course Overview
125 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
BCI308D-Data Visualization With Python
No ratings yet
BCI308D-Data Visualization With Python
2 pages
Unit 3
No ratings yet
Unit 3
110 pages
Python Packages To Learn Data Science E-Book
No ratings yet
Python Packages To Learn Data Science E-Book
76 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
PDS Merged New
No ratings yet
PDS Merged New
19 pages
Document Dsbda Codes For Mini Project
No ratings yet
Document Dsbda Codes For Mini Project
9 pages
FDS Dhana
No ratings yet
FDS Dhana
49 pages
Unit - 2 Notes DSF
No ratings yet
Unit - 2 Notes DSF
30 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Lab Manual
No ratings yet
Lab Manual
80 pages
Sentiment Analysis PDF
No ratings yet
Sentiment Analysis PDF
4 pages
B.tech Minor Syllabus-CSE (Data Science) - Final
No ratings yet
B.tech Minor Syllabus-CSE (Data Science) - Final
17 pages
Data Science and Machine Learning Using Python
No ratings yet
Data Science and Machine Learning Using Python
4 pages
Dslab Manual - Merged
No ratings yet
Dslab Manual - Merged
59 pages
Data Science Lab
No ratings yet
Data Science Lab
61 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Assignment 01
No ratings yet
Assignment 01
7 pages
AI & ML Lab Lesson Plan for B.Tech
No ratings yet
AI & ML Lab Lesson Plan for B.Tech
5 pages
INDEXReport Ayush
No ratings yet
INDEXReport Ayush
38 pages
E026 ShubhamTanna ASTM Exp-2
No ratings yet
E026 ShubhamTanna ASTM Exp-2
6 pages
Final Manual BCS358D
No ratings yet
Final Manual BCS358D
45 pages
Social Media Data Scraping with Python
No ratings yet
Social Media Data Scraping with Python
3 pages
ML Lab
No ratings yet
ML Lab
30 pages
Data Sceince Lab Manual
No ratings yet
Data Sceince Lab Manual
64 pages
FODS - Practical - 1-6 (2) Piyush
No ratings yet
FODS - Practical - 1-6 (2) Piyush
17 pages
DM Lab Manual
No ratings yet
DM Lab Manual
5 pages
Experiment 1 To 4
No ratings yet
Experiment 1 To 4
15 pages
Data Scientist Nanodegree
No ratings yet
Data Scientist Nanodegree
42 pages
PDS Practical
No ratings yet
PDS Practical
94 pages
PDS Practical
No ratings yet
PDS Practical
94 pages
Data Ty
No ratings yet
Data Ty
59 pages
VARMA For Battery Voltage Forecasting 1
No ratings yet
VARMA For Battery Voltage Forecasting 1
70 pages
Data Science
No ratings yet
Data Science
30 pages
Data Visualization - Lab - Manual - 2024
No ratings yet
Data Visualization - Lab - Manual - 2024
13 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Twitter Sentiment Analysis Project
No ratings yet
Twitter Sentiment Analysis Project
18 pages
Micro Project Report Format
No ratings yet
Micro Project Report Format
11 pages
ML Lab
No ratings yet
ML Lab
58 pages
FODS File: Python Programming Experiments
No ratings yet
FODS File: Python Programming Experiments
18 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Week1 - Introduction To Machine Learning and Toolkit
No ratings yet
Week1 - Introduction To Machine Learning and Toolkit
102 pages
Lecture 1 Pyhton Programming DOST 1
No ratings yet
Lecture 1 Pyhton Programming DOST 1
67 pages
Python & Jupyter Setup Guide
No ratings yet
Python & Jupyter Setup Guide
12 pages
AI for IoT Laboratory Course Guide
No ratings yet
AI for IoT Laboratory Course Guide
4 pages
Machine Learning Syllabus for GTU
No ratings yet
Machine Learning Syllabus for GTU
3 pages
Python Docs
No ratings yet
Python Docs
9 pages
Jmeter Notes
No ratings yet
Jmeter Notes
49 pages
Data Analysis and Visualization
No ratings yet
Data Analysis and Visualization
9 pages
Oracle Blockchain Org Setup Guide
No ratings yet
Oracle Blockchain Org Setup Guide
14 pages
Robotics Assignment
No ratings yet
Robotics Assignment
1 page
Robotics
No ratings yet
Robotics
76 pages
UX Designer & Technologist Profile
No ratings yet
UX Designer & Technologist Profile
1 page
RISC Processor VHDL Project Report
No ratings yet
RISC Processor VHDL Project Report
13 pages
MobaXterm SMF 20201015 193005
No ratings yet
MobaXterm SMF 20201015 193005
38 pages
Wireshark Packet Capture Overview
No ratings yet
Wireshark Packet Capture Overview
3 pages
Using OMICRON MBX1 For GOOSE Signal Switchboard Testing
No ratings yet
Using OMICRON MBX1 For GOOSE Signal Switchboard Testing
47 pages
Entry-Level Data Analyst Resume
No ratings yet
Entry-Level Data Analyst Resume
2 pages
infoPLC Net Finding Out The IP Address of A Lenze Controller
No ratings yet
infoPLC Net Finding Out The IP Address of A Lenze Controller
6 pages
Vivado FPGA Training Guide
No ratings yet
Vivado FPGA Training Guide
86 pages
Employee Salary Prediction Slides
No ratings yet
Employee Salary Prediction Slides
21 pages
AI Chatbot Building for Professionals
No ratings yet
AI Chatbot Building for Professionals
46 pages
Dde Client Manual
No ratings yet
Dde Client Manual
21 pages
Muhammad's Resume-2
No ratings yet
Muhammad's Resume-2
1 page
Automotive Networking Basics
No ratings yet
Automotive Networking Basics
20 pages
1st Semester Web Design and Development Worksheet For Grade 11
No ratings yet
1st Semester Web Design and Development Worksheet For Grade 11
4 pages
Immersive Audio Headphone Guide
No ratings yet
Immersive Audio Headphone Guide
27 pages
Compiler Design Exam Guide
No ratings yet
Compiler Design Exam Guide
2 pages
Project Overview: Dr. C. V. Raman University
No ratings yet
Project Overview: Dr. C. V. Raman University
8 pages
EN Data Concentrator Spec Sheet
No ratings yet
EN Data Concentrator Spec Sheet
1 page
WLAN Debug Log Analysis
No ratings yet
WLAN Debug Log Analysis
166 pages
Ultrashift Plus and Fuller Advantage Transmission Ecu Recovery Procedure
No ratings yet
Ultrashift Plus and Fuller Advantage Transmission Ecu Recovery Procedure
14 pages
Implementing Banker's Algorithm in C
No ratings yet
Implementing Banker's Algorithm in C
8 pages
Bankers Algorithm
No ratings yet
Bankers Algorithm
4 pages
ARM Reverse Engineering Guide
No ratings yet
ARM Reverse Engineering Guide
161 pages
? SAP BASIS Administration Questions
No ratings yet
? SAP BASIS Administration Questions
5 pages
GC 2024 10 31
No ratings yet
GC 2024 10 31
6 pages
DS4Windows Changelog
No ratings yet
DS4Windows Changelog
80 pages
70-640 Lesson01 PPT 041009
No ratings yet
70-640 Lesson01 PPT 041009
32 pages
FileOrbis Data & Security Solutions
No ratings yet
FileOrbis Data & Security Solutions
1 page
System Software Lab Manual
No ratings yet
System Software Lab Manual
38 pages
Hotel Management System
No ratings yet
Hotel Management System
78 pages

Ayushi Data Science Final File

Uploaded by

Ayushi Data Science Final File

Uploaded by

Shri Vaishnav Vidyapeeth Vishwavidyalaya

Shri Vaishnav Institute of Information Technology

Submitted to: Mrs. Archana Choubey

Sr. No. EXPERIMENTS DATE

1 Study and installation of jupyter notebook anaconda. 25/03/2

4 Write a program in Python to predict if a loan will get 22/04/22

5 Write a program in Python to identify the tweets which are 28/04/22

6 Case study on Hate Tweets 29/04/22

Installing Jupyter using Anaconda and conda in windows OS:

Use the following installation steps:

1. Download Anaconda’s latest Python 3 version from

19100BTCSES05363 Ayushi Upadhyay

5. Jupyter Notebook has been installed successfully.

19100BTCSES05363 Ayushi Upadhyay

Opening Jupyter Notebook:

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

The bag-of-words approach is a simplified representation used in natural language processing and

tenses, grammatical errors, unknown symbols, hashtags, and Greek characters.

19100BTCSES05363 Ayushi Upadhyay

Visualizations and Word Clouds

19100BTCSES05363 Ayushi Upadhyay

19100BTCSES05363 Ayushi Upadhyay

corpus was formed.

discard the SMOTE oversampling method.

that we use for our feature generation and modelling purposes.

19100BTCSES05363 Ayushi Upadhyay

Frequency Inverse Document Frequency (TFIDF).

2. Term Frequency Inverse Document Frequency

19100BTCSES05363 Ayushi Upadhyay

The scoring of the document would look as follows:

19100BTCSES05363 Ayushi Upadhyay

As a binary vector, this would look as follows:

The other three documents would look as follows:

1. “it was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

2. “it was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

3. “it was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

word embedding models (GloVe, fastText etc.)

relevant the keyword is throughout the web, which is referred to as corpus.

19100BTCSES05363 Ayushi Upadhyay

Wt, d = TFt, d log (N/DFt)

● TFt, d is the number of occurrences of t in document d.

● DFt is the number of documents containing the term t.

● N is the total number of documents in the corpus.

TFcat = 12/100 i.e. 0.12

of the model. Business context plays a vital role as well.

19100BTCSES05363 Ayushi Upadhyay

You might also like