0% found this document useful (0 votes)

23 views59 pages

Class 10 AI Practical & Project Guide

Uploaded by

Advik Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views59 pages

Class 10 AI Practical & Project Guide

Uploaded by

Advik Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 59

VENKATESHWAR INTERNATIONAL SCHOOL

ARTIFICIAL INTELLIGENCE (417)

CLASS X (2023-24)
PRACTICAL & PROJECT FILE

NAME: Garv Gupta

CLASS & SECTION: 10 HT
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)

INDEX
S.NO. NAME OF THE PROGRAM
1 Write a program to print the name given by the user as input.
2 Write a program to check whether a given number is even or odd .
3 Check if a certain year is leap year or not.
4 Write a program to find the largest number among the three input
numbers.
5 Check whether a number entered is prime or not.
6 Find the factorial of a number given by user.
7 Display the multiplication table of number given by user.
8 Fibonacci sequence.
9 Check whether a number is Armstrong number.
10 Sum of natural numbers.
11 sum of digits of a number.
12 Length of string entered by user.
13 Check whether a person eligible to vote.
14 Reverse order of a list of number.
15 Store specific value and print
PROJECT
“Generate TFIDF values for all the words
and find the words having the highest
value”
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)

CERTIFICATE
This is to certify that…Garv Gupta……., a
student of class X has successfully completed
the practical as well as Project file on the
topic “Generate TFIDF values for all the
words and find the words having the
highest value” under the guidance of Ms.
Puja Shah Dahiya (Subject Teacher ) during
the session 2023-24.
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
PRACTICAL FILE
#1 WAP to find out the largest number among the three input numbers

(coding )

(output)

(15 . PROGRAMS TO BE DONE AS PRACTICAL )

VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)

PROJECT
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
DATA
VISUALIZATION
What is data visualization?
Data visualization is the graphical representation of information and data.
By using visual elements like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers,
and patterns in data.
Additionally, it provides an excellent way for employees or business
owners to present data to non-technical audiences without confusion.

In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-
driven decisions.
Why data visualization is important
The importance of data visualization is simple: it helps people see,
interact with, and better understand data. Whether simple or complex,
the right visualization can bring everyone on the same page, regardless
of their level of expertise.

It’s hard to think of a professional industry that doesn’t benefit from

making data more understandable. Every STEM field benefits from
understanding data—and so do fields in government, finance,
marketing, history, consumer goods, service industries, education,
sports, and so on.

While we’ll always wax poetically about data visualization (you’re on the
Tableau website, after all) there are practical, real-life applications that
are undeniable. And, since visualization is so prolific, it’s also one of the
most useful professional skills to develop. The better you can convey your
points visually, whether in a dashboard or a slide deck, the better you can
leverage that information. The concept of the citizen data scientist is on
the rise. Skill sets are changing to accommodate a data-driven world. It is
increasingly valuable for professionals to be able to use data to make
decisions and use visuals to tell stories of when data informs the who,
what, when, where, and how.

While traditional education typically draws a distinct line between

creative storytelling and technical analysis, the modern professional
world also values those who can cross between the two: data
visualization sits right in the middle of analysis and visual storytelling.
Different types of visualizations

When you think of data visualization, your first thought probably

immediately goes to simple bar graphs or pie charts. While these may be
an integral part of visualizing data and a common baseline for many data
graphics, the right visualization must be paired with the right set of
information. Simple graphs are only the tip of the iceberg. There’s a
whole selection of visualization methods to present data in effective and
interesting ways.

General Types of Visualizations:

 Chart: Information presented in a tabular, graphical form with data

displayed along two axes. Can be in the form of a graph, diagram, or
map.
 Table: A set of figures displayed in rows and columns. Learn more.

 Graph: A diagram of points, lines, segments, curves, or areas that

represents certain variables in comparison to each other, usually along
two axes at a right angle.
 Geospatial: A visualization that shows data in map form using different
shapes and colors to show the relationship between pieces of data and
specific locations. Learn more.
 Infographic: A combination of visuals and words that represent data.
Usually uses charts or diagrams.

 Dashboards: A collection of visualizations and data displayed in one

place to help with analyzing and presenting data. Learn more.
More specific examples

 Area Map: A form of geospatial visualization, area maps are used to

show specific values set over a map of a country, state, county, or any
other geographic location. Two common types of area maps are
choropleths and isopleths. Learn more.

 Bar Chart: Bar charts represent numerical values compared to each

other. The length of the bar represents the value of each variable. Learn
more.

 Box-and-whisker Plots: These show a selection of ranges (the box)

across a set measure (the bar). Learn more.
 Bullet Graph: A bar marked against a background to show progress or
performance against a goal, denoted by a line on the graph. Learn more.

 Gantt Chart: Typically used in project management, Gantt charts

are a bar chart depiction of timelines and tasks. Learn more.

 Heat Map: A type of geospatial visualization in map form which displays

specific data values as different colors (this doesn’t need to be
temperatures, but that is a common use). Learn more.
 Highlight Table: A form of table that uses color to categorize similar
data, allowing the viewer to read it more easily and intuitively. Learn
more.

 Histogram: A type of bar chart that split a continuous measure into

different bins to help analyze the distribution. Learn more.
 Pie Chart: A circular chart with triangular segments that shows data as
a percentage of a whole. Learn more.
 Treemap: A type of chart that shows different, related values in
the form of rectangles nested together. Learn more.
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2022-23)

PROJECT

TOPIC : “Generate TFIDF values for all the

words and find the words having the
highest value”
Introduction: TF-IDF

TF-IDF stands for “Term Frequency — Inverse

Document Frequency”. This is a technique to quantify
words in a set of documents. We generally compute a
score for each word to signify its importance in the
document and corpus. This method is a widely used
technique in Information Retrieval and Text Mining.

If I give you a sentence for example “This building is so

tall”. It's easy for us to understand the sentence as we
know the semantics of the words and the sentence. But
how can any program (eg: python) interpret this
sentence? It is easier for any programming language to
understand textual data in the form of numerical value.
So, for this reason, we need to vectorize all of the text so
that it is better represented.

By vectorizing the documents we can further perform

multiple tasks such as finding the relevant documents,
ranking, clustering, etc. This exact technique is used
when you perform a google search (now they are updated
to newer transformer techniques). The web pages are
called documents and the search text with which you
search is called a query. The search engine maintains a
fixed representation of all the documents. When you
search with a query, the search engine will find the
relevance of the query with all of the documents, ranks
them in the order of relevance and shows you the top k
documents.
All of this process is done using the vectorized form of query
and documents.

Now coming back to our TF-IDF,

TF-IDF = Term Frequency (TF) * Inverse Document

Frequency (IDF)

Terminology

 t — term (word)

 d — document (set of words)

 N — count of corpus

 corpus — the total document set

Term Frequency

This measures the frequency of a word in a document.

This highly depends on the length of the document and
the generality of the word, for example, a very common
word such as “was” can appear multiple times in a
document. But if we take two documents with 100 words
and 10,000 words respectively, there is a high probability
that the common word “was” is present more in the
10,000 worded document. But we cannot say that the
longer document is more
important than the shorter document. For this exact
reason, we perform normalization on the frequency
value, we divide the frequency with the total number of
words in the document.

Recall that we need to finally vectorize the document.

When we plan to vectorize documents, we cannot just
consider the words that are present in that particular
document. If we do that, then the vector length will be
different for both the documents, and it will not be
feasible to compute the similarity. So, what we do is that
we vectorize the documents on the vocab. Vocab is the
list of all possible worlds in the corpus.

We need the word counts of all the vocab words and the
length of the document to compute TF. In case the term
doesn’t exist in a particular document, that particular TF
value will be 0 for that particular document. In an
extreme case, if all the words in the document are the
same, then TF will be 1. The final value of the normalised
TF value will be in the range of [0 to 1]. 0, 1 inclusive.

TF is individual to each document and word, hence we

can formulate TF as follows:

tf(t,d) = count of t in d / number of words in d

If we already computed the TF value and if this produces

a vectorized form of the document, why not use just TF
to find the relevance between documents? Why do we
need IDF?
Let me explain, words which are most common such as
‘is’, ‘are’ will have very high values, giving those words
very high importance. But using these words to compute
the relevance produces bad results.
These kinds of common words are called stop-words.
Although we will remove the stop words later in the
preprocessing step, finding the presence of the word
across the documents and somehow reduce their
weightage is more ideal.

Document Frequency

This measures the importance of documents in a whole

set of the corpus. This is very similar to TF but the only
difference is that TF is the frequency counter for a term t
in document d, whereas DF is the count of occurrences
of term t in the document set N. In other words, DF is the
number of documents in which the word is present. We
consider one occurrence if the term is present in the
document at least once, we do not need to know the
number of times the term is present.

df(t) = occurrence of t in N documents

To keep this also in a range, we normalize by dividing by

the total number of documents. Our main goal is to know
the informativeness of a term, and DF is the exact inverse
of it. that is why we inverse the DF

Inverse Document Frequency

IDF is the inverse of the document frequency which

measures the informativeness of term t. When we
calculate IDF, it will be very low for the most occurring
words such as stop words (because they are
present in almost all of the documents, and N/df will give
a very low value to that word). This finally gives what we
want, a relative weightage.

idf(t) = N/df

Now there are few other problems with the IDF, when
we have a large corpus size say N=10000, the IDF value
explodes. So to dampen the effect we take the log of IDF.

At query time, when the word is not present in is not in

the vocab, it will simply be ignored. But in few cases, we
use a fixed vocab and few words of the vocab might be
absent in the document, in such cases, the df will be 0.
As we cannot divide by 0, we smoothen the value by
adding 1 to the denominator.

idf(t) = log(N/(df + 1))

Finally, by taking a multiplicative value of TF and IDF,

we get the TF-IDF score. There are many different
variations of TF-IDF but for now, let us concentrate on
this basic version.

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

About Me:

I’m a Senior Data Scientist and AI researcher in the field

of NLP and DL.
Connect with me: Twitter, LinkedIn.
Implementing on a real-world dataset
Now that we learnt what is TF-IDF let us compute the
similarity score on a dataset.

The dataset we are going to use are archives of few stories, this
dataset has lots of documents in different formats.
Download the dataset and open your notebooks, Jupyter
Notebooks I mean 😜.

Dataset Link: http://archives.textfiles.com/stories.zip

Step 1: Analysing Dataset

The first step in any of the Machine Learning tasks is to

analyse the data. So if we look at the dataset, at first
glance, we see all the documents with words in English.
Each document has different names and there are two
folders in it.

Now one of the important tasks is to identify the title in

the body, if we analyse the documents, there are different
patterns of alignment of title. But most of the titles are
centre aligned. Now we need to figure out a way to
extract the title. But before we get all pumped up and
start coding, let us analyse the dataset little deep.

Take few minutes to analyse the dataset yourself. Try to

explore…

Upon more inspection, we can notice that there’s an

index.html in each folder (including the root), which
contains all the document names and their titles. So, let
us consider ourselves lucky as the titles are given to us,
without exhaustively extracting titles from each
document.
Step 2: Extracting Title & Body:

There is no specific way to do this, this totally

depends on the problem statement at hand and on
the analysis, we do on the dataset.

As we have already found that the titles and the

document names are in the index.html, we need to
extract those names and titles. We are lucky that
index.html has tags that we can use as patterns to
extract our required content.

Before we start extracting the titles and file names, as we have

different folders, first let’s crawl the folders to later read
all the index.html files at once.
[x[0] for x in os.walk(str(os.getcwd())+’/stories/’)]

os.walk gives us the files in the directory, os.getcwd

gives us the current directory and title and we are going
to search in the current directory + stories folder as our
data files are in the stories folder.
Always assume that you are dealing with a
huge dataset, this helps in automating the
code.

Now we can find that folders give extra / for the root
folder, so we are going to remove it.
folders[0] = folders[0][:len(folders[0])-1]

The above code removes the last character for the 0th
index in folders, which is the root folder

Now, let’s crawl through all the index.html to extract

their titles. To do that we need to find a pattern to take
out the title. As this is in html, our job will be a little
simpler.

let’s see…

We can clearly observe that each file name is enclosed

between (><A HREF=”) and (”) and each title is
between (<BR><TD>) and (\n)
We will use simple regular expressions to retrieve the
name and title. The following code gives the list of all the
values that match that pattern. so names and titles
variables have the list of all names and titles.
names = re.findall(‘><A HREF=”(.*)”>’, text)
titles = re.findall(‘<BR><TD> (.*)\n’, text)

Now that we have code to retrieve the values from the

index, we just need to iterate to all the folders and get
the title and file name from all the index.html files

- read the file from index files

- extract title and names

- iterate to next folder

dataset = []for i in folders:
file = open(i+"/index.html",
'r') text = file.read().strip()
file.close() file_name = re.findall('><A
HREF="(.*)">', text)
file_title = re.findall('<BR><TD> (.*)\n', text)

for j in range(len(file_name)):
dataset.append((str(i) + str(file_name[j]),
file_title[j]))

This prepares the indexes of the dataset, which is a tuple

of the location of the file and its title. There is a small
issue, the root folder index.html also has folders and its
links, we need to remove those.
simply use a conditional checker to remove it.
if c == False:
file_name = file_name[2:]
c = True

Step 3: Preprocessing
Preprocessing is one of the major steps when we are
dealing with any kind of text model. During this stage, we
have to look at the distribution of our data, what
techniques are needed and how deep we should clean.

This step never has a one-hot rule, and totally depends on

the problem statement. Few mandatory preprocessing are:
converting to lowercase, removing punctuation, removing
stop words and lemmatization/stemming. In our problem
statement, it seems like the basic preprocessing steps will
be sufficient.

Lowercase
During the text processing, each sentence is split into
words and each word is considered as a token after
preprocessing.
Programming languages consider textual data as
sensitive, which means that The is different from the.
we humans know that those both belong to the same
token but due to the character encoding those are
considered as different tokens. Converting to lowercase
is a very mandatory preprocessing step. As we have all
our data in the list, numpy has a method that can convert
the list of lists to
lowercase at once.
np.char.lower(data)

Stop words

Stop words are the most commonly occurring words that

don’t give any additional value to the document vector. in-
fact removing these will increase computation and space
efficiency. nltk library has a method to download the
stopwords, so instead of explicitly mentioning all the
stopwords ourselves we can just use the nltk library and
iterate over all the words and remove the stop words.
There are many efficient ways to do this, but ill just give
a simple method.
we are going to iterate over all the stop words and not
append them to the list if it’s a stop word
new_text = ""
for word in words:
if word not in stop_words:
new_text = new_text + " " + word

Punctuation

Punctuation is the set of unnecessary symbols that are in

our corpus documents. We should be a little careful with
what we are doing with this, there might be few problems
such as U.S — us “United Stated” being converted to “us”
after the preprocessing. hyphen and should usually be
dealt with little care. But for this problem statement, we
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\
are just
n" for i going to remove these
in symbols:
data = np.char.replace(data, i, ' ')

We are going to store all our symbols in a variable and

iterate that variable removing that particular symbol in
the whole dataset. we are using numpy here because our
data is stored in a list of lists, and numpy is our best bet.

Apostrophe

Note that there is no ‘ apostrophe in the punctuation

symbols. Because when we remove punctuation first it
will convert don’t to dont, and it is a stop word that won't
be removed. What we will do instead, is removing the
stop words first followed by symbols and then finally
repeat stopword removal as few words might still have
an apostrophe that are not stopwords.
return np.char.replace(data, "'", "")

Single Characters
Single characters are not much useful in knowing the
importance of the document and few final single
characters might be irrelevant symbols, so it is always
new_text = ""
good
for w to
in remove
words: the single characters.
if len(w) > 1:
new_text = new_text + " " + w

We just need to iterate to all the words and not append

the word if the length is not greater than 1.

Stemming

This is the final and most important part of the

preprocessing. stemming converts words to their stem.

For example, playing and played are the same type of

words that basically indicate an action play. Stemmer
does exactly this, it reduces the word to its stem. we are
going to use a library called porter-stemmer which is a
rule-based stemmer. Porter-Stemmer identifies and
removes the suffix or affix of a word. The words given by
the stemmer need not be meaningful few times, but it
will be identified as a single token for the model.

Lemmatisation

Lemmatisation is a way to reduce the word to the root

synonym of a word. Unlike Stemming, Lemmatisation
makes sure that the reduced word is again a dictionary
word (word present in the same language).
WordNetLemmatizer can be used to lemmatize any word.
Stemming vs Lemmatization

stemming — need not be a dictionary word, removes

prefix and affix based on few rules

lemmatization — will be a dictionary word. reduces to a

root synonym.

A better efficient way to proceed is to first lemmatise and

then stem, but stemming alone is also fine for few
problems statements, here we will not lemmatise.

Converting Numbers

When a user gives a query such as 100 dollars or

hundred dollars. For the user, both those search terms
are the same. but our IR model treats them separately, as
we are storing 100, dollars, hundred as different tokens.
So to make our IR mode a little better we need to convert
100 to hundred. To achieve this we are going to use a
library called num2word.

If we look a little close to the above output, it is giving us

few symbols and sentences such as “one hundred and
two”, but damn we just cleaned our data, then how do we
handle this? No worries,
we will just run the punctuation and stop words again
after converting numbers to words.

Preprocessing

Finally, we are going to put in all those preprocessing

methods above in another method and we will call that
def preprocess(data):
preprocess
data method. =
convert_lower_case(data) data
= remove_punctuation(data)
data =
remove_apostrophe(data)
data = remove_single_characters(data)
data = convert_numbers(data)
data =
remove_stop_words(data) data
= stemming(data)

If you look closely, a few of the preprocessing methods

are repeated again. As discussed, this just helps clean the
data little deep. Now we need to read the documents and
store their title and the body separately as we are going
to use them later. In our problem statement, we have very
different types of documents, this can cause few errors in
reading the documents due to encoding compatibility. to
resolve this, just use encoding=”utf8", errors=’ignore’ in
the open() method.

Step 3: Calculating TF-IDF

Recall that we need to give different weights to title and

body. Now how are we going to handle that issue? how
will the calculation of TF-IDF work in this case?

Giving different weights to title and body is a very

common approach. We just need to consider the
document as body + title, using this we can find the
vocab. And we need to give different
weights to words in the title and different weights to the
words in the body. To better explain this, let us consider
an example.

title = “This is a novel paper”

body = “This paper consists of survey of many papers”

Now, we need to calculate the TF-IDF for body and for the
title. For the time being let us consider only the word
paper, and forget about removing stop words.

What is the TF of word paper in the title? 1/4?

No, it’s 3/13. How? word paper appears in title and body
3 times and the total number of words in title and body is
13. As I mentioned before, we just consider the word in
the title to have different weights, but still, we consider
the whole document when calculating TF-IDF.

Then the TF of paper in both title and body is the same?

Yes, it’s the same! it’s just the difference in weights that
we are going to give. If the word is present in both title
and body, then there wouldn't be any reduction in the TF-
IDF value. If the word is present only in the title, then the
weight of the body for that particular word will not add to
the TF of that word, and vice versa.

document = body + title

TF-IDF(document) = TF-IDF(title) * alpha + TF-
IDF(body) * (1- alpha)

Calculating DF

Let us be smart and calculate DF beforehand. We need to

iterate through all the words in all the documents and
store the document id’s for each word. For this, we will
use a dictionary as we can use the word as the key and a
set of documents as the value. I mentioned set because,
even if we are trying to add the document multiple times,
DF = {}
a set
for i will not just take duplicate values.
in range(len(processed_text)):
tokens = processed_text[i]
for w in tokens:
try:
DF[w].add(i)
except:
DF[w] = {i}

We are going to create a set if the word doesn’t have a set

yet else add it to the set. This condition is checked by the
try block. Here processed_text is the body of the
document, and we are going to repeat the same for the
title as well, as we need to consider the DF of the whole
document.

len(DF) will give the unique words

DF will have the word as the key and the list of doc id’s
as the value. but for DF we don’t actually need the list of
docs, we just need the count. so we are going to replace
the list with its count.
There we have it, the count we need for all the words. To
find the total unique words in our vocabulary, we need to
take all the keys of DF.

Calculating TF-IDF

Recall that we need to maintain different weights for title

and body. To calculate TF-IDF of body or title we need to
consider both the title and body. To make our job a little
easier, let’s use a dictionary with (document, token) pair
as key and any TF-IDF score as the value. We just need to
iterate over all the documents, we can use the Coutner
which can give us the frequency of the tokens, calculate tf
and idf and finally store as a (doc, token) pair in tf_idf.
tf_idf dictionary is for the body, we will use the same logic
to build a dictionary tf_idf_title for the words in the title.
tf_idf = {}
for i in range(N):
tokens = processed_text[i]
counter = Counter(tokens + processed_title[i])
for token in np.unique(tokens):
tf =
counter[token]/words_count df
= doc_freq(token)
idf = np.log(N/(df+1))

Coming to the calculation of different weights. Firstly, we

need to maintain a value alpha, which is the weight for
the body, then obviously 1-alpha will be the weight for the
title. Now let us delve into a little math, we discussed that
TF-IDF value of a word will be the same for both body and
title if the word is present in both places. We will
maintain two different tf-idf dictionaries, one for the body
and one for the title.

What we are going to do is a little smart, we will calculate

TF-IDF for the body; multiply the whole body TF-IDF
values with alpha; iterate the tokens in the title; replace
the title TF-IDF value in the body TF- IDF value of the
(document, token) pair exists. Take some time to process
this :P

Flow:

- Calculate TF-IDF for Body for all docs

- Calculate TF-IDF for title for all docs

- multiply the Body TF-IDF with alpha

- Iterate Title IF-IDF for every (doc, token)

— if token is in body, replace the Body(doc, token) value
with the value in Title(doc, token)

I know this is not easy at first to understand, but still let

me explain why the above flow works, as we know that
the tf-idf for body and title will be the same if the token is
in both places, The weights that we use for body and title
sum up to one

TF-IDF = body_tf-idf * body_weight + title_tf-idf*title_weight

body_weight + title_weight = 1

When a token is in both places, then the final TF-IDF will

be the same as taking either body or title tf_idf. That is
exactly what we are doing in the above flow. So, finally,
we have a dictionary tf_idf which has the values as a (doc,
token) pair.

Step 4: Ranking using Matching Score

Matching score is the simplest way to calculate the

similarity, in this method, we add tf_idf values of the
tokens that are in query for every document. For
example, for the query “hello world”, we need to check in
every document if these words exist and if the word
exists, then the tf_idf value is added to the matching score
of that particular doc_id. In the end, we will sort and take
the top k documents.

Mentioned above is the theoretical concept, but as we

are using a dictionary to hold our dataset, what we are
going to do is we will iterate over all of the values in the
dictionary and check if the value
is present in the token. As our dictionary is a (document,
token) key, when we find a token that is in the query we
will add the document id to another dictionary along with
the tf-idf value. Finally, we will just take the top k
def
documents again.
matching_score(query)
: query_weights = {}
for key in tf_idf:
if key[1] in tokens:
query_weights[key[0]] += tf_idf[key]

key[0] is the documentid, key[1] is the token.

Step 5: Ranking using Cosine Similarity

When we have a perfectly working Matching Score, why

do we need cosine similarity again? though Matching
Score gives relevant documents, it quite fails when we
give long queries, it will not be able to rank them
properly. What cosine similarly does is that it will mark all
the documents as vectors of tf-idf tokens and measures
the similarity in cosine space (the angle between the
vectors. Few times the query length would be small but it
might be closely related to the document in such cases
cosine similarity is the best to find relevance.
Observe the above plot, the blue vectors are the
documents and the red vector is the query, as we can
clearly see, though the manhattan distance (green line)
is very high for document d1, the query is still close to
document d1. In such cases, cosine similarity would be
better as it considers the angle between those two
vectors. But Matching Score will return document d3 but
that is not very closely related.

Matching Score computes manhattan distance (straight

line from tips)
Cosine score considers the angle of the vectors.

Vectorization

To compute any of the above, the simplest way is to

convert everything to a vector and then compute the
cosine similarity. So, let’s convert the query and
documents to vectors. We are going to use total_vocab
variable which has all the list of unique tokens to
generate an index for each token, and we will use numpy
# Document Vectorization
of=shape
D (docs, total_vocab)
np.zeros((N, to store
total_vocab_size)) the document
for i in tf_idf:
vectors.
ind =
total_vocab.index(i[1])

For vector, we need to calculate the TF-IDF values, TF

we can calculate from the query itself, and we can make
use of DF that we created for the document frequency.
Finally, we will store in a (1,vocab_size) numpy array to
store the tf-idf values, index of the token will be decided
Q = np.zeros((len(total_vocab)))
from the
counter = total_voab list
Counter(tokens)
words_count = len(tokens)
query_weights = {}
for token in np.unique(tokens):
tf = counter[token]/words_count
df = doc_freq(token)
idf = math.log((N+1)/(df+1))

Now, all we have to do is calculate the cosine similarity

for all the documents and return the maximum k
documents. Cosine similarity is defined as follows.

np.dot(a, b)/(norm(a)*norm(b))
Analysis

I took the text from doc_id 200 (for me) and pasted
some content with long query and short query in both
matching score and cosine similarity.

 NAME OF THE SOFTWARE .......................................................(SOFTWARE USED FOR THIS

PROJECT )
 DOCUMENTS USED :

Document 1
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly
normal, thank you very much. They were the last people you’d expect to be involved in anything
strange or mysterious, because they just didn’t hold with such nonsense.

Document 2
Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy
man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and
blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so
much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small
son called Dudley and in their opinion there was no fi ner boy anywhere.

Document 3
When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing
about the cloudy sky outside to suggest that strange and mysterious things would soon be
happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for
work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high
chair.
OUT PUT GIVEN BY THE SOFTWARE
Term count Docum TF × IDF
ent
No. Token Doc 1 Doc 2 Doc 3 count IDF Doc 1 Doc 2 Doc 3

1 mr 0.0222 0.0117 0.0298 3 0.125 0.003 0.001 0.004

222222 647058 507462

2 8 7

2 mrs 0.0222 0.0117 0.0298 3 0.125 0.003 0.001 0.004

222222 647058 507462
2 8 7

3 dursley 0.0222 0.0235 0.0447 3 0.125 0.003 0.003 0.006

222222 294117 761194

2 6

4 number 0.0222 0 0 1 0.602 0.013 0 0

222222
2

5 four 0.0222 0 0 1 0.602 0.013 0 0

222222
2

6 privet 0.0222 0 0 1 0.602 0.013 0 0

222222
2

7 drive 0.0222 0 0 1 0.602 0.013 0 0

222222
2

8 proud 0.0222 0 0 1 0.602 0.013 0 0

222222

2
9 say 0.0222 0 0 1 0.602 0.013 0 0
222222
2

10 perfectl y 0.0222 0 0 1 0.602 0.013 0 0

222222
2

11 normal 0.0222 0 0 1 0.602 0.013 0 0

222222
2

12 thank 0.0222 0 0 1 0.602 0.013 0 0

222222
2
13 much 0.0222 0.0117 0 2 0.301 0.007 0.004 0
222222 647058
2 8
14 last 0.0222 0 0 1 0.602 0.013 0 0
222222
2
15 people 0.0222 0 0 1 0.602 0.013 0 0
222222
2
16 you’d 0.0222 0 0 1 0.602 0.013 0 0
222222
2
17 expect 0.0222 0 0 1 0.602 0.013 0 0
222222
2
18 involve 0.0222 0 0 1 0.602 0.013 0 0
d
222222
2
19 anythin 0.0222 0 0 1 0.602 0.013 0 0
g
222222
2
20 strange 0.0222 0 0.0149 2 0.301 0.007 0
0.004
222222 253731
2 3
21 mysteri 0.0222 0 0.0149 2 0.301 0.007 0
ous 0.004
222222 253731
2 3
22 didn’t 0.0222 0 0 1 0.602 0.013 0
222222 0
2
23 hold 0.0222 0 0 1 0.602 0.013 0
222222
0
2
24 nonsen 0.0222 0 0 1 0.602 0.013 0
se 222222
0
2
25 director 0 0.0117 0 1 0.602 0 0.007
647058
8 0
26 firm 0 0.0117 0 1 0.602 0 0.007
647058
8
0
27 called 0 0.0235 0 1 0.602 0 0.014
294117

0
6

28 grunnin 0 0.0117 0 1 0.602 0 0.007 0

gs
647058
8
29 made 0 0.0117 0 1 0.602 0 0.007 0
647058
8

30 drills 0 0.0117 0 1 0.602 0 0.007 0

647058
8
31 big 0 0.0117 0 1 0.602 0 0.007 0
647058
8
32 beefy 0 0.0117 0 1 0.602 0 0.007 0
647058
8
33 man 0 0.0117 0 1 0.602 0 0.007 0
647058
8
34 hardly 0 0.0117 0 1 0.602 0 0.007 0
647058
8
35 neck 0 0.0235 0 1 0.602 0 0.014 0
294117
6
36 althoug 0 0.0117 0 1 0.602 0 0.007 0
h
647058
8
37 large 0 0.0117 0 1 0.602 0 0.007 0
647058
8
38 mustac 0 0.0117 0 1 0.602 0 0.007 0
he
647058
8
39 thin 0 0.0117 0 1 0.602 0 0.007
0
647058
8
40 blonde 0 0.0117 0 1 0.602 0 0.007
647058 0

8
41 nearly 0 0.0117 0 1 0.602 0 0.007
647058 0
8
42 twice 0 0.0117 0 1 0.602 0 0.007
647058
0
8
43 usual 0 0.0117 0 1 0.602 0 0.007
647058
8 0

44 amount 0 0.0117 0 1 0.602 0 0.007

647058
8 0
45 came 0 0.0117 0 1 0.602 0 0.007
647058
8
0
46 useful 0 0.0117 0 1 0.602 0 0.007
647058

0
8

47 spent 0 0.0117 0 1 0.602 0 0.007 0

647058
8

48 time 0 0.0117 0 1 0.602 0 0.007 0

647058
8
49 craning 0 0.0117 0 1 0.602 0 0.007 0
647058
8
50 garden 0 0.0117 0 1 0.602 0 0.007 0
647058
8
51 fences 0 0.0117 0 1 0.602 0 0.007 0
647058
8
52 spying 0 0.0117 0 1 0.602 0 0.007 0
647058
8
53 neighbo 0 0.0117 0 1 0.602 0 0.007 0
rs 647058
8
54 dursley 0 0.0117 0 1 0.602 0 0.007 0
s 647058
8
55 small 0 0.0117 0 1 0.602 0 0.007 0
647058
8

56 son 0 0.0117 0 1 0.602 0 0.007 0

647058
8
57 dudley 0 0.0117 0.0149 2 0.301 0 0.004 0.004
647058 253731
8 3
58 opinion 0 0.0117 0 1 0.602 0 0.007 0
647058
8
59 finer 0 0.0117 0 1 0.602 0 0.007 0
647058
8
60 boy 0 0.0117 0 1 0.602 0 0.007 0
647058
8
61 anywhe 0 0.0117 0 1 0.602 0 0.007 0
re 647058
8
62 woke 0 0 0.0149 1 0.602 0 0 0.009
253731
3
63 dull 0 0 0.0149 1 0.602 0 0 0.009
253731
3
64 gray 0 0 0.0149 1 0.602 0 0 0.009
253731
3
65 tuesday 0 0 0.0149 1 0.602 0 0 0.009
253731

66 story 0 0 0.0149 1 0.602 0 0 0.009

253731
3
67 starts 0 0 0.0149 1 0.602 0 0 0.009
253731
3
68 nothing 0 0 0.0149 1 0.602 0 0 0.009
253731
3
69 cloudy 0 0 0.0149 1 0.602 0 0 0.009
253731
3
70 sky 0 0 0.0149 1 0.602 0 0 0.009
253731
3
71 outside 0 0 0.0149 1 0.602 0 0 0.009
253731
3
72 suggest 0 0 0.0149 1 0.602 0 0 0.009
253731
3
73 things 0 0 0.0149 1 0.602 0 0 0.009
253731
3
74 would 0 0 0.0149 1 0.602 0 0 0.009
253731
3

75 soon 0 0 0.0149 1 0.602 0 0 0.009

253731
3
76 happeni 0 0 0.0149 1 0.602 0 0 0.009
ng
253731
3
77 country 0 0 0.0149 1 0.602 0 0 0.009
253731
3
78 humme 0 0 0.0149 1 0.602 0 0 0.009
d 253731
3
79 picked 0 0 0.0149 1 0.602 0 0 0.009
253731
3
80 boring 0 0 0.0149 1 0.602 0 0 0.009
253731
3
81 tie 0 0 0.0149 1 0.602 0 0 0.009
253731
3
82 work 0 0 0.0149 1 0.602 0 0 0.009
253731
3
83 gossipe 0 0 0.0149 1 0.602 0 0 0.009
d
253731
3
84 away 0 0 0.0149 1 0.602 0 0 0.009
253731
3

85 happily 0 0 0.0149 1 0.602 0 0 0.009

253731
3
86 wrestle 0 0 0.0149 1 0.602 0 0 0.009
d
253731
3
87 screami 0 0 0.0149 1 0.602 0 0 0.009
ng
253731
3
88 high 0 0 0.0149 1 0.602 0 0 0.009
253731
3
89 chair 0 0 0.0149 1 0.602 0 0 0.009
253731
3
90 air 0 0 0 1 0.602 0 0 0
91 quality 0 0 0 1 0.602 0 0 0

Chapter 1
No ratings yet
Chapter 1
52 pages
DV Co1 All PDF
No ratings yet
DV Co1 All PDF
196 pages
DVP Unit1
No ratings yet
DVP Unit1
44 pages
Data Visualization Techniques: Dr. D. Koteswara Rao
No ratings yet
Data Visualization Techniques: Dr. D. Koteswara Rao
41 pages
Data Visualization1
No ratings yet
Data Visualization1
5 pages
Class 9 AI Project Cycle Notes
100% (1)
Class 9 AI Project Cycle Notes
8 pages
Unit-6: Data Visualization and Hadoop
No ratings yet
Unit-6: Data Visualization and Hadoop
96 pages
Data Visualization Course Overview
No ratings yet
Data Visualization Course Overview
129 pages
DAT100 - Int - Data - Ana - Lec11 - Visualization
No ratings yet
DAT100 - Int - Data - Ana - Lec11 - Visualization
33 pages
Fe 550
No ratings yet
Fe 550
4 pages
Chapter 2c Data Visualization Notes & Q&A
No ratings yet
Chapter 2c Data Visualization Notes & Q&A
4 pages
DMV - Unit 3 & 4
No ratings yet
DMV - Unit 3 & 4
32 pages
Creating Histograms in Data Visualization
No ratings yet
Creating Histograms in Data Visualization
3 pages
Principles of Data Visualization
No ratings yet
Principles of Data Visualization
61 pages
Unit5 1
No ratings yet
Unit5 1
12 pages
W5 Lecture Slides
No ratings yet
W5 Lecture Slides
54 pages
Data Visualization - Chapter1
No ratings yet
Data Visualization - Chapter1
66 pages
Unit Vbig Data Visualization 1
No ratings yet
Unit Vbig Data Visualization 1
43 pages
Unit 3 DV
No ratings yet
Unit 3 DV
44 pages
B.Tech Iii Year Sem-1 Academic Year - July, 2021: PE-1 Data Visualization Techniques Course Code: 19Cs3051S
No ratings yet
B.Tech Iii Year Sem-1 Academic Year - July, 2021: PE-1 Data Visualization Techniques Course Code: 19Cs3051S
23 pages
Reading and Writing Set 2 Assgn
No ratings yet
Reading and Writing Set 2 Assgn
16 pages
Data Visualization & Classification Guide
No ratings yet
Data Visualization & Classification Guide
25 pages
Data Visualization Charts, Maps, and Interactive Graphics
100% (17)
Data Visualization Charts, Maps, and Interactive Graphics
249 pages
What Is Data Visualization UNIT-V
No ratings yet
What Is Data Visualization UNIT-V
24 pages
Data Visualization Techniques Explained
No ratings yet
Data Visualization Techniques Explained
32 pages
Visualization Techniques Overview
No ratings yet
Visualization Techniques Overview
18 pages
BDT UNIT - 4 Text Note
No ratings yet
BDT UNIT - 4 Text Note
63 pages
Unit IV
No ratings yet
Unit IV
63 pages
Data Science and Ai Education For Young Minds
No ratings yet
Data Science and Ai Education For Young Minds
75 pages
Week 2 Môn Info
No ratings yet
Week 2 Môn Info
75 pages
Data Visualization 2
No ratings yet
Data Visualization 2
25 pages
4 - Data Visualization For Decison Making
100% (1)
4 - Data Visualization For Decison Making
64 pages
Unit1-Visual Search-Strategies
No ratings yet
Unit1-Visual Search-Strategies
35 pages
Data Analysis and Visualization Techniques
No ratings yet
Data Analysis and Visualization Techniques
4 pages
Data Visualization
100% (1)
Data Visualization
23 pages
Unit 4
No ratings yet
Unit 4
21 pages
Introduction To Data Science Module 1
No ratings yet
Introduction To Data Science Module 1
32 pages
Module4 DSV
No ratings yet
Module4 DSV
89 pages
Data Science
No ratings yet
Data Science
12 pages
Big Data Visualization Techniques Guide
No ratings yet
Big Data Visualization Techniques Guide
66 pages
Data Visualization for Business
No ratings yet
Data Visualization for Business
25 pages
Data Visualization-1
No ratings yet
Data Visualization-1
29 pages
Unit 6
No ratings yet
Unit 6
12 pages
Common Visualization Idioms
0% (1)
Common Visualization Idioms
95 pages
Business Research Unit - 4
No ratings yet
Business Research Unit - 4
14 pages
Class X AI Project Cycle Notes
No ratings yet
Class X AI Project Cycle Notes
19 pages
Data Visualization in MERN App
No ratings yet
Data Visualization in MERN App
55 pages
Milestone
No ratings yet
Milestone
79 pages
Foundation of Data Science Imp Notes
No ratings yet
Foundation of Data Science Imp Notes
6 pages
Data Visualization - Data Mining
No ratings yet
Data Visualization - Data Mining
11 pages
History of Data Visualization
No ratings yet
History of Data Visualization
79 pages
AI Project Cycle: Problem Scoping & Data
No ratings yet
AI Project Cycle: Problem Scoping & Data
6 pages
Data Science & Python Essentials
No ratings yet
Data Science & Python Essentials
59 pages
Lecture Notes 1 - Introduction To Data Analysis and Visualization-1718780831207
No ratings yet
Lecture Notes 1 - Introduction To Data Analysis and Visualization-1718780831207
11 pages
Data Science
No ratings yet
Data Science
6 pages
1.data Handling and Visualization Module 1 Slides
No ratings yet
1.data Handling and Visualization Module 1 Slides
51 pages
Introduction To Data Visualisation
100% (1)
Introduction To Data Visualisation
47 pages
Data Visualization Techniques and Tools
No ratings yet
Data Visualization Techniques and Tools
33 pages
Programming Problems
No ratings yet
Programming Problems
6 pages
Maine Drilling & Blasting Procedures
No ratings yet
Maine Drilling & Blasting Procedures
2 pages
Run Linux Programs On Chromebook With Crostini: Mental Flow
No ratings yet
Run Linux Programs On Chromebook With Crostini: Mental Flow
5 pages
Essay On Hate Crimes
100% (2)
Essay On Hate Crimes
8 pages
Microsoft Publisher Activities
No ratings yet
Microsoft Publisher Activities
1 page
OpenText Archive Center CE 22.4 - Installation Guide For Windows (Extended Component Installer) English (AR220400-IASW-En-03)
No ratings yet
OpenText Archive Center CE 22.4 - Installation Guide For Windows (Extended Component Installer) English (AR220400-IASW-En-03)
74 pages
App Manual
No ratings yet
App Manual
40 pages
A
No ratings yet
A
3 pages
Unifast Application Form
No ratings yet
Unifast Application Form
1 page
Data Warehouse Scheme and Syllabus
No ratings yet
Data Warehouse Scheme and Syllabus
2 pages
CMPT 120, Fall 2008, SFU Burnaby Instructor: Diana Cukierman
No ratings yet
CMPT 120, Fall 2008, SFU Burnaby Instructor: Diana Cukierman
3 pages
Top 100 SQL Questions and Answers
No ratings yet
Top 100 SQL Questions and Answers
10 pages
Lean Portfolio Management EN
No ratings yet
Lean Portfolio Management EN
22 pages
Aptech Learning Pakistan-Courses and Progression Path-2
No ratings yet
Aptech Learning Pakistan-Courses and Progression Path-2
32 pages
Matlab Vit
0% (1)
Matlab Vit
73 pages
C Programs for Stack, Queue, SLL, DLL
No ratings yet
C Programs for Stack, Queue, SLL, DLL
56 pages
Ebook DesigningCloudArchitectureCloudcraft
No ratings yet
Ebook DesigningCloudArchitectureCloudcraft
12 pages
Digital Art Brush Installation Guide
No ratings yet
Digital Art Brush Installation Guide
3 pages
IntroWebForms-rev2 Removed
No ratings yet
IntroWebForms-rev2 Removed
33 pages
Scheduling Steel Weight in Autodesk Revit 2015 PDF
No ratings yet
Scheduling Steel Weight in Autodesk Revit 2015 PDF
9 pages
Datasheet
No ratings yet
Datasheet
8 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
23 pages
SAP CATS: Time Recording Overview
100% (3)
SAP CATS: Time Recording Overview
89 pages
HPE 3PAR Features: HPI RFQ - 28 July 2016
No ratings yet
HPE 3PAR Features: HPI RFQ - 28 July 2016
20 pages
DraftSight 2017 SP3 System Requirements v1
No ratings yet
DraftSight 2017 SP3 System Requirements v1
1 page
Docker Flash Card
No ratings yet
Docker Flash Card
12 pages
Sentry Card Reader PDF
No ratings yet
Sentry Card Reader PDF
2 pages
Linux L2 PDF
No ratings yet
Linux L2 PDF
275 pages
Ict1541 TL101 0 2024
No ratings yet
Ict1541 TL101 0 2024
16 pages
TVL - Computer Systems Servicing: Quarter 2 - Module 1: Set Network Configuration
No ratings yet
TVL - Computer Systems Servicing: Quarter 2 - Module 1: Set Network Configuration
36 pages