VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
PRACTICAL & PROJECT FILE
NAME: Garv Gupta
CLASS & SECTION: 10 HT
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
INDEX
S.NO. NAME OF THE PROGRAM
1 Write a program to print the name given by the user as input.
2 Write a program to check whether a given number is even or odd .
3 Check if a certain year is leap year or not.
4 Write a program to find the largest number among the three input
numbers.
5 Check whether a number entered is prime or not.
6 Find the factorial of a number given by user.
7 Display the multiplication table of number given by user.
8 Fibonacci sequence.
9 Check whether a number is Armstrong number.
10 Sum of natural numbers.
11 sum of digits of a number.
12 Length of string entered by user.
13 Check whether a person eligible to vote.
14 Reverse order of a list of number.
15 Store specific value and print
PROJECT
“Generate TFIDF values for all the words
and find the words having the highest
value”
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
CERTIFICATE
This is to certify that…Garv Gupta……., a
student of class X has successfully completed
the practical as well as Project file on the
topic “Generate TFIDF values for all the
words and find the words having the
highest value” under the guidance of Ms.
Puja Shah Dahiya (Subject Teacher ) during
the session 2023-24.
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
PRACTICAL FILE
#1 WAP to find out the largest number among the three input numbers
(coding )
(output)
(15 . PROGRAMS TO BE DONE AS PRACTICAL )
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
PROJECT
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2023-24)
DATA
VISUALIZATION
What is data visualization?
Data visualization is the graphical representation of information and data.
By using visual elements like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers,
and patterns in data.
Additionally, it provides an excellent way for employees or business
owners to present data to non-technical audiences without confusion.
In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-
driven decisions.
Why data visualization is important
The importance of data visualization is simple: it helps people see,
interact with, and better understand data. Whether simple or complex,
the right visualization can bring everyone on the same page, regardless
of their level of expertise.
It’s hard to think of a professional industry that doesn’t benefit from
making data more understandable. Every STEM field benefits from
understanding data—and so do fields in government, finance,
marketing, history, consumer goods, service industries, education,
sports, and so on.
While we’ll always wax poetically about data visualization (you’re on the
Tableau website, after all) there are practical, real-life applications that
are undeniable. And, since visualization is so prolific, it’s also one of the
most useful professional skills to develop. The better you can convey your
points visually, whether in a dashboard or a slide deck, the better you can
leverage that information. The concept of the citizen data scientist is on
the rise. Skill sets are changing to accommodate a data-driven world. It is
increasingly valuable for professionals to be able to use data to make
decisions and use visuals to tell stories of when data informs the who,
what, when, where, and how.
While traditional education typically draws a distinct line between
creative storytelling and technical analysis, the modern professional
world also values those who can cross between the two: data
visualization sits right in the middle of analysis and visual storytelling.
Different types of visualizations
When you think of data visualization, your first thought probably
immediately goes to simple bar graphs or pie charts. While these may be
an integral part of visualizing data and a common baseline for many data
graphics, the right visualization must be paired with the right set of
information. Simple graphs are only the tip of the iceberg. There’s a
whole selection of visualization methods to present data in effective and
interesting ways.
General Types of Visualizations:
Chart: Information presented in a tabular, graphical form with data
displayed along two axes. Can be in the form of a graph, diagram, or
map.
Table: A set of figures displayed in rows and columns. Learn more.
Graph: A diagram of points, lines, segments, curves, or areas that
represents certain variables in comparison to each other, usually along
two axes at a right angle.
Geospatial: A visualization that shows data in map form using different
shapes and colors to show the relationship between pieces of data and
specific locations. Learn more.
Infographic: A combination of visuals and words that represent data.
Usually uses charts or diagrams.
Dashboards: A collection of visualizations and data displayed in one
place to help with analyzing and presenting data. Learn more.
More specific examples
Area Map: A form of geospatial visualization, area maps are used to
show specific values set over a map of a country, state, county, or any
other geographic location. Two common types of area maps are
choropleths and isopleths. Learn more.
Bar Chart: Bar charts represent numerical values compared to each
other. The length of the bar represents the value of each variable. Learn
more.
Box-and-whisker Plots: These show a selection of ranges (the box)
across a set measure (the bar). Learn more.
Bullet Graph: A bar marked against a background to show progress or
performance against a goal, denoted by a line on the graph. Learn more.
Gantt Chart: Typically used in project management, Gantt charts
are a bar chart depiction of timelines and tasks. Learn more.
Heat Map: A type of geospatial visualization in map form which displays
specific data values as different colors (this doesn’t need to be
temperatures, but that is a common use). Learn more.
Highlight Table: A form of table that uses color to categorize similar
data, allowing the viewer to read it more easily and intuitively. Learn
more.
Histogram: A type of bar chart that split a continuous measure into
different bins to help analyze the distribution. Learn more.
Pie Chart: A circular chart with triangular segments that shows data as
a percentage of a whole. Learn more.
Treemap: A type of chart that shows different, related values in
the form of rectangles nested together. Learn more.
VENKATESHWAR INTERNATIONAL SCHOOL
ARTIFICIAL INTELLIGENCE (417)
CLASS X (2022-23)
PROJECT
TOPIC : “Generate TFIDF values for all the
words and find the words having the
highest value”
Introduction: TF-IDF
TF-IDF stands for “Term Frequency — Inverse
Document Frequency”. This is a technique to quantify
words in a set of documents. We generally compute a
score for each word to signify its importance in the
document and corpus. This method is a widely used
technique in Information Retrieval and Text Mining.
If I give you a sentence for example “This building is so
tall”. It's easy for us to understand the sentence as we
know the semantics of the words and the sentence. But
how can any program (eg: python) interpret this
sentence? It is easier for any programming language to
understand textual data in the form of numerical value.
So, for this reason, we need to vectorize all of the text so
that it is better represented.
By vectorizing the documents we can further perform
multiple tasks such as finding the relevant documents,
ranking, clustering, etc. This exact technique is used
when you perform a google search (now they are updated
to newer transformer techniques). The web pages are
called documents and the search text with which you
search is called a query. The search engine maintains a
fixed representation of all the documents. When you
search with a query, the search engine will find the
relevance of the query with all of the documents, ranks
them in the order of relevance and shows you the top k
documents.
All of this process is done using the vectorized form of query
and documents.
Now coming back to our TF-IDF,
TF-IDF = Term Frequency (TF) * Inverse Document
Frequency (IDF)
Terminology
t — term (word)
d — document (set of words)
N — count of corpus
corpus — the total document set
Term Frequency
This measures the frequency of a word in a document.
This highly depends on the length of the document and
the generality of the word, for example, a very common
word such as “was” can appear multiple times in a
document. But if we take two documents with 100 words
and 10,000 words respectively, there is a high probability
that the common word “was” is present more in the
10,000 worded document. But we cannot say that the
longer document is more
important than the shorter document. For this exact
reason, we perform normalization on the frequency
value, we divide the frequency with the total number of
words in the document.
Recall that we need to finally vectorize the document.
When we plan to vectorize documents, we cannot just
consider the words that are present in that particular
document. If we do that, then the vector length will be
different for both the documents, and it will not be
feasible to compute the similarity. So, what we do is that
we vectorize the documents on the vocab. Vocab is the
list of all possible worlds in the corpus.
We need the word counts of all the vocab words and the
length of the document to compute TF. In case the term
doesn’t exist in a particular document, that particular TF
value will be 0 for that particular document. In an
extreme case, if all the words in the document are the
same, then TF will be 1. The final value of the normalised
TF value will be in the range of [0 to 1]. 0, 1 inclusive.
TF is individual to each document and word, hence we
can formulate TF as follows:
tf(t,d) = count of t in d / number of words in d
If we already computed the TF value and if this produces
a vectorized form of the document, why not use just TF
to find the relevance between documents? Why do we
need IDF?
Let me explain, words which are most common such as
‘is’, ‘are’ will have very high values, giving those words
very high importance. But using these words to compute
the relevance produces bad results.
These kinds of common words are called stop-words.
Although we will remove the stop words later in the
preprocessing step, finding the presence of the word
across the documents and somehow reduce their
weightage is more ideal.
Document Frequency
This measures the importance of documents in a whole
set of the corpus. This is very similar to TF but the only
difference is that TF is the frequency counter for a term t
in document d, whereas DF is the count of occurrences
of term t in the document set N. In other words, DF is the
number of documents in which the word is present. We
consider one occurrence if the term is present in the
document at least once, we do not need to know the
number of times the term is present.
df(t) = occurrence of t in N documents
To keep this also in a range, we normalize by dividing by
the total number of documents. Our main goal is to know
the informativeness of a term, and DF is the exact inverse
of it. that is why we inverse the DF
Inverse Document Frequency
IDF is the inverse of the document frequency which
measures the informativeness of term t. When we
calculate IDF, it will be very low for the most occurring
words such as stop words (because they are
present in almost all of the documents, and N/df will give
a very low value to that word). This finally gives what we
want, a relative weightage.
idf(t) = N/df
Now there are few other problems with the IDF, when
we have a large corpus size say N=10000, the IDF value
explodes. So to dampen the effect we take the log of IDF.
At query time, when the word is not present in is not in
the vocab, it will simply be ignored. But in few cases, we
use a fixed vocab and few words of the vocab might be
absent in the document, in such cases, the df will be 0.
As we cannot divide by 0, we smoothen the value by
adding 1 to the denominator.
idf(t) = log(N/(df + 1))
Finally, by taking a multiplicative value of TF and IDF,
we get the TF-IDF score. There are many different
variations of TF-IDF but for now, let us concentrate on
this basic version.
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))
About Me:
I’m a Senior Data Scientist and AI researcher in the field
of NLP and DL.
Connect with me: Twitter, LinkedIn.
Implementing on a real-world dataset
Now that we learnt what is TF-IDF let us compute the
similarity score on a dataset.
The dataset we are going to use are archives of few stories, this
dataset has lots of documents in different formats.
Download the dataset and open your notebooks, Jupyter
Notebooks I mean 😜.
Dataset Link: http://archives.textfiles.com/stories.zip
Step 1: Analysing Dataset
The first step in any of the Machine Learning tasks is to
analyse the data. So if we look at the dataset, at first
glance, we see all the documents with words in English.
Each document has different names and there are two
folders in it.
Now one of the important tasks is to identify the title in
the body, if we analyse the documents, there are different
patterns of alignment of title. But most of the titles are
centre aligned. Now we need to figure out a way to
extract the title. But before we get all pumped up and
start coding, let us analyse the dataset little deep.
Take few minutes to analyse the dataset yourself. Try to
explore…
Upon more inspection, we can notice that there’s an
index.html in each folder (including the root), which
contains all the document names and their titles. So, let
us consider ourselves lucky as the titles are given to us,
without exhaustively extracting titles from each
document.
Step 2: Extracting Title & Body:
There is no specific way to do this, this totally
depends on the problem statement at hand and on
the analysis, we do on the dataset.
As we have already found that the titles and the
document names are in the index.html, we need to
extract those names and titles. We are lucky that
index.html has tags that we can use as patterns to
extract our required content.
Before we start extracting the titles and file names, as we have
different folders, first let’s crawl the folders to later read
all the index.html files at once.
[x[0] for x in os.walk(str(os.getcwd())+’/stories/’)]
os.walk gives us the files in the directory, os.getcwd
gives us the current directory and title and we are going
to search in the current directory + stories folder as our
data files are in the stories folder.
Always assume that you are dealing with a
huge dataset, this helps in automating the
code.
Now we can find that folders give extra / for the root
folder, so we are going to remove it.
folders[0] = folders[0][:len(folders[0])-1]
The above code removes the last character for the 0th
index in folders, which is the root folder
Now, let’s crawl through all the index.html to extract
their titles. To do that we need to find a pattern to take
out the title. As this is in html, our job will be a little
simpler.
let’s see…
We can clearly observe that each file name is enclosed
between (><A HREF=”) and (”) and each title is
between (<BR><TD>) and (\n)
We will use simple regular expressions to retrieve the
name and title. The following code gives the list of all the
values that match that pattern. so names and titles
variables have the list of all names and titles.
names = re.findall(‘><A HREF=”(.*)”>’, text)
titles = re.findall(‘<BR><TD> (.*)\n’, text)
Now that we have code to retrieve the values from the
index, we just need to iterate to all the folders and get
the title and file name from all the index.html files
- read the file from index files
- extract title and names
- iterate to next folder
dataset = []for i in folders:
file = open(i+"/index.html",
'r') text = file.read().strip()
file.close() file_name = re.findall('><A
HREF="(.*)">', text)
file_title = re.findall('<BR><TD> (.*)\n', text)
for j in range(len(file_name)):
dataset.append((str(i) + str(file_name[j]),
file_title[j]))
This prepares the indexes of the dataset, which is a tuple
of the location of the file and its title. There is a small
issue, the root folder index.html also has folders and its
links, we need to remove those.
simply use a conditional checker to remove it.
if c == False:
file_name = file_name[2:]
c = True
Step 3: Preprocessing
Preprocessing is one of the major steps when we are
dealing with any kind of text model. During this stage, we
have to look at the distribution of our data, what
techniques are needed and how deep we should clean.
This step never has a one-hot rule, and totally depends on
the problem statement. Few mandatory preprocessing are:
converting to lowercase, removing punctuation, removing
stop words and lemmatization/stemming. In our problem
statement, it seems like the basic preprocessing steps will
be sufficient.
Lowercase
During the text processing, each sentence is split into
words and each word is considered as a token after
preprocessing.
Programming languages consider textual data as
sensitive, which means that The is different from the.
we humans know that those both belong to the same
token but due to the character encoding those are
considered as different tokens. Converting to lowercase
is a very mandatory preprocessing step. As we have all
our data in the list, numpy has a method that can convert
the list of lists to
lowercase at once.
np.char.lower(data)
Stop words
Stop words are the most commonly occurring words that
don’t give any additional value to the document vector. in-
fact removing these will increase computation and space
efficiency. nltk library has a method to download the
stopwords, so instead of explicitly mentioning all the
stopwords ourselves we can just use the nltk library and
iterate over all the words and remove the stop words.
There are many efficient ways to do this, but ill just give
a simple method.
we are going to iterate over all the stop words and not
append them to the list if it’s a stop word
new_text = ""
for word in words:
if word not in stop_words:
new_text = new_text + " " + word
Punctuation
Punctuation is the set of unnecessary symbols that are in
our corpus documents. We should be a little careful with
what we are doing with this, there might be few problems
such as U.S — us “United Stated” being converted to “us”
after the preprocessing. hyphen and should usually be
dealt with little care. But for this problem statement, we
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\
are just
n" for i going to remove these
in symbols:
data = np.char.replace(data, i, ' ')
We are going to store all our symbols in a variable and
iterate that variable removing that particular symbol in
the whole dataset. we are using numpy here because our
data is stored in a list of lists, and numpy is our best bet.
Apostrophe
Note that there is no ‘ apostrophe in the punctuation
symbols. Because when we remove punctuation first it
will convert don’t to dont, and it is a stop word that won't
be removed. What we will do instead, is removing the
stop words first followed by symbols and then finally
repeat stopword removal as few words might still have
an apostrophe that are not stopwords.
return np.char.replace(data, "'", "")
Single Characters
Single characters are not much useful in knowing the
importance of the document and few final single
characters might be irrelevant symbols, so it is always
new_text = ""
good
for w to
in remove
words: the single characters.
if len(w) > 1:
new_text = new_text + " " + w
We just need to iterate to all the words and not append
the word if the length is not greater than 1.
Stemming
This is the final and most important part of the
preprocessing. stemming converts words to their stem.
For example, playing and played are the same type of
words that basically indicate an action play. Stemmer
does exactly this, it reduces the word to its stem. we are
going to use a library called porter-stemmer which is a
rule-based stemmer. Porter-Stemmer identifies and
removes the suffix or affix of a word. The words given by
the stemmer need not be meaningful few times, but it
will be identified as a single token for the model.
Lemmatisation
Lemmatisation is a way to reduce the word to the root
synonym of a word. Unlike Stemming, Lemmatisation
makes sure that the reduced word is again a dictionary
word (word present in the same language).
WordNetLemmatizer can be used to lemmatize any word.
Stemming vs Lemmatization
stemming — need not be a dictionary word, removes
prefix and affix based on few rules
lemmatization — will be a dictionary word. reduces to a
root synonym.
A better efficient way to proceed is to first lemmatise and
then stem, but stemming alone is also fine for few
problems statements, here we will not lemmatise.
Converting Numbers
When a user gives a query such as 100 dollars or
hundred dollars. For the user, both those search terms
are the same. but our IR model treats them separately, as
we are storing 100, dollars, hundred as different tokens.
So to make our IR mode a little better we need to convert
100 to hundred. To achieve this we are going to use a
library called num2word.
If we look a little close to the above output, it is giving us
few symbols and sentences such as “one hundred and
two”, but damn we just cleaned our data, then how do we
handle this? No worries,
we will just run the punctuation and stop words again
after converting numbers to words.
Preprocessing
Finally, we are going to put in all those preprocessing
methods above in another method and we will call that
def preprocess(data):
preprocess
data method. =
convert_lower_case(data) data
= remove_punctuation(data)
data =
remove_apostrophe(data)
data = remove_single_characters(data)
data = convert_numbers(data)
data =
remove_stop_words(data) data
= stemming(data)
If you look closely, a few of the preprocessing methods
are repeated again. As discussed, this just helps clean the
data little deep. Now we need to read the documents and
store their title and the body separately as we are going
to use them later. In our problem statement, we have very
different types of documents, this can cause few errors in
reading the documents due to encoding compatibility. to
resolve this, just use encoding=”utf8", errors=’ignore’ in
the open() method.
Step 3: Calculating TF-IDF
Recall that we need to give different weights to title and
body. Now how are we going to handle that issue? how
will the calculation of TF-IDF work in this case?
Giving different weights to title and body is a very
common approach. We just need to consider the
document as body + title, using this we can find the
vocab. And we need to give different
weights to words in the title and different weights to the
words in the body. To better explain this, let us consider
an example.
title = “This is a novel paper”
body = “This paper consists of survey of many papers”
Now, we need to calculate the TF-IDF for body and for the
title. For the time being let us consider only the word
paper, and forget about removing stop words.
What is the TF of word paper in the title? 1/4?
No, it’s 3/13. How? word paper appears in title and body
3 times and the total number of words in title and body is
13. As I mentioned before, we just consider the word in
the title to have different weights, but still, we consider
the whole document when calculating TF-IDF.
Then the TF of paper in both title and body is the same?
Yes, it’s the same! it’s just the difference in weights that
we are going to give. If the word is present in both title
and body, then there wouldn't be any reduction in the TF-
IDF value. If the word is present only in the title, then the
weight of the body for that particular word will not add to
the TF of that word, and vice versa.
document = body + title
TF-IDF(document) = TF-IDF(title) * alpha + TF-
IDF(body) * (1- alpha)
Calculating DF
Let us be smart and calculate DF beforehand. We need to
iterate through all the words in all the documents and
store the document id’s for each word. For this, we will
use a dictionary as we can use the word as the key and a
set of documents as the value. I mentioned set because,
even if we are trying to add the document multiple times,
DF = {}
a set
for i will not just take duplicate values.
in range(len(processed_text)):
tokens = processed_text[i]
for w in tokens:
try:
DF[w].add(i)
except:
DF[w] = {i}
We are going to create a set if the word doesn’t have a set
yet else add it to the set. This condition is checked by the
try block. Here processed_text is the body of the
document, and we are going to repeat the same for the
title as well, as we need to consider the DF of the whole
document.
len(DF) will give the unique words
DF will have the word as the key and the list of doc id’s
as the value. but for DF we don’t actually need the list of
docs, we just need the count. so we are going to replace
the list with its count.
There we have it, the count we need for all the words. To
find the total unique words in our vocabulary, we need to
take all the keys of DF.
Calculating TF-IDF
Recall that we need to maintain different weights for title
and body. To calculate TF-IDF of body or title we need to
consider both the title and body. To make our job a little
easier, let’s use a dictionary with (document, token) pair
as key and any TF-IDF score as the value. We just need to
iterate over all the documents, we can use the Coutner
which can give us the frequency of the tokens, calculate tf
and idf and finally store as a (doc, token) pair in tf_idf.
tf_idf dictionary is for the body, we will use the same logic
to build a dictionary tf_idf_title for the words in the title.
tf_idf = {}
for i in range(N):
tokens = processed_text[i]
counter = Counter(tokens + processed_title[i])
for token in np.unique(tokens):
tf =
counter[token]/words_count df
= doc_freq(token)
idf = np.log(N/(df+1))
Coming to the calculation of different weights. Firstly, we
need to maintain a value alpha, which is the weight for
the body, then obviously 1-alpha will be the weight for the
title. Now let us delve into a little math, we discussed that
TF-IDF value of a word will be the same for both body and
title if the word is present in both places. We will
maintain two different tf-idf dictionaries, one for the body
and one for the title.
What we are going to do is a little smart, we will calculate
TF-IDF for the body; multiply the whole body TF-IDF
values with alpha; iterate the tokens in the title; replace
the title TF-IDF value in the body TF- IDF value of the
(document, token) pair exists. Take some time to process
this :P
Flow:
- Calculate TF-IDF for Body for all docs
- Calculate TF-IDF for title for all docs
- multiply the Body TF-IDF with alpha
- Iterate Title IF-IDF for every (doc, token)
— if token is in body, replace the Body(doc, token) value
with the value in Title(doc, token)
I know this is not easy at first to understand, but still let
me explain why the above flow works, as we know that
the tf-idf for body and title will be the same if the token is
in both places, The weights that we use for body and title
sum up to one
TF-IDF = body_tf-idf * body_weight + title_tf-idf*title_weight
body_weight + title_weight = 1
When a token is in both places, then the final TF-IDF will
be the same as taking either body or title tf_idf. That is
exactly what we are doing in the above flow. So, finally,
we have a dictionary tf_idf which has the values as a (doc,
token) pair.
Step 4: Ranking using Matching Score
Matching score is the simplest way to calculate the
similarity, in this method, we add tf_idf values of the
tokens that are in query for every document. For
example, for the query “hello world”, we need to check in
every document if these words exist and if the word
exists, then the tf_idf value is added to the matching score
of that particular doc_id. In the end, we will sort and take
the top k documents.
Mentioned above is the theoretical concept, but as we
are using a dictionary to hold our dataset, what we are
going to do is we will iterate over all of the values in the
dictionary and check if the value
is present in the token. As our dictionary is a (document,
token) key, when we find a token that is in the query we
will add the document id to another dictionary along with
the tf-idf value. Finally, we will just take the top k
def
documents again.
matching_score(query)
: query_weights = {}
for key in tf_idf:
if key[1] in tokens:
query_weights[key[0]] += tf_idf[key]
key[0] is the documentid, key[1] is the token.
Step 5: Ranking using Cosine Similarity
When we have a perfectly working Matching Score, why
do we need cosine similarity again? though Matching
Score gives relevant documents, it quite fails when we
give long queries, it will not be able to rank them
properly. What cosine similarly does is that it will mark all
the documents as vectors of tf-idf tokens and measures
the similarity in cosine space (the angle between the
vectors. Few times the query length would be small but it
might be closely related to the document in such cases
cosine similarity is the best to find relevance.
Observe the above plot, the blue vectors are the
documents and the red vector is the query, as we can
clearly see, though the manhattan distance (green line)
is very high for document d1, the query is still close to
document d1. In such cases, cosine similarity would be
better as it considers the angle between those two
vectors. But Matching Score will return document d3 but
that is not very closely related.
Matching Score computes manhattan distance (straight
line from tips)
Cosine score considers the angle of the vectors.
Vectorization
To compute any of the above, the simplest way is to
convert everything to a vector and then compute the
cosine similarity. So, let’s convert the query and
documents to vectors. We are going to use total_vocab
variable which has all the list of unique tokens to
generate an index for each token, and we will use numpy
# Document Vectorization
of=shape
D (docs, total_vocab)
np.zeros((N, to store
total_vocab_size)) the document
for i in tf_idf:
vectors.
ind =
total_vocab.index(i[1])
For vector, we need to calculate the TF-IDF values, TF
we can calculate from the query itself, and we can make
use of DF that we created for the document frequency.
Finally, we will store in a (1,vocab_size) numpy array to
store the tf-idf values, index of the token will be decided
Q = np.zeros((len(total_vocab)))
from the
counter = total_voab list
Counter(tokens)
words_count = len(tokens)
query_weights = {}
for token in np.unique(tokens):
tf = counter[token]/words_count
df = doc_freq(token)
idf = math.log((N+1)/(df+1))
Now, all we have to do is calculate the cosine similarity
for all the documents and return the maximum k
documents. Cosine similarity is defined as follows.
np.dot(a, b)/(norm(a)*norm(b))
Analysis
I took the text from doc_id 200 (for me) and pasted
some content with long query and short query in both
matching score and cosine similarity.
NAME OF THE SOFTWARE .......................................................(SOFTWARE USED FOR THIS
PROJECT )
DOCUMENTS USED :
Document 1
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly
normal, thank you very much. They were the last people you’d expect to be involved in anything
strange or mysterious, because they just didn’t hold with such nonsense.
Document 2
Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy
man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and
blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so
much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small
son called Dudley and in their opinion there was no fi ner boy anywhere.
Document 3
When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing
about the cloudy sky outside to suggest that strange and mysterious things would soon be
happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for
work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high
chair.
OUT PUT GIVEN BY THE SOFTWARE
Term count Docum TF × IDF
ent
No. Token Doc 1 Doc 2 Doc 3 count IDF Doc 1 Doc 2 Doc 3
1 mr 0.0222 0.0117 0.0298 3 0.125 0.003 0.001 0.004
222222 647058 507462
2 8 7
2 mrs 0.0222 0.0117 0.0298 3 0.125 0.003 0.001 0.004
222222 647058 507462
2 8 7
3 dursley 0.0222 0.0235 0.0447 3 0.125 0.003 0.003 0.006
222222 294117 761194
2 6
4 number 0.0222 0 0 1 0.602 0.013 0 0
222222
2
5 four 0.0222 0 0 1 0.602 0.013 0 0
222222
2
6 privet 0.0222 0 0 1 0.602 0.013 0 0
222222
2
7 drive 0.0222 0 0 1 0.602 0.013 0 0
222222
2
8 proud 0.0222 0 0 1 0.602 0.013 0 0
222222
2
9 say 0.0222 0 0 1 0.602 0.013 0 0
222222
2
10 perfectl y 0.0222 0 0 1 0.602 0.013 0 0
222222
2
11 normal 0.0222 0 0 1 0.602 0.013 0 0
222222
2
12 thank 0.0222 0 0 1 0.602 0.013 0 0
222222
2
13 much 0.0222 0.0117 0 2 0.301 0.007 0.004 0
222222 647058
2 8
14 last 0.0222 0 0 1 0.602 0.013 0 0
222222
2
15 people 0.0222 0 0 1 0.602 0.013 0 0
222222
2
16 you’d 0.0222 0 0 1 0.602 0.013 0 0
222222
2
17 expect 0.0222 0 0 1 0.602 0.013 0 0
222222
2
18 involve 0.0222 0 0 1 0.602 0.013 0 0
d
222222
2
19 anythin 0.0222 0 0 1 0.602 0.013 0 0
g
222222
2
20 strange 0.0222 0 0.0149 2 0.301 0.007 0
0.004
222222 253731
2 3
21 mysteri 0.0222 0 0.0149 2 0.301 0.007 0
ous 0.004
222222 253731
2 3
22 didn’t 0.0222 0 0 1 0.602 0.013 0
222222 0
2
23 hold 0.0222 0 0 1 0.602 0.013 0
222222
0
2
24 nonsen 0.0222 0 0 1 0.602 0.013 0
se 222222
0
2
25 director 0 0.0117 0 1 0.602 0 0.007
647058
8 0
26 firm 0 0.0117 0 1 0.602 0 0.007
647058
8
0
27 called 0 0.0235 0 1 0.602 0 0.014
294117
0
6
28 grunnin 0 0.0117 0 1 0.602 0 0.007 0
gs
647058
8
29 made 0 0.0117 0 1 0.602 0 0.007 0
647058
8
30 drills 0 0.0117 0 1 0.602 0 0.007 0
647058
8
31 big 0 0.0117 0 1 0.602 0 0.007 0
647058
8
32 beefy 0 0.0117 0 1 0.602 0 0.007 0
647058
8
33 man 0 0.0117 0 1 0.602 0 0.007 0
647058
8
34 hardly 0 0.0117 0 1 0.602 0 0.007 0
647058
8
35 neck 0 0.0235 0 1 0.602 0 0.014 0
294117
6
36 althoug 0 0.0117 0 1 0.602 0 0.007 0
h
647058
8
37 large 0 0.0117 0 1 0.602 0 0.007 0
647058
8
38 mustac 0 0.0117 0 1 0.602 0 0.007 0
he
647058
8
39 thin 0 0.0117 0 1 0.602 0 0.007
0
647058
8
40 blonde 0 0.0117 0 1 0.602 0 0.007
647058 0
8
41 nearly 0 0.0117 0 1 0.602 0 0.007
647058 0
8
42 twice 0 0.0117 0 1 0.602 0 0.007
647058
0
8
43 usual 0 0.0117 0 1 0.602 0 0.007
647058
8 0
44 amount 0 0.0117 0 1 0.602 0 0.007
647058
8 0
45 came 0 0.0117 0 1 0.602 0 0.007
647058
8
0
46 useful 0 0.0117 0 1 0.602 0 0.007
647058
0
8
47 spent 0 0.0117 0 1 0.602 0 0.007 0
647058
8
48 time 0 0.0117 0 1 0.602 0 0.007 0
647058
8
49 craning 0 0.0117 0 1 0.602 0 0.007 0
647058
8
50 garden 0 0.0117 0 1 0.602 0 0.007 0
647058
8
51 fences 0 0.0117 0 1 0.602 0 0.007 0
647058
8
52 spying 0 0.0117 0 1 0.602 0 0.007 0
647058
8
53 neighbo 0 0.0117 0 1 0.602 0 0.007 0
rs 647058
8
54 dursley 0 0.0117 0 1 0.602 0 0.007 0
s 647058
8
55 small 0 0.0117 0 1 0.602 0 0.007 0
647058
8
56 son 0 0.0117 0 1 0.602 0 0.007 0
647058
8
57 dudley 0 0.0117 0.0149 2 0.301 0 0.004 0.004
647058 253731
8 3
58 opinion 0 0.0117 0 1 0.602 0 0.007 0
647058
8
59 finer 0 0.0117 0 1 0.602 0 0.007 0
647058
8
60 boy 0 0.0117 0 1 0.602 0 0.007 0
647058
8
61 anywhe 0 0.0117 0 1 0.602 0 0.007 0
re 647058
8
62 woke 0 0 0.0149 1 0.602 0 0 0.009
253731
3
63 dull 0 0 0.0149 1 0.602 0 0 0.009
253731
3
64 gray 0 0 0.0149 1 0.602 0 0 0.009
253731
3
65 tuesday 0 0 0.0149 1 0.602 0 0 0.009
253731
66 story 0 0 0.0149 1 0.602 0 0 0.009
253731
3
67 starts 0 0 0.0149 1 0.602 0 0 0.009
253731
3
68 nothing 0 0 0.0149 1 0.602 0 0 0.009
253731
3
69 cloudy 0 0 0.0149 1 0.602 0 0 0.009
253731
3
70 sky 0 0 0.0149 1 0.602 0 0 0.009
253731
3
71 outside 0 0 0.0149 1 0.602 0 0 0.009
253731
3
72 suggest 0 0 0.0149 1 0.602 0 0 0.009
253731
3
73 things 0 0 0.0149 1 0.602 0 0 0.009
253731
3
74 would 0 0 0.0149 1 0.602 0 0 0.009
253731
3
75 soon 0 0 0.0149 1 0.602 0 0 0.009
253731
3
76 happeni 0 0 0.0149 1 0.602 0 0 0.009
ng
253731
3
77 country 0 0 0.0149 1 0.602 0 0 0.009
253731
3
78 humme 0 0 0.0149 1 0.602 0 0 0.009
d 253731
3
79 picked 0 0 0.0149 1 0.602 0 0 0.009
253731
3
80 boring 0 0 0.0149 1 0.602 0 0 0.009
253731
3
81 tie 0 0 0.0149 1 0.602 0 0 0.009
253731
3
82 work 0 0 0.0149 1 0.602 0 0 0.009
253731
3
83 gossipe 0 0 0.0149 1 0.602 0 0 0.009
d
253731
3
84 away 0 0 0.0149 1 0.602 0 0 0.009
253731
3
85 happily 0 0 0.0149 1 0.602 0 0 0.009
253731
3
86 wrestle 0 0 0.0149 1 0.602 0 0 0.009
d
253731
3
87 screami 0 0 0.0149 1 0.602 0 0 0.009
ng
253731
3
88 high 0 0 0.0149 1 0.602 0 0 0.009
253731
3
89 chair 0 0 0.0149 1 0.602 0 0 0.009
253731
3
90 air 0 0 0 1 0.602 0 0 0
91 quality 0 0 0 1 0.602 0 0 0