0% found this document useful (0 votes)

134 views13 pages

Web Mining Techniques and Analysis

This document contains the program submission of a student named Deep Agrawal for the course CSE3024 - Web Mining. The program performs various natural language processing and text mining tasks on content extracted from 5 Wikipedia articles, including calculating bag-of-words representations, term frequency, inverse document frequency, and TF-IDF. It also calculates cosine similarity and Euclidean distance between the documents and a query "Mining of large data". Finally, it analyzes graph centrality measures like degree centrality on a dataset of wiki page votes.

Uploaded by

Deep Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views13 pages

Web Mining Techniques and Analysis

Uploaded by

Deep Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

CSE3024 – Web Mining

WINTER SEMESTER – 2019-2020

G2 SLOT

E-RECORD
Assessment No.: 03

Submitted By

Deep Agrawal
Reg. No.: 18BCE0518
B.Tech. (Branch) – II Year
SCOPE

VELLORE INSTITUTE OF TECHNOLOGY

VELLORE – 632 014
TAMIL NADU
INDIA
18BCE0518 DEEP AGRAWAL

TF-IDF, SNA, Page Rank and HITS

1. Write a program to extract the contents (excluding any tags) from the following
five websites
https://en.wikipedia.org/wiki/Web_mining
https://en.wikipedia.org/wiki/Data_mining
https://en.wikipedia.org/wiki/Artificial_intelligence
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Mining
save the content in five separates .doc file. Considering a vector space model and
do the following operations according to the query “Mining of large data”

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup as soup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import numpy as np

sw = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

urls = ['https://en.wikipedia.org/wiki/Web_mining',
'https://en.wikipedia.org/wiki/Data_mining',
'https://en.wikipedia.org/wiki/Artificial_intelligence',
'https://en.wikipedia.org/wiki/Machine_learning',
'https://en.wikipedia.org/wiki/Mining']
d = []
for _ in range(5):
urlOpen = urllib.request.urlopen(urls[_])
urlHTML = urlOpen.read()
urlSoup = soup(urlHTML, 'html.parser')
pageText = ''
for i in urlSoup.findAll('p'):
pageText = pageText + i.text
pageText = pageText.lower()
words = word_tokenize(pageText)
terms = [w for w in words if not w in sw]
lemTerms = ''
# Lemmatization
for w in terms:
lemTerms += lemmatizer.lemmatize(w) + ' '
with open('doc{}.doc'.format(_), 'r+', encoding='utf-8') as doc:
doc.write(lemTerms)
doc.seek(0)
d.append(doc.read())

q = 'mining of large data'

d.append(q)

pg. 2
18BCE0518 DEEP AGRAWAL

all_t = [] # Set of all terms in 5 docs

t = [] # array arrays of terms in each doc
t1 = [] # set of arrays of t
cd = [] # count of each terms in each doc
dtc = [] # array of number of terms in each doc len() = 5

for _ in range(len(d)):
words = word_tokenize(d[_])
# StopWords Removal
t.append([w for w in words if not w in sw])
for i in t[_]:
all_t.append(i)

all_t = list(set(all_t))

for _ in range(len(d)):
cd.append([0]*len(all_t))
for i in range(len(all_t)):
for j in range(len(t[_])):
if all_t[i] == t[_][j]:
cd[_][i] = cd[_][i] + 1

for _ in range(len(d)):
t1.append(list(set(t[_])))
dtc.append(len(t1[_]))

table = [['aaa', 0, 0, 0, 0, 0]]*len(all_t)

for _ in range(len(all_t)):
table[_] = [all_t[_], cd[0][_], cd[1][_], cd[2][_], cd[3][_], cd[4]
[_], cd[5][_]]

 Bag-of-Words (Document set)

from tabulate import tabulate
print("Bag of words representation representation :\n")
print(tabulate(table, headers=['Term', 'Doc1', 'Doc2', 'Doc3', 'Doc4',
'Doc5', 'Query']))

pg. 3
18BCE0518 DEEP AGRAWAL

 TF (Document set)
tf = []
for _ in range(len(d)):
tf.append([0]*len(all_t))
for i in range(len(all_t)):
for j in range(len(t[_])):
if all_t[i] == t[_][j]:
tf[_][i] = tf[_][i] + 1

for i in range(len(tf)):
for j in range(len(tf[i])):
tf[i][j] = round(tf[i][j] / dtc[i], 5)
for _ in range(len(all_t)):
table[_] = [all_t[_], tf[0][_], tf[1][_], tf[2][_], tf[3][_], tf[4]
[_], tf[5][_]]

print("Term Frequency :\n")

print(tabulate(table, headers=['Term', 'Doc1', 'Doc2', 'Doc3', 'Doc4',
'Doc5', 'Query']))

 IDF (Document set)

idf = [0]*len(all_t)
for i in range(len(all_t)):
for j in range(len(tf)):
if tf[j][i] != 0:
idf[i] += 1

import math
pg. 4
18BCE0518 DEEP AGRAWAL

for i in range(len(idf)):
idf[i] = round(math.log(len(d) / idf[i], 10), 3)

for _ in range(len(all_t)):
table[_] = [all_t[_], idf[_]]
print("Inverse Doc Freq :\n")
print(tabulate(table, headers=['Terms', 'IDF']))

 TF-IDF (Document set)

tfidf = []
for _ in range(len(d)):
tfidf.append([0]*len(all_t))
for i in range(len(all_t)):
for j in range(len(t[_])):
if all_t[i] == t[_][j]:
tfidf[_][i] = tfidf[_][i] + 1

for i in range(len(tfidf)):
for j in range(len(tfidf[i])):
tfidf[i][j] = round(tfidf[i][j] / dtc[i], 5)

for i in range(len(tfidf)):
for j in range(len(tfidf[i])):
tfidf[i][j] = round(tfidf[i][j] * idf[j], 5)
for _ in range(len(all_t)):
table[_] = [all_t[_], tfidf[0][_], tfidf[1][_], tfidf[2][_], tfidf[3]
[_], tfidf[4][_]]
In [16]:
print("TF - IDF (Document set and Query):\n")
print(tabulate(table, headers=['Term', 'Doc1', 'Doc2', 'Doc3', 'Doc4',
'Doc5']))

pg. 5
18BCE0518 DEEP AGRAWAL

 TF-IDF (Query)
for _ in range(len(all_t)):
table[_] = [all_t[_], tfidf[0][_], tfidf[1][_], tfidf[2][_],
tfidf[3][_], tfidf[4][_], tfidf[5][_]]

print("TF - IDF (Document set and Query):\n")

print(tabulate(table, headers=['Term', 'Doc1', 'Doc2', 'Doc3', 'Doc4',
'Doc5', 'Query']))

 Normalized (Query)
 Normalized - TF-IDF (Document set)
for i in range(len(tfidf)):
temp = 0
for j in range(len(tfidf[i])):
temp += tfidf[i][j] ** 2
temp = temp ** (1/2)
for k in range(len(tfidf[i])):
tfidf[i][k] = tfidf[i][k] / temp
for _ in range(len(all_t)):

pg. 6
18BCE0518 DEEP AGRAWAL

table[_] = [all_t[_], tfidf[0][_], tfidf[1][_], tfidf[2][_], tfidf[3]

[_], tfidf[4][_], tfidf[5][_]]
In [21]:
print("TF - IDF (Normalized Document set and Query):\n")
print(tabulate(table, headers=['Term', 'Doc1', 'Doc2', 'Doc3', 'Doc4',
'Doc5', 'Query']))

 Cosine Similarity

cosine = [0]*6
for i in range(len(tfidf)):
for j in range(len(tfidf[i])):
cosine[i] += tfidf[i][j] * tfidf[5][j]
cosine[i] = round(cosine[i], 5)
for i in range(len(tfidf)):
print("Cosine similarity of Doc{}: ".format(i) + str(cosine[i]))

 Euclidean Distance

eu = [0]*6
for i in range(len(tfidf)):
for j in range(len(tfidf[i])):
eu[i] += (tfidf[i][j] - tfidf[5][j]) ** 2
eu[i] = round(eu[i] ** (1/2), 5)
In [25]:
for i in range(len(tfidf)):
print("Euclidean Distance of Doc{}: ".format(i) + str(eu[i]))

pg. 7
18BCE0518 DEEP AGRAWAL

 Document Ranking (Display Order)

cosineSort = cosine[:]
cosineSort.sort()
cosineSort = cosineSort[::-1]
cosineSort = cosineSort[1:]

for i in range(len(cosineSort)):
index = cosine.index(cosineSort[i])
print("{}. Doc{} - Link: {}".format(i, index, urls[index]))

 Document Similarity (Among Documents)

docsim1 = [0]*5
for i in range(len(tfidf) - 1):
for j in range(len(tfidf[i])):
docsim1[i] += tfidf[i][j] * tfidf[0][j]
docsim1[i] = round(docsim1[i], 5)
print(docsim1)
docsim2 = [0]*5
for i in range(len(tfidf) - 1):
for j in range(len(tfidf[i])):
docsim2[i] += tfidf[i][j] * tfidf[1][j]
docsim2[i] = round(docsim2[i], 5)
print(docsim2)
docsim3 = [0]*5
for i in range(len(tfidf) - 1):
for j in range(len(tfidf[i])):
docsim3[i] += tfidf[i][j] * tfidf[2][j]
docsim3[i] = round(docsim3[i], 5)
print(docsim3)
docsim4 = [0]*5
for i in range(len(tfidf) - 1):
for j in range(len(tfidf[i])):
docsim4[i] += tfidf[i][j] * tfidf[3][j]
docsim4[i] = round(docsim4[i], 5)
print(docsim4)
docsim5 = [0]*5
for i in range(len(tfidf) - 1):
for j in range(len(tfidf[i])):
docsim5[i] += tfidf[i][j] * tfidf[4][j]
docsim5[i] = round(docsim5[i], 5)
pg. 8
18BCE0518 DEEP AGRAWAL

print(docsim5)

2. Find out different types of centrality (degree, Betweenness, closeness) and

prestige (Degree, Proximity) using a graph dataset given in the following link.
http://snap.stanford.edu/data/wiki-Vote.txt.gz

nodes = 7115 # Number of nodes with outgoing links (no. of voters)

edges = 103689 # Number of edges (no. of votes)
with open('Wiki-Vote.txt', 'r') as dataset:
d = dataset.read()
dataset.seek(0)
d = d.split()
f = d[0::2]
fset = []
for _ in f:
if _ not in fset:
fset.append(_)
ffreq = []
for i in range(len(fset)):
ffreq.append(f.count(fset[i]))
t = d[1::2]

deg = []
for _ in range(len(fset)):
deg.append(round(ffreq[_]/nodes, 5))
In [31]:
for _ in range(len(fset)):
print('Degree Centrality of node {} is: {}'.format(fset[_], deg[_]))
print("Degree Centrality of all other nodes is 0")

pg. 9
18BCE0518 DEEP AGRAWAL

tset = []
for _ in t:
if _ not in tset:
tset.append(_)
tfreq = []
for i in range(len(tset)):
tfreq.append(t.count(tset[i]))
pdeg = []
for _ in range(len(tset)):
pdeg.append(round(tfreq[_]/nodes, 5))
for _ in range(len(fset)):
print('Degree Centrality of node {} is: {}'.format(fset[_], deg[_]))
print("Degree Centrality of all other nodes is 0")

3. Write a program to display the page rank of the given directed graph
representing web of six pages and damping factor is 0.9. Input to the program must
be adjacency matrix or adjacency list of the given web graph along with damping
factor and threshold value (stopping criteria: - ε = 0.05). The program must print
the result after each of the following scenario:

import numpy as np
l = [[0,1,1,1,0,0,0], [0,0,0,1,1,0,0], [0,0,0,0,0,1,0], [0,0,1,0,0,1,1],
[0,0,0,1,0,0,1], [0,0,0,0,0,0,0], [0,0,0,0,0,1,0]]

pg. 10
18BCE0518 DEEP AGRAWAL

b. Stochastic matrix formation

for i in range(len(l)):
temp = 0
for j in range(len(l[i])):
temp += l[i][j]
for k in range(len(l[i])):
l[i][k] = round(l[i][k] / temp, 5)
In [41]:
l

c. Page rank of all the seven nodes after each iteration

import numpy as np

d = 0.9
ep = 0.05
e = np.matrix([[1/7], [1/7], [1/7], [1/7], [1/7], [1/7], [1/7]])
p = [e]
k = 1

lt = np.matrix(l).transpose()

for i in range(100):
res1 = np.matrix(e) * (1 - d)
res2 = lt * p[i]
res2 = res2 * d
p.append(np.add(res1, res2))
temp = 0
res3 = np.subtract(p[k], p[k-1])
temp = np.linalg.norm(res3)
if temp < ep:
break
else:
k += 1
print(p[k])

pg. 11
18BCE0518 DEEP AGRAWAL

d. Total number iteration count until stopping criteria.

print("Number of iteration : " + str(k))

4. Write a program to implement HITS algorithm for the graph shown in Question No.
3 and display the final authority score and hub score of all the nodes after
stopping criteria are attained. (Note.: Consider the same criteria as mentioned for
Question No. 3)

l = np.matrix([[0,1,1,1,0,0,0], [0,0,0,1,1,0,0], [0,0,0,0,0,1,0],

[0,0,1,0,0,1,1], [0,0,0,1,0,0,1],
[0,0,0,0,0,0,0], [0,0,0,0,0,1,0]])
lt = l.transpose()

n = 7
ep = 0.05
a = np.matrix([1] * n).transpose()
h = np.matrix([1] * n).transpose()
k = 1
ltl = lt * l
llt = l * lt
ak = [a]
hk = [h]

for _ in range(5):
ak.append(ltl * ak[k-1])
hk.append(llt * hk[k-1])
anorm = np.linalg.norm(ak[k])
ak[k] = np.matrix(ak[k]) / (anorm)
hnorm = np.linalg.norm(hk[k])
hk[k] = np.matrix(hk[k]) / (hnorm)
if np.linalg.norm(np.subtract(ak[k], ak[k-1]))<ep and
np.linalg.norm(np.subtract(hk[k], hk[k-1]))<ep:
break
else:
k += 1

print(ak[k])
print(hk[k])

pg. 12
18BCE0518 DEEP AGRAWAL

print("Number of iterations: " + str(k))

pg. 13

CSE 3024: Web Mining: Lab Assessment - 3
No ratings yet
CSE 3024: Web Mining: Lab Assessment - 3
13 pages
Machine Learning Algorithms in Python
No ratings yet
Machine Learning Algorithms in Python
18 pages
IR
No ratings yet
IR
12 pages
Irs 122010304057 PDF
No ratings yet
Irs 122010304057 PDF
23 pages
Vector Model-21PW41
No ratings yet
Vector Model-21PW41
5 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
Math 189 HW-1: Data Analysis with Pandas
No ratings yet
Math 189 HW-1: Data Analysis with Pandas
11 pages
BigdataFinal
No ratings yet
BigdataFinal
13 pages
AI Assignment: Asad Nasir - 37 Muhammad Usman Ali - 29 Momin - 49
No ratings yet
AI Assignment: Asad Nasir - 37 Muhammad Usman Ali - 29 Momin - 49
7 pages
ML Lab File Batch 1
No ratings yet
ML Lab File Batch 1
20 pages
IR Assignment4
No ratings yet
IR Assignment4
5 pages
DSC Lab Programs
No ratings yet
DSC Lab Programs
24 pages
Web Mining: 19BCE2483 Anubhav Bhandary Prob.1
No ratings yet
Web Mining: 19BCE2483 Anubhav Bhandary Prob.1
4 pages
Py 1679789071
No ratings yet
Py 1679789071
2 pages
FDS All Practicals
No ratings yet
FDS All Practicals
10 pages
Program 1
No ratings yet
Program 1
25 pages
Numpy
No ratings yet
Numpy
13 pages
Python Lab fileVANSH-1
No ratings yet
Python Lab fileVANSH-1
41 pages
AI and ML Lab Assignment PCCCS594 - Final
No ratings yet
AI and ML Lab Assignment PCCCS594 - Final
14 pages
Numpy Coding Question
No ratings yet
Numpy Coding Question
11 pages
NumPy Array Operations Guide
No ratings yet
NumPy Array Operations Guide
5 pages
ML Programs
No ratings yet
ML Programs
34 pages
IDP Assignment - 4 - 5 (Saswat Mohanty - 1941012407 - CSE-D)
No ratings yet
IDP Assignment - 4 - 5 (Saswat Mohanty - 1941012407 - CSE-D)
15 pages
Sowmi DS
No ratings yet
Sowmi DS
27 pages
ML LAB - Ipynb - (4) - JupyterLab
No ratings yet
ML LAB - Ipynb - (4) - JupyterLab
10 pages
Python Data Science Exam Questions
No ratings yet
Python Data Science Exam Questions
8 pages
Lab - Activity-Iii: ST ND
No ratings yet
Lab - Activity-Iii: ST ND
9 pages
Numpy and Pandas Essential Functions
No ratings yet
Numpy and Pandas Essential Functions
46 pages
IR Practical
100% (1)
IR Practical
24 pages
16 - Practical - 6-7.ipynb - Colab
No ratings yet
16 - Practical - 6-7.ipynb - Colab
3 pages
Module 6 NumPY and Pandas
No ratings yet
Module 6 NumPY and Pandas
12 pages
Fds Answers
No ratings yet
Fds Answers
53 pages
Data Analysis and Visualization Exam Paper
No ratings yet
Data Analysis and Visualization Exam Paper
12 pages
DAA Prac
No ratings yet
DAA Prac
130 pages
PDA Lab Prog (Short)
No ratings yet
PDA Lab Prog (Short)
11 pages
Data Toolkit Assignment
No ratings yet
Data Toolkit Assignment
30 pages
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
No ratings yet
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
3 pages
MLLabcode 1
No ratings yet
MLLabcode 1
3 pages
Data Analysis Exam for CS Majors
No ratings yet
Data Analysis Exam for CS Majors
12 pages
NumPy and Pandas Array Operations
No ratings yet
NumPy and Pandas Array Operations
9 pages
Lab 3 ML
No ratings yet
Lab 3 ML
3 pages
Ge - Computer Science Data Analysis
No ratings yet
Ge - Computer Science Data Analysis
16 pages
Collections
No ratings yet
Collections
7 pages
XII IP Practical File 1 Complete
No ratings yet
XII IP Practical File 1 Complete
38 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
35 pages
Mayank Chaudhary DEV Practicals
No ratings yet
Mayank Chaudhary DEV Practicals
14 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Ir - Assignment 3
No ratings yet
Ir - Assignment 3
11 pages
Code
No ratings yet
Code
2 pages
Scipy Cheat Sheet Python For Data Science: Linear Algebra
No ratings yet
Scipy Cheat Sheet Python For Data Science: Linear Algebra
1 page
Ibm Ai
No ratings yet
Ibm Ai
10 pages
Matrix Operations and Data Analysis in Python
No ratings yet
Matrix Operations and Data Analysis in Python
38 pages
B. Sc. H Computer S FkQNyBB
No ratings yet
B. Sc. H Computer S FkQNyBB
6 pages
Python Based Questions
No ratings yet
Python Based Questions
3 pages
Ak ML
No ratings yet
Ak ML
8 pages
Section 7
No ratings yet
Section 7
33 pages
Bonafide Practical Record Certificate
No ratings yet
Bonafide Practical Record Certificate
36 pages
Aiml - Kaushik - Gogoi
No ratings yet
Aiml - Kaushik - Gogoi
13 pages
Aiml Exp 00 1032232220
No ratings yet
Aiml Exp 00 1032232220
6 pages
IDS All Questions
No ratings yet
IDS All Questions
15 pages
NLP Notes-1
No ratings yet
NLP Notes-1
11 pages
Search Engine - HW-module2 - CPEG657
No ratings yet
Search Engine - HW-module2 - CPEG657
6 pages
Chatbot - Ipynb - Colaboratory
No ratings yet
Chatbot - Ipynb - Colaboratory
5 pages
Amaliy Ish
No ratings yet
Amaliy Ish
17 pages
Question Bank (Problems)
No ratings yet
Question Bank (Problems)
6 pages
Lecture#3 TFIDF
No ratings yet
Lecture#3 TFIDF
16 pages
Python PageRank for Network Analysis
No ratings yet
Python PageRank for Network Analysis
39 pages

Web Mining Techniques and Analysis

Uploaded by

Web Mining Techniques and Analysis

Uploaded by

CSE3024 – Web Mining

WINTER SEMESTER – 2019-2020

VELLORE INSTITUTE OF TECHNOLOGY

TF-IDF, SNA, Page Rank and HITS

q = 'mining of large data'

all_t = [] # Set of all terms in 5 docs

table = [['aaa', 0, 0, 0, 0, 0]]*len(all_t)

 Bag-of-Words (Document set)

print("Term Frequency :\n")

 IDF (Document set)

 TF-IDF (Document set)

print("TF - IDF (Document set and Query):\n")

table[_] = [all_t[_], tfidf[0][_], tfidf[1][_], tfidf[2][_], tfidf[3]

 Document Ranking (Display Order)

 Document Similarity (Among Documents)

2. Find out different types of centrality (degree, Betweenness, closeness) and

nodes = 7115 # Number of nodes with outgoing links (no. of voters)

a. Handling the nodes with no outgoing links

b. Stochastic matrix formation

c. Page rank of all the seven nodes after each iteration

d. Total number iteration count until stopping criteria.

print("Number of iteration : " + str(k))

l = np.matrix([[0,1,1,1,0,0,0], [0,0,0,1,1,0,0], [0,0,0,0,0,1,0],

print("Number of iterations: " + str(k))

You might also like