0% found this document useful (0 votes)
34 views15 pages

DWM IA2 Theory

Data Warehousing and Mining Sem 5 Mod 4,5,6 imp solutions

Uploaded by

ketkikdighe01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

DWM IA2 Theory

Data Warehousing and Mining Sem 5 Mod 4,5,6 imp solutions

Uploaded by

ketkikdighe01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

clhssate

DaBe
Page

Data Warchauing and Mining


Explain accuracy ond EYrar meodures each in detail

Accuracy is Yundameatal metic ucd to_evaluate louikcaian


models Tt mEaure the praporhon of correct preichan
true
Coathapasikue and true nzgaliue ) aut al tatal predichans
Accuvatn al clauilice, ov secognitan
recagnitian xate :
set tuple that are corzecHy clouihed
TPt TN
Recentag
Auracy TP+TN+ FP+ FN
Accurary = Sensitvityt SensiHviyN
-Conhuian MatiX
Ackual clu/ Peesheted Clase
C Tue Paxikive (TP) Ealse Negakue (EN)
False Posikive (EP) TrucNegahue (TY
Sensihviby (True Positive Recogniharn Rate)
Sensihviy TP/P #P is na t pasihve tuple
Specikciky (True Ngative Recagnition Rate)
Specikey =TNN # N is na.d negahve huple
Eryor Yate
set ot
Percentage
i
instance wcd for
erroYs made over he uhale

Error rate = 1- accuracy


teaing
Exsor xate =EP+ EN
PtN

Precisiooi af tupleewbich -cowautly clauikl


paaitivePexeag
actually eositve (Exactbess
Precisioo = |TP
are

TP+IEP|
are

Recall Percentagt af pasikve tuple which clauißer labeleL


posihve, lCompletencss)
Recall =TTPl+)FN.
classmte
Date
Paqe

-meayure Harmanic Mean ol


of precisionand recall)
F Peecisiao X Recall
Precisiont Becall.
Meaure af precisian and recall):
Assigns e timen a much wueight to rccall a to pxeciion
=1+B°) X Precision x Recall
3 x Breaision +Recall
cwhere is nan=negaive reall number.
C Why is tree pruning weful in deision tree induction ? What
drawback of ing separate set ak tuples to eualuate tree paaing
The decision tree buit
buit branche becawe
may oueekt due to many branches
of naise or outliers To avaid over HingPrune the tree to
prevent excessive Speifciby
Two typ are

: t invlves halting the growth oftree during


Pre-pruning
Construchion before it eaches max size to prevent ouetfting.
Post-puning : Tt invalveA Temoval of branche andnodes Hhot
donot signkantly impaue. madels acurasg franm klly groan te
- Drauwbaeks
Using separate set ol tuple for pruning raluatioo may not
represmt training data aceuiatelyy
Separate set
tuple far pruning evaluahon xeducei
FUsing
the number of tuplen available for creahng and tehing
tree,hich can afect overall perfaoaonce
CTais issue is more relevant to machine lcarning, dnd less to
data mining due to larger dataset)
cdAssmite
Date
Page.

Explain Assaciaian Rule Mining i key Cancephs, Steps and Appliakoa.


Associahon Bale ining is abndamental echaique in data
mining that discoverinteresting
interesting relakonshiips pattex
associabons
I t is enmang large set l dataiten
commanly e in market baket analyais where gealis
to ident shich products fequenty
transacttanS
Key Cancepts
-tenset :A Set ol iteons
Support Proporkan hransacians in datoyet bat ontaio
paricular itemset T meaure ho braauesty an

itemset appears in dataset


Suppart CK) = Natransacians containing
Total number of trantachan
- (onfidence Gtven auocahian rule X’Y, canfderce meyuree

prapabilik that transachon cantainihg X also contain X.


Cookdace Cx’Y) = Suppavt KuY)
Suppart (x)
Lage ZExequent Tteosets : Tterosets tbat satiakya
miniOun Suppart threshal.
Steps
-Associghon rule mining is divided into tuo major steps.
Einding Freaunt Ihemseks
yaenerakng Asseciaton Rule
- Fndinq Frequeot Ttennsets
The goal of this step is to And all itemsets in dataset Hhat
appear adieat a Prquently o a Speked minimu
Suppovt thresbd.
The most camrm ony med algaithn tar this is Apiori Algaithm.
-Apriari properhy states that if an itemset is feguent, Hhenall op o!
its subset must adso be trequcnt
classmste
Date
Page

-Steps in Apiori Algorithm


Gencrahon
itemsond
: Start ewith all. Single
-Step 1 Candidate
is below
caleulate their Suppart. Items whasc Suppart
minimum threshold are. eliminated.
-Step 2 Frequent temset Gentrahon Gienerate largeribemuet
caleulate their
Combining Smaller tequent ttermses and
Suppart.
nat
-Step 3 Pruning Eliminate condidate itemiet that do
meet minimura Support theshald. Use Apriozi property to
eliminate larger itemscts carly if any o hei
Sulbsets are nat Beauent
oue,genealing
Step 4 Repeat : The process continue ,
itemsets of inreauing sizeuntil ho more itemset can

be generatod.
-)Gooaing AssociahanRules
-Once itennsets are identiked next step is togenerahe
Moiyhon rule Eoch rule nmt +he miniaum
Confdente threahsld to be considere
-For each freauent itemíetule are
itermiet ioto antecelant(and Conscquent(Y),XY farm
Steps io Rule Genera hon
- Step i Generalia. Rules Fov cach foequent itemiet, generate
all passible associahian ulesof foeno X’y
- Step2Calculate Conkdence For each ule, calcdate
Confence andkehain only thaie that meet minimum
ConPidence threbald
FApplicahos
- Market Bosket Analyis
RecorsoendationSytems
Fraud Detechon
Custarne Segmentahan..
clAsSMte
Date
Page

QExplain Maeket Barket Analyis with exomple.


-Market Bauket Analgtis is amadein techniaue eehich
is also.called a alnihy
amaiby analyik,it helps in idenkkging
which iters are
to lbe purchosed tagether
The market basket pralalen Oume cue have Same
number ol Henns e . bxeadmilkete Cutanaes lage
the Subset a! itemsa per theiy nced and mavketer
thing
infarmahan that which tomer haue taken thieey togthey
Marketen cue this intarmahon toput iteros o differtot
pasitian.
-Eg Tf sacmeane byt a pocket olailk,it aatenck
a

a bread at Same hoe


Milk Bread
Macket Basket Analyais Algarithm staightsads
Qve

d:iculHea azise dealing with lage amounts


oaialy
aP transackianal dataheve afferapplying algorithm it
Y0Seto larg numbcr otile bich
trivial innature

Q State Apxioci Reinciple, Join rule and Prune rule


Apriori Peinciple All nan-Eaphy sulseks al keauent iternsct auat
be
Toin Rule : Ck, set of candidate k=itensets is. goecaled byjeining
Lk- with itsele
LaiFreautrt 1- itermset LE FreAent k-iteoset )
- Rrune Rule i bx, set ol feeajuent k-iteniels is exhacted kos C
pmuning it geting id af all nan- feauent
k-itermseb in
classmte
Date
Page

and Divisive.
Dilerentiate batseen Agglamerotive clutering
cateing
Pananneter s
(luatering Divisive Clusteing
Agglane rativeve Cluutering
Top down approach
1
Categorg
Approach
Bottorn- up appraach
Starts with indiv'dual Storts with all poink
paiets,nerge clutocs. in one cluter splte
step bystep. it recursively
Comleity Mare comptationally Lessss campuatonally
Level expensive due to expensive, deals
pawise distance with altey spli.
caleulaion.
4 Outie
Better at handling May create separate
dustens around autlic
them iato alasoring
ers
Outiers by alasor
larger duter. leading to Suboptimal
results.
5 Tnterprekabity Easiee to interpet uith Harder to interprct
clear dendrograms due to spltting pro
shouing mergng proceu and reaaired
stapping criteria,
6Tmplernentahan Ioaplernonted in Scikit-lan Not. teackly implma
with multhple linkoge /n Scikit-learn
methods.
ZExanplel mgegcntahian, Market Sgcatahis
Aplication Cuttoner Scgmentahian,
document ~lutering Ananaal g deteson
Biolagieal elasikatiat
clhssmate
Date
Page

Wnte a shartnote. taxonomy ot web mining


-Web Mining refers to appliceahon ot data nining
technique to_web dat
sa procest oP discouering uwehd patterni and
Tt is
iesights Aom vat amount aldata available oncweb,
-IE can be braadly clashed into three. categorie,:
Wee ining
Web Content Web Steucture Web
g
Miaiag
Web Cantot Mìning
Miaiag Miaing
- Focuses an extracking and analing data Rom
chich cacan b text, image
content of cweb paghich
audio or video.
tincludes
F Web Crawlevs
Hanet Syatem
Victua Web View
Personalizahoo
Web Struc hure Mìning

agpelinksndedgnt
ink structure afHhe
analyis
webSuch cs bctween web
It iocudes
Page Rank algonthm
t CLEVER (HIS algorithm)

Web Usagc Mining er interachian data with


website, such as
brousingpatterns anaystecara er logaandwe
Servr
clAssmte
Date.
Page

What is web miniog ? Lst aneroache used to stuchre


the ucb
web minin
page totoimproue effectiveneu t e a h
Cngiae and caslers

Web Mioing efero to appico Hion oP odata mining techniaye


to web data.
Tt is proceSS Patterns and
of discouering wekl patterns
web
insighthts troco vaut amount Pdata available on

categorie,
Welb Cantent Mining
Web Shructure Mining
Web Usage Mining
The appxoachel wed to sthucture to

ìmprouc eefeciveneu of search


Web Crawlers engine are

Periodic Craslers
Tocremental Caawlers
Forwed Crawlers
- Hanet Syatem
FVixtual Web View
Personalizahoo
clhssMate
Dale
Page

Write shart nate on Crawlers and Personalization.


-Craulers
-Acweb crauler is an automnaterogram that Scans or
Crawls thraugh the intcmet to create an index
P data
-Seareh engine make al cweb crawlers to callect
infornehan abaut dataan publie web paq, Hoeir
primang purpase is to callect data so that cohen
enter
asearch temthey can auickly proxide
the xelevantcwebsites.
Same popular ceb rawleors are

Googlebot
Scarpy
Storm Crausler
Elasticsearch River Web
ype of raler
- Periodic Graualer i Tn oder to replace efeuh it collechion,
-periodically replacen the ald dacucnents cuit neulu
doenloaded cocmeats. AS it is activated
periadicall
every time it is activated it xeplacer eristing index.
Tocreroeatal Graler i Thcrementally zefreshes existing
Collection of
pagesagwisiting them. fequently and
update index incrementalls instd ok
replacing
Focued Ceausler i Dounloads the web pages that are
related to each other iie it isits page telated to
topic of interest
classMte
Date
Page

Personali zation
Web personalizahon is the procesS of customizina a

cOcb site to the needs each Seccitic yer or setof


tes ike.
-The key tnfonmation thatis requircd for suggeting
these similar web page come Hrom
FKnousledge of ather 4ers awho have also visited thbe
Curreot pag
Tne sthructurc
Pweb page ar wers pesonal prokle
informaHoo
Wela personalizaken pracess Can be diided iato faur phoe
Data Calletioo
Pre-praceSsing of cwe data
Analyis fweb dato
Decision reammcoohoo.
oaaking
Type al personalizahan
Content Baued Fitering
Callalboraive Filtecing
Model Baed Techniauc
Heomang Boed Techaiaue.
classmate
Date
pnge

QExplain Web Usage


Usage Mining
Web Usagc Mining iS process of extracting pattcrns and
informatian fanm
rom Scrver
legs to gain insights an
er
Web Server
adivity are Considered oas raw data in return o!
meaningkl data are extracted and patterns. arc idenkcd
Phae a? Web Usage Mining

Rauw
Sever log UserSessia
file
Ruleand
Patterns
Intcreting
Knoulelg
Patern
Preprocessing Sestien Analy4is
Recansteucho Appltcaion
Heurs Hc S Pattern
Discoveg
Apriori, GIP
SPADE

Repracessing
Patern TDiscoveny
Pattern Analyis
Preprocessing
- Pre pracessing con sists of Convetin sage, cantent and
stuctne infarmahion containcd in Vario availalale
dotasourcey into data abstachons neceisang
pattern dliscov
Usage Prepraceing
Content Preprocesing
Structure Peepreceuinq
Patern TDircoven
ethodl and algarithms develaped
in several comaS ike Statisties data miningttne
pattern decagnihon an Machine learning
classmate
Date
Paqe

Steps
Statistical Analyis
Associakon Rule.
Clustering
Clauikcahon
Sequenhal Patterns.
Degendeny Modeing
Paten Analyis
Filter ouut uninteresting rule and am Set
paterns tom
fund in pattern oliscovery
- Load
uage data into dota cube to perkono OLAP operal
Vsalion
Visudlizaion echnioaues hke axaphs or assign colar
to diterent Valuel can highlight auerall patton
Explain Page Ronk algortm in detail.
The fage Rank praduce ranking indapeodent ol a cscri
The impotance olwek
ofucb page. is deterrmincd by_numbeyol
other impartant web Pages hat ae painting to that pag
and the number of out links trom oher cweb
Poge Rack isis an alaorithm uedby Google Searthpagta ran
cueoste in their Search engine resuts.
A. cwill havc high page rank iE
Ihcre arc md
m Pges painting to it
Tbere are Sorme
to it. high ranks painting

Damping factor d :The PR theoy


Surfer cuho is
bolds that even an
imaginong
Cualu
eventually stop rondor
dicking The icing on atlinkS ewill
prababiliga
factord g
that person will countinue is damping Rstor
clAssmate
Date
Page

- The Poqe Rank ol page u is computed oa falow.s:


Page Rank W(1)+d Pagc Rank (v)
Degree (v)
cohere, oge Rank ) = Paoge Rank af page
Ot Degee () E Na links aaing
d = Damping factor which can bc any realn o
betcen O andd 1
1 Cgenerallg
Algaxithm
-Assume that here are n linked
-Let S. =(V,Ehere VE Set of PageA
E= Set ot hyelinks behweenpac,ei.
-Tnitiali2e Page Rank Cp) =0 for IL
-Repeat unil aaeRank vector Convere
Cie stabielize Or do not change)
For all page ueV
PageRonk Cu) = (1-d) + d Page Rank)
u)eE Out
O Deqree()
-Retharn fage Rank veetar
Basic Diagram
A

kO
classMate
Date
Paqe

O. Explain HITS algonithm


-Hyperlink - Tndueed Topic Search (HITS) is a link analst
algorithm that rates Web pagu,developed bydon klienber
HITS algoithmattem pts to computatianally
bubs and authori hes on a particular tapic throuh
eterrnine
analuuis ol xelevant subgraph ofweb.
Baed On mutually reeursive lacts
Hubs point to lats of? authorihes
Authorities are-painted by lots of Huhs
Autthoiies are
Page that are recognizcd os providis
ucRlimon
Signifcant, ttwothy and cuef infoxonahion
topic

Authorihy

Habs are index page that prouidle lats ofyekl links to


selevant content (Topic authoriiee)

Hub

Algpih Assume that there are n dinked


- Let S =
pages
Cv,E) (s sek d poges,EzSct dt hyperlinks betuween
- Iniialize HpB = AUTH (1,1,lE R"
Repeat untl HuB andAUTH
Normalize
conerge Ce stabilkze or dant chanal
HuB and AUTH
For all page pEV (4)
HuB =}a Cp.a) E E AUTH
AUTH=Sacqp)¬E HuB
-Reuvn. HUB and AutH
classmate
Date
Page

Explain variout similarity meure ycd in


clutering orsce
Partitioning Methods : Thtst rmcthach dividcs datosehs into
predefined number af cluters
Twa tyeea of Pactihianing Methads
-k-Aeans :The algorith.m initialize k centraidsassigns
each data point to nearest centoid and Hhe updatea
the centraids boed aPassigned paints.
-k-Medoids Or PAM CPartioning Around Medids ): The
cluuteeing algoita mthat ada pakttoni datasets
k clsters but actual data paiets as ceoters
instad af caleulaing Centhroicd
Hiexarchical Methads : Thtee omethads creates a higrarchical
decoaposikon of aiven dataet t create, tree-like structure
of cluters The results can be visualized in
alluaing Aexible selechian of oualber of cluter dendragam
-Tuso typei af Hiecarchical Methodi
Agglarnerahue Approach Te is hattam-up approcch. Ia this,
we stat cwith individual paints, erg e clsters step bystep
Diwisive Appraach i Jt is top-doun appraach. In this we
Stark aith allpaints in one cter , split it ecuriuely
Densiy-Bosed Methods Thasemethad idenkfy clsters adernte
gians in data space
Gnde Baed Methods i Thsc pmethodsdivide dataSpace into
Aate nulber of cells forrning a grid steucture. Clutert
are then identted basecd an density of paints cuithin
cella, making it computationally ePeicieot
Model- Boed Hethads Thtse roenods Quune that datas
generatad Rana aizhure
mixture
f undedying prabaleiliky ditibukon
Coostraiot - Baed Methads: These nnethods as6t inorporate
wer-dehoed constraints into clutering process,allowing
aore tailorec resuts bae Specitc applicahan tequirenent
on

You might also like