0% found this document useful (0 votes)
165 views7 pages

Efficient Sequential Pattern Mining

1) The document outlines PrefixSpan, an algorithm for efficiently mining sequential patterns. 2) PrefixSpan uses a pattern-growth approach called prefix-projection, which avoids candidate generation and allows mining patterns in a single database scan. 3) The algorithm recursively projects a sequence database into smaller projected databases based on prefixes, and grows subsequences within each projected database.

Uploaded by

Ghiffari Agsarya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views7 pages

Efficient Sequential Pattern Mining

1) The document outlines PrefixSpan, an algorithm for efficiently mining sequential patterns. 2) PrefixSpan uses a pattern-growth approach called prefix-projection, which avoids candidate generation and allows mining patterns in a single database scan. 3) The algorithm recursively projects a sequence database into smaller projected databases based on prefixes, and grows subsequences within each projected database.

Uploaded by

Ghiffari Agsarya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Outline

`
Mining Sequential Patterns
PrefixSpan: Mining Sequential Patterns

Problem statement
Efficiently by Prefix-Projected Pattern Definitions & examples
Growth Strategies
PrefixSpan algorithm
Authors:
Jian Pei, Jiawei Han, Behzad Mortazavi-Asi, Helen Pinto Qiming Chen, Motivation
Umeshwar Dayal, Mei-Chun Hsu
Definitions & examples
Algorithm
Example
Performance study
Presenter: Conclusions
Wojciech Stach
2

Sequential Pattern Mining Sequential Pattern Mining


` `
Given Find all the frequent subsequences, i.e. the
a set of sequences, where each sequence consists of a list subsequences whose occurrence frequency in the
of elements and each element consists of set of items set of sequences is no less than min_support
user-specified min_support threshold

Solution 53 frequent subsequences


<a><aa> <ab> <a(bc)> <a(bc)a> <aba> <abc>
<a(abc)(ac)d(cf)> - 5 elements, 9 items id Sequence
id Sequence <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <ac>
10 <a(abc)(ac)d(cf)> 10 <a(abc)(ac)d(cf)> <aca> <acb> <acc> <ad> <adc> <af>
20 <(ad)c(bc)(ae)> <a(abc)(ac)d(cf)> - 9-sequence 20 <(ad)c(bc)(ae)> <b> <ba> <bc> <(bc)> <(bc)a> <bd> <bdc> <bf>
30 <(ef)(ab)(df)cb> 30 <(ef)(ab)(df)cb> <c> <ca> <cb> <cc>
40 <eg(af)cbc> <a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)> 40 <eg(af)cbc> <d> <db> <dc> <dcb>
<a(abc)(ac)d(cf)> <a(ac)(abc)d(cf)>
<e> <ea> <eab> <eac> <eacb> <eb> <ebc> <ec>
<ecb> <ef> <efb> <efc> <efcb>
min_support = 2 <f> <fb> <fbc> <fc> <fcb>
3 4
Subsequence vs. super sequence Sequence Support Count
` `
Given two sequences =<a1a2an> and A sequence database is a set of tuples <sid, s>
=<b1b2bm> A tuple <sid, s> is said to contain a sequence , if
is called a subsequence of , denoted as , is a subsequence of s, i.e., s
if there exist integers 1j1<j2<<jn m such that The support of a sequence is the number of
a1bj1, a2 bj2,, anbjn tuples containing
is a super sequence of

id Sequence 1=<a> support(1) = 4


10 <a(abc)(ac)d(cf)>
=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)> 2=<ac> support(2) = 4
20 <(ad)c(bc)(ae)>
1=<aa(ac)d(c)> 4=<df(cf)> 30 <(ef)(ab)(df)cb> 3=<(ab)c> support(3) = 2
40 <eg(af)cbc>
2=<(ac)(ac)d(cf)> 5=<(cf)d>

3=<ac> 6=<(abc)dcf>
5 6

Strategies Outline
` `
Apriori-property based Mining Sequential Patterns
AprioriSome (1995) Problem statement
AprioriAll (1995) Definitions & examples
DynamicSome (1995) Strategies
GSP (1996) PrefixSpan algorithm
Motivation
Regular expression constraints Definitions & examples
SPIRIT (1999) Algorithm
Example
Data projection based Performance study
FreeSpan (2000) Conclusions

7 8
Motivation and Background Prefix
` `
Shortcomings of Apriori-like approaches Given two sequences =<a1a2an> and
Potentially huge set of candidate sequences =<b1b2bm>, mn
Multiple scans of databases
Sequence is called a prefix of if and only if:


Difficulties at mining long sequential patterns
bi = ai for i m-1;
FreeSpan (Frequent pattern-projected Sequential pattern bm am;
mining) pattern growth method All the items in (am bm) are alphabetically after those in
General idea is to use frequent items to recursively project bm
sequence databases into a smaller projected databases and
grow subsequence fragments in each projected database

=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
PrefixSpan (Prefix-projected Sequential pattern mining)
Less projections and quickly shrinking sequences =<a(abc)a> =<a(abc)c>

9 10

Projection Postfix
` `
Given sequences and , such that is a Let =<a1a2an> be the projection of w.r.t.
subsequence of . prefix =<a1a2am-1am> (m n)
A subsequence of sequence is called a Sequence =<amam+1an> is called the postfix of
projection of w.r.t. prefix if and only if w.r.t. prefix , denoted as = / , where
has prefix ; am=(am-am)
There exist no proper super-sequence of such that We also denote =
is a subsequence of and also has prefix

=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
=<(bc)a>
=<a(abc)a>
=<(bc)(ac)d(cf)>
=<(_c)d(cf)>

11 12
PrefixSpan Algorithm PrefixSpan Algorithm (2)
` `
Input: A sequence database S, and the minimum support Method
threshold min_sup
1. Scan S| once, find the set of frequent items b
Output: The complete set of sequential patterns such that:
a) b can be assembled to the last element of to form a
Method: Call PrefixSpan(<>,0,S) sequential pattern; or
b) <b> can be appended to to form a sequential pattern.
Subroutine PrefixSpan(, l, S|) 2. For each frequent item b, append it to to form a
sequential pattern , and output ;
Parameters:
: sequential pattern, 3. For each , construct -projected database S|,
l: the length of ; and call PrefixSpan(, l+1, S|).
S|: the -projected database, if <>; otherwise; the
sequence database S.

13 14

id Sequence
10 <a(abc)(ac)d(cf)>

PrefixSpan - Example 20
30
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
PrefixSpan Example (2)
` 40 <eg(af)cbc> `
3. Find subsets of sequential patterns
1. Find length-1 sequential patterns min_support = 2
<a> <b> <c> <d> <e> <f> <g> <d> <a> <b> <c> <d> <e> <(_e)> <f> <(_f)>
4 4 4 3 3 3 1 <(cf)> 1 2 3 0 1 0 1 1
<c(bc)(ae)>
<(_f)cb>
2. Divide search space
Prefix
<db> <dc>

<a> <b> <c> <d> <e> <f> <db> <dc> <b> <c>
<(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> <(_c)> <(bc)> 2 1
<(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc> <b>
<(_b)(df)cb> <(df)cb> <b> <(_f)cb>
<(_f)cbc> <c> <bc>
<dcb>
<dcb>
<>
15 16
id Sequence
10 <a(abc)(ac)d(cf)>

PrefixSpan - characteristics Bi-level Projection 20


30
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
40 <eg(af)cbc>
` No candidate sequence needs to be generated by `

PrefixSpan min_support = 2
Scan to get 1-length sequences
Projected databases keep shrinking
Construct a triangular matrix instead of projected
The major cost of PrefixSpan is the construction of databases for each length-1 patterns
projected databases
a 2
How to reduce this cost? b (4,2,2) 1 ALL length-2 sequential
c (4,2,1) (3,3,2) 3 pattern
Different projection methods d (2,1,1) (2,2,0) (1,3,0) 0
e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0
Bi-level projection
f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1
reduces the number and the size of projected databases a b c d e f

Pseudo-Projection
Support(<ac>) = 4
Support(<ca>) = 2 Support(<cc>) = 3
reduces the cost of projection when projected database can be
Support(<(ac)>) = 1
held in main memory
17 18

Bi-level projection (2) Bi-level projection (3) - optimization


` `
For each length-2 sequential pattern , construct Do we need to include every item in a postfix in
the -projected database and find the frequent the projected databases?
items
NO! Item pruning in projected database by 3-way
Construct corresponding S-matrix Apriori checking
<ab> a b c (_c) d (_d) e (_e) f (_f)
<(_c)(ac)(cf)> 2 0 2 2 0 1 0 0 1 0 Any super-sequence of c can be excluded from construction of
<ac> is not frequent
<(_c)a> it can never be a sequential <ab> - projected database
<c> pattern
<aba> <abc> <a(bc)>

a 0 <a(bd)> is not frequent To construct <a(bc)>-projected database,


sequence <a(bcde)df> should be projected to <(_e)df>
c (1,0,1) 1 instead of <(_de)df>
(_c) (,2, ) (,1, )
a c (_c)
<a(bc)a>
19 20
Pseudo-Projection Experimental Results
` `
Observation: postfixes of a sequence often Environment: 233MHz Pentium PC, 128 MB RAM,
appear repeatedly in recursive projected databases Windows NT, Visual C++ 6.0
Method: instead of constructing physical Reported test on synthetic data set: C10T8S8I8:
projection by collecting all the postfixes, we can 1000 items
use pointers referring to the sequences in the 10000 sequences
database as a pseudo-projection Average number of items within elements: 8
Every projection consists of two pieces of Average number of elements in a sequence: 8
information: pointer to the sequence in database Competitors:
and offset to the postfix in the sequence GSP
FreeSpan
s1=<a(abc)(ac)d(cf)> Pointer Offset Postfix PrefixSpan-1 (level-by-level projection)
s1 2 <(abc)(ac)d(cf)> PrefixSpan-2 (bi-level projection)
s1 5 <(ac)d(cf)>
s1 6 <(_c)d(cf)>
21 22

Runtime vs. support threshold I/O costs vs. threshold and scalability
` `

23 24
Outline Conclusions
` `
Mining Sequential Patterns
Problem statement PrefixSpan
Definitions & examples Efficient pattern growth method
Strategies Outperforms both GSP and FreeSpan
PrefixSpan algorithm Explores prefix-projection in sequential pattern mining
Motivation Mines the complete set of patterns but reduces the effort
Definitions & examples of candidate subsequence generation
Algorithm Prefix-projection reduces the size of projected database
Example and leads to efficient processing
Performance study Bi-level projection and pseudo-projection may improve
mining efficiency
Conclusions

25 26

References
` `
Pei J., Han J., Mortazavi-Asl J., Pinto H., Chen Q., Dayal U., Hsu M.,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
Pattern Growth, 17th International Conference on Data Engineering
(ICDE), April 2001
Agrawal R., Srikant R., Mining sequential patterns, Proceedings 1995
Int. Conf. Very Large Data Bases (VLDB94), pp. 487-499, 1995 THANK YOU !!!
Han J., Dong G., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M.-C.,
Freespan: Frequent pattern-projected sequential pattern mining,
Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining
(KDD00), pp. 355-359, 2000
Srikant R., Agrawal R., Mining sequential pattern: Generalizations

and performance improvements, Proceedings 5th Int. /conf. Any Questions?


Extending Database Technology (EDBT96), pp. 3-17, 1996
Zhao Q., Bhowmick S. S., Sequential Pattern Mining: A Survey.
Technical Report Center for Advanced Information Systems, School
of Computer Engineering, Nanyang Technological University,
Singapore, 2003

27 28

You might also like