Mining Data Streams
(Part 2)
Today’s Lecture
More algorithms for streams:
(1) Filtering a data stream: Bloom filters
Select elements with property x from stream
(2) Counting distinct elements: Flajolet-Martin
Number of distinct elements in the last k elements
of the stream
(3) Estimating moments: AMS method
Estimate std. dev. of last k elements
(4) Counting frequent items
2
(1) Filtering Data Streams
Filtering Data Streams
Each element of data stream is a tuple
Given a list of keys S
Determine which tuples of stream are in S
Obvious solution: Hash table
But suppose we do not have enough memory to
store all of S in a hash table
E.g., we might be processing millions of filters
on the same stream
4
Applications
Example: Email spam filtering
We know 1 billion “good” email addresses
If an email comes from one of these, it is NOT
spam
Publish-subscribe systems
You are collecting lots of messages (news articles)
People express interest in certain sets of keywords
Determine whether each message matches user’s
interest
5
First Cut Solution (1)
Given a set of keys S that we want to filter
Create a bit array B of n bits, initially all 0s
Choose a hash function h with range [0,n)
Hash each member of s S to one of
n buckets, and set that bit to 1, i.e., B[h(s)]=1
Hash each element a of the stream and
output only those that hash to bit that was set
to 1
Output a if B[h(a)] == 1
6
First Cut Solution (2)
Output the item since it may be in S.
Item hashes to a bucket that at least
one of the items in S hashed to.
Filter
Item
Hash
func h
0010001011000 Bit array B
Drop the item.
It hashes to a bucket set
to 0 so it is surely not in S.
Creates false positives but no false negatives
If the item is in S we surely output it, if not we may
still output it
7
First Cut Solution (3)
|S| = 1 billion email addresses
|B|= 1GB = 8 billion bits
If the email address is in
S, then it surely hashes to a
bucket that has the big set to 1,
so it always gets through (no false negatives)
Approximately 1/8 of the bits are set to 1, so about
1/8th of the addresses not in S get through to the
output (false positives)
Actually, less than 1/8th, because more than one address
might hash to the same bit
8
Analysis: Throwing Darts (1)
More accurate analysis for the number of
false positives
Consider: If we throw m darts into n equally
likely targets, what is the probability that
a target gets at least one dart?
In our case:
Targets = bits/buckets
Darts = hash values of items
9
Analysis: Throwing Darts (2)
We have m darts, n targets
What is the probability that a target gets at
least one dart?
Equals 1/e
Equivalent
as n ∞
n( m / n)
1 - (1 – 1/n)
1 – e–m/n
Probability some
target X not hit Probability at
by a dart least one dart
hits target X
10
Analysis: Throwing Darts (3)
Fraction of 1s in the array B =
= probability of false positive = 1 – e-m/n
Example: 109 darts, 8∙109 targets
Fraction of 1s in B = 1 – e-1/8 = 0.1175
Compare with our earlier estimate: 1/8 = 0.125
11
Filtering Stream Content
12
Rule of the Bloom Filter
13
How a Bloom Filter Works
14
Example: Bloom Filter
15
Example: Bloom Filter
16
Bloom Filter Lookup
17
Example: Lookup
18
Performance of Bloom Filter
19
Throwing Dart
20
Example: Throwing Dart
21
Bloom Filter
Consider: |S| = m, |B| = n
Use k independent hash functions h1 ,…, hk
Initialization:
Set B to all 0s
Hash each element s S using each hash function hi,
set B[hi(s)] = 1 (for each i = 1,.., k) (note: we have a
single array B!)
Run-time:
When a stream element with key x arrives
If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
That is, x hashes to a bucket set to 1 for every hash function hi(x)
Otherwise discard the element x
22
Bloom Filter -- Analysis
What fraction of the bit vector B are 1s?
Throwing k∙m darts at n targets
So fraction of 1s is (1 – e-km/n)
But we have k independent hash functions
and we only let the element x through if all k
hash element x to a bucket of value 1
So, false positive probability = (1 – e-km/n)k
23
Bloom Filter – Analysis (2)
0.2
m = 1 billion, n = 8 billion 0.18
k = 1: (1 – e-1/8) = 0.1175 0.16
False positive prob.
0.14
k = 2: (1 – e-1/4)2 = 0.0493 0.12
0.1
0.08
0.06
What happens as we 0.04
keep increasing k? 0.02
0 2 4 6 8 10 12 14
Number of hash functions, k
16 18 20
“Optimal” value of k: n/m ln(2)
In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
Error at k = 6: (1 – e-1/6)2 = 0.0235
24
Bloom Filter: Wrap-up
Bloom filters guarantee no false negatives,
and use limited memory
Great for pre-processing before more
expensive checks
Suitable for hardware implementation
Hash function computations can be parallelized
Is it better to have 1 big B or k small Bs?
It is the same: (1 – e-km/n)k vs. (1 – e-m/(n/k))k
But keeping 1 big B is simpler
25
(2) Counting Distinct Elements
Counting Distinct Elements
Problem:
Data stream consists of a universe of elements
chosen from a set of size N
Maintain a count of the number of distinct
elements seen so far
Obvious approach:
Maintain the set of elements seen so far
That is, keep a hash table of all the distinct
elements seen so far
27
Applications
How many different words are found among
the Web pages being crawled at a site?
Unusually low or high numbers could indicate
artificial pages (spam?)
How many different Web pages does each
customer request in a week?
How many distinct products have we sold in
the last week?
28
Using Small Storage
Real problem: What if we do not have space
to maintain the set of elements seen so far?
Estimate the count in an unbiased way
Accept that the count may have a little error,
but limit the probability that the error is large
29
Flajolet-Martin Approach
Pick a hash functionh that maps each of the N
elements to at least log2 N bits
For each stream element a, let r(a) be the
number of trailing 0s in h(a)
r(a) = position of first 1 counting from the right
E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
Record R = the maximum r(a) seen
R = maxa r(a), over all the items a seen so far
Estimated number of distinct elements = 2R
30
Why It Works: Intuition
Very very rough and heuristic intuition why
Flajolet-Martin works:
h(a) hashes a with equal prob. to any of N values
Then h(a) is a sequence of log2 N bits,
where 2-r fraction of all as have a tail of r zeros
About 50% of as hash to ***0
About 25% of as hash to **00
So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably seen
about 4 distinct items so far
So, it takes to hash about 2r items before we
see one with zero-suffix of length r
31
Why It Works: More formally
Now we show why Flajolet-Martin works
Formally, we will show that probability of
finding a tail of r zeros:
Goes to 1 if
Goes to 0 if
where is the number of distinct elements
seen so far in the stream
Thus, 2R will almost always be around m!
32
Why It Works: More formally
What is the probability that a given h(a) ends
in at least r zeros is 2-r
h(a) hashes elements uniformly at random
Probability that a random number ends in
at least r zeros is 2-r
Then, the probability of NOT seeing a tail
of length r among m elements:
Prob. all end in Prob. that given h(a) ends
fewer than r zeros. in fewer than r zeros
33
Why It Works: More formally
Note:
Prob. of NOT finding a tail of length r is:
If m << 2r, then prob. tends to 1
as m/2r 0
So, the probability of finding a tail of length r tends to 0
If m >> 2r, then prob. tends to 0
as m/2r
So, the probability of finding a tail of length r tends to 1
Thus, 2R will almost always be around m!
34
Why It Doesn’t Work
E[2R] is actually infinite
Probability halves when R R+1, but value doubles
Workaround involves using many hash
functions hi and getting many samples of Ri
How are samples Ri combined?
Average? What if one very large value ?
Median? All estimates are a power of 2
Solution:
Partition your samples into small groups
Take the median of groups
Then take the average of the medians
35
(3) Computing Moments
Generalization: Moments
Suppose a stream has elements chosen
from a set A of N values
Let mi be the number of times value i occurs
in the stream
The kth moment is
37
Special Cases
0thmoment = number of distinct elements
The problem just considered
1st moment = count of the numbers of
elements = length of the stream
Easy to compute
2nd moment = surprise number S =
a measure of how uneven the distribution is
38
Example: Surprise Number
Stream of length 100
11 distinct values
Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9
Surprise S = 910
Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
Surprise S = 8,110
39
[Alon, Matias, and Szegedy]
AMS Method
AMS method works for all moments
Gives an unbiased estimate
We will just concentrate on the 2nd moment S
We pick and keep track of many variables X:
For each variable X we store X.el and X.val
X.el corresponds to the item i
X.val corresponds to the count of item i
Note this requires a count in main memory,
so number of Xs is limited
Our goal is to compute
40
One Random Variable (X)
How to set X.val and X.el?
Assume stream has length n (we relax this later)
Pick some random time t (t<n) to start,
so that any time is equally likely
Let at time t the stream have item i. We set X.el = i
Then we maintain count c (X.val = c) of the number
of is in the stream starting from the chosen time t
Then the estimate of the 2nd moment () is:
Note, we will keep track of multiple Xs, (X1, X2,… Xk)
and our final estimate will be
41
Expectation Analysis
Count: 1 2 3 ma
Stream: a a b b b a b a
2nd moment is
ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1, c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)
Time t when Time t when
Time t when the penultimate the first i is
Group times
the last i is i is seen (ct=2) seen (ct=mi)
by the value
seen (ct=1)
seen
42
Expectation Analysis
Count: 1 2 3 ma
Stream: a a b b b a b a
Little side calculation:
Then
So,
We have the second moment (in expectation)!
43
Higher-Order Moments
For
estimating kth moment we essentially use the
same algorithm but change the estimate:
For k=2 we used n (2·c – 1)
For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val)
Why?
For k=2: Remember we had and we showed terms 2c-1
(for c=1,…,m) sum to m2
So:
For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
Generally: Estimate
44
Combining Samples
In
practice:
Compute for
as many variables X as you can fit in memory
Average them in groups
Take median of averages
Problem: Streams never end
We assumed there was a number n,
the number of positions in the stream
But real streams go on forever, so n is
a variable – the number of inputs seen so far
45
Streams Never End: Fixups
(1) The variables X have n as a factor –
keep n separately; just hold the count in X
(2) Suppose we can only store k counts.
We must throw some Xs out as time goes on:
Objective: Each starting time t is selected with
probability k/n
Solution: (fixed-size sampling!)
Choose the first k times for k variables
When the nth element arrives (n > k), choose it with
probability k/n
If you choose it, throw one of the previously stored
variables X out, with equal probability
46
Counting Itemsets
Counting Itemsets
New Problem: Given a stream, which items
appear more than s times in the window?
Possible solution: Think of the stream of
baskets as one binary stream per item
1 = item present; 0 = not present
Use DGIM to estimate counts of 1s for all items
6 10
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
48
Extensions
In principle, you could count frequent pairs
or even larger sets the same way
One stream per itemset
Drawbacks:
Only approximate
Number of itemsets is way too big
49
Exponentially Decaying Windows
Exponentially decaying windows: A heuristic
for selecting likely frequent item(sets)
What are “currently” most popular movies?
Instead of computing the raw count in last N elements
Compute a smooth aggregation over the whole stream
If stream is a1, a2,… and we are taking the sum
of the stream, take the answer at time t to be:
c is a constant, presumably tiny, like 10-6 or 10-9
When new at+1 arrives:
Multiply current sum by (1-c) and add at+1
50
Example: Counting Items
If each ai is an “item” we can compute the
characteristic function of each possible
item x as an Exponentially Decaying Window
That is:
where δi=1 if ai=x, and 0 otherwise
Imagine that for each item x we have a binary
stream (1 if x appears, 0 if x does not appear)
New item x arrives:
Multiply all counts by (1-c)
Add +1 to count for element x
Call this sum the “weight” of item x
51
Sliding Versus Decaying Windows
...
1/c
Important property: Sum over all weights is
1/[1 – (1 – c)] = 1/c
52
Example: Counting Items
What are “currently” most popular movies?
Suppose we want to find movies of weight > ½
Important property: Sum over all weights is 1/[1 –
(1 – c)] = 1/c
Thus:
There cannot be more than 2/c movies with
weight of ½ or more
So, 2/c is a limit on the number of
movies being counted at any time
53
Extension to Itemsets
Count (some) itemsets in an E.D.W.
What are currently “hot” itemsets?
Problem: Too many itemsets to keep counts of
all of them in memory
When a basket B comes in:
Multiply all counts by (1-c)
For uncounted items in B, create new count
Add 1 to count of any item in B and to any itemset
contained in B that is already being counted
Drop counts < ½
Initiate new counts (next slide)
54
Initiation of New Counts
S ⊆ B if every
Start a count for an itemset
proper subset of S had a count prior to arrival
of basket B
Intuitively: If all subsets of S are being counted this
means they are “frequent/hot” and thus S has a
potential to be “hot”
Example:
Start counting S={i, j} iff both i and j were counted
prior to seeing B
Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
were all counted prior to seeing B
55
How many counts do we need?
Counts for single items < (2/c)∙(avg. number
of items in a basket)
Counts for larger itemsets = ??
But we are conservative about starting
counts of large sets
If we counted every set we saw, one basket
of 20 items would initiate 1M counts
56