0% found this document useful (0 votes)

19 views20 pages

FM Algorithm

The Flajolet-Martin algorithm is designed to approximate the number of distinct elements in a data stream using a single pass and logarithmic space. It employs a bit vector to track trailing zeros from hashed values, allowing the calculation of unique elements based on the index of the first zero in the bit array. The algorithm is efficient, operating in O(n) time with a standard deviation that indicates the potential error in the estimate of distinct elements.

Uploaded by

deepa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views20 pages

FM Algorithm

Uploaded by

deepa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Flajolet-Martin algorithm approximates the number of unique objects in a stream or a

database in one pass. If the stream contains n elements with m of them unique, this
algorithm runs in O(n) time and needs O(log(m))) memory.

The Flajolet–Martin
algorithm is an
algorithm for
approximating the
number of distinct
elements in a stream
with a single pass and
space-consumption
which is logarithmic in
the
maximum number of
possible distinct
elements in the stream.
1. Create a bit vector
(bit array) of sufficient
length L, such that
2L>n, the number of
elements
in the stream. Usually a
64-bit vector is sufficient
since 264 is quite large
for most purposes.
2. The i-th bit in this
vector/array represents
whether we have seen
a hash function value
whose binary
representation ends in
0i. So initialize each bit
to 0.
3. Generate a good,
random hash function
that maps input (usually
strings) to natural
numbers.
4. Read input. For each
word, hash it and
determine the number
of trailing zeros. If the
number
of trailing zeros is k, set
the k-th bit in the bit
vector to 1.
5. Once input is
exhausted, get the
index of the first 0 in the
bit array (call this R). By
the way,
this is just the number
of consecutive 1s (i.e.
we have seen 0, 00, ...,
as the output
of the hash function)
plus one.
6. Calculate the
number of unique words
as 2R/ϕ, where ϕ is
0.77351. A proof for this
can be
found in the original
paper listed in the
reference section.
7. The standard
deviation of R is a
constant: σ(R)=1.12. (In
other words, R can be
off by about
1 for 1-0.68=32% of the
observations, off by 2
for about 1-0.95=5% of
the observations,
off by 3 for 1-
0.997=0.3% of the
observations using the
Empirical rule of
statistics). This
implies that our count
can be off by a factor of
2 for 32% of the
observations, off by a
factory of 4 for 5% of
the observations, off by
a factor of 8 for 0.3% of
the observations
and so on.
The Flajolet–Martin
algorithm is an
algorithm for
approximating the
number of distinct
elements in a stream
with a single pass and
space-consumption
which is logarithmic in
the
maximum number of
possible distinct
elements in the stream.
1. Create a bit vector
(bit array) of sufficient
length L, such that
2L>n, the number of
elements
in the stream. Usually a
64-bit vector is sufficient
since 264 is quite large
for most purposes.
2. The i-th bit in this
vector/array represents
whether we have seen
a hash function value
whose binary
representation ends in
0i. So initialize each bit
to 0.
3. Generate a good,
random hash function
that maps input (usually
strings) to natural
numbers.
4. Read input. For each
word, hash it and
determine the number
of trailing zeros. If the
number
of trailing zeros is k, set
the k-th bit in the bit
vector to 1.
5. Once input is
exhausted, get the
index of the first 0 in the
bit array (call this R). By
the way,
this is just the number
of consecutive 1s (i.e.
we have seen 0, 00, ...,
as the output
of the hash function)
plus one.
6. Calculate the
number of unique words
as 2R/ϕ, where ϕ is
0.77351. A proof for this
can be
found in the original
paper listed in the
reference section.
7. The standard
deviation of R is a
constant: σ(R)=1.12. (In
other words, R can be
off by about
1 for 1-0.68=32% of the
observations, off by 2
for about 1-0.95=5% of
the observations,
off by 3 for 1-
0.997=0.3% of the
observations using the
Empirical rule of
statistics). This
implies that our count
can be off by a factor of
2 for 32% of the
observations, off by a
factory of 4 for 5% of
the observations, off by
a factor of 8 for 0.3% of
the observations
and so on.
The Flajolet–Martin
algorithm is an
algorithm for
approximating the
number of distinct
elements in a stream
with a single pass and
space-consumption
which is logarithmic in
the
maximum number of
possible distinct
elements in the stream.
1. Create a bit vector
(bit array) of sufficient
length L, such that
2L>n, the number of
elements
in the stream. Usually a
64-bit vector is sufficient
since 264 is quite large
for most purposes.
2. The i-th bit in this
vector/array represents
whether we have seen
a hash function value
whose binary
representation ends in
0i. So initialize each bit
to 0.
3. Generate a good,
random hash function
that maps input (usually
strings) to natural
numbers.
4. Read input. For each
word, hash it and
determine the number
of trailing zeros. If the
number
of trailing zeros is k, set
the k-th bit in the bit
vector to 1.
5. Once input is
exhausted, get the
index of the first 0 in the
bit array (call this R). By
the way,
this is just the number
of consecutive 1s (i.e.
we have seen 0, 00, ...,
as the output
of the hash function)
plus one.
6. Calculate the
number of unique words
as 2R/ϕ, where ϕ is
0.77351. A proof for this
can be
found in the original
paper listed in the
reference section.
7. The standard
deviation of R is a
constant: σ(R)=1.12. (In
other words, R can be
off by about
1 for 1-0.68=32% of the
observations, off by 2
for about 1-0.95=5% of
the observations,
off by 3 for 1-
0.997=0.3% of the
observations using the
Empirical rule of
statistics). This
implies that our count
can be off by a factor of
2 for 32% of the
observations, off by a
factory of 4 for 5% of
the observations, off by
a factor of 8 for 0.3% of
the observations
and so on.
Algorithm:
1. Create a bit vector (bit array) of sufficient length L, such that 2L>n, the number of
elements in the stream. Usually a 64-bit vector is sufficient since 264 is quite large
for most purposes.
2. The i-th bit in this vector/array represents whether we have seen a hash function
value whose binary representation ends in 0i. So initialize each bit to 0.
3. Generate a good, random hash function that maps input(usually string) to natural
numbers.
4. Read input. For each word, hash it and find out number of trailing zeros. If the
number of trailing zeros is k, set the kth bit in the bit vector to 1.
5. Once input is exhausted, get the index of the first 0 in the bit array (call this R). By
the way, this is just the number of consecutive 1s (i.e. we have
seen 0,00,...,0R−10,00,...,01 as the output of the hash function) plus one.
6. Calculate the number of unique words as 2R/ϕ2, where ϕ is 0.77351. A proof for
this can be found in the original paper listed in the reference section.
7. The standard deviation of R is a constant: σ(R)=1.12 (In other words, R can be
off by about 1 for 1 - 0.68 = 32% of the observations, off by 2 for about 1 - 0.95 =
5% of the observations, off by 3 for 1 - 0.997 = 0.3% of the observations using the
Empirical rule of statistics). This implies that our count can be off by a factor of 2
for 32% of the observations, off by a factory of 4 for 5% of the observations, off by a
factor of 8 for 0.3% of the observations and so on.
Example:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5
Assume |b| = 5
x h(x) Rem Binary r(a)

1 7 2 00010 1

3 19 4 00100 2

2 13 3 00011 0

1 7 2 00010 1

2 13 3 00011 0

3 19 4 00100 2

4 25 0 00000 5

3 19 4 00100 2

1 7 2 00010 1

2 13 3 00011 0

3 19 4 00100 2
x h(x) Rem Binary r(a)

1 7 2 00010 1

R = max( r(a) ) = 5
So no. of distinct elements = N=2R=25=32

 We may want to know how many different elements have appeared in the stream.

 For example, we wish to know how many distinct users visited the website till now
or in last 2 hours.

 If no of distinct elements required to process many streams then keeping data in

main memory is challenge.

 FM algorithm gives an efficient way to count the distinct elements in a stream.

 It is possible to estimate the no. of distinct elements by hashing the elements of

the universal set to a bit string that is sufficiently long.

 The length of the bit string must be sufficient that there are more possible results
of the hash function than there are elements in the universal set.

 Whenever we apply a hash function h to a stream element a, the bit string h(a) will
end in some number of oS, possibly none.

 Call this as tail length for a hash.

 Let R be the maximum tail length of any a seen so far in the stream.

 Then we shall use estimate 2R for the number of distinct elements seen in the
stream.
 Consider a stream as:

S = {1, 2, 1, 3}

Let hash function be 2x + 2 mod 4

 When we apply the hash function we get reminder represented in binary as follows:

000, 101, 000 considering bit string length as 3.

 Maximum tail length R will be 3.

 No of distinct elements will be 2R=23=8

 Here the estimates may be too large or too low depending on hash function.

 We may apply multiple hash functions and combine the estimate to get near
accurate values.

Bda Exp8
No ratings yet
Bda Exp8
4 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
Flajolet-Martin Algorithm for Distinct Count
No ratings yet
Flajolet-Martin Algorithm for Distinct Count
23 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Informatics Sampling & Hashing Techniques
No ratings yet
Informatics Sampling & Hashing Techniques
6 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Flajolet-Martin Algorithm Guide
No ratings yet
Flajolet-Martin Algorithm Guide
3 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Counting Distinct Elements in Data Streams
No ratings yet
Counting Distinct Elements in Data Streams
13 pages
Streams 2
No ratings yet
Streams 2
49 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
MMD 05
No ratings yet
MMD 05
50 pages
Bloom FIlter and Hash Function Numericals
No ratings yet
Bloom FIlter and Hash Function Numericals
6 pages
B.tech Bloom Filter 3
No ratings yet
B.tech Bloom Filter 3
14 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
4 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
DGIM
No ratings yet
DGIM
90 pages
SPA Session 13 Streaming Algo Bloom
No ratings yet
SPA Session 13 Streaming Algo Bloom
23 pages
Algorithms For Massive Data Problems
No ratings yet
Algorithms For Massive Data Problems
28 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Implementing DGIM Algorithm
No ratings yet
Implementing DGIM Algorithm
6 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Estimating Distinct Elements Using Flajolet-Martin Algorithm On A Data Stream
No ratings yet
Estimating Distinct Elements Using Flajolet-Martin Algorithm On A Data Stream
3 pages
Universal Hashing Explained
No ratings yet
Universal Hashing Explained
4 pages
Tutorial 3
No ratings yet
Tutorial 3
5 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
L11 PDF
No ratings yet
L11 PDF
5 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Data Mining: Streaming Algorithms Overview
No ratings yet
Data Mining: Streaming Algorithms Overview
8 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
Unit Iimathematical Foundation of Big Data
No ratings yet
Unit Iimathematical Foundation of Big Data
21 pages
Streaming Algorithms Overview
No ratings yet
Streaming Algorithms Overview
90 pages
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
No ratings yet
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
35 pages
1999 - Hash and Displace - Efficient Evaluation of Minimum Perfect Hash Functions (10.1.1.148.7694)
No ratings yet
1999 - Hash and Displace - Efficient Evaluation of Minimum Perfect Hash Functions (10.1.1.148.7694)
10 pages
Data Structures & Algorithms Guide
No ratings yet
Data Structures & Algorithms Guide
34 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Bloom Filter: Algorithm Description
No ratings yet
Bloom Filter: Algorithm Description
11 pages
hw2 15211
No ratings yet
hw2 15211
8 pages
Lect1004 PDF
No ratings yet
Lect1004 PDF
7 pages
Understanding Bloom Filters and Their Efficiency
No ratings yet
Understanding Bloom Filters and Their Efficiency
29 pages
Advanced Data Structures Lecture
No ratings yet
Advanced Data Structures Lecture
46 pages
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
No ratings yet
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
10 pages
Counting Distinct Elements Seminar
No ratings yet
Counting Distinct Elements Seminar
12 pages
Unit 3
No ratings yet
Unit 3
49 pages
2009 - Hash, Displace, and Compress (Esa09)
No ratings yet
2009 - Hash, Displace, and Compress (Esa09)
17 pages
Bda Exp4 Chinmay
No ratings yet
Bda Exp4 Chinmay
4 pages
Understanding Hashing Techniques
No ratings yet
Understanding Hashing Techniques
23 pages
Streaming Algorithms Complete
No ratings yet
Streaming Algorithms Complete
10 pages
Lab 3
No ratings yet
Lab 3
5 pages
Open Data Structures
No ratings yet
Open Data Structures
3 pages
Degrees of Comparison: When We Compare Two Nouns: Comparative. When We Compare Three or More Nouns: Superlative
No ratings yet
Degrees of Comparison: When We Compare Two Nouns: Comparative. When We Compare Three or More Nouns: Superlative
6 pages
The Summary Response of A TED Talk Speech (F2019)
No ratings yet
The Summary Response of A TED Talk Speech (F2019)
2 pages
Lesson Plan Format: Standards
No ratings yet
Lesson Plan Format: Standards
2 pages
Polyspace Code Verification: Call Hierarchy Report For Project: Polyspace
No ratings yet
Polyspace Code Verification: Call Hierarchy Report For Project: Polyspace
7 pages
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
No ratings yet
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
5 pages
PowerVault - 4012 Troubleshooting-Restoring Access To A Storage System With A Lost Password
No ratings yet
PowerVault - 4012 Troubleshooting-Restoring Access To A Storage System With A Lost Password
5 pages
Worship Song Lyrics: "Por Siempre Te Alabaré"
No ratings yet
Worship Song Lyrics: "Por Siempre Te Alabaré"
3 pages
Moeller 9e Ch02
No ratings yet
Moeller 9e Ch02
19 pages
Translation Proper Names in Children's Literature
100% (1)
Translation Proper Names in Children's Literature
10 pages
Language Loss
No ratings yet
Language Loss
5 pages
TedTalk - What Makes A Good Teacher Great
No ratings yet
TedTalk - What Makes A Good Teacher Great
3 pages
Cauchy's Integral Formulas and Infinite Series
No ratings yet
Cauchy's Integral Formulas and Infinite Series
7 pages
16 My Father Goes To Court by Carlos Bulosan
No ratings yet
16 My Father Goes To Court by Carlos Bulosan
6 pages
IS221 Lab 3: XHTML Page Layout
No ratings yet
IS221 Lab 3: XHTML Page Layout
8 pages
Profile
No ratings yet
Profile
2 pages
Dream Format
No ratings yet
Dream Format
2 pages
The Rules of English Concord
No ratings yet
The Rules of English Concord
19 pages
Understanding Future Tense Aspects
No ratings yet
Understanding Future Tense Aspects
13 pages
Medina Book Fiqh Level 3
75% (4)
Medina Book Fiqh Level 3
89 pages
Semantic Search Engine
No ratings yet
Semantic Search Engine
8 pages
Literature Discussion Guide
No ratings yet
Literature Discussion Guide
7 pages
Period of New Society
100% (1)
Period of New Society
21 pages
Skripsi PDF
No ratings yet
Skripsi PDF
65 pages
DepEd Emerging-LAS Week1 (Edited)
No ratings yet
DepEd Emerging-LAS Week1 (Edited)
17 pages
Wa0037.
No ratings yet
Wa0037.
40 pages
10 Ea Transfer Order
No ratings yet
10 Ea Transfer Order
2 pages
English Project Social Media and Communication
No ratings yet
English Project Social Media and Communication
2 pages
Lesson Plan Extremophiles
No ratings yet
Lesson Plan Extremophiles
5 pages
1 Intruduction To Humanities
No ratings yet
1 Intruduction To Humanities
14 pages

FM Algorithm

Uploaded by

FM Algorithm

Uploaded by

Flajolet-Martin algorithm approximates the number of unique objects in a stream or a

 If no of distinct elements required to process many streams then keeping data in

 FM algorithm gives an efficient way to count the distinct elements in a stream.

 It is possible to estimate the no. of distinct elements by hashing the elements of

 Call this as tail length for a hash.

Let hash function be 2x + 2 mod 4

000, 101, 000 considering bit string length as 3.

 No of distinct elements will be 2R=23=8

You might also like