Flajolet-Martin algorithm approximates the number of unique objects in a stream or a
database in one pass. If the stream contains n elements with m of them unique, this
algorithm runs in O(n) time and needs O(log(m))) memory.
The Flajolet–Martin
algorithm is an
algorithm for
approximating the
number of distinct
elements in a stream
with a single pass and
space-consumption
which is logarithmic in
the
maximum number of
possible distinct
elements in the stream.
1. Create a bit vector
(bit array) of sufficient
length L, such that
2L>n, the number of
elements
in the stream. Usually a
64-bit vector is sufficient
since 264 is quite large
for most purposes.
2. The i-th bit in this
vector/array represents
whether we have seen
a hash function value
whose binary
representation ends in
0i. So initialize each bit
to 0.
3. Generate a good,
random hash function
that maps input (usually
strings) to natural
numbers.
4. Read input. For each
word, hash it and
determine the number
of trailing zeros. If the
number
of trailing zeros is k, set
the k-th bit in the bit
vector to 1.
5. Once input is
exhausted, get the
index of the first 0 in the
bit array (call this R). By
the way,
this is just the number
of consecutive 1s (i.e.
we have seen 0, 00, ...,
as the output
of the hash function)
plus one.
6. Calculate the
number of unique words
as 2R/ϕ, where ϕ is
0.77351. A proof for this
can be
found in the original
paper listed in the
reference section.
7. The standard
deviation of R is a
constant: σ(R)=1.12. (In
other words, R can be
off by about
1 for 1-0.68=32% of the
observations, off by 2
for about 1-0.95=5% of
the observations,
off by 3 for 1-
0.997=0.3% of the
observations using the
Empirical rule of
statistics). This
implies that our count
can be off by a factor of
2 for 32% of the
observations, off by a
factory of 4 for 5% of
the observations, off by
a factor of 8 for 0.3% of
the observations
and so on.
The Flajolet–Martin
algorithm is an
algorithm for
approximating the
number of distinct
elements in a stream
with a single pass and
space-consumption
which is logarithmic in
the
maximum number of
possible distinct
elements in the stream.
1. Create a bit vector
(bit array) of sufficient
length L, such that
2L>n, the number of
elements
in the stream. Usually a
64-bit vector is sufficient
since 264 is quite large
for most purposes.
2. The i-th bit in this
vector/array represents
whether we have seen
a hash function value
whose binary
representation ends in
0i. So initialize each bit
to 0.
3. Generate a good,
random hash function
that maps input (usually
strings) to natural
numbers.
4. Read input. For each
word, hash it and
determine the number
of trailing zeros. If the
number
of trailing zeros is k, set
the k-th bit in the bit
vector to 1.
5. Once input is
exhausted, get the
index of the first 0 in the
bit array (call this R). By
the way,
this is just the number
of consecutive 1s (i.e.
we have seen 0, 00, ...,
as the output
of the hash function)
plus one.
6. Calculate the
number of unique words
as 2R/ϕ, where ϕ is
0.77351. A proof for this
can be
found in the original
paper listed in the
reference section.
7. The standard
deviation of R is a
constant: σ(R)=1.12. (In
other words, R can be
off by about
1 for 1-0.68=32% of the
observations, off by 2
for about 1-0.95=5% of
the observations,
off by 3 for 1-
0.997=0.3% of the
observations using the
Empirical rule of
statistics). This
implies that our count
can be off by a factor of
2 for 32% of the
observations, off by a
factory of 4 for 5% of
the observations, off by
a factor of 8 for 0.3% of
the observations
and so on.
The Flajolet–Martin
algorithm is an
algorithm for
approximating the
number of distinct
elements in a stream
with a single pass and
space-consumption
which is logarithmic in
the
maximum number of
possible distinct
elements in the stream.
1. Create a bit vector
(bit array) of sufficient
length L, such that
2L>n, the number of
elements
in the stream. Usually a
64-bit vector is sufficient
since 264 is quite large
for most purposes.
2. The i-th bit in this
vector/array represents
whether we have seen
a hash function value
whose binary
representation ends in
0i. So initialize each bit
to 0.
3. Generate a good,
random hash function
that maps input (usually
strings) to natural
numbers.
4. Read input. For each
word, hash it and
determine the number
of trailing zeros. If the
number
of trailing zeros is k, set
the k-th bit in the bit
vector to 1.
5. Once input is
exhausted, get the
index of the first 0 in the
bit array (call this R). By
the way,
this is just the number
of consecutive 1s (i.e.
we have seen 0, 00, ...,
as the output
of the hash function)
plus one.
6. Calculate the
number of unique words
as 2R/ϕ, where ϕ is
0.77351. A proof for this
can be
found in the original
paper listed in the
reference section.
7. The standard
deviation of R is a
constant: σ(R)=1.12. (In
other words, R can be
off by about
1 for 1-0.68=32% of the
observations, off by 2
for about 1-0.95=5% of
the observations,
off by 3 for 1-
0.997=0.3% of the
observations using the
Empirical rule of
statistics). This
implies that our count
can be off by a factor of
2 for 32% of the
observations, off by a
factory of 4 for 5% of
the observations, off by
a factor of 8 for 0.3% of
the observations
and so on.
Algorithm:
1. Create a bit vector (bit array) of sufficient length L, such that 2L>n, the number of
elements in the stream. Usually a 64-bit vector is sufficient since 264 is quite large
for most purposes.
2. The i-th bit in this vector/array represents whether we have seen a hash function
value whose binary representation ends in 0i. So initialize each bit to 0.
3. Generate a good, random hash function that maps input(usually string) to natural
numbers.
4. Read input. For each word, hash it and find out number of trailing zeros. If the
number of trailing zeros is k, set the kth bit in the bit vector to 1.
5. Once input is exhausted, get the index of the first 0 in the bit array (call this R). By
the way, this is just the number of consecutive 1s (i.e. we have
seen 0,00,...,0R−10,00,...,01 as the output of the hash function) plus one.
6. Calculate the number of unique words as 2R/ϕ2, where ϕ is 0.77351. A proof for
this can be found in the original paper listed in the reference section.
7. The standard deviation of R is a constant: σ(R)=1.12 (In other words, R can be
off by about 1 for 1 - 0.68 = 32% of the observations, off by 2 for about 1 - 0.95 =
5% of the observations, off by 3 for 1 - 0.997 = 0.3% of the observations using the
Empirical rule of statistics). This implies that our count can be off by a factor of 2
for 32% of the observations, off by a factory of 4 for 5% of the observations, off by a
factor of 8 for 0.3% of the observations and so on.
Example:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5
Assume |b| = 5
x h(x) Rem Binary r(a)
1 7 2 00010 1
3 19 4 00100 2
2 13 3 00011 0
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
4 25 0 00000 5
3 19 4 00100 2
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
x h(x) Rem Binary r(a)
1 7 2 00010 1
R = max( r(a) ) = 5
So no. of distinct elements = N=2R=25=32
We may want to know how many different elements have appeared in the stream.
For example, we wish to know how many distinct users visited the website till now
or in last 2 hours.
If no of distinct elements required to process many streams then keeping data in
main memory is challenge.
FM algorithm gives an efficient way to count the distinct elements in a stream.
It is possible to estimate the no. of distinct elements by hashing the elements of
the universal set to a bit string that is sufficiently long.
The length of the bit string must be sufficient that there are more possible results
of the hash function than there are elements in the universal set.
Whenever we apply a hash function h to a stream element a, the bit string h(a) will
end in some number of oS, possibly none.
Call this as tail length for a hash.
Let R be the maximum tail length of any a seen so far in the stream.
Then we shall use estimate 2R for the number of distinct elements seen in the
stream.
Consider a stream as:
S = {1, 2, 1, 3}
Let hash function be 2x + 2 mod 4
When we apply the hash function we get reminder represented in binary as follows:
000, 101, 000 considering bit string length as 3.
Maximum tail length R will be 3.
No of distinct elements will be 2R=23=8
Here the estimates may be too large or too low depending on hash function.
We may apply multiple hash functions and combine the estimate to get near
accurate values.