0% found this document useful (0 votes)

47 views10 pages

Streaming Algorithms Complete

The document explains various streaming algorithms, including their purposes, real-life use cases, and pseudocode implementations. Key algorithms discussed include Misra-Gries for frequent item detection, Reservoir Sampling for random sampling, and HyperLogLog for estimating distinct elements. Each algorithm is accompanied by practical applications in fields such as web analytics, network monitoring, and real-time data processing.

Uploaded by

Yoshi Hao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views10 pages

Streaming Algorithms Complete

Uploaded by

Yoshi Hao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Streaming Algorithms Explained with Use Cases

1. Misra-Gries Algorithm

Purpose: Identify elements in a data stream that occur frequently (above a given threshold).

Real-Life Use: Detecting most common queries in a search engine, finding trending topics on Twitter.

Pseudocode:

Initialize an empty dictionary C

Set k (maximum number of counters)

For each element x in the stream:

If x in C:

C[x] += 1

Else if len(C) < k - 1:

C[x] = 1

Else:

Decrease all counts in C by 1

Remove entries with count 0

2. Lossy Counting

Purpose: Track frequent items approximately with error tolerance.

Real-Life Use: Online ad clickstream analysis, monitoring frequently accessed web pages.

Pseudocode:

Set epsilon (error parameter), N = 0, empty dict C

For each x:

N += 1

If x in C: C[x][0] += 1

Else: C[x] = [1, current_bucket - 1]

Every bucket_width items: remove entries if C[key][0] + C[key][1] <= current_bucket

3. Reservoir Sampling
Streaming Algorithms Explained with Use Cases

Purpose: Random sampling from a stream of unknown size.

Real-Life Use: Select random user sessions, logs, or tweets from a firehose for analysis.

Pseudocode:

Fill reservoir of size k with first k items

For each i > k, replace item at random with probability k/i

4. KMV (K Minimum Values)

Purpose: Estimate number of distinct elements (cardinality).

Real-Life Use: Estimate unique website visitors, count distinct IP addresses in logs.

Pseudocode:

Hash elements to [0,1]; maintain k smallest values

Estimate = (k - 1) / max(k smallest hash values)

5. Boyer-Moore Majority Vote

Purpose: Find majority element (>50% frequency).

Real-Life Use: Detect most dominant behavior in logs, top-voted answer in feedback.

Pseudocode:

candidate = None, count = 0

For each x:

If count == 0: candidate = x

If x == candidate: count += 1 else: count -= 1

6. Space-Saving Algorithm

Purpose: Find most frequent elements using fixed memory.

Real-Life Use: Top-k search terms, most active users on a platform.

Streaming Algorithms Explained with Use Cases

Pseudocode:

Keep k counters. If x is new and full, replace min entry and increase count.

7. Count Sketch

Purpose: Approximate frequency counts with negative noise correction.

Real-Life Use: Network traffic monitoring, approximate counting in DBMS.

Pseudocode:

Use d hash + sign functions. Update multiple counters using signs.

Estimate = median of (counter * sign) for each row.

8. Count-Min Sketch

Purpose: Estimate frequencies (overestimate only).

Real-Life Use: Spam detection, streaming logs, cache eviction policies.

Pseudocode:

Update count in d hash buckets; estimate = min(counts across all hashes)

9. Bloom Filter

Purpose: Test membership in a set with false positives.

Real-Life Use: Caches, databases (avoid unnecessary disk lookups), network security.

Pseudocode:

Hash x to k bit positions and set them to 1

To query: check all k bits are 1 (else not in set)

10. Sliding Window Model

Purpose: Analyze most recent data (e.g., last 1 min/hour).

Streaming Algorithms Explained with Use Cases

Real-Life Use: Real-time alerts, CPU/memory usage, fraud detection.

Pseudocode:

Maintain deque of last N elements, slide as new ones arrive

11. HyperLogLog (HLL)

**Purpose:** Estimate the number of distinct elements in a stream using limited memory.

**Simple Usage:** Count the number of unique visitors to a website without storing all IPs.

Real-World Use Cases:

- Analytics platforms like Google Analytics, Mixpanel

- Network monitoring tools for unique flows

- Databases (e.g., Redis, Postgres) for cardinality estimation

**Pseudocode:**

1. Use a good hash function to hash each element to a binary string.

2. Divide hash space into `m` buckets using the first few bits.

3. For each bucket, track the maximum number of leading zeros in the remaining bits.

4. Estimate the cardinality using the formula:

`Estimate = alpha * m^2 / sum(2^-R[i])`

Go Implementation (simplified):

```go

import "hash/fnv"

func leadingZeros(x uint32) int {

n := 1

for x >>= 1; x > 0; x >>= 1 {

n++
Streaming Algorithms Explained with Use Cases

return 32 - n

func HyperLogLog(stream []string, m int) float64 {

buckets := make([]int, m)

for _, item := range stream {

h := fnv.New32a()

h.Write([]byte(item))

hash := h.Sum32()

idx := int(hash % uint32(m))

val := hash >> uint32(32 - 5)

buckets[idx] = max(buckets[idx], leadingZeros(val))

sum := 0.0

for _, r := range buckets {

sum += 1.0 / math.Pow(2, float64(r))

alpha := 0.7213 / (1 + 1.079/float64(m))

return alpha * float64(m*m) / sum

```

12. Flajolet-Martin

Purpose: Early method to estimate number of unique elements in a stream.

Simple Usage: Estimate how many distinct IP addresses hit a server.

Real-World Use Cases:

- Web analytics
Streaming Algorithms Explained with Use Cases

- Spam detection (unique message signatures)

- DNS query monitoring

**Pseudocode:**

1. Hash each incoming element.

2. Count the position of the least significant 1 in the binary form.

3. Track the maximum such position R.

4. Estimated count = 2^R

**C++ Implementation:**

```cpp

#include <iostream>

#include <bitset>

#include <string>

#include <functional>

int trailingZeros(uint32_t x) {

int count = 0;

while ((x & 1) == 0 && count < 32) {

x >>= 1;

count++;

return count;

int flajoletMartin(std::vector<std::string>& stream) {

int R = 0;

for (auto& s : stream) {

std::hash<std::string> hasher;

uint32_t hash = hasher(s);

Streaming Algorithms Explained with Use Cases

R = std::max(R, trailingZeros(hash));

return 1 << R;

```

13. Counting Bloom Filter

Purpose: Like Bloom Filter but supports deletions using counters.

**Simple Usage:** Membership check for keys in cache with support for removal.

Real-World Use Cases:

- Cache invalidation in CDNs

- Malware filtering

- Router forwarding table

**Pseudocode:**

1. Use k hash functions.

2. For each item, increment counters at k positions.

3. For delete, decrement those counters.

4. Query if all k counters > 0.

Go Implementation (simplified):

```go

type CountingBloomFilter struct {

counters []int

k int

func (cbf *CountingBloomFilter) Add(item string) {

Streaming Algorithms Explained with Use Cases

for i := 0; i < cbf.k; i++ {

idx := hash(item, i) % len(cbf.counters)

cbf.counters[idx]++

func (cbf *CountingBloomFilter) Remove(item string) {

for i := 0; i < cbf.k; i++ {

idx := hash(item, i) % len(cbf.counters)

cbf.counters[idx]--

func (cbf *CountingBloomFilter) Query(item string) bool {

for i := 0; i < cbf.k; i++ {

idx := hash(item, i) % len(cbf.counters)

if cbf.counters[idx] <= 0 {

return false

return true

```

14. DGIM Algorithm

**Purpose:** Count 1s in a binary stream over a sliding window using little space.

**Simple Usage:** Count how many times a user clicked 'Yes' in last 100 responses.

Real-World Use Cases:

Streaming Algorithms Explained with Use Cases

- Monitoring recent events in sliding windows (alerts, success/failures)

- Packet monitoring in network

**Pseudocode:**

1. Each bucket = (timestamp, size)

2. Merge old buckets of same size.

3. Keep at most 2 buckets of each size.

4. Estimate count from all bucket sizes; halve last one.

Go Implementation (simplified):

```go

type Bucket struct {

timestamp int

size int

func estimate(buckets []Bucket, windowSize int) int {

total := 0

for i, b := range buckets {

if i == len(buckets)-1 {

total += b.size / 2

} else {

total += b.size

return total

```

15. Exponential Histogram

Streaming Algorithms Explained with Use Cases

Purpose: Approximate count over a sliding window with error bound.

**Simple Usage:** Count how many messages arrived in the last 10 minutes.

Real-World Use Cases:

- Real-time dashboards

- IoT time-window stats

**Pseudocode:**

1. Buckets have (timestamp, count), in powers of 2.

2. When too many buckets of same size, merge.

3. Estimate count = sum of bucket sizes.

Go Implementation (simplified):

```go

type EHBucket struct {

timestamp int

size int

func addBucket(buckets *[]EHBucket, ts int) {

buckets = append(buckets, EHBucket{ts, 1})

for len(buckets) > 2 && (buckets)[len(buckets)-1].size == (buckets)[len(*buckets)-2].size {

b2 := (*buckets)[len(*buckets)-1]

b1 := (*buckets)[len(*buckets)-2]

*buckets = (*buckets)[:len(*buckets)-2]

buckets = append(buckets, EHBucket{b1.timestamp, b1.size + b2.size})

```

Streaming Algorithms Explained
No ratings yet
Streaming Algorithms Explained
4 pages
Streaming Algorithms Overview
No ratings yet
Streaming Algorithms Overview
90 pages
DGIM
No ratings yet
DGIM
90 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
Streams 1
No ratings yet
Streams 1
33 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Unit 3
No ratings yet
Unit 3
49 pages
SPA Session 14 15 CMS HyperLog
No ratings yet
SPA Session 14 15 CMS HyperLog
23 pages
Bda Exp8
No ratings yet
Bda Exp8
4 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
B43 BDA Exp7
No ratings yet
B43 BDA Exp7
12 pages
Module 4
No ratings yet
Module 4
20 pages
SPA Session 13 Streaming Algo Bloom
No ratings yet
SPA Session 13 Streaming Algo Bloom
23 pages
3 4
No ratings yet
3 4
5 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Flajolet-Martin Algorithm Guide
No ratings yet
Flajolet-Martin Algorithm Guide
3 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
Streams 2
No ratings yet
Streams 2
49 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
Implementing DGIM Algorithm
No ratings yet
Implementing DGIM Algorithm
6 pages
Presentation On Counting Frequent Itemsets
No ratings yet
Presentation On Counting Frequent Itemsets
13 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
Approximate Frequency Counting Algorithm
No ratings yet
Approximate Frequency Counting Algorithm
87 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Counting Ones in A Window
No ratings yet
Counting Ones in A Window
27 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
97 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
74 pages
BigdataFinal
No ratings yet
BigdataFinal
13 pages
Design A Systems Which Finds Top K (Heavy Hitters)
No ratings yet
Design A Systems Which Finds Top K (Heavy Hitters)
8 pages
Bda Que1
No ratings yet
Bda Que1
1 page
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Big Data Analytics Lecture Notes
No ratings yet
Big Data Analytics Lecture Notes
6 pages
Optimizing Data Stream Processing Techniques
No ratings yet
Optimizing Data Stream Processing Techniques
96 pages
Data Stream Algorithms Notes
No ratings yet
Data Stream Algorithms Notes
61 pages
Unit 4 - Lecture 3 - DGIM Algorithm Notes
100% (1)
Unit 4 - Lecture 3 - DGIM Algorithm Notes
8 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Decaying Window
No ratings yet
Decaying Window
16 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
B.tech Bloom Filter 3
No ratings yet
B.tech Bloom Filter 3
14 pages
Bda Unit - 2
No ratings yet
Bda Unit - 2
12 pages
Lossy Counting
No ratings yet
Lossy Counting
39 pages
Flajolet-Martin Algorithm for Distinct Count
No ratings yet
Flajolet-Martin Algorithm for Distinct Count
23 pages
Bloom Filter Cache Overview
No ratings yet
Bloom Filter Cache Overview
4 pages
Assocrules 2
No ratings yet
Assocrules 2
49 pages
MapReduce Bloom Filter Guide
No ratings yet
MapReduce Bloom Filter Guide
4 pages
MMD 05
No ratings yet
MMD 05
50 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
7 Algorithms To Know Before Your Next System Design Interview
100% (1)
7 Algorithms To Know Before Your Next System Design Interview
18 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Detailed Explanation of LangChain - by Happyer - Medium
No ratings yet
Detailed Explanation of LangChain - by Happyer - Medium
13 pages
Dokumen - Pub Monk Mode M 7388768
No ratings yet
Dokumen - Pub Monk Mode M 7388768
152 pages
Early Work
No ratings yet
Early Work
5 pages
Dremel - Interactive Analysis of Web-Scale Datasets
No ratings yet
Dremel - Interactive Analysis of Web-Scale Datasets
10 pages
G Huawei AMR Optimization Proposal 20070903 A 1.0
100% (1)
G Huawei AMR Optimization Proposal 20070903 A 1.0
33 pages
KNX sg001 - en P
No ratings yet
KNX sg001 - en P
252 pages
Batch Report: Tourism & Hospitality
No ratings yet
Batch Report: Tourism & Hospitality
4 pages
Computer Awareness for Competitive Exams
100% (1)
Computer Awareness for Competitive Exams
598 pages
Introductory Digital Image Processing: A Remote Sensing Perspective John R. Jensen & Dr. Kalmesh Lulla
No ratings yet
Introductory Digital Image Processing: A Remote Sensing Perspective John R. Jensen & Dr. Kalmesh Lulla
57 pages
HP09-541 Parameters Pocket Guide
No ratings yet
HP09-541 Parameters Pocket Guide
126 pages
Assignment No 2 ICT
No ratings yet
Assignment No 2 ICT
4 pages
FMEA Homework Feedback on Coursera
100% (1)
FMEA Homework Feedback on Coursera
10 pages
Internship Report Format For GGSIPU
100% (1)
Internship Report Format For GGSIPU
4 pages
T305-04 System Installation - RevE
No ratings yet
T305-04 System Installation - RevE
34 pages
Configure Outlook For Gmail PDF
No ratings yet
Configure Outlook For Gmail PDF
24 pages
Colmap
No ratings yet
Colmap
3 pages
Feature Engineering Techniques in Data Science
100% (2)
Feature Engineering Techniques in Data Science
76 pages
DSA Concepts and Python Implementation
No ratings yet
DSA Concepts and Python Implementation
10 pages
iOS Developer Essentials Guide
No ratings yet
iOS Developer Essentials Guide
1 page
Ejercicios Cinética Enzimática
No ratings yet
Ejercicios Cinética Enzimática
6 pages
Computer Applications Program Outcomes
No ratings yet
Computer Applications Program Outcomes
45 pages
COMP170 Exam TWOPractice Exam V2
No ratings yet
COMP170 Exam TWOPractice Exam V2
6 pages
C Operators: Types and Examples
No ratings yet
C Operators: Types and Examples
11 pages
Supply Chain & SAP BI Expert Resume
No ratings yet
Supply Chain & SAP BI Expert Resume
3 pages
Professional Diploma in Industrial Automation
No ratings yet
Professional Diploma in Industrial Automation
2 pages
Esm - 4410 Manual
No ratings yet
Esm - 4410 Manual
28 pages
Asda
No ratings yet
Asda
29 pages
Frontend Task
No ratings yet
Frontend Task
6 pages
Pan Os Cli Quick Start
No ratings yet
Pan Os Cli Quick Start
742 pages
Artificial Intelligence and Human Computer Interaction
No ratings yet
Artificial Intelligence and Human Computer Interaction
8 pages
Class Implementation for Day Management
No ratings yet
Class Implementation for Day Management
11 pages
1.1 Organizational Description: Chapter-1
No ratings yet
1.1 Organizational Description: Chapter-1
8 pages
Configuration Checklist For SAP PP
50% (2)
Configuration Checklist For SAP PP
6 pages
PCA Set2
No ratings yet
PCA Set2
21 pages

Streaming Algorithms Complete

Uploaded by

Streaming Algorithms Complete

Uploaded by

Streaming Algorithms Explained with Use Cases

Initialize an empty dictionary C

Set k (maximum number of counters)

For each element x in the stream:

Else if len(C) < k - 1:

Decrease all counts in C by 1

Remove entries with count 0

Purpose: Track frequent items approximately with error tolerance.

Set epsilon (error parameter), N = 0, empty dict C

Else: C[x] = [1, current_bucket - 1]

Every bucket_width items: remove entries if C[key][0] + C[key][1] <= current_bucket

Purpose: Random sampling from a stream of unknown size.

Fill reservoir of size k with first k items

For each i > k, replace item at random with probability k/i

4. KMV (K Minimum Values)

Purpose: Estimate number of distinct elements (cardinality).

Hash elements to [0,1]; maintain k smallest values

Estimate = (k - 1) / max(k smallest hash values)

5. Boyer-Moore Majority Vote

Purpose: Find majority element (>50% frequency).

candidate = None, count = 0

If x == candidate: count += 1 else: count -= 1

Purpose: Find most frequent elements using fixed memory.

Real-Life Use: Top-k search terms, most active users on a platform.

Purpose: Approximate frequency counts with negative noise correction.

Real-Life Use: Network traffic monitoring, approximate counting in DBMS.

Use d hash + sign functions. Update multiple counters using signs.

Estimate = median of (counter * sign) for each row.

Purpose: Estimate frequencies (overestimate only).

Real-Life Use: Spam detection, streaming logs, cache eviction policies.

Update count in d hash buckets; estimate = min(counts across all hashes)

Purpose: Test membership in a set with false positives.

Hash x to k bit positions and set them to 1

To query: check all k bits are 1 (else not in set)

10. Sliding Window Model

Purpose: Analyze most recent data (e.g., last 1 min/hour).

Real-Life Use: Real-time alerts, CPU/memory usage, fraud detection.

Maintain deque of last N elements, slide as new ones arrive

11. HyperLogLog (HLL)

**Real-World Use Cases:**

- Analytics platforms like Google Analytics, Mixpanel

- Network monitoring tools for unique flows

- Databases (e.g., Redis, Postgres) for cardinality estimation

1. Use a good hash function to hash each element to a binary string.

4. Estimate the cardinality using the formula:

`Estimate = alpha * m^2 / sum(2^-R[i])`

**Go Implementation (simplified):**

func leadingZeros(x uint32) int {

for x >>= 1; x > 0; x >>= 1 {

func HyperLogLog(stream []string, m int) float64 {

for _, item := range stream {

idx := int(hash % uint32(m))

val := hash >> uint32(32 - 5)

buckets[idx] = max(buckets[idx], leadingZeros(val))

for _, r := range buckets {

sum += 1.0 / math.Pow(2, float64(r))

alpha := 0.7213 / (1 + 1.079/float64(m))

return alpha * float64(m*m) / sum

**Purpose:** Early method to estimate number of unique elements in a stream.

**Simple Usage:** Estimate how many distinct IP addresses hit a server.

**Real-World Use Cases:**

- Spam detection (unique message signatures)

- DNS query monitoring

1. Hash each incoming element.

2. Count the position of the least significant 1 in the binary form.

3. Track the maximum such position R.

4. Estimated count = 2^R

while ((x & 1) == 0 && count < 32) {

int flajoletMartin(std::vector<std::string>& stream) {

for (auto& s : stream) {

uint32_t hash = hasher(s);

13. Counting Bloom Filter

**Purpose:** Like Bloom Filter but supports deletions using counters.

**Real-World Use Cases:**

- Cache invalidation in CDNs

- Router forwarding table

Real-World Use Cases:

Go Implementation (simplified):

Purpose: Early method to estimate number of unique elements in a stream.

Simple Usage: Estimate how many distinct IP addresses hit a server.

Real-World Use Cases:

Purpose: Like Bloom Filter but supports deletions using counters.

Real-World Use Cases:

Go Implementation (simplified):

Real-World Use Cases:

Go Implementation (simplified):

Purpose: Approximate count over a sliding window with error bound.

Real-World Use Cases:

Go Implementation (simplified):

buckets = append(buckets, EHBucket{ts, 1})

for len(buckets) > 2 && (buckets)[len(buckets)-1].size == (buckets)[len(*buckets)-2].size {

buckets = append(buckets, EHBucket{b1.timestamp, b1.size + b2.size})