0% found this document useful (0 votes)
47 views10 pages

Streaming Algorithms Complete

The document explains various streaming algorithms, including their purposes, real-life use cases, and pseudocode implementations. Key algorithms discussed include Misra-Gries for frequent item detection, Reservoir Sampling for random sampling, and HyperLogLog for estimating distinct elements. Each algorithm is accompanied by practical applications in fields such as web analytics, network monitoring, and real-time data processing.

Uploaded by

Yoshi Hao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views10 pages

Streaming Algorithms Complete

The document explains various streaming algorithms, including their purposes, real-life use cases, and pseudocode implementations. Key algorithms discussed include Misra-Gries for frequent item detection, Reservoir Sampling for random sampling, and HyperLogLog for estimating distinct elements. Each algorithm is accompanied by practical applications in fields such as web analytics, network monitoring, and real-time data processing.

Uploaded by

Yoshi Hao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Streaming Algorithms Explained with Use Cases

1. Misra-Gries Algorithm

Purpose: Identify elements in a data stream that occur frequently (above a given threshold).

Real-Life Use: Detecting most common queries in a search engine, finding trending topics on Twitter.

Pseudocode:

Initialize an empty dictionary C

Set k (maximum number of counters)

For each element x in the stream:

If x in C:

C[x] += 1

Else if len(C) < k - 1:

C[x] = 1

Else:

Decrease all counts in C by 1

Remove entries with count 0

2. Lossy Counting

Purpose: Track frequent items approximately with error tolerance.

Real-Life Use: Online ad clickstream analysis, monitoring frequently accessed web pages.

Pseudocode:

Set epsilon (error parameter), N = 0, empty dict C

For each x:

N += 1

If x in C: C[x][0] += 1

Else: C[x] = [1, current_bucket - 1]

Every bucket_width items: remove entries if C[key][0] + C[key][1] <= current_bucket

3. Reservoir Sampling
Streaming Algorithms Explained with Use Cases

Purpose: Random sampling from a stream of unknown size.

Real-Life Use: Select random user sessions, logs, or tweets from a firehose for analysis.

Pseudocode:

Fill reservoir of size k with first k items

For each i > k, replace item at random with probability k/i

4. KMV (K Minimum Values)

Purpose: Estimate number of distinct elements (cardinality).

Real-Life Use: Estimate unique website visitors, count distinct IP addresses in logs.

Pseudocode:

Hash elements to [0,1]; maintain k smallest values

Estimate = (k - 1) / max(k smallest hash values)

5. Boyer-Moore Majority Vote

Purpose: Find majority element (>50% frequency).

Real-Life Use: Detect most dominant behavior in logs, top-voted answer in feedback.

Pseudocode:

candidate = None, count = 0

For each x:

If count == 0: candidate = x

If x == candidate: count += 1 else: count -= 1

6. Space-Saving Algorithm

Purpose: Find most frequent elements using fixed memory.

Real-Life Use: Top-k search terms, most active users on a platform.


Streaming Algorithms Explained with Use Cases

Pseudocode:

Keep k counters. If x is new and full, replace min entry and increase count.

7. Count Sketch

Purpose: Approximate frequency counts with negative noise correction.

Real-Life Use: Network traffic monitoring, approximate counting in DBMS.

Pseudocode:

Use d hash + sign functions. Update multiple counters using signs.

Estimate = median of (counter * sign) for each row.

8. Count-Min Sketch

Purpose: Estimate frequencies (overestimate only).

Real-Life Use: Spam detection, streaming logs, cache eviction policies.

Pseudocode:

Update count in d hash buckets; estimate = min(counts across all hashes)

9. Bloom Filter

Purpose: Test membership in a set with false positives.

Real-Life Use: Caches, databases (avoid unnecessary disk lookups), network security.

Pseudocode:

Hash x to k bit positions and set them to 1

To query: check all k bits are 1 (else not in set)

10. Sliding Window Model

Purpose: Analyze most recent data (e.g., last 1 min/hour).


Streaming Algorithms Explained with Use Cases

Real-Life Use: Real-time alerts, CPU/memory usage, fraud detection.

Pseudocode:

Maintain deque of last N elements, slide as new ones arrive

11. HyperLogLog (HLL)

**Purpose:** Estimate the number of distinct elements in a stream using limited memory.

**Simple Usage:** Count the number of unique visitors to a website without storing all IPs.

**Real-World Use Cases:**

- Analytics platforms like Google Analytics, Mixpanel

- Network monitoring tools for unique flows

- Databases (e.g., Redis, Postgres) for cardinality estimation

**Pseudocode:**

1. Use a good hash function to hash each element to a binary string.

2. Divide hash space into `m` buckets using the first few bits.

3. For each bucket, track the maximum number of leading zeros in the remaining bits.

4. Estimate the cardinality using the formula:

`Estimate = alpha * m^2 / sum(2^-R[i])`

**Go Implementation (simplified):**

```go

import "hash/fnv"

func leadingZeros(x uint32) int {

n := 1

for x >>= 1; x > 0; x >>= 1 {

n++
Streaming Algorithms Explained with Use Cases

return 32 - n

func HyperLogLog(stream []string, m int) float64 {

buckets := make([]int, m)

for _, item := range stream {

h := fnv.New32a()

h.Write([]byte(item))

hash := h.Sum32()

idx := int(hash % uint32(m))

val := hash >> uint32(32 - 5)

buckets[idx] = max(buckets[idx], leadingZeros(val))

sum := 0.0

for _, r := range buckets {

sum += 1.0 / math.Pow(2, float64(r))

alpha := 0.7213 / (1 + 1.079/float64(m))

return alpha * float64(m*m) / sum

```

12. Flajolet-Martin

**Purpose:** Early method to estimate number of unique elements in a stream.

**Simple Usage:** Estimate how many distinct IP addresses hit a server.

**Real-World Use Cases:**

- Web analytics
Streaming Algorithms Explained with Use Cases

- Spam detection (unique message signatures)

- DNS query monitoring

**Pseudocode:**

1. Hash each incoming element.

2. Count the position of the least significant 1 in the binary form.

3. Track the maximum such position R.

4. Estimated count = 2^R

**C++ Implementation:**

```cpp

#include <iostream>

#include <bitset>

#include <string>

#include <functional>

int trailingZeros(uint32_t x) {

int count = 0;

while ((x & 1) == 0 && count < 32) {

x >>= 1;

count++;

return count;

int flajoletMartin(std::vector<std::string>& stream) {

int R = 0;

for (auto& s : stream) {

std::hash<std::string> hasher;

uint32_t hash = hasher(s);


Streaming Algorithms Explained with Use Cases

R = std::max(R, trailingZeros(hash));

return 1 << R;

```

13. Counting Bloom Filter

**Purpose:** Like Bloom Filter but supports deletions using counters.

**Simple Usage:** Membership check for keys in cache with support for removal.

**Real-World Use Cases:**

- Cache invalidation in CDNs

- Malware filtering

- Router forwarding table

**Pseudocode:**

1. Use k hash functions.

2. For each item, increment counters at k positions.

3. For delete, decrement those counters.

4. Query if all k counters > 0.

**Go Implementation (simplified):**

```go

type CountingBloomFilter struct {

counters []int

k int

func (cbf *CountingBloomFilter) Add(item string) {


Streaming Algorithms Explained with Use Cases

for i := 0; i < cbf.k; i++ {

idx := hash(item, i) % len(cbf.counters)

cbf.counters[idx]++

func (cbf *CountingBloomFilter) Remove(item string) {

for i := 0; i < cbf.k; i++ {

idx := hash(item, i) % len(cbf.counters)

cbf.counters[idx]--

func (cbf *CountingBloomFilter) Query(item string) bool {

for i := 0; i < cbf.k; i++ {

idx := hash(item, i) % len(cbf.counters)

if cbf.counters[idx] <= 0 {

return false

return true

```

14. DGIM Algorithm

**Purpose:** Count 1s in a binary stream over a sliding window using little space.

**Simple Usage:** Count how many times a user clicked 'Yes' in last 100 responses.

**Real-World Use Cases:**


Streaming Algorithms Explained with Use Cases

- Monitoring recent events in sliding windows (alerts, success/failures)

- Packet monitoring in network

**Pseudocode:**

1. Each bucket = (timestamp, size)

2. Merge old buckets of same size.

3. Keep at most 2 buckets of each size.

4. Estimate count from all bucket sizes; halve last one.

**Go Implementation (simplified):**

```go

type Bucket struct {

timestamp int

size int

func estimate(buckets []Bucket, windowSize int) int {

total := 0

for i, b := range buckets {

if i == len(buckets)-1 {

total += b.size / 2

} else {

total += b.size

return total

```

15. Exponential Histogram


Streaming Algorithms Explained with Use Cases

**Purpose:** Approximate count over a sliding window with error bound.

**Simple Usage:** Count how many messages arrived in the last 10 minutes.

**Real-World Use Cases:**

- Real-time dashboards

- IoT time-window stats

**Pseudocode:**

1. Buckets have (timestamp, count), in powers of 2.

2. When too many buckets of same size, merge.

3. Estimate count = sum of bucket sizes.

**Go Implementation (simplified):**

```go

type EHBucket struct {

timestamp int

size int

func addBucket(buckets *[]EHBucket, ts int) {

*buckets = append(*buckets, EHBucket{ts, 1})

for len(*buckets) > 2 && (*buckets)[len(*buckets)-1].size == (*buckets)[len(*buckets)-2].size {

b2 := (*buckets)[len(*buckets)-1]

b1 := (*buckets)[len(*buckets)-2]

*buckets = (*buckets)[:len(*buckets)-2]

*buckets = append(*buckets, EHBucket{b1.timestamp, b1.size + b2.size})

```

You might also like