0% found this document useful (0 votes)
2 views6 pages

Problem 4 Report - D&C Compression

The document outlines a divide and conquer algorithm for string compression that detects repeated substrings and encodes them using reference encoding. It details the algorithm's steps, pseudocode, time and space complexity, correctness, and parallelization capabilities. Additionally, it compares the algorithm with suffix arrays and provides examples and test cases demonstrating its efficiency and effectiveness.

Uploaded by

Fahad Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Problem 4 Report - D&C Compression

The document outlines a divide and conquer algorithm for string compression that detects repeated substrings and encodes them using reference encoding. It details the algorithm's steps, pseudocode, time and space complexity, correctness, and parallelization capabilities. Additionally, it compares the algorithm with suffix arrays and provides examples and test cases demonstrating its efficiency and effectiveness.

Uploaded by

Fahad Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Problem 4: Divide and Conquer in String Compression

Course: CS-2009 Design and Analysis of Algorithms


Assignment: 3
Student: [Your Name]
Roll Number: [Your Roll Number]

1. Algorithm Explanation
The algorithm uses divide and conquer to detect repeated substrings for compression. It splits text into blocks,
finds repeats within and across blocks, then merges results to create compressed output using reference
encoding (similar to LZ77).

Key Steps:

1. Divide text recursively into left/right halves

2. Find longest repeated substring in each half

3. Check for repeats spanning the boundary

4. Merge by selecting repeat with maximum compression savings

5. Encode output with <offset, length> references

2. Pseudocode

FUNCTION compress(text, blockSize):


repeat = divideAndConquer(0, length(text))
RETURN createCompressedOutput(repeat)

FUNCTION divideAndConquer(start, end):


IF end - start ≤ blockSize:
RETURN findLongestRepeatInBlock(text[start...end])

mid = (start + end) / 2


leftRepeat = divideAndConquer(start, mid)
rightRepeat = divideAndConquer(mid, end)
crossRepeat = findCrossBoundaryRepeats(mid)

RETURN mergeRepeats([leftRepeat, rightRepeat, crossRepeat])


FUNCTION findLongestRepeatInBlock(block):
FOR each substring in block:
IF substring appears > 1 time AND longer than current best:
Update longest repeat
RETURN longest repeat

FUNCTION findCrossBoundaryRepeats(mid):
FOR overlap = 1 TO blockSize:
IF text[mid-overlap...mid] = text[mid...mid+overlap]:
IF this pattern repeats elsewhere:
Update cross repeat
RETURN cross repeat

FUNCTION mergeRepeats(repeats):
SELECT repeat with maximum savings
WHERE savings = length × (occurrences - 1)

FUNCTION createCompressedOutput(repeat):
Keep first occurrence
Replace other occurrences with <offset, length>
RETURN compressed string

3. Time and Space Complexity


Time Complexity

Component Complexity

Divide O(1)

Find repeats in block O(b³) where b = block size

Cross-boundary check O(b² × n)

Merge O(k) where k = candidates

Total per level O(n × b²)

Recursion depth O(log n)

Overall O(n × b² × log n)


 

Optimized: O(n log² n) with hash-based substring matching

Space Complexity
Recursion stack: O(log n)
Substring storage: O(n)

Overall: O(n)

4. Merge Step Correctness


Ensures Correctness by:

1. Completeness: Left + Right + Cross-boundary covers ALL possible repeats

2. No missed patterns: Cross-boundary check handles patterns spanning blocks

3. Optimal selection: Chooses repeat with max savings = length × (occurrences - 1)

Proof: Any repeated substring R is either:

Entirely in left half → Found by left recursion

Entirely in right half → Found by right recursion

Spans boundary → Found by cross-boundary check ∴ R is always detected

5. Parallelization
The algorithm naturally supports parallel execution:

PARALLEL divideAndConquer(start, end):


IF base case: RETURN sequential result

mid = (start + end) / 2

SPAWN leftTask = divideAndConquer(start, mid)


SPAWN rightTask = divideAndConquer(mid, end)

SYNC leftRepeat = [Link]


SYNC rightRepeat = [Link]

crossRepeat = findCrossBoundaryRepeats(mid)
RETURN mergeRepeats([leftRepeat, rightRepeat, crossRepeat])

Performance: With p processors, time reduces to O(n log² n / p + log n)


6. Comparison with Suffix Array
Metric Divide & Conquer Suffix Array

Time O(n log² n) O(n log n)

Space O(n) O(n)

Parallelizable ✓ Excellent ✗ Limited

Cache locality ✓ Sequential ✗ Random

Best for Parallel systems Single-threaded


 

Trade-off: D&C is slightly slower but better for multi-core/distributed systems

7. Diagrams
Divide and Conquer Tree for "abcabcabcdefdef"

[SPACE FOR D&C TREE DIAGRAM]

"abcabcabcdefdef"
|
___________|___________
| |
"abcabcabc" "defdef"
| |
____|____ _____|_____
| | | |
"abca" "bcabc" "def" "def"

Merge at each level to find best repeat


Result: "abc" appears 3 times (positions 0,3,6)

Algorithm Flowchart

[SPACE FOR FLOWCHART]

Start → Input Text



Divide at midpoint

Recursively process left & right

Check cross-boundary repeats

Merge: Select best repeat

Create compressed output

End

8. Example Scenario
Input: "abcabcabcdefdefghi" (block size = 6)

Execution:

Level 0: Split "abcabcabcdefdefghi" at position 9


Left: "abcabcabc" | Right: "defdefghi"

Level 1:
Left finds: "abc" at [0,3,6] → 3 occurrences
Right finds: "def" at [9,12] → 2 occurrences
Cross-boundary: No match

Merge: Select "abc" (savings = 3×(3-1) = 6 > 3×(2-1) = 3)

Output: "abc<3,3><6,3>defdefghi"
(First "abc" kept, others replaced with references)

Compression: Original 18 chars → Compressed ~16 chars

9. Output Screenshots

[SCREENSHOT 1: Basic compression with "abcabcabcdef"]

[SCREENSHOT 2: High repetition text showing good compression ratio]

[SCREENSHOT 3: Text with no repetition showing 0% compression]

[SCREENSHOT 4: Example execution trace for "abcabcabcdefdefghi"]


10. Sample Test Cases

Test 1: "ababababab"
Output: "ab<2,2><4,2><6,2><8,2>"
Ratio: ~40%

Test 2: "abcdefghijk"
Output: "abcdefghijk" (no compression)
Ratio: 0%

Test 3: "abcabcdefdef"
Output: Detects "abc" (3 chars, 2 times)
Best: "abc" or "def" based on savings

Conclusion
✓ Efficient: O(n log² n) time, O(n) space
✓ Correct: Merge ensures all repeats detected
✓ Parallel: Natural decomposition for multi-core systems
✓ Practical: Cache-friendly, good for large-scale compression

End of Report

You might also like