Problem 4: Divide and Conquer in String Compression
Course: CS-2009 Design and Analysis of Algorithms
Assignment: 3
Student: [Your Name]
Roll Number: [Your Roll Number]
1. Algorithm Explanation
The algorithm uses divide and conquer to detect repeated substrings for compression. It splits text into blocks,
finds repeats within and across blocks, then merges results to create compressed output using reference
encoding (similar to LZ77).
Key Steps:
1. Divide text recursively into left/right halves
2. Find longest repeated substring in each half
3. Check for repeats spanning the boundary
4. Merge by selecting repeat with maximum compression savings
5. Encode output with <offset, length> references
2. Pseudocode
FUNCTION compress(text, blockSize):
repeat = divideAndConquer(0, length(text))
RETURN createCompressedOutput(repeat)
FUNCTION divideAndConquer(start, end):
IF end - start ≤ blockSize:
RETURN findLongestRepeatInBlock(text[start...end])
mid = (start + end) / 2
leftRepeat = divideAndConquer(start, mid)
rightRepeat = divideAndConquer(mid, end)
crossRepeat = findCrossBoundaryRepeats(mid)
RETURN mergeRepeats([leftRepeat, rightRepeat, crossRepeat])
FUNCTION findLongestRepeatInBlock(block):
FOR each substring in block:
IF substring appears > 1 time AND longer than current best:
Update longest repeat
RETURN longest repeat
FUNCTION findCrossBoundaryRepeats(mid):
FOR overlap = 1 TO blockSize:
IF text[mid-overlap...mid] = text[mid...mid+overlap]:
IF this pattern repeats elsewhere:
Update cross repeat
RETURN cross repeat
FUNCTION mergeRepeats(repeats):
SELECT repeat with maximum savings
WHERE savings = length × (occurrences - 1)
FUNCTION createCompressedOutput(repeat):
Keep first occurrence
Replace other occurrences with <offset, length>
RETURN compressed string
3. Time and Space Complexity
Time Complexity
Component Complexity
Divide O(1)
Find repeats in block O(b³) where b = block size
Cross-boundary check O(b² × n)
Merge O(k) where k = candidates
Total per level O(n × b²)
Recursion depth O(log n)
Overall O(n × b² × log n)
Optimized: O(n log² n) with hash-based substring matching
Space Complexity
Recursion stack: O(log n)
Substring storage: O(n)
Overall: O(n)
4. Merge Step Correctness
Ensures Correctness by:
1. Completeness: Left + Right + Cross-boundary covers ALL possible repeats
2. No missed patterns: Cross-boundary check handles patterns spanning blocks
3. Optimal selection: Chooses repeat with max savings = length × (occurrences - 1)
Proof: Any repeated substring R is either:
Entirely in left half → Found by left recursion
Entirely in right half → Found by right recursion
Spans boundary → Found by cross-boundary check ∴ R is always detected
5. Parallelization
The algorithm naturally supports parallel execution:
PARALLEL divideAndConquer(start, end):
IF base case: RETURN sequential result
mid = (start + end) / 2
SPAWN leftTask = divideAndConquer(start, mid)
SPAWN rightTask = divideAndConquer(mid, end)
SYNC leftRepeat = [Link]
SYNC rightRepeat = [Link]
crossRepeat = findCrossBoundaryRepeats(mid)
RETURN mergeRepeats([leftRepeat, rightRepeat, crossRepeat])
Performance: With p processors, time reduces to O(n log² n / p + log n)
6. Comparison with Suffix Array
Metric Divide & Conquer Suffix Array
Time O(n log² n) O(n log n)
Space O(n) O(n)
Parallelizable ✓ Excellent ✗ Limited
Cache locality ✓ Sequential ✗ Random
Best for Parallel systems Single-threaded
Trade-off: D&C is slightly slower but better for multi-core/distributed systems
7. Diagrams
Divide and Conquer Tree for "abcabcabcdefdef"
[SPACE FOR D&C TREE DIAGRAM]
"abcabcabcdefdef"
|
___________|___________
| |
"abcabcabc" "defdef"
| |
____|____ _____|_____
| | | |
"abca" "bcabc" "def" "def"
Merge at each level to find best repeat
Result: "abc" appears 3 times (positions 0,3,6)
Algorithm Flowchart
[SPACE FOR FLOWCHART]
Start → Input Text
↓
Divide at midpoint
↓
Recursively process left & right
↓
Check cross-boundary repeats
↓
Merge: Select best repeat
↓
Create compressed output
↓
End
8. Example Scenario
Input: "abcabcabcdefdefghi" (block size = 6)
Execution:
Level 0: Split "abcabcabcdefdefghi" at position 9
Left: "abcabcabc" | Right: "defdefghi"
Level 1:
Left finds: "abc" at [0,3,6] → 3 occurrences
Right finds: "def" at [9,12] → 2 occurrences
Cross-boundary: No match
Merge: Select "abc" (savings = 3×(3-1) = 6 > 3×(2-1) = 3)
Output: "abc<3,3><6,3>defdefghi"
(First "abc" kept, others replaced with references)
Compression: Original 18 chars → Compressed ~16 chars
9. Output Screenshots
[SCREENSHOT 1: Basic compression with "abcabcabcdef"]
[SCREENSHOT 2: High repetition text showing good compression ratio]
[SCREENSHOT 3: Text with no repetition showing 0% compression]
[SCREENSHOT 4: Example execution trace for "abcabcabcdefdefghi"]
10. Sample Test Cases
Test 1: "ababababab"
Output: "ab<2,2><4,2><6,2><8,2>"
Ratio: ~40%
Test 2: "abcdefghijk"
Output: "abcdefghijk" (no compression)
Ratio: 0%
Test 3: "abcabcdefdef"
Output: Detects "abc" (3 chars, 2 times)
Best: "abc" or "def" based on savings
Conclusion
✓ Efficient: O(n log² n) time, O(n) space
✓ Correct: Merge ensures all repeats detected
✓ Parallel: Natural decomposition for multi-core systems
✓ Practical: Cache-friendly, good for large-scale compression
End of Report