COS 212
External Sorting
The Need for External Sorting
Many sorting algorithms have been proposed, some
more efficient than others
What assumption do the sorting algorithms make about
the data?
Easy random access
Go to any data[i] in one unit of time
What if your data is stored in a file on the hard drive?
Just load it into memory!
Reading and writing to a file
(and other external
…What if your file is 50GB long?
memory) is notoriously
Doesn’t fit into memory slow!
No problem, just access the file every time you need an
element?..
What about sorting a linked list?
External Sorting
External sorting algorithms are designed to work well
even when your data does not fit into memory / offers
no quick random access
How?
Merge Sort for the rescue!
Basis for most external sorting routines
Sort any number of records using a tiny amount of main
memory
Extreme (base) case: memory can fit two data items only
Main idea:
Load data into memory one chunk at a time
Sort each chunk (sorted chunk = run)
Use the merge algorithm to combine runs!
Let’s remind ourselves of how mergesort works…
Main-Memory Merge Sort
Merge-Sort(A)
Merge-Sort(A)
01
01 if
if length(A)
length(A) >> 11 then
then
02
02 Copy
Copy the first half
the first half of
of AA into
into array
array A1
A1 Divide
03
03 Copy
Copy the second half of A into array A2
the second half of A into array A2
04 Merge-Sort(A1)
04
05
Merge-Sort(A1)
Merge-Sort(A2) Conquer
05 Merge-Sort(A2)
06
06 Merge(A,
Merge(A, A1,
A1, A2)
A2) Combine
Running time for Merge sort: O(n log n)
Merge-Sort Recursion Tree
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
log2N
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
2 10 1 5 13 19 7 9 4 15 3 8 12 17 6 11
10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11
N
Backtracking: merge runs (sorted sequences) of
size x into runs of size 2x, decrease the number of
runs twofold.
We can’t fit all the data into memory, but we can
fit some of it – what should be stored in RAM?
External-Memory Merge-Sort
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
Two-way merge
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
Two-way merges
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
Main-memory Main-memory Main-memory Main-memory
sort sort sort sort
10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11
Bottom-up rather than top-down
Initial runs do not have to be of size 1, but can rather be of the
size that fits into memory (M data elements)
We can sort M elements with heapsort or another sorting
algorithm
All we need to do is merge – merge is sequential, thus
efficient even without random access.
Yes, but… We can’t store two runs in memory! How do we
merge?
Read as much
Simple External Merge-Sort as you can fit
Original unsorted input into memory,
file: then sort!
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
File 2
File 3
File 4
Read chunks of size M (= 3 in this example) at a time,
sort in RAM, write alternatively to File 3 and File 4
File 1
Write sorted
File 2 runs into
separate
File 3 11 81 94 17 28 99 15 external files
Merge can work directly on
File 4 12 35 96 41 58 75 the files that store sorted
runs
Simple External Merge-Sort
File 1
File 2
File 3 11 81 94 17 28 99 15
File 4 12 35 96 41 58 75
Merge the 1st run of File 3 & File 4, write to File 1;
Merge the 2nd run of File 3 & File 4, write to File 2;
etc.
File 1 11 12 35 81 94 96 15
File 2 17 28 41 58 75 99
File 3
File 4
Simple External Merge-Sort
File 1 11 12 35 81 94 96 15
File 2 17 28 41 58 75 99
File 3
File 4
Merge the 1st run of File 1 & File 2, write to File 3;
Merge the 2nd run of File 1 & File 2, write to File 4.
File 1
File 2
File 3 11 12 17 28 35 41 58 75 81 94 96 99
File 4 15
passes for the
Simple External Merge-Sort merging, + 1
initial pass to
File 1 construct the runs
File 2
File 3 11 12 17 28 35 41 58 75 81 94 96 99
File 4 15
Merge the last two runs, write to File 1
File 1 11 12 15 17 28 35 41 58 75 81 94 96 99
File 2
File 3
File 4
Multiway (k-way) Merge-Sort
2-way (simple) merge sort requires passes through the
files to sort the data
Example on the previous slides: passes
Fewer passes – faster algorithm (less I/O)
Can we reduce the number of passes?
Yes we can – by using more files!
Multiway merge: merge k sorted runs
How?
Two runs: pick the smallest one of the two elements
k runs: pick the smallest one out of k
Heaps can help!
Multiway (3-way) Merge-Sort k=3
Original unsorted input
file:
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
File 2
File 3
File 4
File 5
File 6
Phase 1: each run is sorted and recorded into one of 2*k
files
File 1
File 2
File 3 2k =
File 4 11 81 94 41 58 75 6
File 5 12 35 96 15
File 6 17 28 99
Multiway (3-way) Merge-Sort
File 1
Put in a
Filemin-heap
2
File 3
File 4 11 81 94 41 58 75
File 5 12 35 96 15
File 6 17 28 99
Phase 2: Perform 3-way merge on the sorted runs!
File 1 11 12 17 28 35 81 94 96 99
File 2 15 41 58 75
File 3
File 4
File 5
File 6
Multiway (3-way) Merge-Sort
File 1 11 12 17 28 35 81 94 96 99
File 2 15 41 58 75
File 3
File 4
File 5
File 6
Phase 2: Perform 3-way merge on the sorted runs
File 1
File 2
File 3
File 4 11 12 15 17 28 35 41 58 75 81 94 96 99
File 5
passes ⌈ log3 13/3⌉=2
File 6
Polyphase Merge-Sort
What is the problem with the multiway mergesort?
The number of files (or other output devices) = 2 * k
This might be infeasible and/or impractical
Polyphase merge: use k + 1 instead of 2k files
How?
Split the data unevenly between the files, i.e. some files get more
runs than others
Suppose we have 3 files only, and a total of 8 runs of length M
Standard split (even): 4 runs per file
Polyphase split (uneven): 5 runs on one file, 3 runs on another file
Input: file F1, output: 5 runs on file F2, 3 runs on file F3
Merge F2 + F3 until you have no more runs on one,
put the result on F1
F3 is now empty!
Thus, merge remainder of F2 + F1, put on F3
Repeat until sorted!
Polyphase Merge-Sort
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
File 2
File 3
Total number of runs: 5, put the first 3 on F2, last 2
on F3:
File 1
File 2 11 81 94 12 35 96 17 28 99
File 3 41 58 75 15
Polyphase Merge-Sort
File 1
File 2 11 81 94 12 35 96 17 28 99
File 3 41 58 75 15
Merge F2 and F3 into F1: F2 is left with some
baggage!
File 1 11 41 58 75 81 94 12 15 35 96
File 2 17 28 99
File 3
We can now merge F1 and F2 into F3
Polyphase Merge-Sort
File 1 11 41 58 75 81 94 12 15 35 96
File 2 17 28 99
File 3
Merge F1 and F2 into F3: F1 is left with some
baggage!
File 1 12 15 35 96
File 2
File 3 11 17 28 41 58 75 81 94 99
We can now merge F1 and F3 into F2
Polyphase Merge-Sort
File 1 12 15 35 96
File 2
File 3 11 17 28 41 58 75 81 94 99
Merge F1 and F3 into F2: done!
File 1
File 2 11 12 15 17 28 35 41 58 75 81 94 96 99
File 3
Polyphase Merge-Sort
In the example, we have split 5 runs into 2 + 3 runs
The split matters:
if the runs are distributed between two files incorrectly, there will
be a lot of unnecessary and slow back and forth
How do we know how to split the runs?
Ask Fibonacci!
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ....
For a number of runs = ,
split the runs such that the 1st and the
2nd output have
and runs.
If the number of runs
is not a Fibonacci
number, pad with
“dummy” runs!
Replacement Selection Merge-Sort
Let’s reconsider Phase 1 of external merge sort:
Splitting input data into sorted runs
What we have done so far:
Fill memory with M data points, sort, store
Maximum size of initial run: M
Can we do better?
Suppose heap sort is utilised to sort M elements
Min elements are removed one by one from the heap and written
to a file
As you remove min elements, you free up space in memory!
Replacement selection:
After min element was written to a file, read the next input
element;
If the element > min element just removed, then add it to the
heap!
If new element is smaller, can’t add to this run - save for the next
Result: longer runs, fewer merges!
Replacement Selection
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
11 < 81 >
96 12 12 is not
Heap: 11 94 81 81 94 96 94 96 12 a part of
the heap
We constructed a run > M 94 >
35 35 is not
96 35 12 a part of
Can become very efficient if
the heap
you get runs >> M 96 >
17 End of the
12 35 17 17 35 12 run!
Rebuild
heap
File 2 11 81 94 96
Summary
External sorting
Necessary when sorting data that does not fit into RAM
(memory)
External sorting algorithms do not assume that you have
cheap random access to every data element
External Merge Sort
External algorithms are based on a bottom-up merge sort
First, data is split into sorted runs
Runs are then iteratively merged
External merge sort variations:
Simple (2-way)
Multiway (k-way)
Polyphase
Replacement selection
Next lecture: Graphs