0% found this document useful (0 votes)
11 views55 pages

Externalsorting Fall2023

Uploaded by

erterek23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views55 pages

Externalsorting Fall2023

Uploaded by

erterek23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

External Sorting

CENG 351 Data Management and File Structures 1


External Sorting
• Problem: Sort 1GB of data with 1MB of
RAM.
• When a file doesn’t fit in memory, there
are two stages in sorting:
1. File is divided into several segments, each of
which sorted separately
2. Sorted segments are merged.
(Each stage involves reading and writing the file
at least once)
CENG 351 Data Management and File Structures 2
How to sort segments
• Any efficient sorting algorithm (such as
Quicksort) can be used to sort segments
• Each sorted segment is called a run.
• Each run will be the size of the available
memory.

CENG 351 Data Management and File Structures 3


External sort for Large Files
Basic idea
1. Form runs (i.e. sorted segments):
• bring as many records as possible to main
memory, sort them, save it into a small file on
disk.
• Repeat this until we have read all records
from the original file and written them as
sorted segments (i.e. runs) to disk.
2. Do a multiway merge of the runs.
CENG 351 Data Management and File Structures 4
External sort for Large Files (cont.)
• How big is each run?
– As big as the available memory.
• What is the time it takes to create all sorted
segments?
– Ignoring the seek time and assuming b blocks in the file

The time for creating the initial sorted segments is 2b*btt


(read in the file as segments and write out the runs)
• Note that the entire file has not been sorted yet.
• These are just sorted segments, and
• the size of each segment is limited to the size of the available
memory used for this purpose.

CENG 351 Data Management and File Structures 5


Multiway Merging
• K-way merge: we want to merge K sorted input
lists to create a single sorted output list. (K is the
order of a K-way merge)

• We will adapt the 2-way merge algorithm:


– Instead of two lists, keep an array of lists: list[0], list[1],
… list[k-1]

CENG 351 Data Management and File Structures 6


Unsorted Example -- Step 1: Create Runs
file
21
110
27
5 Memory (buffer)
108 of size 6
9
109
11
25
107
2
4
10
22
3
1
4
21
CENG 351 Data Management and File Structures 7
Unsorted Example -- Step 1: Create Runs Run1
file 5
9
21 21
110 27
27 108
5 Memory (buffer)
110
108 of size 6
9 21
109 110
11 27
25 5
107 108
2 9
4
10
22
3
1
4
21
CENG 351 Data Management and File Structures 8
Unsorted Example -- Step 1: Create Runs Run1
file 5
9
21 21
110 27
27 108
5 Memory (buffer)
110
108 of size 6
9 Run2
109 109 2
11 11 4
25 25 11
107 107 25
2 2 107
4 4 109
10
22
3
1
4
21
CENG 351 Data Management and File Structures 9
Unsorted Example -- Step 1: Create Runs Run1
file 5
9
21 21
110 27
27 108
5 Memory (buffer)
110
108 of size 6
9 Run2
109 2
10
11 4
22
25 11
3
107 25
1
2 107
4
4 109
21
10 Run3
22 1
3 3
1 4
4 10
21 21
CENG 351 Data Management and File Structures 10
22
Run1 Example -- Step 2: K-way Merge
5
9
21
27
108 Memory (buffer)
110 of size 6
Run2
2
4
11
25
107
109
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 11
22
Run1 Example -- Step 2: K-way Merge
5
9
21
27
108 Memory (buffer)
110 of size 6
Run2
2 5
4 9
11 2
25 4
107 1
109 3
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 12
22
Run1 Example -- Step 2: K-way Merge
5
9
21
27
108 Memory (buffer)
110 of size 6
Run2
2 5
4 9
11 2
25 4
107 1
109 3
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 13
22
Run1 Example -- Step 2: K-way Merge
5 1
9
21
27
108 Memory (buffer)
110 of size 6
Run2
2 5
4 9
11 2
25 4
107 1
109 3
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 14
22
Run1 Example -- Step 2: K-way Merge
5 1
9 2
21
27
108 Memory (buffer)
110 of size 6
Run2
2 5
4 9
11 2
25 4
107 1
109 3
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 15
22
Run1 Example -- Step 2: K-way Merge
5 1
9 2
21 3
27
108 Memory (buffer)
110 of size 6
Run2
2 5
4 9
11 2
25 4
107 4
109 10
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 16
22
Run1 Example -- Step 2: K-way Merge
5 1
9 2
21 3
27 4
108 Memory (buffer)
110 of size 6
Run2
2 5
4 9
11 11
25 25
107 4
109 10
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 17
22
Run1 Example -- Step 2: K-way Merge
5 1
9 2
21 3
27 4
108 Memory (buffer)
4
110 of size 6 5
Run2
2 5
4 9
11 11
25 25
107 4
109 10
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 18
22
Run1 Example -- Step 2: K-way Merge
5 1
9 2
21 3
27 4
108 Memory (buffer)
4
110 of size 6 5
Run2 9
2 21
4 27
11 11
25 25
107 4
109 10
Run3
1
3
4
10
21 CENG 351 Data Management and File Structures 19
22
Run1 Example -- Step 2: K-way Merge
5 1
9 2
21 3
27 4
108 Memory (buffer)
4
110 of size 6 5
Run2 9
2 21
4 27
11 11
25 25
107 4
109 10
Run3
1
3 Continues…
4
10
21 CENG 351 Data Management and File Structures 20
22
Run1 Example -- Step 2: K-way Merge
5
9
21 1
27 2
108 Memory (buffer) 3
110 of size 6 4
4
Run2 5
2 . 9
4 . 10
11 . 11
25 . 21
107 . 21
109 . 22
Run3 At the end 25
1 27
3 107
We obtain the 108
4
10 sorted file 109
21 110
CENG 351 Data Management and File Structures 21
22
Ext. Sort (2 steps) Run1 Final Sorted
Unsorted file
5 File
21 9
110 1
21
27 2
27
5 Memory (buffer) Memory (buffer) 3
108
108 of size 6 of size 6 4
110
9 4
109 …
Run2 . 5
11 2 . 9
25 4 . 10
107 11 . 11
2 25 . 21
4 107 . 21
10 109 22
22 Run3 25
3 27
1
1 107
3
4 108
4
21 109
10
110
Read Cost: Write Cost: 21 22
22
Ext. Sort (2 steps) Run1 Final Sorted
Unsorted file
5 File
21 9
110 1
21
27 2
27
5 Memory (buffer) Memory (buffer) 3
108
108 of size 6 of size 6 4
110
9 4
109 …
Run2 . 5
11 2 . 9
25 4 . 10
107 11 . 11
2 25 . 21
4 107 . 21
10 109 22
22 Run3 25
3 27
1
1 107
3
4 108
4
21 109
10
110
Read Cost: Write Cost: 21 23
3 (s+r) + Tx 3 (s+r) + Tx 22
Ext. Sort (2 steps) Run1 Final Sorted
Unsorted file
5 File
21 9
110 1
21
27 2
27
5 Memory (buffer) Memory (buffer) 3
108
108 of size 6 of size 6 4
110
9 4
109 …
Run2 . 5
11 2 . 9
25 4 . 10
107 11 . 11
2 25 . 21
4 107 . 21
10 109 22
22 Run3 25
3 27
1
1 Assume: Output 107
3
4 buffer of size 2 108
4
21 109
10
Read Cost: Write Cost: 21 Read Cost: Write Cost: 110
24
3 (s+r) + Tx 3 (s+r) + Tx 22 9 (s+r) + Tx 9 (s+r) + Tx
Multiway Merging to sort Large Files
Let us consider the following example:
• File to be sorted:
– 8,000,000 records
– R = 100 bytes
Total file size = 800MB

• Memory available as a work area: 10MB (not


counting memory used to hold program, other
variables, O.S., I/O buffers etc.)

CENG 351 Data Management and File Structures 25


Step 1: Create Runs
Disk I/Os are performed in the following procedures:
1. Reading records into main memory for sorting
2. Writing runs to disk.

These two steps are done as follows:


• Read 10MB, write 10MB (repeat this 80 times, because file is
800MB). So there will be 80 runs
– In terms of basic disk operations, this operation costs:
• For reading: (80 seeks +) a total transfer time of 800 MB (TX)
• Same for writing: (80 seeks +) a total transfer time of 800 MB (TX)

CENG 351 Data Management and File Structures 26


Unsorted Ex. -- Step 1: Create Runs Run1
File (800 MB) …

Run2
Memory (buffer)

of 10 MB

Run80

Read: Write:
80 (s+r) + Tx 80 (s+r) + Tx
CENG 351 Data Management and File Structures 27
Step 2: Multiway Merging
• Read runs into memory for merging.
– Available memory is 10 M and each run size is 10MB.
– There are 80 runs.
– We need to do a 80-way merge.
– 10M / 80 = 125,000 bytes are available for each list to be merged.
• Read one piece of each run.
• How many pieces to be read for each run?
Size of run/size of chunk = 10,000,000/125,000= 80 pieces
• Sorting completed in one round:
– For reading: a total transfer time of 800 MB (TX)
– Same for writing: a total transfer time of 800 MB (TX)
– In addition, time for a high number of seeks
CENG 351 Data Management and File Structures 28
Ext. Sort (800 MB) Run1 Final Sorted
Unsorted file
(800 MB) … File
80 pieces per run

Memory (buffer) Memory (buffer)


of 10 MB of 10 MB


Run2



Run80

80-way merge

Read: Write: Read: Write:


29
80 (s+r) + Tx 80 (s+r) + Tx 6400 (s+r) + Tx 6400 (s+r) + Tx
Step 2 (continues)
• Total # of seeks = Total # of pieces (counting all runs):
There are 80 runs and 80 pieces in each run:
✓ 802 pieces = 6400 seeks for reading
• Writing the merged records to the big sorted file:
• The number of separate writes closely approximates
reads:
✓ 802 pieces = 6400 seeks for writing
• So we estimate two seeks (one for reading and one for
writing) for each piece: 80* 80 pieces therefore
a total of 2*6400 = 12800 seeks are necessary.

CENG 351 Data Management and File Structures 30


Sorting a File that is 10 times larger
• How is the time for merge phase affected if
the file is 80 million records?
– More runs: 800 runs
– 800-way merge in 10MB memory
• i.e. divide the memory into 800 pieces (lists).
– Each piece (list) holds 1/800th of a run
– So, 800 runs * 800 seeks/run = 640,000 seeks
– Total = 2 * 640,000 = 1,280,000 seeks

CENG 351 Data Management and File Structures 31


Ext. Sort (8000MB) Run1 Final Sorted
Unsorted file
(8000 MB) … File
800 pieces per run

Memory (buffer) Memory (buffer)


of 10 MB of 10 MB


Run2



Run800
… 800-way merge

Read:
Read: Write: 640,000 (s+r) + Tx Write: 32
800 (s+r) + Tx 800 (s+r) + Tx 640,000 (s+r) + Tx
The cost of increasing the file size
• In general, for a K-way merge of K runs, the
buffer size reserved for each run is
– (1/K) * size of memory space = (1/K) * size of each run
• So K seeks are required to read all of the records
in each run and K seeks to write records.
• Since there are K runs, merge requires a total of
2K2 seeks.
• Because K is directly proportional to N it also
follows that the total cost is an O(N2) operation.

CENG 351 Data Management and File Structures 33


Multiple-pass multiway merges
• Instead of merging all runs at once, we
break the original set of runs into small
groups and merge the runs in these groups
separately.
– more memory space is available for each run;
hence fewer seeks are required per run.
• When all of the smaller merges are
completed, a second pass merges the new
set of merged runs.
CENG 351 Data Management and File Structures 34
10 MB
R1 Memory (buffer): 10 MB
R2

R32 Two-pass merge of
R33
800 runs
R34

R64


R769

R770

R800
10 MB
R1 Memory (buffer): 10 MB
R2

R32
Two-pass merge of
R33
800 runs:
R34
… 32 way- merge in the first
R64 pass


R769

R770

R800
10 MB 10 x 32 =320 MB
R1 Memory (buffer): 10 MB
R2 …

32-way merge
R32

R33 Memory (buffer): 10 MB

R34

25 sets of 32 runs each
(25 32-way merges) … 32-way merge
R64


R769 Memory (buffer): 10 MB

R770 …
… 32-way merge
R800
10 MB 10 x 32 =320 MB
R1 Final Sorted
Memory (buffer): 10 MB
File
R2 … …
… 32-way merge
R32 Memory (buffer):
10 MB
R33 Memory (buffer): 10 MB
… …
25 sets of 32 R34
runs each … 32-way merge
(25 32-way R64
merges) 25-way merge
(only one!)
… …
R769 Memory (buffer): 10 MB

R770 …
… 32-way merge
R800
Two-pass merge of 800 runs
Each run is
25 sets of 32 runs each (25 32-way merges)
10M
… … … … …

32-way 32-way
32-way 32-way
merge merge
merge merge
Each run is …
320M

One 25-way
merge

Two-pass merge of 800 runs


CENG 351 Data Management and File Structures 39
10 MB 10 x 32 =320 MB
R1 Final Sorted
Memory (buffer): 10 MB
File
R2 … …
… 32-way merge
R32 Memory (buffer):
10 MB
R33 Memory (buffer): 10 MB
… …
25 sets of 32 R34
runs each … 32-way merge
(25 32-way R64
merges) 25-way merge
(only one!)

R769

Memory (buffer): 10 MB

R770 …
… 32-way merge
R800

Merge Pass-1 Merge Pass-2


10 MB 10 x 32 =320 MB
R1 Final Sorted
Memory (buffer): 10 MB
File
R2 … …
… 32-way merge
R32 Memory (buffer):
10 MB
R33 Memory (buffer): 10 MB
… …
25 sets of 32 R34
runs each … 32-way merge
(25 32-way R64
merges) 25-way merge
How many times do we make an (only
exhaustive one!)
access

(read or write)?
R769

Memory (buffer): 10 MB

R770 …

How about seek cost (s+r) per merge pass?
32-way merge
R800

Merge Pass-1 Merge Pass-2


10 x 32 =320 MB Final Sorted
10 MB R1 32 pieces
File
R2 … …
32x 32 =1024 … 32-way merge
R32 Memory (buffer):
10 MB
R33 Memory (buffer): 10 MB
… …
25 sets of 32 R34
runs each … 32-way merge
(25 32-way R64
merges) 25-way merge
(only one!)

R769

Memory (buffer): 10 MB

R770 …
… 32-way merge
R800
Write Cost:
Read Cost (s + r):
1024 x 25 = 25,600
1024 x 25 = 25,600 Merge Pass-2
Merge Pass-1
10 x 32 =320 MB Final Sorted
10 MB R1 32 pieces
File
R2 … 320 / (10 / 25) =

32x 32 =1024 … 32-way merge
800 pieces

R32 Memory (buffer):


10 MB
R33 Memory (buffer): 10 MB
… …
25 sets of 32 R34
runs each … 32-way merge
(25 32-way R64
merges) 25-way merge
(only one!)

R769

Memory (buffer): 10 MB

R770 … Read:

… 32-way merge
800 x 25 = 20,000

R800 Write:
Write Cost: 800 x 25 = 20,000
Read Cost (s + r):
1024 x 25 = 25,600 1024 x 25 = 25,600 Merge Pass-2
Merge Pass-1
Cost of multi-pass merging
• 25 sets of 32 runs, followed by 25-way merge:
– Disadvantage: we read every record twice.
– Advantage: we can use larger lists and avoid a large
number of disk seeks.
• Calculations:
First Merge Pass:
– List size = 1/32 run => 32*32 = 1024 seeks
– For 25 32-way merges=> 25 * 1024 = 25,600 seeks

CENG 351 Data Management and File Structures 44


Second Merge Pass:
– For each 25 final runs, 1/25 memory space is allocated.
– So each input list is 0.4M and it can hold 4000 records
(or 1/800 run)
– Hence, 800 seeks per run, so we end up making 25 *
800 = 20,000 seeks.
Total number of seeks for reading in two passes:
25600 + 20000 = 45,600
• What about the total time for merge?
– We now have to transmit all of the records 4 times
instead of two.
– We also write the records twice, requiring an extra
45,600 seeks.
– Total number of seeks : 91200 seeks
• Still the trade is profitable.
CENG 351 Data Management and File Structures 45
External sorting of 800 run (8000 MB) file

With one-pass merge:

Create runs (Pass-0) 2 x 800 (s+r) + 2 Tx


Merge Pass-1 2 x 640,000 (s+r) + 2 Tx

With two-pass merge:

Create runs (Pass-0) 2 x 800 (s+r) + 2 Tx


Merge Pass-1 2 x 25,600 (s+r) + 2 Tx
Merge Pass-2 2 x 20,000 (s+r) + 2 Tx
CENG 351 Data Management and File Structures 46
Total Cost

Total # of block transfers= 2b*btt + (⌈logkm⌉) *2b*btt


= 2b*btt * (⌈logkm⌉ +1)
(when k-way multi-pass merge is used for m runs.)

Total cost including # of seeks:


= 2b*btt + (⌈logkm⌉)* [2k*m*(s+r) + 2b*btt]

creating sorted runs k-way multi-pass merge


(handled in ⌈logkm⌉ passes (steps))

CENG 351 Data Management and File Structures 47


DBMS view (from Raghu’s book)
• Sort a file with N blocks (pages)
• Available memory B blocks (buffer pages):
– Pass 0: use B buffer pages. Produce  N / B sorted runs
of B pages each.
– Pass 1, 2, …, etc.: merge B-1 runs.
INPUT 1

... INPUT 2
... OUTPUT ...
INPUT B-1
Disk Disk
B Main memory buffers
Cost of External Merge Sort
• Number of passes: 1 +  log B −1  N / B 
• Cost = 2N * (# of passes)
• E.g., with 5 buffer pages, to sort 108 page file:
– Pass 0: 108 / 5  = 22 sorted runs of 5 pages each
(last run is only 3 pages)
– Pass 1:  22 / 4  = 6 sorted runs of 20 pages each
(last run is only 8 pages)
– Pass 2: 6 / 4 = 2 sorted runs, 80 pages and 28
pages
– Pass 3: Sorted file of 108 pages
Number of Passes of External Sort

N B=3 B=5 B=9 B=17 B=129 B=257


100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
Problem
Assume you are given a file of 800,000 records, 400 bytes each,
a block size of 4000 bytes with btt of 1 msec, and the available
memory size is 8MB. You are asked to sort this file using 4-way
merge with one disk drive. Assume s = 5 ms, r = 4 ms, btt= 1ms.
1. How many runs (sorted segments) are created ?
2. What is the size of each run?
3. Estimate the cost of creating these runs.

CENG 351 Data Management and File Structures 51


Solution
1. How many runs (sorted segments) are created?
Number of runs = file size/ 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑎𝑣𝑎𝑖𝑙.𝑚𝑒𝑚𝑜𝑟𝑦 =
800000∗400/ 8000000 = 40 sorted segments

2. What is the size of each run?


Size of each run = size of available memory = 8MB

3. Estimate the cost of creating these runs


• Number of blocks = b = 800000 ∗400/4000=80000 𝑏𝑙𝑜𝑐𝑘𝑠
• Cost of creating the runs = 2∗𝑏∗𝑏𝑡𝑡=2∗80000∗(1𝑚𝑠)=160000
𝑚𝑠

CENG 351 Data Management and File Structures 52


Problem (cont.)
4. What is the number of passes necessary to merge the sorted
segments using 4-way merge?
5. Estimate the cost of one pass of merging.
6. Estimate the total cost of sorting the file, including the
creation of sorted runs and the multi- step merging.

CENG 351 Data Management and File Structures 53


Solution (cont.)
4. What is the number of passes necessary to merge the sorted
segments using 4-way merge?
Number of passes = ⌈ log440 ⌉ = 3

5. Estimate the cost of one pass of merging.


Cost of one pass of merging =
= 2 * k * (# of runs) * (s+r) + 2 * b * btt
where k is the number in k-way merging (= no of pieces in
each run)
= 2 * 4 * 40 * (9 ms) + 2 * 80000 * (1ms)
= 2880 ms + 160000 ms
= 162880 ms
CENG 351 Data Management and File Structures 54
Solution (cont.)
6. Estimate the total cost of sorting the file, including the creation of
sorted runs and the multi- step merging.
Total cost of sorting the file =
# of passes * cost for one pass of merging + cost of creating runs
(Pass-0)
= 3 * (162880 ms) + 160000 ms = 648640 ms

CENG 351 Data Management and File Structures 55

You might also like