Advanced Database Systems
Spring 2025
Lecture #04:
HW & Disk Space Management
R&G: Chapters 1, 9.1, 9.3
2
DBMS: B IG P ICTURE
SQL clients interact with a DBMS SQ L C lie nt
You know how to write a SQL query
How is a SQL query executed?
Da taba se
Mana ge me nt Syste m
Da taba se
3
DBMS: Q UERY P LANNING
Parse, check, and verify the SQL query SQ L C lie nt
SELECT [Link] Q uer y Pl anning
FROM Student S JOIN Enrolled E
ON [Link] = [Link]
WHERE [Link] = ‘INF-11199’
Da taba se
Mana ge me nt Syste m
Translate into an efficient relational
query plan that can be executed
Da taba se
4
DBMS: O PERATOR E XECUTION
Execute a dataflow by operating on SQ L C lie nt
records and files
Q uer y Pl anning
sorting
π [Link] O pera tor Exe cuti on
sort-merge join
⋈ [Link] = [Link] Da taba se
Mana ge me nt Syste m
B+ tree
σ [Link] = ‘INF-11199’
scan
Student Enrolled
Da taba se
5
DBMS: F ILES & I NDEX M ANAGEMENT
Organise tables and records as SQ L C lie nt
groups of pages in a logical file
Q uer y Pl anning
sid name dept age
12344 Jones CS 18 O pera tor Exe cuti on
12355 Smith Physics 23
12366 Gold CS 21 F ile s & Index Mana ge me nt
Directory Header Header
Page1 Page2
Da taba se
Disk
Database File
6
DBMS: B UFFER M ANAGEMENT
Transfer data between disk and memory SQ L C lie nt
Buffer Pool
Q uer y Pl anning
Directory Header
O pera tor Exe cuti on
Page2
Memory F ile s & Index Mana ge me nt
Buf fe r Mana ge me nt
Directory Header Header
Page1 Page2
Da taba se
Disk
Database File
7
DBMS: D ISK S PACE M ANAGEMENT
Translate page requests into reading & SQ L C lie nt
writing physical bytes on devices
Q uer y Pl anning
01010 10100 O pera tor Exe cuti on
11101 11010
10010 10101 01111
01010 F ile s & Index Mana ge me nt
10110
Buf fe r Mana ge me nt
Di sk S pace Ma nag em e nt
Da taba se
Di sk
8
A RCHITECTURE OF A DBMS
Organised in layers SQ L C lie nt
Each layer abstracts the layer below Q uer y Pl anning
Manage complexity
O pera tor Exe cuti on
Performance assumptions
F ile s & Index Mana ge me nt
Example of good systems design
Buf fe r Mana ge me nt
Di sk S pace Ma nag em e nt
Da taba se
9
DBMS: C ONCURRENCY & R ECOVERY
Two cross-cutting modules related to SQ L C lie nt
storage and memory management
Q uer y Pl anning
O pera tor Exe cuti on
F ile s & Index Mana ge me nt
C onc urre ncy Control
Buf fe r Mana ge me nt
Re cove r y
Di sk S pace Ma nag em e nt
Da taba se
10
O UTLINE
Storage Media SQ L C lie nt
Disk Space Management Q uer y Pl anning
Buffer Management O pera tor Exe cuti on
F ile s & Index Mana ge me nt
File Layout
Buf fe r Mana ge me nt
Page Layout
Di sk S pace Ma nag em e nt
Record Layout
Da taba se
11
D ISK -O RIENTED A RCHITECTURE
Most database systems are designed for non-volatile disk storage*
The primary location of the database is on disks (HDD and/or SSD)
Data processing happens in volatile main memory
The DBMS responsible for moving data between disk and main memory
Major implications
Data stored on disk is not byte addressable. Instead, an API:
READ: transfer “page” of data from disk to RAM
WRITE: transfer “page” of data from RAM to disk
Disk reads & writes are very, very slow! ⇒ Must plan carefully!
* Volatile storage only maintains its data while the device is powered
12
W HY N OT S TORE A LL IN M AIN M EMORY ?
Costs too much
Cost of 1TB storage (2020): 50$ for HDD, 200$ for SSD, 6000$ for RAM
High-end databases today in the petabyte range!
Roughly 60% of the cost of a production system is in the disks
Main memory is volatile
Obviously important if DB stops/crashes. We want data to be saved!
Some specialised systems do store entire databases in main memory
Faster than disk-oriented but with much higher cost/GB
Suitable for small databases
13
S TORAGE H IERARCHY
Faster
CPU Registers Smaller
Volatile
CPU Caches
Byte-Addressable
DRAM
SSD
Non-Volatile
Block-Addressable HDD
Network Storage Slower
Larger
14
S TORAGE H IERARCHY Latency Capacity
CPU Registers < 1ns B
CPU Caches < 10ns KB/MB
Memory for active data
(primary storage) DRAM 100ns GB
SSD 0.1ms GB/TB
Disk for main database
(secondary storage) HDD 10ms TB
Network Storage 30ms PB
15
A NATOMY OF A D ISK
Platters rotate (say 15000 rpm)
Disk arm moves in or out to position
disk heads on a desired track
Tracks under heads make a “cylinder”
Only one head reads/writes at any one time
Block size is a multiple of (fixed) sector size
Sector = minimum storage unit (512 B or 4 KB)
Video on how disk drives work
16
A CCESSING A D ISK PAGE
Data is stored and retrieved in units called disk blocks
Block size is determined by the filesystem (usually 4 KB, sometimes up to 64 KB)
Unlike RAM, time to retrieve a block depends on its location
Time to access (read/write) a disk block:
Seek time: moving disk arm to position disk heads on track
Rotational delay: waiting for target block to rotate under a head
Transfer time: actually moving data to/from disk surface
Seagate Cheetah 15K.7 17
4 disks, 8 heads, avg. 512 KB/track, 600GB capacity
rotational speed: 15 000 rpm
average seek time: 3.4 ms
transfer rate ≈ 163 MB/s
Access time to read one block of size 8KB
Average seek time 3.40 ms
Average rotational delay 1/2 · 1/15000 min 2.00 ms
Transfer time 8KB / 163 MB/s 0.05 ms
Total access time 5.45 ms
Seek time and rotational delay dominate!
18
S EQUENTIAL VS . R ANDOM A CCESS
What about accessing 1000 blocks of size 8 KB
Random: 1000 · 5.45 ms = 5.45 s
Sequential: 3.4 ms + 2 ms + 1000 · 0.05 ms ≈ 55 ms
tracks store only 512KB ⟹ some additional (< 5 ms) track-to-track seek time
Sequential I/O orders of magnitude faster than random I/O
avoid random I/O at all cost
19
A RRANGING B LOCKS ON D ISK
‘Next’ block concept:
sequential blocks on same track, followed by
blocks on same cylinder, followed by
blocks on adjacent cylinder
Arrange file pages sequentially by ‘next’ on disk
Minimize seek and rotational delay
For a sequential scan, pre-fetch several blocks at a time!
Reading large consecutive blocks
“Amortises” seek time and rotational delay
20
S OLID S TATE D RIVES
Alternative to conventional hard disks
Data accessed in pages, internally pages are organised into blocks
Fine-grain reads (4-8 KB pages), coarse-grain writes (1-2 MB blocks)
Issues in current generation (NAND)
Write amplification: Writing data in small pages causes erasing big blocks
Limited endurance: Only 2K-3K erasures before cell failure
Wear levelling: SSD controller needs to keep moving hot write units around
Price: SSD is 2-5x more expensive than HDD
21
S OLID S TATE D RIVES
Read is fast and predictable
Single read access time: 30 µs
4KB random reads: ~500 MB/sec
Sequential reads: ~525 MB/sec
But write is not! Slower for random
Single write access time: 30 µs
4KB random writes: ~120 MB/sec
Sequential writes: ~480 MB/sec
Random access still slower than sequential access
22
SSD VS . HDD
SSD can achieve 1-10x the bandwidth (bytes/sec) of ideal HDD
Note: Ideal HDD spec numbers are hard to achieve
Expect 10-100x bandwidth for non-sequential reads
Locality matters for both
Reading/writing to “far away” blocks on HDD requires slow seek/rotation delay
Writing 2 “far away” blocks on SSD can require writing multiple much larger units
High-end flash drives are getting much better at this
And don’t forget
SSD is 2-5x more expensive than HDD
24
B OTTOM L INE
Very large DBs: relatively traditional
Disk still offers the best cost/GB by a lot
SSDs improve performance and performance variance
Smaller DB story is changing quickly
SSDs win at the low end (modest DB sizes)
Many interesting databases fit in RAM
Lots of change brewing on the HW storage tech side
Non-volatile memory likely to affect the design of future systems
We will focus on traditional RAM and disk
25
D ATABASE S TORAGE
Most DBMSs store data as one or more files on disk
Files consist of pages (loaded in memory), pages contain records
Data on disk is read & written in large chunks of sequential bytes
Block = Unit of transfer for disk read/write
Page = A common synonym for “block”
In some textbooks, “page” = a block-sized chunk of RAM
We will treat “block” and “page” as synonyms
I/O operation = read/write disk operation
Sequential pages: reading “next” page is fastest
26
S YSTEM D ESIGN G OALS
Goal: allow the DBMS to manage databases > available main memory
Disk reads/writes are expensive ⟹ must be managed carefully
Minimise disk I/O, maximise usage of data per I/O
Spatial control
Where to write pages on disk
Goal: keep pages often used together as physically close as possible on disk
Temporal control
When to read pages into memory and when to write them to disk
Goal: minimise the number of CPU stalls from having to read data from disk
27
D ISK S PACE M ANAGEMENT
Lowest layer of DBMS, manages space on disk
Map pages to locations on disk S QL C l ie nt
Load pages from disk to memory Qu e ry P l a nn in g
Save pages back to disk Op e ra t o r E xe c ut i o n
Introduces the concept of a page F i le s & Ind ex M a na g e m e nt
Typical page size: 4 – 64KB (a multiple of 4KB) B u ff e r Ma na g e m e nt
Each page has a unique identifier: page ID D i s k S pa c e Ma na g e m e n t
Higher levels call upon this layer to:
Datab ase
Allocate/de-allocate a page
Read/write a page
28
D ISK S PACE M ANAGEMENT : PAGE R EQUESTS
Disk space manager can get requests for a sequence of pages
E.g., when higher levels execute a scan operator on a relation
Such requests are best satisfied by pages stored sequentially on disk
Physical details hidden from higher levels of system
Higher levels may “safely” assume Next Page is fast, so they will
simply expect sequential runs of pages to be quick to scan
Disk space manager aims to intelligently lay out data on disk
to meet the performance expectation of higher levels as best as possible
30
D ISK S PACE M ANAGEMENT : I MPLEMENTATION
Using local filesystem (FS)
Allocate one large “contiguous” file on an empty disk
Rely on OS and FS that sequential pages in this file are physically contiguous on disk
A logical database “file” may span multiple FS files on multiple disks/machines
Disk space manager maintains a mapping from page IDs to physical locations
physical location = filename + offset within that file
The OS and other apps know nothing about the contents of these files
Only the DBMS knows how to decipher their contents
Early DBMSs in the 1980s used custom ‘filesystems’ on raw storage
31
S UMMARY
Magnetic disk and flash storage
Random access vs. sequential access (10x) S QL C l ie nt
Physical data placement is important
Qu e ry P l a nn in g
Disk space management Op e ra t o r E xe c ut i o n
Exposes data as a collection of pages F i le s & Ind ex M a na g e m e nt
Pages: block-level organisation of bytes on disk
B u ff e r Ma na g e m e nt
API to read/write pages to disk
D i s k S pa c e Ma na g e m e n t
Provides “next” locality
Abstracts device and file system details Datab ase