PostgreSQL lingo

Navigating confusing Postgres terminology (HOT, TOAST! 🍞)

17 September 2020

Marsel Mavletkulov

Relation

PostgreSQL is object-relational database management system

relation (or predicate) is a bool function of multiple arguments (attributes), e.g., person relation with attributes name, age
predicate returns true if its attributes are factual

In fact, Bob is 18.

func person(name string, age int) bool { ... }

fmt.Println("Is Bob 18?", person("Bob", 18)) // true
fmt.Println("Is Bob 21?", person("Bob", 21)) // false

Relational calculus (declarative query language)

Since predicates' values are bool, you can make new predicates based on existing ones using logical operations (OR, AND, NOT)

A database query is a predicate expressed as a logical formula

Database needs to find attribute values that make the formula true

Relation as set

Predicate can be written down as math set of factual attributes (values where formula returns true) which in turn can be represented as a table

An element in that set is called a tuple (or row in table representation)

In fact, Bob is 18, Alice is 21.

{
    (name="Bob", age=18),   // person("Bob", 18) == true
    (name="Alice", age=21), // person("Alice", 21) == true
}

| name  | age
| ---   | ---
| Bob   | 18
| Alice | 21

Tuple

An immutable record with typed attributes is called a tuple (or cortege)

(
    name string = "Bob"
    age int = 18
)

The term originated as an abstraction of the sequence: single, couple, triple, quadruple, quintuple, sextuple, septuple, octuple, ..., n‑tuple

* etymology

Postgres relation

Postgres query result is a relation

# SELECT * FROM person;
name  | age
------+-----
Bob   |  18
Alice |  21

pg_class is where you can find description of relations such as tables, views, indexes, sequences

# SELECT relname FROM pg_class;
   relname
-------------
person
pg_statistic
pg_type
...

Relation files

Where is a relation stored on disk?

# SELECT pg_relation_filepath('person') AS table, pg_relation_filepath('person_name') as index;
     table       |      index
-----------------+------------------
base/16488/16493 | base/16488/16500

# \! ls /usr/local/pgsql/data/base/16488 | grep 16493
16493
16493_fsm
16493_vm

16493 is a main fork that contains table tuples, 16500 contains index tuples

Each file (~1 GB) is logically divided by 8 KB blocks called pages

Page starts with a header which describes page content (e.g., table page, index page)

Postgres tuple

Multiple processes (readers and writers) concurrently work with person table

# INSERT INTO person(name, age) VALUES ('Eve', 19); -- Current transaction id is 29410.
# UPDATE person SET age=80 WHERE name='Eve'; -- Current transaction id is 29411.
# UPDATE person SET age=20 WHERE name='Eve'; -- Current transaction id is 29412.

Eve row has three versions each represented with two numbers (transaction IDs)

xmin is when the row version was created
xmax is when the row version was deleted

xmin  |  xmax |  data
------+-------+--------
29410 | 29411 | Eve,18
29411 | 29412 | Eve,80
29412 |     0 | Eve,20

Heap-only tuple (HOT)

"Heap" in this case means "table", its data (a bunch of table tuples)

ctid tuple ID is a pair (page number, tuple number within page) that denotes the physical location of the tuple
t_ctid points to the next row version
hot "heap only tuple" means a page tuple doesn't have pointers from index tuples
hhu means a tuple was "heap hot updated": a chain of tuple IDs must be followed

ctid  | t_ctid | hhu | hot |  xmin |  xmax |  data
------+--------+-----+-----+-------+-------+--------
(0,1) |  (0,2) |  t  |     | 29410 | 29411 | Eve,18
(0,2) |  (0,3) |  t  |  t  | 29411 | 29412 | Eve,80
(0,3) |  (0,3) |     |  t  | 29412 |     0 | Eve,20

Heap-only tuple update

Table updates must be reflected in its indices

more indices, slower queries
extra work for vacuum to remove accumulated pointers to irrelevant table tuples
index size might not shrink even after removing many tuples (index page split)

HOT update mechanism prevents creation of wasteful index tuples and facilitates in-page cleaning (less work for vacuum)

It works when an updated field is not part of any index and table page has free space

you can reserve 50% of page space for updates with fillfactor=50
chain of updates works only within a single page

The oversized attribute storage technique (TOAST)

Postgres requires a tuple to fit into a single page, but 8 KB is not enough in practise

The oversized attribute storage technique allows to compress/store a large attribute value (e.g., text, numeric, json) in a special table called TOAST

split the value into page-sized chunks
fetch and concat the chunks when reading a tuple

This makes the original table smaller because it contains a reference to a TOAST table, but there is an overhead when a large field is frequently used

# SELECT relname, relfilenode FROM pg_class WHERE oid = (
    -- Find OID (16497) of TOAST table by table name "person" for which TOAST table was created.
    SELECT reltoastrelid FROM pg_class WHERE relname = 'person'
);
   relname     | relfilenode
---------------+-------------
pg_toast_16493 |       16497

TOAST strategies

TOAST is triggered when a tuple is wider than 1/4 of a page, TOAST_TUPLE_THRESHOLD (2 KB)

It compresses and/or moves field values until the tuple is shorter than TOAST_TUPLE_TARGET (2 KB)

extended (default) tries compression but quickly resorts to separate storage
main priority is compression; if compression didn't help, value is moved to separate storage
external is for separate storing without compression (makes substring operations on wide text columns faster at the penalty of increased storage space)
plain allows no compression, no separate storage

Buffer cache

Buffer cache is an array of buffers located in shared memory of Postgres processes

A buffer stores a page and where it was read from, its state

When buffer cache becomes full

Postgres evicts least recently used page
if the page was modified, it's considered dirty and must be written on disk
changes are sequentially written into WAL on each transaction commit with fsync

Write ahead log (WAL) is a stream of executed actions (segment files of 16 MB)

checkpointer process periodically writes all dirty pages on disk (this helps to keep WAL size fairly small), bgwriter writes dirty pages that will likely be evicted soon, backend also writes dirty pages on disk if there are not enough buffers for a query

Buffers in explain

You can see how many buffers were used in a query plan

shared hit=2 means two pages were found in a buffer cache
read=1 means one page was read from a disk and it took 50 ms

# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM person;
...
Buffers: shared hit=2 read=1
I/O Timings: read=50

Conclusion

Tables, views, indexes, sequences are relations

Table rows, indexed rows are tuples (table tuples, index tuples); elements of a relation

Heap-only tuple (HOT) is a table tuple that doesn't have pointers from index tuples

The oversized attribute storage technique (TOAST) makes sure a tuple fits into a page

Page is a logical block (8 KB) of a segment file where tuples are physically stored

Buffer cache is in-memory representation of array of pages (one buffer per page)

References

Администрирование PostgreSQL 10. Базовый курс Егор Рогов, Павел Лузанов
Администрирование PostgreSQL 10. Настройка и мониторинг Егор Рогов, Павел Лузанов
Основы технологий баз данных Борис Новиков
Database Physical Storage chapter of the PostgreSQL docs

Thank you

Marsel Mavletkulov

@marselester