PostgreSQL lingo
Navigating confusing Postgres terminology (HOT, TOAST! 🍞)
17 September 2020
Marsel Mavletkulov
Marsel Mavletkulov
PostgreSQL is object-relational database management system
In fact, Bob is 18.
func person(name string, age int) bool { ... }
fmt.Println("Is Bob 18?", person("Bob", 18)) // true
fmt.Println("Is Bob 21?", person("Bob", 21)) // falseSince predicates' values are bool, you can make new predicates based on existing ones using logical operations (OR, AND, NOT)
A database query is a predicate expressed as a logical formula
Database needs to find attribute values that make the formula true
3Predicate can be written down as math set of factual attributes (values where formula returns true) which in turn can be represented as a table
An element in that set is called a tuple (or row in table representation)
In fact, Bob is 18, Alice is 21.
{
(name="Bob", age=18), // person("Bob", 18) == true
(name="Alice", age=21), // person("Alice", 21) == true
}
| name | age
| --- | ---
| Bob | 18
| Alice | 21An immutable record with typed attributes is called a tuple (or cortege)
(
name string = "Bob"
age int = 18
)The term originated as an abstraction of the sequence: single, couple, triple, quadruple, quintuple, sextuple, septuple, octuple, ..., n‑tuple
5Postgres query result is a relation
# SELECT * FROM person; name | age ------+----- Bob | 18 Alice | 21
pg_class is where you can find description of relations such as tables, views, indexes, sequences
# SELECT relname FROM pg_class; relname ------------- person pg_statistic pg_type ...
Where is a relation stored on disk?
# SELECT pg_relation_filepath('person') AS table, pg_relation_filepath('person_name') as index;
table | index
-----------------+------------------
base/16488/16493 | base/16488/16500
# \! ls /usr/local/pgsql/data/base/16488 | grep 16493
16493
16493_fsm
16493_vm16493 is a main fork that contains table tuples, 16500 contains index tuples
Each file (~1 GB) is logically divided by 8 KB blocks called pages
Page starts with a header which describes page content (e.g., table page, index page)
7Multiple processes (readers and writers) concurrently work with person table
# INSERT INTO person(name, age) VALUES ('Eve', 19); -- Current transaction id is 29410.
# UPDATE person SET age=80 WHERE name='Eve'; -- Current transaction id is 29411.
# UPDATE person SET age=20 WHERE name='Eve'; -- Current transaction id is 29412.Eve row has three versions each represented with two numbers (transaction IDs)
xmin | xmax | data ------+-------+-------- 29410 | 29411 | Eve,18 29411 | 29412 | Eve,80 29412 | 0 | Eve,20
"Heap" in this case means "table", its data (a bunch of table tuples)
ctid | t_ctid | hhu | hot | xmin | xmax | data ------+--------+-----+-----+-------+-------+-------- (0,1) | (0,2) | t | | 29410 | 29411 | Eve,18 (0,2) | (0,3) | t | t | 29411 | 29412 | Eve,80 (0,3) | (0,3) | | t | 29412 | 0 | Eve,20
Table updates must be reflected in its indices
HOT update mechanism prevents creation of wasteful index tuples and facilitates in-page cleaning (less work for vacuum)
It works when an updated field is not part of any index and table page has free space
fillfactor=50Postgres requires a tuple to fit into a single page, but 8 KB is not enough in practise
The oversized attribute storage technique allows to compress/store a large attribute value (e.g., text, numeric, json) in a special table called TOAST
This makes the original table smaller because it contains a reference to a TOAST table, but there is an overhead when a large field is frequently used
# SELECT relname, relfilenode FROM pg_class WHERE oid = (
-- Find OID (16497) of TOAST table by table name "person" for which TOAST table was created.
SELECT reltoastrelid FROM pg_class WHERE relname = 'person'
);
relname | relfilenode
---------------+-------------
pg_toast_16493 | 16497
TOAST is triggered when a tuple is wider than 1/4 of a page, TOAST_TUPLE_THRESHOLD (2 KB)
It compresses and/or moves field values until the tuple is shorter than TOAST_TUPLE_TARGET (2 KB)
Buffer cache is an array of buffers located in shared memory of Postgres processes
A buffer stores a page and where it was read from, its state
When buffer cache becomes full
Write ahead log (WAL) is a stream of executed actions (segment files of 16 MB)
checkpointer process periodically writes all dirty pages on disk (this helps to keep WAL size fairly small), bgwriter writes dirty pages that will likely be evicted soon, backend also writes dirty pages on disk if there are not enough buffers for a query
13You can see how many buffers were used in a query plan
# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM person; ... Buffers: shared hit=2 read=1 I/O Timings: read=50
Tables, views, indexes, sequences are relations
Table rows, indexed rows are tuples (table tuples, index tuples); elements of a relation
Heap-only tuple (HOT) is a table tuple that doesn't have pointers from index tuples
The oversized attribute storage technique (TOAST) makes sure a tuple fits into a page
Page is a logical block (8 KB) of a segment file where tuples are physically stored
Buffer cache is in-memory representation of array of pages (one buffer per page)
15Marsel Mavletkulov