Skip to content

*: lower default GC TTL to 4h#93836

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:221215.lower-gc-ttl
Jan 6, 2023
Merged

*: lower default GC TTL to 4h#93836
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:221215.lower-gc-ttl

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

@irfansharif irfansharif commented Dec 16, 2022

Fixes #89233.

Release note (general change): The GC TTL previously defaulted to 25h. This value was configurable using ALTER RANGE DEFAULT CONFIGURE ZONE USING gc.ttlseconds = <whatever>, but also possible to scope to specific schema objects using ALTER {DATABASE,TABLE,INDEX} CONFIGURE ZONE USING .... This value determine how long overwritten values were retained. The RANGE DEFAULT value is now lowered to 4h but only for freshly created clusters. When existing clusters upgrade onto this release, we will respect whatever value they were using before the upgrade for all their schema objects. This will be 25h if the GC TTL was never altered, or some specific value if set explicitly. Full cluster backups taken on earlier version clusters, when restored to clusters that started off at v23.1, will use the GC TTL recorded in the backup image.

We've found the 25h value to translate to higher-than-necessary storage costs, especially for workloads where rows are deleted frequently. It can also make for costlier reads with respect to CPU since we currently have to scan over overwritten values to get to the one of interest. Finally, we've also observed cluster instability due to large unsplittable ranges that have accumulated an excessive amount of MVCC garbage. We chose a default of 25h originally to accommodate daily incremental backups with revision history. But with the introduction of scheduled backups introduced in 22.2, we no longer need a large GC TTL. Scheduled backups "chain together" and prevent garbage collection of relevant data to ensure coverage of revision history across backups, decoupling it from whatever value is used for GC TTL. So we no longer need a 25h default, hence this change.

The GC TTL determines how far back AS OF SYSTEM TIME queries can go, which now if going past now()-4h, will start failing informatively. To support larger windows for AS OF SYSTEM TIME queries, users are encouraged to pick a more appropriate GC TTL and set it using ALTER ... CONFIGURE ZONE using gc.ttlseconds = <whatever>. The earlier considerations around storage use, read costs, and stability still apply.

Release note (backward-incompatible change): See release note above. Technically this is not a backwards-incompatible change since we're only changing the default value that new clusters are initialized with -- existing clusters will remain unaffected. But it might be worth highlighting this change more prominently to our users for added scrutiny.

@irfansharif irfansharif requested review from ajwerner and dt December 16, 2022 23:39
@irfansharif irfansharif requested a review from a team as a code owner December 16, 2022 23:39
@irfansharif irfansharif requested a review from a team December 16, 2022 23:39
@irfansharif irfansharif requested a review from a team as a code owner December 16, 2022 23:39
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 17, 2022

This change in defaults determines how far back AS OF SYSTEM TIME queries can go, which if going past now()-4h, will start failing informatively.

Maybe add another release note "backward-incompatible change", so it gets more scrutiny by users.

@irfansharif irfansharif requested a review from a team as a code owner December 19, 2022 15:29
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all the testlogic files but I'm not sure why there's so much indentation diff, I've used just vanilla dev test{logic} ... --rewrite to power through. It's making for a larger diff than is real -- all that's changing is s/90000/14400 wherever we've rendered the zone config.

Maybe add another release note "backward-incompatible change", so it gets more scrutiny by users.

Done (left a note for the docs team to highlight this more prominently, even though it's not technically backwards compatible). PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @dt)

@irfansharif
Copy link
Copy Markdown
Contributor Author

It's making for a larger diff than is real -- all that's changing is s/90000/14400 wherever we've rendered the zone config.

Using git show HEAD --ignore-all-space helps highlight exactly what's going on.

@irfansharif
Copy link
Copy Markdown
Contributor Author

(Bump.)

@postamar
Copy link
Copy Markdown

postamar commented Jan 3, 2023

What's the story for restoring backups here? I only skimmed through the changes but I don't see any injection of zone configs upon restoring pre-23.1 backups, so it's safe to assume that cluster- and partial restores will be subject to this new 4h default as well. Are we cool with that? To me it feels strange.

@irfansharif irfansharif requested a review from adityamaru January 4, 2023 16:00
@irfansharif
Copy link
Copy Markdown
Contributor Author

Good point, I didn't do any such thing. @dt or @adityamaru, can you maybe show me where I could inject such a hook? We backup+restore system.zones:

systemschema.ZonesTable.GetName(): {
shouldIncludeInClusterBackup: optInToClusterBackup, // ID in "id".

So some point after the system table is restored, and maybe before MVCC GC has a chance to kick in (or before user tables are restored), I go explicitly set the existing GC TTL to whatever it was before? Looking at the code, we do have descriptors that we "pre-"restore before restoring user tables, so somewhere here?

dataToPreRestore = &restorationDataBase{
spans: preRestoreSpans,

@adityamaru
Copy link
Copy Markdown
Contributor

adityamaru commented Jan 5, 2023

cluster- and partial restores will be subject to this new 4h default as well.

Cluster restore deletes the zone configs of the restoring cluster and applies the zone configs of the backed-up cluster. It does this prior to ingesting user data to allow the allocator to distribute the replicas. Non-cluster restores don't deal in zone configs at all, and the ingested data is subject to whatever configuration applies to the table being restored.

The status quo seems okay to me? A cluster restore is the only form of restore that is in the business of altering system tables, so it is what we encourage users to use if they want an exact replica of their backed up cluster.

@postamar
Copy link
Copy Markdown

postamar commented Jan 5, 2023

Thanks! All good then.

@irfansharif irfansharif requested a review from a team as a code owner January 5, 2023 17:36
@irfansharif irfansharif requested review from herkolategan and renatolabs and removed request for a team January 5, 2023 17:36
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Somehow forgot that v22.2 and earlier clusters do actually have an explicit entry in system.zones with the default GC TTL. So didn't need any explicit upgrade/migration. Same reason that we didn't need anything for restores. Updated PR + rebased, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, @dt, @herkolategan, and @renatolabs)

@irfansharif irfansharif force-pushed the 221215.lower-gc-ttl branch 2 times, most recently from d4a1a44 to 1dba482 Compare January 5, 2023 18:38
Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, it makes me very happy to see us dropping this value.

Fixes cockroachdb#89233.

Release note (general change): The GC TTL previously defaulted to 25h.
This value was configurable using `ALTER RANGE DEFAULT CONFIGURE ZONE
USING gc.ttlseconds = <whatever>`, but also possible to scope to
specific schema objects using `ALTER {DATABASE,TABLE,INDEX} CONFIGURE
ZONE USING ...`. This value determine how long overwritten values were
retained. The `RANGE DEFAULT` value is now lowered to 4h but only for
freshly created clusters. When existing clusters upgrade onto this
release, we will respect whatever value they were using before the
upgrade for all their schema objects. This will be 25h if the GC TTL was
never altered, or some specific value if set explicitly. Full cluster
backups taken on earlier version clusters, when restored to clusters
that started off at v23.1, will use the GC TTL recorded in the backup
image.

We've found the 25h value to translate to higher-than-necessary storage
costs, especially for workloads where rows are deleted frequently. It
can also make for costlier reads with respect to CPU since we currently
have to scan over overwritten values to get to the one of interest.
Finally, we've also observed cluster instability due to large
unsplittable ranges that have accumulated an excessive amount of MVCC
garbage. We chose a default of 25h originally to accommodate daily
incremental backups with revision history. But with the introduction of
scheduled backups introduced in 22.2, we no longer need a large GC TTL.
Scheduled backups "chain together" and prevent garbage collection of
relevant data to ensure coverage of revision history across backups,
decoupling it from whatever value is used for GC TTL. So we no longer
need a 25h default, hence this change.

The GC TTL determines how far back AS OF SYSTEM TIME queries can go,
which now if going past `now()-4h`, will start failing informatively. To
support larger windows for AS OF SYSTEM TIME queries, users are
encouraged to pick a more appropriate GC TTL and set it using `ALTER ...
CONFIGURE ZONE using gc.ttlseconds = <whatever>`. The earlier
considerations around storage use, read costs, and stability still
apply.

Release note (backward-incompatible change): See release note above.
Technically this is not a backwards-incompatible change since we're only
changing the default value that new clusters are initialized with --
existing clusters will remain unaffected. But it might be worth
highlighting this change more prominently to our users for added
scrutiny.
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 6, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

*: lower default GC TTL

5 participants