proposal: consider adding a chain storage interface to ctfe

This is an extended version of google/trillian#1968 with more concrete numbers (and probably in the right place this time).

Currently ctfe sends both the leaf certificate and the rest of the chain (intermediates, and root) to trillian for storage. While the leaf is necessary for merkle tree operations (as part of the tree leaf) the rest of the chain is not, and is stored as an opaque blob in ExtraData. Unlike the leaf, which is unique, the chain is mostly made up of duplicated data since the number of possible chains is typically a (relatively) small number, especially when a CA is using a specific log for a hierarchy. Assuming most chains have three entries, a leaf, intermediate, and root, and that certificates are on average around 1.5kb a LeafData row ends up being about 5kb with the chain in ExtraData taking up 3kb of that (ignoring indexes and storage engine overhead).

Pulling this data out of the trillian storage backend would make it possible to more easily deduplicate this data, which would provide a significant storage win, and likely a bit of a performance win because the will be able to store more rows in memory/hot cache if they are smaller, I think the main benefit here would be in sequencing operations, and get-entries type operations.

Some concrete numbers for the oak 2020 Let's Encrypt log: the database for the log is around ~3TB for ~420 million entries. Looking through all the entries in the log the chains, excluding leaves, make up around 1TB of data (again this ignores the database overhead which would add some % to this number) of which only 1.5MB are unique certificates. Deduplicating this would be a ~33% storage win (which is not a insignificant cost).

I think there are two obvious approaches to this, using the ExtraData to store some kind of identifier, or set of identifiers that link to entries in a chain storage backend, or don't store anything in ExtraData and use the leaf identifier hash to as an entry in the chain storage backend to link it to the chain. Either way when the ctfe stored a chain it'd decompose the chain into leaf + rest, and send the rest to the chain storage backend, and then send the leaf to trillian. In the other direction once a leaf or set of leaves is returned the ctfe would then reconstruct the chain by querying the storage backend. The one downside here is that it would increase the number of database queries, i.e. for a get-entries call we currently only need N calls to the trillian database, but with two storage backends you would need N+1, in the best case, or N^2 in the worst depending on query/schema design.

I haven't specified a formal API in this proposal because I want to gauge interest in the overall concept, if there is interest I will try to spend some time coming up with a more concrete proposal.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: consider adding a chain storage interface to ctfe #691

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: consider adding a chain storage interface to ctfe #691

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions