Skip to content

to pack or not to pack ... #8572

@ThomasWaldmann

Description

@ThomasWaldmann

borg 1.x segment files

borg 1.x used:

  • "segment files", elsewhere also known as "pack files" to store multiple repository objects in one file.
  • a "repository index" to be able to find these objects, using a mapping object id --> (segment_name, offset_in_segment).
  • transactions and rollback via log-like appending of operations (PUT, DEL, COMMIT) to these segment files

borg2 status quo: objects stored separately

borg2 is much simpler:

  • implemented using borgstore (k/v store with misc. backends)
  • objects are stored separately: 1 file chunk --> 1 repo object
  • objects can be directly found by their id (e.g. the id is mapped to the fs path / file name)
  • no transactions, no log-like appending - but correct write order

Pros:

  • simplicity
  • no need for some sort of "index" (which could be corrupted or out of date)
  • no segment file compaction needed, the server-side filesystem manages space allocation

Cons:

  • leads to big amounts of relatively small objects transferred and stored individually in the repository
  • latency and other overheads have quite a speed impact for remote repositories
  • depending on the storage type / filesystem, there will be more or less storage space usage overhead due to block size, esp. for many very small objects
  • dealing with lots of objects / doing lots of api calls can be expensive for some cloud storage providers

borg2 alternative idea

  • client assembles packs locally, transfers to store when the pack has reached the desired size or when there is no more data to write.
  • pack files have a per-pack index appended (pointing to the objects contained in the pack), so the per-pack index can be read without reading the full pack.
  • the per-pack index would also contain the RepoObj metadata (e.g. compression type/level, etc.)

Pros:

  • a lot less objects in store, less api calls, less latency impact

Cons:

  • more complex in general
  • will need an addtl. global index mapping object_id -> pack_id, offset_in_pack
  • will need more memory for that global index
  • space is managed clientside, causing more (network) I/O: compact will need to read the pack, drop unused entries and write it back to the store, update indexes

Side note: desired pack "size" could be given by amount of objects in the pack (N) or by the overall size of all objects in the pack (S). For the special case of N == 1 it would be a slightly different implementation (using a different file format) of what we currently have in borg2, not necessarily need that global index and also compact would still be very easy.

Related: #191

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions