Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Container Layer Unpacking #8881

Closed
ike-ma opened this issue Jul 26, 2023 · 5 comments
Closed

Parallel Container Layer Unpacking #8881

ike-ma opened this issue Jul 26, 2023 · 5 comments
Assignees

Comments

@ike-ma
Copy link

ike-ma commented Jul 26, 2023

What is the problem you're trying to solve

The current ContainerD fetches layers in parallel, but unpacks them in a single thread layer by layer sequentially.

Describe the solution you'd like

Proposal:

We propose a config below in ContainerD (/etc/containerd/config.toml) to support unpacking images in parallel.

[plugins."io.containerd.grpc.v1.cri".containerd.overlayfs]
  unpacking_mode = "parallel"

Key Changes: ContainerD Handlers + Content

This option reuses the existing FetchHandler and Content to support pre-decompression during the fetch phase of each layer. The actual decompression happens directly following the fetch process as the graph below. Then the unpack or snapshot handling becomes a light operation by just renaming the pre-decompression folder to the desired overlay path.

  • Fetch

    • When ContainerD config (/etc/containerd/config.toml) unpacking_mode = "parallel" is set, replace FetchHandler in pull.go with FetchUnzipHandler created below.
    • Create a new FetchUnzipHandler similar to FetchHandler, which enables the unzip option in Fetch.
    • Add unzip option to Fetch function; Update Content to support pre-decompress buckets as part of its content.Store APIs for content lifecycle management, i.e., A fetched gzip file and corresponding pre-decompressed fs folder will have the identical lifecycle, for example, create and delete.
  • Unpack and Snapshot processing

    • Update apply function to support an option to apply with the pre-decompression folder.
    • Update s.store.ReaderAt implementation, the content store knows if there is a pre-decompress bucket for a layer, if yes, it will return a list of file paths as byte[]. Alternatively, maybe we will add a new method s.store.FolderAt to Provider.

Additional context

Who should enable this feature?

  • Those who use disks with high in-parallel IO support. For example, PD or LocalSSD is designed for a deeper IO queue to get better throughput by in-parallel IO operations.
  • Those who use large containers and are sensitive to slow pod cold start. For example, containers with GPU libraries and frameworks (> 4GB). All GPU workloads can be put into this category. In contrast, previously top containers without using GPU were significantly smaller (< 500MB).

Who should NOT enable this feature?

HDD with high seek time for random read and write.

Potential Benefit

If user can improve disk performance, the image pull latency by this proposal can be reduced significantly. For example, Tao was able to achieve 3X faster image pull (120 seconds -> 40 seconds) for pulling a popular container: gcr.io/deeplearning-platform-release/base-cu113:m106 (5.4 GB) with a common setup of a deep learning node (2500 GB PD-SSD, 32 vCPUs).

@ike-ma
Copy link
Author

ike-ma commented Jul 26, 2023

/cc @elfinhe
/cc @bobbypage
/cc @qiutongs
/cc @samuelkarp

@ike-ma
Copy link
Author

ike-ma commented Jul 26, 2023

/assign @ike-ma

@samuelkarp
Copy link
Member

I'm not quite back from leave, but some context here:

The sequential layer unpack is a consequence of snapshot creation requiring a committed parent, and an individual snapshot not being committed until all of its content has been written. For the overlay snapshotter, the backing filesystem does not have a concept of a "committed" lowerdir, and containerd unpacks without an active overlay mount anyway (opting to write whiteout markers explicitly). Because the overlay filesystem does not have a dependency on a committed parent-child relationship, we can implement a further optimization of writing out the actual snapshot in parallel. For some storage devices (such as the PD-SSD device tested above), there are performance gains from concurrent IO over sequential IO and this approach leads to faster overall image pull times.

@dmcgowan
Copy link
Member

We have also discussed in the past of having a "rebase" function on snapshotter. Such a function would be very lightweight in the overlay snapshotter since it would just be updating the parent field. The rebase could possibly be performed on commit so that unpacks could just occur without parents and then commit the snapshots in order with the appropriate parent.

Ideally content store would not gain new functionality dealing with any processed content. Snapshotters might have more room for optimization functionality specific to a single snapshotter, as we already have that today on unpack.

@cookieisaac
Copy link

Thanks @dmcgowan for the comment. I have prepared a draft PR: #9138

Wonder if you have any high level comments for the first version, where the uncompressed layers are consumed directly during Apply with some basic config wiring.

@dosubot dosubot bot added the Stale label Aug 4, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2024
@dosubot dosubot bot removed the Stale label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants