Skip to content

Conversation

@mauriciovasquezbernal
Copy link
Contributor

Linux 5.10 adds support for a new volatile option for overlay mounts that increases performance by avoiding sync calls to the filesystem where the upper directory resides. The goal of this PR is to discuss possible implementations to support this in containerd and also present some challenges regarding the remount when that option is used.

I did some experiments using this PR and Kubernetes and the time required to install emacs on an image with the cached packages drop from ~55s to ~30s on my machine.

Test details

The image I used was created with the following Docker file

FROM ubuntu
RUN apt-get update && apt-get -y install -d emacs
ENTRYPOINT bash -c "time DEBIAN_FRONTEND=noninteractive apt-get -y install emacs

A run with kubectl run --restart=Never --image=mauriciovasquezbernal/testimage:latest m$RANDOM in Kubernetes.

The volatile option has the particularity that it's not possible to perform a mount reusing the upper/work directories as they could be out of sync. Sargun Dhillon has sent some patches to relax this requirement in some cases where it's safe to perform such mount, however they probably won't be part of 5.10.

Containerd performs some mounts of the overlayfs with WithTempMount before starting the container to inspect the image, then a mount for the rootfs of the container is done. This makes the support for the volatile option a bit difficult to implement as this option can only be used in the last mount (the root fs of the container).

This PR proposes a way to implement it. The volatile option is not set by the snapshotter but it's set in NewTask, in this way mounts performed with WithTempMount don't have this option and only the final mount has it. This solution is far from ideal because it breaks the modularity polluting other components with overlay specific options, but it's the less invasive option we've found so far.

We've considered the following options to far:

  • Making the overlayfs snapshotter return this option and introducing a check in WithTempMount to remove it.
    • Pros
      • Easy to implement, few lines of code in WithTempMount
    • Cons
      • Pollutes WithTempMount with overlayfs specific logic
      • Any other place doing an overlay remount should implement a similar logic to avoid the remounting problem
  • Extend the snapshotter interface with a TempMounts method that return the mounts to be used in the temp mounts case. It's be the same as Mount in all implementations but in overlayfs it'll return the mounts without the volatile option
    • Pros
      • Clean solution as the volatile logic keeps contained in the snapshotter
    • Cons
      • Changes in too many places
      • Could be an overkill if Sargun's patches land in the kernel

I'd like to get more opinions and ideas on this.

cc @rata @alban

@k8s-ci-robot
Copy link

Hi @mauriciovasquezbernal. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Dec 1, 2020

Build succeeded.

@fuweid
Copy link
Member

fuweid commented Dec 2, 2020

Nice! Thanks! The umount overlayfs brings so many syncfs, like write amplification. It can reduce disk syncfs when read the user/group ID from image. 👍

@AkihiroSuda
Copy link
Member

Is this still draft?

@rata
Copy link
Contributor

rata commented Dec 16, 2020

@AkihiroSuda the main reason for this to be draft still is because we wanted to know your (or other maintainers) opinion regarding the questions @mauriciovasquezbernal asked in the PR description.

Other than that, in the limited testing we did this PR works fine and we see good performance improvements (~20% as mentioned in the commit adding it torvalds/linux@c86243b)

If you think this direction is fine (only for the rootfs of containers created via the CRI plugin), we can make it ready quite soon. If you think other direction is best, please let us know and we will adapt :-)

Thanks!

@rata
Copy link
Contributor

rata commented Dec 17, 2020

@AkihiroSuda does the current approach in the PR sound fine for you? Or do you suggest some other way?

@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/volatile_cri_global_option branch from d37c3e9 to 2b8cefe Compare January 6, 2021 21:26
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 6, 2021

Build succeeded.

@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/volatile_cri_global_option branch from 2b8cefe to 141e7df Compare January 6, 2021 22:48
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 6, 2021

Build succeeded.

@mauriciovasquezbernal mauriciovasquezbernal changed the title RFC: Add support for overlay volatile option Add support for overlay volatile option Jan 6, 2021
@mauriciovasquezbernal mauriciovasquezbernal marked this pull request as ready for review January 6, 2021 23:15
@mauriciovasquezbernal
Copy link
Contributor Author

I discussed internally with @rata and came up with an new implementation that should be cleaner. Now the volatile option is configured at the snapshotter level and it affects all the containers created by containerd, not only the ones created by the CRI plugin.

We know a configuration parameter adds complexity, but we don't think we can enable that by default right now (until the remount problems are solved in the kernel). We think this is an useful option and some users would like to enable it.

We'd appreciate any feedback. Thanks!

@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/volatile_cri_global_option branch from 141e7df to 7a52544 Compare January 7, 2021 20:08
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 7, 2021

Build succeeded.

@kzys
Copy link
Member

kzys commented Mar 17, 2021

@mauriciovasquezbernal Can you rebase the branch?

@fuweid PTAL

/ok-to-test

Linux 5.10 adds support for a new volatile option for overlay mounts
that increases performance by avoiding sync calls to the filesystem
where the upper directory resides. This commit adds an option to
the overlayfs snapshotter to enable this feature.

Mounts using the volatile option can't be remounted, for this reason
there is some logic in the WithTempMount function to avoid using that
option there.

Signed-off-by: Mauricio Vásquez <[email protected]>
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/volatile_cri_global_option branch from 7a52544 to 63a24ba Compare March 18, 2021 12:21
@mauriciovasquezbernal
Copy link
Contributor Author

@kzys I rebased it!

@kzys
Copy link
Member

kzys commented Mar 20, 2021

@mauriciovasquezbernal Thanks!

@estesp or @fuweid can you take a look?

Copy link
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AkihiroSuda AkihiroSuda requested a review from dmcgowan March 24, 2021 05:52
// option here.
// TODO: Make this logic conditional once the kernel supports reusing
// overlayfs volatile mounts.
for _, m := range mounts {
Copy link
Member

@fuweid fuweid Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we can remove this part. The caller should validate the volatile option at the beginning.

In overlayFS snapshotter, we can check volatile in NewSnapshotter like https://github.com/containerd/containerd/blob/master/snapshots/overlay/overlayutils/check.go#L110 if set Volatile=true in toml.

If kernel doesn't support that option, just return error.

Copy link
Contributor

@rata rata Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid Hi, I'm a co-worker of Mauricio.

Not sure I follow what you mean. If the kernel doesn't support this option, it will error out when trying to mount with volatile already. Do you suggest to move this code to a function and we just call it from here? Or what is the suggestion? Not sure I follow.

The code here makes sure to not use volatile for temp mounts, because the overlayfs dirs (workdir at least) is reused for the following mounts. But re-using them is not supported by the kernel when you use volatile, so we need to remove the option if not the following mounts that re-use the work-dir will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rata sorry for late reply.

=> If the kernel doesn't support this option, it will error out when trying to mount with volatile already.

I was thinking we can add pre-check option during initializing overlayFS snapshotter, like

//  From https://github.com/containerd/containerd/blob/master/snapshots/overlay/overlay.go
//
// NewSnapshotter returns a Snapshotter which uses overlayfs. The overlayfs
// diffs are stored under the provided root. A metadata file is stored under
// the root.
func NewSnapshotter(root string, opts ...Opt) (snapshots.Snapshotter, error) {

     ...

     // figure out whether "index=off" option is recognized by the kernel
     var indexOff bool
     if _, err = os.Stat("/sys/module/overlay/parameters/index"); err == nil {
	indexOff = true
     }

     // check volatile option
     if config.volatile {
         if err := overlayutils.SupportVolatile(root); err != nil {
               return nil, err
         }
     }
    
     ...
}

The pre-check can help us to know the option is supported before running container.

=> But re-using them is not supported by the kernel when you use volatile

Yeah, you are correct. This code in WithTempMount should be kept. In addition, the other case is that container engine uses WithTempMount to get the user id mapping. I think use volatile is good for this case. But I don't have an idea how to use it correctly.


The change looks good to me. I am fine with handling overlayutils.SupportVolatile in the follow-up pr.

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid heh, no late response, it was quite fast :)

Ohh, I see. Yes, if we can merge this (one LGTM is missing IIUC) and I'll do a follow-up PR adding that it would be great.

Thanks again!

Copy link
Contributor

@rata rata Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I think we might need to probe a dummy mount with the volatile option, not possible to do as we do with indexOff because that is a module parameter. Those are exposed on sysfs, not sure about mount options (think they are not). I smell this logic might end up in snapshots/overlay/overlayutils with similar functions.

Will check and open a new PR in the next days with that :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rata I would like to request @dmcgowan review on this~ Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this fail without this code? How is this different from the index option being set?

@fuweid
Copy link
Member

fuweid commented Mar 29, 2021

ping @dmcgowan ~

return nil
}

// Volatile enables the volatile option of overlayfs to ommit all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omit

@dmcgowan
Copy link
Member

When overlay is mounted with “volatile” option, the directory “$workdir/work/incompat/volatile” is created. During next mount, overlay checks for this directory and refuses to mount if present. This is a strong indicator that user should throw away upper and work directories and create fresh one. In very limited cases where the user knows that the system has not crashed and contents of upperdir are intact, The “volatile” directory can be removed.

What is the expected snapshotter behavior in this case?

@dmcgowan dmcgowan added the status/needs-discussion Needs discussion and decision from maintainers label May 27, 2021
@natefive
Copy link

@dmcgowan has this been abandoned for an alternative implementation or could this be revived? thanks

@rata
Copy link
Contributor

rata commented Apr 14, 2023

I've completely lost track of this. As I didn't open the PR it doesn't show me when looking for those, and thoes notification arrived while I was moving my emails so I probably lost them. Sorry!

@natefive if you want to work on this, I'd be happy to help if I can. Otherwise, I can take this with lower prio (I'm working on user namespaces support in containerd first).

Do you want to take this work?

@natefive
Copy link

natefive commented Apr 17, 2023

@rata happy to give it a try, I may struggle with the testing though

from the above I can see:

  1. a typo 'ommit' to 'omit'

however I am unclear, what the decided action for this thread #4785 (comment) was?

thanks :D I will begin prepping my own PR based on this one - #8402

@rata
Copy link
Contributor

rata commented Apr 18, 2023

@natefive Thanks! We need to solve it, I've not proposed any concrete solution. I mentioned what I thought might be needed (a dummy mount)

@natefive
Copy link

I've interpreted "a dummy mount" and previous context to mean add a SupportsVolatileMount func in overlayutils to test if volatile mounts are available https://github.com/containerd/containerd/pull/8402/files#diff-0d4696b29eacc6f2ed899b12982ab6fa275670ee474536e067f8778732211e56 then call this func from the NewSnapshotter func

@rata
Copy link
Contributor

rata commented Apr 18, 2023

@natefive Yeap, that is what I had in mind. I don't know if there is anyway to avoid doing that, I don't remember now. Thanks for continuing this! :)

@natefive
Copy link

@rata no problem - I guess I'll leave it in your more experienced hands to test it properly (it compiles, the units pass & CI passes with the exception of Vagrant BOX:rockylinux/[email protected])

@rata
Copy link
Contributor

rata commented Apr 27, 2023

Sorry, I'll be afk for several days and was busy with kubecon. I can have a look later, when I'm back, but I don't think there is anything special to look at, if it is creating pods with the overlayfs filesystem mounted as volatile, it should be fine :)

@natefive
Copy link

natefive commented May 2, 2023

no worries have a good time at kubecon 🎉

changweige pushed a commit to changweige/containerd that referenced this pull request May 19, 2023
We already had several discussions about enabling overlayfs mount option `volatile` which skips `fssync` against upperdir's file system. The benefit of taking the overlayfs feature in was clearly stated in those discussions.
containerd#4785
containerd#6406

Compared with the aforementioned proposals, this solution is more granular and tries to encapsulate the feature only in CRI.
It makes CRI capable of setting up rootfs by itself thus it has a chance to tweak snapshots mounts options And it does not need to change snapshotter's behavior.

Fortunately, containerd's container API allows the client to specify rootfs(mounts slice) when `StartContainer`.

As the rootfs is passed within TaskCreate request to shim who actually mounts the filesystem as rootfs, I suppose it's pretty safe for other critical parts in containerd's core service.

We already convert all temp mounts to "no-upperdir" mounts. Those temp mounts won't result in `fssync`.

The UI is to add a new configuration entry for CRI plugin
```toml
[plugins."io.containerd.grpc.v1.cri".containerd]
overlayfs_volatile = true
```

In the future, if any KEP comes up with enabling certain pod to be marked as `volatile`, we can extend this feature to make it only work for certain pods on the node.

Signed-off-by: Changwei Ge <[email protected]>
@k8s-ci-robot
Copy link

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CormickKneey
Copy link

@rata
Copy link
Contributor

rata commented May 28, 2024

@CormickKneey It indeed seems containerd now has this functionality now with the mount options. Let me know if I'm missing something, of course.

I'll ask @mauriciovasquezbernal to close this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase ok-to-test status/needs-discussion Needs discussion and decision from maintainers

Projects

Status: Done
Archived in project

Development

Successfully merging this pull request may close these issues.