test: run integration tests with rootless Podman #1348
No reviewers
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner!1348
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "aahlenst/runner:podman-ci"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
0a3a10205fa7bc204ae1a7bc204ae1756ded3f21756ded3f21deb2d6f923deb2d6f9236ff0b2948e6ff0b2948e9b3243020f9b3243020f3fee69e65c3fee69e65c0c002cda670c002cda678b7270e3608b7270e360af3ff54f56af3ff54f562773e7a7ec2773e7a7ec0e30242dea@viceice I'm struggling with setting up Podman. The problem seems to be systemd in conjunction with a non-privileged user. It looks like parts of systemd aren't active and other parts like systemd-logind are inactive or missing. Do you have any advice? Or can I somewhere see how
lxc-trixieis created so that I can reproduce it locally?@aahlenst wrote in #1348 (comment):
https://code.forgejo.org/forgejo/runner/src/branch/main/act/runner/lxc-helpers-lib.sh
this script is used. I don't much more about it
Thanks. That's helpful.
It looks like the lxc-helpers are creating some problem. Without them, everything's fine:
The default configuration of the lxc-helpers is causing the degradation of systemd. That itself is concerning. A degraded systemd isn't the best foundation for running tests. One reason for the degradation is that
/procisn't mounted. That causes some services to freak out. I haven't yet figured out which part of the default configuration is responsible.With an empty configuration, the results are slightly better. But systemd-logind is still inactive.
The problem is that neither dbus nor libpam-systemd are present, which are both required. I haven't yet figured out why they are missing. Their names do not appear anywhere in Forgejo Runner's sources.
I've seen forgejo/lxc-helpers#34. But that's not an approach I want to pursue because I expect users to use systemd.
0e30242deac15dca1863c15dca1863e191e62e06e191e62e0656472b28ac56472b28ac1cdf2fafe31cdf2fafe3fb85469cccfb85469ccc146e72accd146e72accd0b440da1670b440da167c7d6af9f4ec7d6af9f4e9eb056115a9eb056115ac07ccd0319c07ccd03196379d75fb76379d75fb7aa4f37069aaa4f37069ab1a33ac3a3b1a33ac3a3555dc771d5555dc771d5154f9657d5154f9657d54b82421c6d4b82421c6d4f0000f6d54f0000f6d5790b9894db790b9894db23180253c923180253c9d03584f55fd03584f55f91a8f1ac2691a8f1ac26ea7e72a76aea7e72a76ad3f18e2605d3f18e260511115886a611115886a658f54787f458f54787f4e9cbd00e4ee9cbd00e4e07f1ed59cf07f1ed59cf155040ec3eOkay, this is getting ridiculous. Even if I ultimately get it green, a runtime of over 2 hours is not sustainable. With Docker, the very same tests run in roughly 15 minutes. Locally, without LXC underneath, it takes roughly 1 hour 45 min. So not much better.
Podman seems to have problems stopping containers and
--initisn't helping much if at all. Output from my workstation with Fedora 43 on AMD64:@mfenniak Do you have any ideas? If not, should I abandon this effort or try to reach out to the Podman project for assistance?
I do agree that podman has some weird shutdown behaviour, as I find that my podman-based runner tends to keep sending
UpdateTaskservice calls long after a workflow is complete. Whatever that weird behaviour is, it's a reasonable guess that it's the cause of the slow test.But I think that your pared down reproduction isn't a good example of it -- docker behaves the same way in this test. (note here I've timed the
stopbecause it doesn't give a useful warning)I like the idea of stripping this problem down to something that can be reproduced with the CLI, and then asking for advice from the podman project if there's not an obvious answer remaining. It makes sense to me to shelve this effort until this shutdown timing issue can be improved.
I created a Bash script that resembles what
go test -count 1 -run 'TestRunner_RunEvent/shells/bash$' -v ./...does:Podman took way longer than Docker to complete it, warning that
SIGTERMdid not work after everyrm. After reading this Podman issue where someone complained about the warning, I concluded that it is expected behaviour. Which means that it needs to be fixed in Forgejo Runner.Inserting
podman stop -t 0 "$container"before everyrmmakes Podman as fast as Docker.I measured how long it takes Forgejo Runner to remove a container with Podman:
err := cr.cli.ContainerRemove(ctx, cr.id, container.RemoveOptions{RemoveVolumes: true,Force: true,})A little more than 10 seconds.
Adding
cr.cli.ContainerStop(ctx, cr.id, container.StopOptions{Timeout: &zero});beforeContainerRemove()brings the total time down to less than 1 second.For some reason, the tests are still slow. 🤔
155040ec3e339b5908bd@aahlenst wrote in #1348 (comment):
🤔 OK, that makes sense to me. Always SIGTERM our containers (job, service, & step) when we're done with them, since a graceful cleanup of the entrypoint process (or tail 🤣) isn't important.
Theoretically, if someone was using a service container and mapping a volume that they reuse, there could be some risk to going immediately to a SIGKILL and prevent a clean shutdown of a service... but that isn't a reasonable usage of the runner.
@mfenniak wrote in #1348 (comment):
The way the Docker API is being used in Forgejo Runner suggests it was the author's intent to outright kill all containers. And Docker complies. Podman is more cautious.
I tried my luck with Go's profiling tools. Found another problem:
func GetHostInfo(ctx context.Context) (info system.Info, err error) {var cli client.APIClientcli, err = GetDockerClient(ctx)if err != nil {return info, err}defer cli.Close()info, err = cli.Info(ctx)if err != nil {return info, err}return info, nil}Removing the
Infocall cuts the runtime of a single test in half. I have no idea yet what could be done about it.There's another one:
info, err := cli.Info(ctx)339b5908bd05524beacf05524beacf226bbed730226bbed730e6d5c86e0be6d5c86e0bb797c54d90WIP: chore: run integration tests with rootless Podmanto test: run integration tests with rootless PodmanWith the recent test fixes and other improvements, I got it green and a full test run with Podman completes in less than 20 minutes.
Ready for review. 😅
Wonderful! 🙂 Great work.