Skip to content

Conversation

@sipsma
Copy link
Contributor

@sipsma sipsma commented Oct 17, 2023

PRs and CI on main are now failing due to trivy finding vulns in our engine image, due to a new GHSA around otel: https://github.com/dagger/dagger/security/dependabot/381

Unfortunately, simply bumping otel libs is never "simple". After bumping our go.mod, the engine and CLI started exiting on startup w/ error cannot merge resource due to conflicting Schema URL.

I'd be lying if I said I understood all the inner machinations of otel, but from my understanding this is due to the fact that buildkit (and other transitive dependencies) have not yet upgraded their versions of otel libs. Due to the fact that these tracers are global to the process, we end up with mismatches in some versions and these failures.

There is a PR in buildkit open already to update their versions, but it stalled because they don't want to upgrade until their dep, containerd, has: moby/buildkit#4318

So, the solution here is to just log these errors rather than fail+exit because of them. If there's a better approach, I'm all ears 🙂

Comment on lines +10 to 11
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @aluzzardi the jaeger exporter is deprecated, jaeger now says to just use plain otel. I had to add this to get lint passing.

Is this code still worth maintaining? I don't remember if it's from older iterations on metrics/logging or if it's still in use.

@sipsma
Copy link
Contributor Author

sipsma commented Oct 17, 2023

@helderco any clue why python tests would be failing consistently with this error all of a sudden? https://github.com/dagger/dagger/actions/runs/6541678095/job/17763530250?pr=5906#step:4:938

273: exec python /pipx.pyz install hatch==1.7.0 ERROR: process "python /pipx.pyz install hatch==1.7.0" did not complete successfully: exit code: 1
273: > in sdk > python > test > 3.10 > build
273: > in sdk > python > test > 3.10
273: [6.13s] upgrading shared libraries...
273: [9.60s] installing hatch from spec 'hatch==1.7.0'...
273: [11.7s] Fatal error from pip prevented installation. Full pip output in file:
273: [11.7s]     /root/.local/pipx/logs/cmd_2023-10-17_02.18.23_pip_errors.log
273: [11.7s] 
273: [11.7s] pip seemed to fail to build package:
273: [11.7s]     idna
273: [11.7s] 
273: [11.7s] Some possibly relevant errors from pip install:
273: [11.7s]     ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
273: [11.7s] Error installing hatch from spec 'hatch==1.7.0'.
273: [11.7s] 

Doesn't seem related to any changes here (? at least I can't see how it could be). Hoping it's just some ephemeral thing, but let me know if you think it seems legit.

@sipsma
Copy link
Contributor Author

sipsma commented Oct 17, 2023

the dagger-in-dagger engine tests are failing only on github-paid-runner consistently, all other tests are okay though 😕

can repro locally, ./hack/make engine:test is fine but ./hack/dev ./hack/make engine:test hangs. Will continue looking tomorrow. Entirely possible this is not something we need to care about (those tests are advisory only anyways), but would feel better if I had a clue as to how tracing dependencies could cause this.

Copy link
Contributor

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on principle so you can merge once CI is sorted out 🫡

If there's mismatches between dependency versions on otel libs between
us and buildkit (and its various deps), then we can get schema merge
errors.

While sad and worth notifying the user about via a log statement, these
are not actually fatal, so it's not worth exiting over.

Signed-off-by: Erik Sipsma <[email protected]>
This is a bit silly. The previous update to log otel failures caused the
word "merge" to appear in the output (it's an error about merging otel
schemas), which in turn caused our test assertion to fail because it's
checking that no "merge" (-op) vertexes show up in the output by
default.

There's not really any great way to make the test more robust than I can
see other than just asserting on a more specific string, "merge (". It's
important to have this test coverage though, so this seems like the best
solution.

Signed-off-by: Erik Sipsma <[email protected]>
@sipsma
Copy link
Contributor Author

sipsma commented Oct 17, 2023

Looking into the dagger-in-dagger engine tests failing on github-paid-runner, it seems it may be a legit problem (glad we have these tests if so!)

I saw from a combination of SIGQUIT stacks and ps output that the CLI was stuck on this line trying to connect to buildkit: https://github.com/sipsma/dagger/blob/661a886a937d3cb620ab24e41dd031a72de1ab71/engine/client/buildkit.go#L51-L51

and that it kept spawning a short lived docker exec -it dagger-engine.dev buildctl dial-stdio ... subprocess. This indicates the docker-container connhelper we pick up from buildkit is trying to connect, but can't because the subprocess keeps exiting.

Unfortunately, that actual output gets entirely hidden (I'm not sure it's even possible for us to fix that without upstream changes), but with strace I could see that it seems to be printing the same cannot merge resource due to conflicting Schema URL error that I fixed elsewhere.

So my best guess is that while I updated our code to ignore that error in the engine + cli, the buildctl binary itself is still getting it (it does do various tracing setup, so makes sense)

I haven't pieced together every strand of spaghetti to explain exactly why this would only happen with the dagger-in-dagger tests, but this feels like a likely explanation.


If the above is correct, then our options are:

  1. Wait for upstream to fix it here.
    • Given they are pending containerd PRs, this isn't viable in terms of time
  2. Ignore the otel vulns, close this PR
    • At first glance they don't seem especially relevant to us, but I haven't looked in enough depth to be confident in that, so this is more of a last resort
  3. Build the buildctl binary using buildkit's upstream go.mod rather than our own
    • However, pretty sure this would just result in the otel vulns being found in that binary rather than dagger engine 😭
  4. Fork buildctl and remove the tracing stuff, which I don't think we care about
    • Investigating how easy/hard this is now...

cc @jedevc in case you have any ideas too

@sipsma
Copy link
Contributor Author

sipsma commented Oct 17, 2023

Happy update: I gave a quick shot at just pulling buildctl from upstream's image rather than building it ourselves. It did indeed fix the dagger-in-dagger engine test problem, but thankfully it also seems to pass the vulnerability scanning.

I'm guessing that the deps w/ vulnerabilities only appear in the "server-side" rather than client side, so this makes sense.

I think we should just go with this for now. Cleaning up the commit w/ comments on what's going on, then will merge if everything is green still 🤞

See comment in `buildctlBin` func in `internal/mage/util/engine.go`
for reasoning.

Signed-off-by: Erik Sipsma <[email protected]>
@sipsma sipsma merged commit c847458 into dagger:main Oct 17, 2023
@helderco
Copy link
Contributor

@helderco any clue why python tests would be failing consistently with this error all of a sudden? https://github.com/dagger/dagger/actions/runs/6541678095/job/17763530250?pr=5906#step:4:938

273: exec python /pipx.pyz install hatch==1.7.0 ERROR: process "python /pipx.pyz install hatch==1.7.0" did not complete successfully: exit code: 1
273: > in sdk > python > test > 3.10 > build
273: > in sdk > python > test > 3.10
273: [6.13s] upgrading shared libraries...
273: [9.60s] installing hatch from spec 'hatch==1.7.0'...
273: [11.7s] Fatal error from pip prevented installation. Full pip output in file:
273: [11.7s]     /root/.local/pipx/logs/cmd_2023-10-17_02.18.23_pip_errors.log
273: [11.7s] 
273: [11.7s] pip seemed to fail to build package:
273: [11.7s]     idna
273: [11.7s] 
273: [11.7s] Some possibly relevant errors from pip install:
273: [11.7s]     ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
273: [11.7s] Error installing hatch from spec 'hatch==1.7.0'.
273: [11.7s] 

Doesn't seem related to any changes here (? at least I can't see how it could be). Hoping it's just some ephemeral thing, but let me know if you think it seems legit.

I suspect this is a conflict inside the cache volumes. We need to at least have different caches per python version.

schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* engine: upgrade otel deps to address trivy scan result

https://github.com/dagger/dagger/security/dependabot/381

Signed-off-by: Erik Sipsma <[email protected]>

* engine: log+ignore otel tracer errors

If there's mismatches between dependency versions on otel libs between
us and buildkit (and its various deps), then we can get schema merge
errors.

While sad and worth notifying the user about via a log statement, these
are not actually fatal, so it's not worth exiting over.

Signed-off-by: Erik Sipsma <[email protected]>

* core: ignore lint failure about deprecated jaeger exporter import

Signed-off-by: Erik Sipsma <[email protected]>

* assert on more specific message in internal pipeline test

This is a bit silly. The previous update to log otel failures caused the
word "merge" to appear in the output (it's an error about merging otel
schemas), which in turn caused our test assertion to fail because it's
checking that no "merge" (-op) vertexes show up in the output by
default.

There's not really any great way to make the test more robust than I can
see other than just asserting on a more specific string, "merge (". It's
important to have this test coverage though, so this seems like the best
solution.

Signed-off-by: Erik Sipsma <[email protected]>

* ci: use buildctl binary from upstream image

See comment in `buildctlBin` func in `internal/mage/util/engine.go`
for reasoning.

Signed-off-by: Erik Sipsma <[email protected]>

---------

Signed-off-by: Erik Sipsma <[email protected]>
Signed-off-by: Christian Schlatter <[email protected]>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* engine: upgrade otel deps to address trivy scan result

https://github.com/dagger/dagger/security/dependabot/381

Signed-off-by: Erik Sipsma <[email protected]>

* engine: log+ignore otel tracer errors

If there's mismatches between dependency versions on otel libs between
us and buildkit (and its various deps), then we can get schema merge
errors.

While sad and worth notifying the user about via a log statement, these
are not actually fatal, so it's not worth exiting over.

Signed-off-by: Erik Sipsma <[email protected]>

* core: ignore lint failure about deprecated jaeger exporter import

Signed-off-by: Erik Sipsma <[email protected]>

* assert on more specific message in internal pipeline test

This is a bit silly. The previous update to log otel failures caused the
word "merge" to appear in the output (it's an error about merging otel
schemas), which in turn caused our test assertion to fail because it's
checking that no "merge" (-op) vertexes show up in the output by
default.

There's not really any great way to make the test more robust than I can
see other than just asserting on a more specific string, "merge (". It's
important to have this test coverage though, so this seems like the best
solution.

Signed-off-by: Erik Sipsma <[email protected]>

* ci: use buildctl binary from upstream image

See comment in `buildctlBin` func in `internal/mage/util/engine.go`
for reasoning.

Signed-off-by: Erik Sipsma <[email protected]>

---------

Signed-off-by: Erik Sipsma <[email protected]>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* engine: upgrade otel deps to address trivy scan result

https://github.com/dagger/dagger/security/dependabot/381

Signed-off-by: Erik Sipsma <[email protected]>

* engine: log+ignore otel tracer errors

If there's mismatches between dependency versions on otel libs between
us and buildkit (and its various deps), then we can get schema merge
errors.

While sad and worth notifying the user about via a log statement, these
are not actually fatal, so it's not worth exiting over.

Signed-off-by: Erik Sipsma <[email protected]>

* core: ignore lint failure about deprecated jaeger exporter import

Signed-off-by: Erik Sipsma <[email protected]>

* assert on more specific message in internal pipeline test

This is a bit silly. The previous update to log otel failures caused the
word "merge" to appear in the output (it's an error about merging otel
schemas), which in turn caused our test assertion to fail because it's
checking that no "merge" (-op) vertexes show up in the output by
default.

There's not really any great way to make the test more robust than I can
see other than just asserting on a more specific string, "merge (". It's
important to have this test coverage though, so this seems like the best
solution.

Signed-off-by: Erik Sipsma <[email protected]>

* ci: use buildctl binary from upstream image

See comment in `buildctlBin` func in `internal/mage/util/engine.go`
for reasoning.

Signed-off-by: Erik Sipsma <[email protected]>

---------

Signed-off-by: Erik Sipsma <[email protected]>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* engine: upgrade otel deps to address trivy scan result

https://github.com/dagger/dagger/security/dependabot/381

Signed-off-by: Erik Sipsma <[email protected]>

* engine: log+ignore otel tracer errors

If there's mismatches between dependency versions on otel libs between
us and buildkit (and its various deps), then we can get schema merge
errors.

While sad and worth notifying the user about via a log statement, these
are not actually fatal, so it's not worth exiting over.

Signed-off-by: Erik Sipsma <[email protected]>

* core: ignore lint failure about deprecated jaeger exporter import

Signed-off-by: Erik Sipsma <[email protected]>

* assert on more specific message in internal pipeline test

This is a bit silly. The previous update to log otel failures caused the
word "merge" to appear in the output (it's an error about merging otel
schemas), which in turn caused our test assertion to fail because it's
checking that no "merge" (-op) vertexes show up in the output by
default.

There's not really any great way to make the test more robust than I can
see other than just asserting on a more specific string, "merge (". It's
important to have this test coverage though, so this seems like the best
solution.

Signed-off-by: Erik Sipsma <[email protected]>

* ci: use buildctl binary from upstream image

See comment in `buildctlBin` func in `internal/mage/util/engine.go`
for reasoning.

Signed-off-by: Erik Sipsma <[email protected]>

---------

Signed-off-by: Erik Sipsma <[email protected]>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* engine: upgrade otel deps to address trivy scan result

https://github.com/dagger/dagger/security/dependabot/381

Signed-off-by: Erik Sipsma <[email protected]>

* engine: log+ignore otel tracer errors

If there's mismatches between dependency versions on otel libs between
us and buildkit (and its various deps), then we can get schema merge
errors.

While sad and worth notifying the user about via a log statement, these
are not actually fatal, so it's not worth exiting over.

Signed-off-by: Erik Sipsma <[email protected]>

* core: ignore lint failure about deprecated jaeger exporter import

Signed-off-by: Erik Sipsma <[email protected]>

* assert on more specific message in internal pipeline test

This is a bit silly. The previous update to log otel failures caused the
word "merge" to appear in the output (it's an error about merging otel
schemas), which in turn caused our test assertion to fail because it's
checking that no "merge" (-op) vertexes show up in the output by
default.

There's not really any great way to make the test more robust than I can
see other than just asserting on a more specific string, "merge (". It's
important to have this test coverage though, so this seems like the best
solution.

Signed-off-by: Erik Sipsma <[email protected]>

* ci: use buildctl binary from upstream image

See comment in `buildctlBin` func in `internal/mage/util/engine.go`
for reasoning.

Signed-off-by: Erik Sipsma <[email protected]>

---------

Signed-off-by: Erik Sipsma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants