Skip to content

Add mayhem for API as a github workflow#43249

Closed
dlowe wants to merge 1 commit intomoby:masterfrom
mayhemheroes:mapi
Closed

Add mayhem for API as a github workflow#43249
dlowe wants to merge 1 commit intomoby:masterfrom
mayhemheroes:mapi

Conversation

@dlowe
Copy link

@dlowe dlowe commented Feb 16, 2022

- What I did

Added a github action to run Mayhem for API (which is a free automated testing tool for http APIs) against dockerd: given the openapi spec and api server, it generates and runs random test payloads, looking for exceptional behavior.

Disclosure: working on this tool is my day job!

- How I did it

The action builds and runs dockerd listening on a tcp port. There's a little bit of jiggery-pokery waiting for the server to start (without which the action is flaky, which nobody wants.) Then it runs mapi for 60 seconds; the results are reported as SARIF ("Code scanning results").

This is probably the easiest/most straightforward way to integrate. If you like mapi but aren't thrilled about the specifics of this workflow, there are lots of options, and I'd be more than happy to help get something working!

- How to verify it

Merging new workflows across forks is a bit weird. You can see some results in my fork, without doing any other work:

the action: https://github.com/makesoftwaresafe/moby/actions/runs/1850050316
the scan results: https://github.com/makesoftwaresafe/moby/pull/1/checks?check_run_id=5209023377
the details in the mayhem for API ui: https://mayhem4api.forallsecure.com/job/10840

To make it work in the upstream repo, there's a bit of out-of-PR effort to integrate:

  • sign up for a free mapi account, creating a moby organization
  • create a mapi service account within the moby organization
  • modify the mapi.yaml action in this PR to point at your mapi organization
  • add the service account's API token to github as a repository secret named MAPI_TOKEN

- Description for the changelog

"Add mayhem for API as a github workflow"

- A picture of a cute animal (not mandatory but encouraged)

Screen Shot 2022-02-15 at 4 08 53 PM

Given the openapi spec and api server, this generates and runs random test
payloads, looking for exceptional behavior.

Mayhem for API docs: https://mayhem4api.forallsecure.com/docs/
mapi GitHub action docs: https://github.com/marketplace/actions/mayhem-for-api

Signed-off-by: J. David Lowe <[email protected]>
@tianon
Copy link
Member

tianon commented Feb 16, 2022

I feel like I'm missing something important while reading this report -- "Internal Server Error" appears to be the response to most of this API fuzzing, which is kind of what I'd expect the daemon to return when fed otherwise garbage inputs? 🙈 😅

Can you give some idea of how we're intended to interpret this result? Is the implication that we should update our swagger to specifically say that "Internal Server Error" means bad arguments (which IMO is a bit implied?) or that we should be updating our parameters/types to be more specific on what they accept so that the fuzzer doesn't generate quite as many invalid values? (Aren't "invalid" inputs a major part of the point of a fuzzer? 😇)

@thaJeztah
Copy link
Member

I guess there's some cases that return a 500, but should've returned a 4xx (most likely because we default to 500 if no specific error was assigned); for example, I see this one with a 500 status;

{
  "message": "Your kernel does not support CPU real-time scheduler"
}

So, the error is handled correctly, but not assigned an errdefs.InvalidParam() (hence a 500 status)

@dlowe
Copy link
Author

dlowe commented Feb 16, 2022

I feel like I'm missing something important while reading this report -- "Internal Server Error" appears to be the response to most of this API fuzzing, which is kind of what I'd expect the daemon to return when fed otherwise garbage inputs? 🙈 😅

That's not an unreasonable expectation, and it appears to be true of dockerd, whether or not it's correct ;)

Can you give some idea of how we're intended to interpret this result? Is the implication that we should update our swagger to specifically say that "Internal Server Error" means bad arguments (which IMO is a bit implied?) or that we should be updating our parameters/types to be more specific on what they accept so that the fuzzer doesn't generate quite as many invalid values? (Aren't "invalid" inputs a major part of the point of a fuzzer? 😇)

The hard-line position, taken by mapi by default, is that 500 means the server broke in some way, and that invalid inputs, no matter how bonkers, should result in 4xx. (And anecdotally, as someone with an operational role, I value that distinction so that I can have meaningful alerts triggered by 500s!)

If that position isn't palatable to this project (understandable!), my suggestion would be to disable the generic InternalServerError rule. Without any other work, mapi will still find some types of bug (if any exist!) And either way, if dockerd can be configured to return stack traces in the 500 payload during testing, we can generate significantly more detailed results (including a certain amount of ability to triage 500s you don't care about.)

@thaJeztah
Copy link
Member

that 500 means the server broke in some way, and that invalid inputs, no matter how bonkers, should result in 4xx.

Agreed (with some nuances); I like using the chart from https://www.codetinkerer.com/2015/12/04/choosing-an-http-status-code.html, which is quite helpful in case of doubt

As to "500 means the server broke"; perhaps this is an area where HTTP status codes are not always a good fit for APIs. Generally, I'd describe these as "it's not your fault, it's ours" (the request is likely OK), but (unfortunately) we're not always able to tell if a failure is temporary, a configuration issue (could be an issue with the host), or a (temporary?) failure in one of the components that the daemon depends on (e.g., containerd, runc). Such errors remain to be a "best effort", in which cases, we will "handle" the error (so that the daemon as a whole remains functioning), and return the error information we have as error-message (to allow the user to investigate if needed).

my suggestion would be to disable the generic InternalServerError rule

I'm on the fence on this one. I think that many of these 500 errors should be considered a bug. There are some cases where we explicitly return a 500 (using errdefs.System(), or an error that implements the errdefs.ErrSystem interface), others may be errors that are handled, but that were not assigned an error-type. I know a debug log was added for this reason (to help reveal the errors that still needed to be fixed);

logrus.WithFields(logrus.Fields{
"module": "api",
"error_type": fmt.Sprintf("%T", err),
}).Debugf("FIXME: Got an API for which error does not match any expected type!!!: %+v", err)

While those logs may help somewhat, the work is definitely not completed (and admitted, not on top of the priority list, unless it's causing breakage); that said, your tool may help to make more of these omissions visible.

if dockerd can be configured to return stack traces in the 500 payload during testing, we can generate significantly more detailed results (including a certain amount of ability to triage 500s you don't care about.)

This should be possible, to some extend: we could make the API return the stack if an error occurred. I don't think this should be the default behaviour, but we could add an option (perhaps an environment variable) to enable this.

It won't be possible to return a stack trace for every error though; Golang errors do not contain a trace by default; we do use the https://github.com/pkg/errors package to generate some of our errors (which provides a stack trace), but that project has been "frozen", and many projects started to remove its use in favor of Golang's "native" error wrapping (which, as far as I'm aware, does not contain stack traces).

For fun, I gave this a quick go during my lunch, and pushed it as a PR: #43252

@dlowe
Copy link
Author

dlowe commented Feb 17, 2022

that 500 means the server broke in some way, and that invalid inputs, no matter how bonkers, should result in 4xx.

Agreed (with some nuances); I like using the chart from https://www.codetinkerer.com/2015/12/04/choosing-an-http-status-code.html, which is quite helpful in case of doubt

As to "500 means the server broke"; perhaps this is an area where HTTP status codes are not always a good fit for APIs. Generally, I'd describe these as "it's not your fault, it's ours"

Yep! "it's not your fault, it's ours" is a nicer way of phrasing what I was trying to convey with "the server broke in some way" (and those flowcharts are great, thanks!)

While those logs may help somewhat, the work is definitely not completed (and admitted, not on top of the priority list, unless it's causing breakage); that said, your tool may help to make more of these omissions visible.

Yep, I imagine it would (especially given stack traces; that's exciting!)

If the results are interesting, but fixing them isn't a high priority, it might make more sense to integrate on a schedule, rather that on pull requests. I don't love a PR-based integration until the baseline is clean, i.e. you can have some confidence that issues found were introduced in the current PR.

@dlowe dlowe closed this Nov 29, 2022
@cpuguy83
Copy link
Member

This is really cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants