add service profile integration tests for service profile metrics#2685
add service profile integration tests for service profile metrics#2685dadjeibaah merged 15 commits intomasterfrom
Conversation
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 5688cef: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 5e9096e: fail 😕 |
|
Integration test results for 140c733: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
140c733 to
c20eca0
Compare
|
Integration test results for c20eca0: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 88ba27b: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 9e834d9: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 192864e: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 0eddb5b: fail 😕 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 977d572: success 🎉 |
ihcsim
left a comment
There was a problem hiding this comment.
Some questions and comments below.
| cliLines = append(cliLines, routes) | ||
| } | ||
| if isWideOutput { | ||
| cmd = append(cmd, "-owide") |
There was a problem hiding this comment.
Probably not needed, unless you want to add an else to the --output json below. The JSON output already contains the extra effective_* and actual_* fields.
There was a problem hiding this comment.
Ah nice! thanks for pointing that out, I had no idea.
| case "timeouts", "budgets": | ||
| // If the P99 latency is greater than 500ms retries are probably happening before applying | ||
| // the service profile and we can't reliably test the service profile. | ||
| assertion.assertFunc = func(rt *rowStat) bool { return rt.LatencyP99 < 500 } |
There was a problem hiding this comment.
I think checking latency is a bit unreliable. If the intent here is to check that retries aren't happening, can we just check if effective_success and actual_success is zero (since --failure-rate=1.0)?
There was a problem hiding this comment.
Good point. Testing for latency is pretty finicky, however, the behaviour we are testing is timeouts. The intent for this test case was to ensure that we aren't already hitting the retry timeout that we will eventually test for. My thinking here was, if we are already seeing timeouts surpassing the threshold even before applying the service profile with the intended timeout, we can't reliably prove that the service profile is doing its job after it's applied.
Another interesting thing here is with --failure-rate=1.0, we will still see effective_success and actual_success be zero even after we've applied the service profile and the service is retrying requests. Similar to your suggestion, I think comparing actual_rps and effective_rps would suffice. 🤔
| // set in in service profile. hello-timeouts-service and hello-budgets always fails | ||
| // so we expect all request latencies to be greater than or equal to the timeout set. | ||
| case "timeouts", "budgets": | ||
| assertion.assertFunc = func(rt *rowStat) bool { return rt.LatencyP99 >= 500 } |
There was a problem hiding this comment.
Similar to above, checking latency can be unreliable. If the intent here is to check retries happened, but requests continue to fail because of --failure-rate=1.0, we can try to see if this works:
- Configure slow cooker to output a fixed number of requests via the
-totalRequestsflag - Make sure
effective_successandactual_successare zero - Use
linkerd topto check thecountcolumn for retried requests
There was a problem hiding this comment.
From my understanding of the initial ticket, I thought we were trying to specifically test the timeouts behavior. Maybe, since its unreliable we should just test that retries are being performed?
There was a problem hiding this comment.
Maybe, since its unreliable we should just test that retries are being performed?
SGTM, unless others have a way to do this. It isn't easy to reliably test timeouts without context. So far, I can't find anything in the service profile and route code that will be useful here, based on the little that I know.
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 11a8396: success 🎉 |
Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 6d3ba84: success 🎉 |
siggy
left a comment
There was a problem hiding this comment.
looks good! mostly tioli comments, good to go following ivan's comment around wide and a lint fix.
|
|
||
| var TestHelper *testutil.TestHelper | ||
|
|
||
| type rowStat struct { |
There was a problem hiding this comment.
TIOLI, you could make jsonRouteStats (from the cmd package) public and re-use that.
| sourceName: "tap", | ||
| namespace: testNamespace, | ||
| deployName: "deploy/t1", | ||
| deployName: "deployment/t1", |
There was a problem hiding this comment.
curious: was this change necessary to match against returned data from a routes command?
There was a problem hiding this comment.
Yea, I had to use the full deployment name in order to get stuff back. I can make a change to the routes command to accept both deploy and deployment
| ports: | ||
| - containerPort: 9999 | ||
| restartPolicy: OnFailure | ||
|
|
|
|
||
| profile := &sp.ServiceProfile{} | ||
|
|
||
| // Grab the output and convert it to a service profile object for modification |
| } | ||
| } | ||
|
|
||
| func assertRouteStat(assertion *routeStatAssertion, t *testing.T, assertFn func(stat *rowStat) error) { |
There was a problem hiding this comment.
tioli: since routeStatAssertion is only used in one place, i'd probably just pass upstream, downstream, and namespace directly into assertRouteStat
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]> Signed-off-by: Dennis Adjei-Baah <[email protected]>
|
Integration test results for 6496635: success 🎉 |
This PR is a continuation of #2638. It adds tests that verify that retries are being performed after service profiles are added for a service and also tests to see if timeout limits for a particular route are respected.
Fixes #2518