-
Notifications
You must be signed in to change notification settings - Fork 29.7k
Allow marking a golden check as flaky #111595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mdebbar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I'll let @Piinks make the final call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you wanted to do the opposite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards giving it a more self-explanatory name like alwaysPass but I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to give is a name that implies some technical debt. I'm not sure if alwaysPass conveys that. But I also don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about falsePositive? Or disable? I am not a great namer. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alwaysPassDisabledFalsePositiveFlakyTest 😛
Here's my evaluation:
isFlaky:- Pros: clear, implies technical debt
- Cons: could be too specific; perhaps you want to use it for something other than flakiness
disable,skip:- Pros: clear, implies technical debt;
skipin particular is already a common name used inpackage:test - Cons: makes it sounds as if the golden is not generated at all; it may be surprising that a disabled/skipped check throws an error because the act of screenshotting the widget fails for some reason ("why is it failing? I skipped it!").
- Pros: clear, implies technical debt;
falsePositive:- Pros: implies technical debt
- Cons: makes me think 🤔 (which thing is positive? what's false about it?), especially when it can take on both
trueandfalsevalues, and especially if used in conjunction withisNot(i.e. used for false negatives), butisNot(...falsePositive = false)makes me think even harder.
alwaysPass:- Pros: the most accurate description of the functionality (the check always passes no matter what's in the screenshot)
- Cons: does not imply technical debt
I think isFlaky might be fine, but maybe skip or alwaysPass are better. skip is already a standard name used in the test framework, and alwaysPass is so accurate it's hard to resist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think skip, especially since it is so widely used in the framework already, indicates the test won't run at all, so this would defy a pretty baked in connotation.
I don't have an opinion otherwise, I am not particularly skilled in crafting names. 🙃
Piinks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering again the public API that regular Flutter developers will have to wokr with for their own applications, I don't think that it is appropriate to expose isFlaky to them. I still think this is really specific to our use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this print statement be within the block of code for when the test fails/flakes? If the test passes, there's no need to add console output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| /// should produce a screenshot and make it available for human review but not | |
| /// will produce a screenshot and make it available for human review but not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class does not provide an implementation of goldenFileComparato, so it cannot guarantee this behavior. I meant this paragraph as a recommendation for implementors. It's up to the implementation to decide whether to respect this flag, what "human review" means (e.g. it may be something other that Skia Gold or local files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about falsePositive? Or disable? I am not a great namer. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure about exposing this to all Flutter developers... If a test fails, ideally it would be fixed, right? For our case, the cause of the failure is way more nuanced, a Flutter developer would not be digging into the depths of the engine code like this, and would not be able to to individually address their specific failure.
If a test fails, and the diff is acceptable to the user, they can just run flutter test --update-goldens and move on.
I don't know that the public API should have an opinion about flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cause of flakiness is in the Flutter SDK, then sure. But flakiness can be caused by app code because:
- it depends on current date-time.
- it depends on
Math.random. - it contains a race condition due to asynchrony.
- it runs in non-uniform environment (different OS versions, different hardware).
- it has non-hermetic dependencies (e.g. pub dependencies shifting versions behind-the-scenes).
Having said that, "flakiness" specifically may be too specific. There may be other reasons a developer might want to label a test to pass but still generate a golden, such as:
- You see that a generated golden is completely non-sensical, but at the same time you don't know what it should look like. So you don't want to update the golden to a non-sensical one, but you still want to generate it for human evaluation.
- You spot a test that's too brittle. It's trying to test the look of one button, but it's taking a golden of the entire screen (i.e. overtesting). It breaks too often, so you want to disable it, but alert the owner that this test's scope should be reduced.
- You want to divide your tests into two categories. One category is like a fully automated unit-test; it is expected to generate stable goldens and prevent code submissions that change them (this is the system Flutter uses). You also want a second category of tests that generate goldens but never fail. The system you submit the goldens to alerts humans (e.g. a QA team) to review any deltas and either approve or file issues.
So I think this option would be useful to developers. OTOH, all this is speculative. Do you have a recommendation for how to expose this flag to tests, if not through matchesGoldenFile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think soliciting feedback from someone other than just me would be beneficial here. I think in all of these cases, the developer would want to just fix their test, not accept it as broken and have it fail silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if these arguments do not immediately come across as sound, I'm totally fine with keeping it private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already agnostic and un-opinionated about this as this doc comment points out, which makes me think we should preserve that.
d55d634 to
f343f06
Compare
f343f06 to
e360777
Compare
|
Customer errors are related to |
Piinks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yjbanov I saw your ping for a second review, is this still up for review, or will you be making changes because of the failing customer test? In its current state, this can't land AFAICT.
I cannot make changes in customer tests before we agree on the API design. If we think that the API changes in the comparator are the right thing to do, then I'll first fix downstream tests before landing this PR. Alternatively, if we think a different design is better, I will first change this PR, see what's broken by the new approach, and go from there. One thing that's not resolved is #111595 (comment). @Piinks I shared some thoughts in that discussion. I feel like an API change in |
Piinks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot make changes in customer tests before we agree on the API design.
If this is the path you would like to take with the API, what is the plan for migration? Since this is a breaking change, that should be part of the discussion. :)
One thing that's not resolved is #111595 (comment)
Sorry, this being a breaking change takes precedence over this for me. 🤷 Can you please share your plan for folks that are broken by this change? It's a moot point if this is going to be re-worked to be non-breaking or otherwise.
|
@Piinks Please treat this PR only as a design proposal. I'm not attached to code written here. It's just a starting point for a discussion. I can change it to whatever works for you.
It definitely should. The plan right now is to roughly follow https://github.com/flutter/flutter/wiki/Tree-hygiene#handling-breaking-changes (more on why "roughly" below). We are currently doing this part: https://github.com/flutter/flutter/wiki/Tree-hygiene#2-evaluate-the-breaking-change-proposal. I'm not sure if a design doc is necessary. This PR already contains every single detail of the change. In particular, I'm not sure if a breaking change is necessary, so I'm soliciting alternative proposals that are some combination of the following:
Here I am asking for your guidance, and also why I'd like to finish this discussion: #111595 (comment). In particular, you say:
I am making some arguments in favor or exposing the API. What are your thoughts on that? I'm curious if you have something in mind what doesn't change the public API. Perhaps a private API? I there a prior art I could look at?
Yes, I think it's a bit moot to go in depth with the breaking change process until we agree on an approach (which may end up being non-breaking). This is why I'd like to conclude the discussion of the proposal and alternatives first. However, in a nutshell: the breaking part is the |
I have shared them. :) Perhaps soliciting feedback from more than one person will help indicate the appropriate direction? That is typical when making breaking changes.
We already discussed this, which is why I did not comment further repeating myself. If we were to not expose this to all Flutter developers, it would instead require writing a custom implementation of |
Can you please say how you plan to do this? This change cannot land before cocoon is fixed, and cocoon cannot be fixed without this change. It is not clear how this will be done. |
|
Thanks, @Piinks! This helps. Here's my understanding of your proposal (precise names and API shape TBD):
This change would be non-breaking, so further discussion of breaking changes is not necessary here, but I'll answer your question in case it's useful in the future:
Methods of an implementation of an interface are allowed to add optional parameters not specified in the interface. So I can first update the implementations, then submit this PR. |
|
Sounds good! Thanks for entertaining different ideas on this. :) |
|
This pull request executed golden file tests, but it has not been updated in a while (20+ days). Test results from Gold expire after as many days, so this pull request will need to be updated with a fresh commit in order to get results from Gold. For more guidance, visit Writing a golden file test for Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing. |
|
Closing in favor of #114450 |
Support marking a golden check as flaky. A flaky check will still generate a golden and report it to the current
GoldenFileComparator, but it will not fail the test.The local file comparator was updated to simply not fail on mismatch.
The Skia Gole comparator was updated to submit the golden to Skia Gold with a virtually unlimited failure threshold. The effect is that the test doesn't fail and it does not generate a triage, but it still allows Skia Gold to track generated screenshots over time. When a fix for flakiness lands, it's easy to go Skia Gold UI to confirm it and remove the
isFlakyargument from the test.Fixes #111325