Currently Fennec has thousands of failures when running the full set of unittests. As it stands when tinderbox runs these, we just set the “pass” criteria as total failures <= acceptable # of failures. As you can imagine, this has a lot of room for improvement.
Enter the LogCompare tool which happyhans and I have been working on with help from mikeal for the couchdb backend. What we do is take the tinderbox log file, parse it and upload it to a database! This way we get a list of all the tests that were run and if they passed or failed. Now we can compare test by test what is fixed, a known failure or a new failure. What is even better is that we are running Mochitests in parallel on 4 different machines and LogCompare can tell us if the tests on machine1 pass or fail without necessarily waiting for the other tests to complete. Another bonus is we can track a specific test over time to look for random orange data.
The concept is simple, here are some of the details and caveats:
- We track tests by test filename, not by directory or test suite
- A single filename can have many tests (mochitest), so there is no clean way to track each specific test.
- If a test fails, future tests (sometimes in the same file, folder, or suite) are skipped.
- Parsing the log file is a nasty task with many corner cases
- To match test names up correctly, we need to strip out full paths and just view the relative path/filename.
- Need to handle when new tests are added or existing ones removed
- Need to baseline from Firefox for full list of tests and counts
The goal here is to keep it simple while bringing the total failure count of the unittests on Fennec to Zero!
Can you do this for Firefox, too? It would be great to have data on how frequent each random orange is, and be able to search to find the first occurrence of a random orange.
Simple answer: YES!
These are the same tests (with a few exceptions) which means the same log parsing scripts and data mining would work.
A big benefit of this is for the parallel runs of the unittests. There could be a dozen different logs which all need to generate a single pass/fail(orange) result. Much easier to determine the status of a build from a database vs multiple files.