Tag Archives: talos

Watching the watcher – Some data on the Talos alerts we generate

What are the performance regressions at Mozilla- who monitors them and what kind of regressions do we see?  I want to answer this question with a few peeks at the data.  There are plenty of previous blog posts I have done outlining stats, trends, and the process.  Lets recap what we do briefly, then look at the breakdown of alerts (not necessarily bugs).

When Talos uploads numbers to graph server they get stored and eventually run through a calculation loop to find regressions and improvements.  As of Jan 1, 2015, we upload these to mozilla.dev.tree-alerts as well as email to the offending patch author (if they can easily be identified).  There are a couple folks (performance sheriffs) who look at the alerts and triage them.  If necessary a bug is filed for further investigation.  Reading this brief recap of what happens to our performance numbers probably doesn’t inspire folks, what is interesting is looking at the actual data we have.

Lets start with some basic facts about alerts in the last 12 months:

  • We have collected 8232 alerts!
  • 4213 of those alerts are regressions (the rest are improvements)
  • 3780 of those above alerts have a manually marked status
    • the rest have been programatically marked as merged and associated with the original
  • 278 bugs have been filed (or 17 alerts/bug)
    • 89 fixed!
    • 61 open!
    • 128 (5 invalid, 8 duplicate, 115 wontfix/worksforme)

As you can see this is not a casual hobby, it is a real system helping out in fixing and understanding hundreds of performance issues.

We generate alerts on a variety of branches, here is the breakdown of branches and alerts/branch;

number of regression alerts we have received per branch

number of regression alerts we have received per branch

There are a few things to keep in mind here, mobile/mozilla-central/Firefox are the same branch, and for non-pgo branches that is only linux/windows/android, not osx. 

Looking at that graph is sort of non inspiring, most of the alerts will land on fx-team and mozilla-inbound, then show up on the other branches as we merge code.  We run more tests/platforms and land/backout stuff more frequently on mozilla-inbound and fx-team, this is why we have a larger number of alerts.

Given the fact we have so many alerts and have manually triaged them, what state the the alerts end up in?

Current state of alerts

Current state of alerts

The interesting data point here is that 43% of our alerts are duplicates.  A few reasons for this:

  • we see an alert on non-pgo, then on pgo (we usually mark the pgo ones as duplicates)
  • we see an alert on mozilla-inbound, then the same alert shows up on fx-team,b2g-inbound,firefox (due to merging)
    • and then later we see the pgo versions on the merged branches
  • sometimes we retrigger or backfill to find the root cause, this generates a new alert many times
  • in a few cases we have landed/backed out/landed a patch and we end up with duplicate sets of alerts

The last piece of information that I would like to share is the break down of alerts per test:

Alerts per test

number of alerts per test (some excluded)

There are a few outliers, but we need to keep in mind that active work was being done in certain areas which would explain a lot of alerts for a given test.  There are 35 different test types which wouldn’t look good in an image, so I have excluded retired tests, counters, startup tests, and android tests.

Personally, I am looking forward to the next year as we transition some tools and do some hacking on the reporting, alert generation and overall process.  Thanks for reading!

1 Comment

Filed under testdev

A case of the weekends?

Case of the Mondays

What was famous 15 years ago as a case of the Mondays has manifested itself in Talos.  In fact, I wonder why I get so many regression alerts on Monday as compared to other days.  It is more to a point of we have less noise in our Talos data on weekends.

Take for example the test case tresize:

linux32,

* in fact we see this on other platforms as well linux32/linux64/osx10.8/windowsXP

30 days of linux tresize

Many other tests exhibit this.  What is different about weekends?  Is there just less data points?

I do know our volume of tests go down on weekends mostly as a side effect of less patches being landed on our trees.

Here are some ideas I have to debug this more:

  • Run massive retrigger scripts for talos on weekends to validate # of samples is/isnot the problem
  • Reduce the volume of talos on weekdays to validate the overall system load in the datacenter is/isnot the problem
  • compare the load of the machines with all branches and wait times to that of the noise we have in certain tests/platforms
  • Look at platforms like windows 7, windows 8, and osx 10.6 as to why they have more noise on weekends or are more stable.  Finding the delta in platforms would help provide answers

If you have ideas on how to uncover this mystery, please speak up.  I would be happy to have this gone and make any automated alerts more useful!

3 Comments

Filed under testdev

Are there any trends in our Talos regression bugs?

Now that we have a better process for taking action on Talos alerts and pushing them to resolution, it is time to take a step back and see if any trends show up in our bugs.

First I want to look at bugs filed/week:

Image

This is fun to see, now what if we stack this up side by side with the alerts we receive:

Image

We started tracking alerts halfway through this process.  We show that for about 1 out of every 25 alerts we file a bug.  I had previously stated it was closer to 1/33 alerts (it appears that is averaging out the first few weeks).

Lets see where these bugs are filed, here is a view of the different bugzilla products:

Image

The Testing product is used to file bugs that we cannot figure out the exact changeset, so they get filed in testing::talos.  As there are almost 30 unique components bugs are filed in, I took a few minutes to look at the Core product, here is where the bugs live in Core:

Image

Pardon my bad graphing attempt here with the components cut off.  Graphics is the clear winner for regressions (with “graphics: layers” being a large part of it).  Of course the Javascript Engine and DOM would be there (a lot of our tests are sensitive to changes here).  This really shows where our test coverage is more than where bad code lives. 

Now that I know where the bugs are, here is a view of how long the bugs stay open:

Image

The fantastic news is most of our bugs are resolved in <=15 days!  I think this is a metric we can track and get better at- ideally closing all Talos regression bugs in <30 days.

Looking over all the bugs we have, what is the status of them?

Image

Yay for the blue pacman!  We have a lot of new bugs instead of assigned bugs, that might be something we could adjust and assign owners once it is confirmed and briefly discussed- that is still up in the air.

The burning question is what are all the bugs resolved as?

Image

To me this seems healthy, it is a starting point.  Tracking this over time will probably be a useful metric!

 

In summary, many developers have done great work to make improvements and fix patches over the last 6 months that we have been tracking this information.  There are things we can do better, I want to know-

What information provided today is useful to track regularly?

Is there something you would rather see?

 

3 Comments

Filed under Uncategorized

The lifecycle of a Talos performance regression

The lifecycle of a Talos performance regression

The cycle of landing a change to Firefox that affects performance

Leave a comment

May 8, 2014 · 9:38 am

Hey, SessionRestore- you have a Talos test

As of last Friday bug 936630 landed so we now have sessionrestore (and sessionrestore_no_auto_restore) as 2 new tests in the Talos suite (posting results under they magic ‘o’ key on tbpl).  In fact we have already seen this test show improvements.

Thanks to Yoric for creating these new tests, give him some karma on irc!  Please refer to the talos docs if you want more information on these tests.

 

Leave a comment

Filed under Uncategorized

Performance Alerts – by the numbers

If you have ever received an automated mail about a performance regression, and then 10 more, you probably are frustrated by the volume of alerts.  6 months ago, I started looking at the alerts and filing bugs, and 10 weeks ago a little tool was written to help out.

What have I seen in 10 weeks:

1926 alerts on mozilla.dev.tree-management for Talos resulting in 58 bugs filed (or 1 bug/33 alerts):

Image

*keep in mind that many alerts are improvements, as well as duplicated between trees and pgo/nonpgo

 

Now for some numbers as we uplift.  How are we doing from one release to another?  Are we regressing, Improving?  These are all questions I would like to answer in the coming weeks.

Firefox 30 uplift, m-c -> Aurora:

  • 26 – regressions (4 TART, 4 SVG, 3 TS, Paint, and many more)
    • 2 remaining bugs not resolved as we are now on Beta (bug 990183, bug 990194)

 

Firefox 31 uplift, m-c -> Aurora (tracking bug 990085):

 

Is this useful information?

Are there questions you would rather I answer with this data?

 

3 Comments

Filed under Uncategorized

Performance Bugs – How to stay on top of Talos regressions

Talos is the framework used for desktop Firefox to measure performance for every patch that gets checked in.  Running tests for every checkin on every platform is great, but who looks at the results?

As I mentioned in a previous blog post, I have been looking at the alerts which are posted to dev.tree-management, and taking action on them if necessary.  I will save discussing my alert manager tool for another day.  One great thing about our alert system is that we send an email to the original patch author if we can determine who it is.  What is great is many developers already take note of this and take actions on their own.  I see many patches backed out or discussed with no one but the developer initiating the action.

So why do we need a Talos alert sheriff?  For the main reason that not even half of the regressions are acted upon.  There are valid reasons for this (wrong patch identified, noisy data, doesn’t seem related to the patch) and of course many regressions are ignored due to lack of time.  When I started filing bugs 6 months ago, I incorrectly assumed all of them would be fixed or resolved as wontfix for a valid reason.  This happens for most of the bugs, but many regressions get forgotten about.

When we did the uplift of Firefox 30 from mozilla-central to mozilla-aurora, we saw 26 regression alerts come in and 4 improvement alerts.  This prompted us to revisit the process of what we were doing and what could be done better.  Here are some of the new things we will be doing:

  • For all regressions found, attempt to find the original bug and reopen/comment in the bug
  • For some regressions that it is not easy to find the original bug, we will open a new bug
  • All bugs that have regression information will be marked as blocking a new tracking bug
  • For each release we will create a new tracking bug for all regressions
  • After an uplift from central->aurora, we will ensure we have all alerts mapped to existing regressions

As this process goes through a cycle or two, we will refine it to ensure we have less noise for developers and more accuracy in tracking regressions faster

 

Leave a comment

Filed under Uncategorized

tracking talos alerts across branches

A year without blogging and I am back.  I figured there was some cool stuff to share, here is one tidbit.

In the last year I have picked up looking at talos results and filing regression bugs for results.  This has been useful.  What currently happens is when results are submitted to g.m.o (graph server) we detect a regression and send out an email to the original patch author (if we can determine it) and post to mozilla.dev.tree-management.  I have been using dev.tree-management as a starting point for my hunting regressions.  When things are busy it can eat up a couple hours in a day.  Luckily many developers are responsible in taking action when they receive the emails.

Given that at least half of the regressions are not acted upon by the original developer, it is important to read the newsgroup. One of the things which makes it frustrating is that for a single regression we can get multiple alerts (regular builds vs pgo builds and as the patch merges between branches/projects).

To make my life easier, I have taken all the alerts on dev.tree-management and put them in a database (local right now).  The final goal is a webUI that lets me easily annotate these alerts similar to tbpl for random test failures.  One thing I wanted to do was help identify duplicate alerts.  Today in my attempt I had a clear picture of what the lifecycle of a regression looks like:

mysql> select date,branch,percent,keyrevision from alerts where test=’Paint’ and platform=’WINNT 6.2 x64′ order by date ASC;
+———————+————————-+———+————–+
| date                | branch                  | percent | keyrevision  |
+———————+————————-+———+————–+
| 2014-02-14 19:41:38 | Mozilla-Inbound-Non-PGO | 10.1%   | c7802c9d6eec |
| 2014-02-15 01:03:54 | Fx-Team-Non-PGO         | 9.53%   | 7a3adc5aac28 |
| 2014-02-15 21:43:48 | Mozilla-Inbound         | 10.6%   | c7802c9d6eec |
| 2014-02-16 03:46:12 | Firefox-Non-PGO         | 8.88%   | 5d7caa093f4f |
| 2014-02-16 03:46:13 | B2g-Inbound-Non-PGO     | 9.44%   | 071885f79841 |
| 2014-02-16 14:22:38 | Fx-Team                 | 10.4%   | 7a3adc5aac28 |
| 2014-02-17 04:42:57 | B2g-Inbound             | 10.7%   | 071885f79841 |
| 2014-02-18 11:43:54 | Firefox                 | 9.76%   | eac89fb04bb9 |
+———————+————————-+———+————–+
8 rows in set (0.00 sec)

This is really cool to see how 1 change can generate alerts for 4 days.

Stay tuned for more information on this and other topics!

Leave a comment

Filed under Uncategorized

Android automation is becoming more stable ~7% failure rate

At Mozilla we have made our unit testing on android devices to be as important as desktop testing. Earlier today I was asked how do we measure this and what is our definition of success. The obvious answer is no failures except for code that breaks a test, but reality is something where we allow for random failures and infrastructure failures. Our current goal is 5%

So what are these acceptable failures and what does 5% really mean. Failures can happen when we have tests which fail randomly, usually poorly written tests or tests which have been written a long time ago and hacked to work in todays environment. This doesn’t mean any test that fails is a problem, it could be a previous test that changes a Firefox preference on accident. For Android testing, this currently means the browser failed to launch and load the test webpage properly or it crashed in the middle of the test. Other failures are the device losing connectivity, our host machine having hiccups, the network going down, sdcard failures, and many other problems. With our current state of testing this mostly falls into the category of losing connectivity to the device. For infrastructure problems they are indicated as Red or Purple and for test related problems they are Orange.

I took at a look at the last 10 runs on mozilla-central (where we build Firefox nightlies from) and built this little graph:

Firefox Android Failures

Firefox Android Failures

Here you can see that our tests are causing 6.67% of the failures and 12.33% of the time we can expect a failure on Android.

We have another branch called mozilla-inbound (we merge this into mozilla-central regularly) where most of the latest changes get checked in.  I did the same thing here:

mozilla-inbound Android Failures

mozilla-inbound Android Failures

Here you can see that our tests are causing 7.77% of the failures and 9.89% of the time we can expect a failure on Android.

This is only a small sample of the tests, but it should give you a good idea of where we are.

3 Comments

Filed under testdev

Some notes about adding new tests to talos

Over the last year and a half I have been editing the talos harness for various bug fixes, but just recently I have needed to dive in and add new tests and pagesets to talos for Firefox and Fennec.  Here are some of the things I didn’t realize or have inconveniently forget about what goes on behind the scenes.

  • tp4 is really 4 tests: tp4, tp4_nochrome, tp4_shutdown, tp4_shutdown_nochrome.  This is because in the .config file, we have “shutdown: true” which adds _shutdown to the test name and running with –noChrome adds the _nochrome to the test name.  Same with any test that us run with the shutdown=true and nochrome options.
  • when adding new tests, we need to add the test information to the graph server (staging and production).  This is done in the hg.mozilla.org/graphs repository by adding to data.sql.
  • when adding new pagesets (as I did for tp4 mobile), we need to provide a .zip of the pages and the pageloader manifest to release engineering as well as modifying the .config file in talos to point to the new manifest file.  see bug 648307
  • Also when adding new pages, we need to add sql for each page we load.  This is also in the graphs repository bug in pages_table.sql.
  • When editing the graph server, you need to file a bug with IT to update the live servers and attach a sql file (not a diff).   Some examples: bug 649774 and bug 650879
  • after you have the graph servers updated, staging run green, review done, then you can check in the patch for talos
  • For new tests, you also have to create a buildbot config patch to add the testname to the list of tests that are run for talos
  • the last step is to file a release engineering bug to update talos on the production servers.  This is done by creating a .zip of talos, posting it on a ftp site somewhere and providing a link to it in the bug.
  • one last thing is to make sure the bug to update talos has an owner and is looked at, otherwise it can sit for weeks with no action!!!

This is my experience from getting ts_paint, tpaint, and tp4m (mobile only) tests added to Talos over the last couple months.

Leave a comment

Filed under testdev