[AIRFLOW-4360] Add performance metrics for Webserver UI. #5139

kunalyogenshah · 2019-04-19T19:24:08Z

Jira

My PR addresses the following AIRFLOW-4360 issues and references them in the PR title.

Description

This PR extends Statsd usage for tracking webserver metrics as well. For starters, we track:

Log page hits, and log fetching latency + failures.
Tree view and graph view latency + load failures.

Tests

Once enabled, you can see the following new counters in Statsd :

{ counters:
   { 'statsd.bad_lines_seen': 0,
     'statsd.packets_received': 68,
     'statsd.metrics_received': 68,
     'airflow.scheduler_heartbeat': 2,
     'airflow.webserver.tree_view.views': 0,
     'airflow.webserver.tree_view.load_failures': 0,
     'airflow.webserver.graph_view.views': 0,
     'airflow.webserver.graph_view.load_failures': 0,
     'airflow.webserver.log.views': 1,
     'airflow.webserver.log.load_failures': 0,
     'airflow.webserver.ajax_logs_fetch.load_failures': 0 },
  timers:
   { 'airflow.webserver.tree_view.load_time': [],
     'airflow.webserver.graph_view.load_time': [],
     'airflow.webserver.ajax_logs_fetch.load_time': [ 17.852 ] },
  gauges:
   { 'airflow.collect_dags': 0.046903,
     'airflow.dagbag_size': 1,
     'airflow.dagbag_import_errors': 0,
     'statsd.timestamp_lag': 0,
     'airflow.pool.starving_tasks.not_pooled': 0 },
  timer_data:
   { 'airflow.webserver.tree_view.load_time': { count_ps: 0, count: 0 },
     'airflow.webserver.graph_view.load_time': { count_ps: 0, count: 0 },
     'airflow.webserver.ajax_logs_fetch.load_time':
      { count_90: 1,
        mean_90: 17.852,
        upper_90: 17.852,
        sum_90: 17.852,
        sum_squares_90: 318.69390400000003,
        std: 0,
        upper: 17.852,
        lower: 17.852,
        count: 1,
        count_ps: 0.1,
        sum: 17.852,
        sum_squares: 318.69390400000003,
        mean: 17.852,
        median: 17.852 } },
  counter_rates:
   { 'statsd.bad_lines_seen': 0,
     'statsd.packets_received': 6.8,
     'statsd.metrics_received': 6.8,
     'airflow.scheduler_heartbeat': 0.2,
     'airflow.webserver.tree_view.views': 0,
     'airflow.webserver.tree_view.load_failures': 0,
     'airflow.webserver.graph_view.views': 0,
     'airflow.webserver.graph_view.load_failures': 0,
     'airflow.webserver.log.views': 0.1,
     'airflow.webserver.log.load_failures': 0,
     'airflow.webserver.ajax_logs_fetch.load_failures': 0 },
  sets: {},
  pctThreshold: [ 90 ] }

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

Passes flake8

mik-laj · 2019-04-19T19:31:56Z

Can you complete the documentation? You should complete the docs/metrics.rst file.

kunalyogenshah · 2019-04-19T21:13:08Z

Will do. Just wanted to propose this verifying correctness and getting feedback on the idea itself. Will update documentation next.

feng-tao · 2019-04-19T21:43:29Z

just skim through the pr, is it just backend server metric or e2e metric? If latter, I don't think this captures the e2e UI pageload time for tree view/graph view. To do e2e instrumentation, we need to instrument in the JS code as well IMO.

mik-laj · 2019-04-20T16:25:18Z

Is it possible to implement this function with middleware/decorator? Metrics for all views would be useful.

@feng-tao These are just the server metrics. Javascript code is not processed.

mik-laj · 2019-04-20T16:34:10Z

I looked at this code quickly. In my opinion, errors should be monitored and information about errors should be collected. Counting errors is not a solution. I personally use Sentry to monitor application errors.

I think that there is a step in the right direction. We must consider what and how to monitor. In the near future, I will improve monitoring in this project.

feng-tao · 2019-04-20T17:23:06Z

@mik-laj , yeah, but not sure what is the usage for the UI loadtime metric. But it will be confusing to tell theairflow.webserver.graph.load_time stat from client side if they encounter page load time performance issue. Most of the page load time is spent on JS side, not on Flask side. I think most of the server side metrics are good to have. But it will be inaccurate to show tree/graph.page_load by measuring the time on Flask side only.

my 2 $

milton0825 · 2019-04-21T05:11:24Z

my 2 $

nit: should be my 2¢. 2 $ is too much.

feng-tao · 2019-04-21T05:19:35Z

@milton0825 , lol, good catch. I guess I am putting 2$ on my word :)

milton0825 · 2019-04-21T05:25:34Z

Jokes aside.

So to measure e2e latency we have to measure the latency on client side right? If so, how do we gather the stats from client?

feng-tao · 2019-04-21T05:35:11Z

@milton0825 , I don't think we did that for Airflow at Lyft. But the typical approach is to instrument the js code using RUM library(https://github.com/topics/real-user-monitoring). Google analytics provides this functionality as well which is a typical approach to measure client side performance at Lyft(Amundsen integrates with google analytics to measure the page load time).

kunalyogenshah · 2019-04-21T14:06:06Z

A little delayed response, but some thoughts below. Thanks for the feedback!

Yeah this PR mainly focuses on the server side metrics. More specifically, it adds metrics for server data fetching / processing for 2 of the most used client side views (more to follow later).
JS does play a major role in rendering, but unless there are AJAX server side calls in JS, most of them are usually rendering, which for now feels like the smaller of the bottlenecks, unless the code goes wrong somewhere? Ideally we should track both metrics eventually as mentioned above. One way I considered was using a statsd-client module available for java script as well using NPM. But that might merit a longer proposal.
I agree with the idea that the errors should be collected and monitored, and that will effectively be done the same way the scheduler Statsd metrics are monitored today right? The aim for the metrics here is to provide regression test and alerts for the UI before clients have to report it.

kunalyogenshah · 2019-04-26T21:25:45Z

Updated PR to perform stats logging through a decorator. Should be extensible to most if not all endpoints in flask.

codecov-io · 2019-04-30T21:18:28Z

Codecov Report

Merging #5139 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #5139      +/-   ##
==========================================
+ Coverage   78.54%   78.55%   +0.01%     
==========================================
  Files         469      469              
  Lines       29972    29995      +23     
==========================================
+ Hits        23541    23564      +23     
  Misses       6431     6431

Impacted Files	Coverage Δ
airflow/www/decorators.py	`81.42% <100%> (+6.91%)`	⬆️
airflow/www/views.py	`76.34% <100%> (+0.06%)`	⬆️
airflow/models/taskinstance.py	`92.42% <0%> (-0.18%)`	⬇️
airflow/stats.py	`71.21% <0%> (+1.51%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8562730...65e6f2b. Read the comment docs.

KevinYang21

Generally I think this is a step towards the right direction. E2e metrics are important but we also care about the server side loading time, to better understand our performance bottleneck and identify regression point etc. Decorator implementation looks nice, ty @kunalyogenshah

docs/metrics.rst

airflow/www/decorators.py

mik-laj · 2019-05-05T23:36:21Z

airflow/www/decorators.py

I have doubts whether increasing and decreasing the counter is the best solution. Can not the counter be increased only when an error has occurred?

The idea around this is to catch silent errors in the implementation. We sometimes have errors that are more transient/unexpected, which causes endpoints to fail. In that case we might not hit the try catch to increment the failures count. This seemed like an assured way to catch any and all failures.

This is going to do funny things to the stats over time.

stale · 2019-09-03T16:42:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mauliksoneji · 2019-12-03T09:04:04Z

Hello all, can we please reopen and take a look at merging this?

potiuk · 2019-12-03T10:46:38Z

@mauliksoneji -> sure. I see @mik-laj @KevinYang21 and @ashb might pick it up.

stale · 2020-01-19T00:52:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kunalyogenshah changed the title ~~[Airflow][WIP] Add performance metrics for Webserver UI.~~ [AIRFLOW-4360][WIP] Add performance metrics for Webserver UI. Apr 19, 2019

kunalyogenshah force-pushed the kunal_add_webserver_stats branch from 609207e to 84a4bdb Compare April 19, 2019 21:41

kunalyogenshah mentioned this pull request Apr 25, 2019

[AIRFLOW-4396] Provide a link to external Elasticsearch logs in UI. #5164

Merged

6 tasks

kunalyogenshah force-pushed the kunal_add_webserver_stats branch 2 times, most recently from 8825fb6 to d6486f8 Compare April 26, 2019 21:24

kunalyogenshah changed the title ~~[AIRFLOW-4360][WIP] Add performance metrics for Webserver UI.~~ [AIRFLOW-4360] Add performance metrics for Webserver UI. Apr 29, 2019

kunalyogenshah force-pushed the kunal_add_webserver_stats branch 2 times, most recently from be9a01e to 65e6f2b Compare April 30, 2019 20:23

KevinYang21 reviewed May 2, 2019

View reviewed changes

docs/metrics.rst Outdated Show resolved Hide resolved

mik-laj reviewed May 5, 2019

View reviewed changes

airflow/www/decorators.py Outdated Show resolved Hide resolved

mik-laj reviewed May 5, 2019

View reviewed changes

kunalyogenshah force-pushed the kunal_add_webserver_stats branch from 65e6f2b to 6cce9ca Compare May 6, 2019 17:28

[Airflow] Add performance metrics for Webserver UI.

c2456ad

kunalyogenshah force-pushed the kunal_add_webserver_stats branch from 6cce9ca to c2456ad Compare May 6, 2019 18:58

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 3, 2019

stale bot closed this Sep 11, 2019

potiuk reopened this Dec 3, 2019

stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 3, 2019

github-actions bot added area:docs area:webserver Webserver related Issues labels Jan 4, 2020

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 19, 2020

stale bot closed this Jan 26, 2020

[AIRFLOW-4360] Add performance metrics for Webserver UI. #5139

[AIRFLOW-4360] Add performance metrics for Webserver UI. #5139

Uh oh!

Conversation

kunalyogenshah commented Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Description

Tests

Commits

Documentation

Code Quality

Uh oh!

mik-laj commented Apr 19, 2019

Uh oh!

kunalyogenshah commented Apr 19, 2019

Uh oh!

feng-tao commented Apr 19, 2019

Uh oh!

mik-laj commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mik-laj commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feng-tao commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milton0825 commented Apr 21, 2019

Uh oh!

feng-tao commented Apr 21, 2019

Uh oh!

milton0825 commented Apr 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feng-tao commented Apr 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunalyogenshah commented Apr 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunalyogenshah commented Apr 26, 2019

Uh oh!

codecov-io commented Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KevinYang21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mik-laj May 5, 2019

Choose a reason for hiding this comment

Uh oh!

kunalyogenshah May 6, 2019

Choose a reason for hiding this comment

Uh oh!

ashb Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

stale bot commented Sep 3, 2019

Uh oh!

mauliksoneji commented Dec 3, 2019

Uh oh!

potiuk commented Dec 3, 2019

Uh oh!

stale bot commented Jan 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

kunalyogenshah commented Apr 19, 2019 •

edited

Loading

mik-laj commented Apr 20, 2019 •

edited

Loading

mik-laj commented Apr 20, 2019 •

edited

Loading

feng-tao commented Apr 20, 2019 •

edited

Loading

milton0825 commented Apr 21, 2019 •

edited

Loading

feng-tao commented Apr 21, 2019 •

edited

Loading

kunalyogenshah commented Apr 21, 2019 •

edited

Loading

codecov-io commented Apr 30, 2019 •

edited

Loading