Skip to content

Conversation

@kunalyogenshah
Copy link
Contributor

@kunalyogenshah kunalyogenshah commented Apr 19, 2019

Jira

  • My PR addresses the following AIRFLOW-4360 issues and references them in the PR title.

Description

This PR extends Statsd usage for tracking webserver metrics as well. For starters, we track:

  1. Log page hits, and log fetching latency + failures.
  2. Tree view and graph view latency + load failures.

Tests

Once enabled, you can see the following new counters in Statsd :

{ counters:
   { 'statsd.bad_lines_seen': 0,
     'statsd.packets_received': 68,
     'statsd.metrics_received': 68,
     'airflow.scheduler_heartbeat': 2,
     'airflow.webserver.tree_view.views': 0,
     'airflow.webserver.tree_view.load_failures': 0,
     'airflow.webserver.graph_view.views': 0,
     'airflow.webserver.graph_view.load_failures': 0,
     'airflow.webserver.log.views': 1,
     'airflow.webserver.log.load_failures': 0,
     'airflow.webserver.ajax_logs_fetch.load_failures': 0 },
  timers:
   { 'airflow.webserver.tree_view.load_time': [],
     'airflow.webserver.graph_view.load_time': [],
     'airflow.webserver.ajax_logs_fetch.load_time': [ 17.852 ] },
  gauges:
   { 'airflow.collect_dags': 0.046903,
     'airflow.dagbag_size': 1,
     'airflow.dagbag_import_errors': 0,
     'statsd.timestamp_lag': 0,
     'airflow.pool.starving_tasks.not_pooled': 0 },
  timer_data:
   { 'airflow.webserver.tree_view.load_time': { count_ps: 0, count: 0 },
     'airflow.webserver.graph_view.load_time': { count_ps: 0, count: 0 },
     'airflow.webserver.ajax_logs_fetch.load_time':
      { count_90: 1,
        mean_90: 17.852,
        upper_90: 17.852,
        sum_90: 17.852,
        sum_squares_90: 318.69390400000003,
        std: 0,
        upper: 17.852,
        lower: 17.852,
        count: 1,
        count_ps: 0.1,
        sum: 17.852,
        sum_squares: 318.69390400000003,
        mean: 17.852,
        median: 17.852 } },
  counter_rates:
   { 'statsd.bad_lines_seen': 0,
     'statsd.packets_received': 6.8,
     'statsd.metrics_received': 6.8,
     'airflow.scheduler_heartbeat': 0.2,
     'airflow.webserver.tree_view.views': 0,
     'airflow.webserver.tree_view.load_failures': 0,
     'airflow.webserver.graph_view.views': 0,
     'airflow.webserver.graph_view.load_failures': 0,
     'airflow.webserver.log.views': 0.1,
     'airflow.webserver.log.load_failures': 0,
     'airflow.webserver.ajax_logs_fetch.load_failures': 0 },
  sets: {},
  pctThreshold: [ 90 ] }

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@mik-laj
Copy link
Member

mik-laj commented Apr 19, 2019

Can you complete the documentation? You should complete the docs/metrics.rst file.

@kunalyogenshah
Copy link
Contributor Author

Will do. Just wanted to propose this verifying correctness and getting feedback on the idea itself. Will update documentation next.

@kunalyogenshah kunalyogenshah changed the title [Airflow][WIP] Add performance metrics for Webserver UI. [AIRFLOW-4360][WIP] Add performance metrics for Webserver UI. Apr 19, 2019
@kunalyogenshah kunalyogenshah force-pushed the kunal_add_webserver_stats branch from 609207e to 84a4bdb Compare April 19, 2019 21:41
@feng-tao
Copy link
Member

just skim through the pr, is it just backend server metric or e2e metric? If latter, I don't think this captures the e2e UI pageload time for tree view/graph view. To do e2e instrumentation, we need to instrument in the JS code as well IMO.

@mik-laj
Copy link
Member

mik-laj commented Apr 20, 2019

Is it possible to implement this function with middleware/decorator? Metrics for all views would be useful.

@feng-tao These are just the server metrics. Javascript code is not processed.

@mik-laj
Copy link
Member

mik-laj commented Apr 20, 2019

I looked at this code quickly. In my opinion, errors should be monitored and information about errors should be collected. Counting errors is not a solution. I personally use Sentry to monitor application errors.

I think that there is a step in the right direction. We must consider what and how to monitor. In the near future, I will improve monitoring in this project.

@feng-tao
Copy link
Member

feng-tao commented Apr 20, 2019

@mik-laj , yeah, but not sure what is the usage for the UI loadtime metric. But it will be confusing to tell theairflow.webserver.graph.load_time stat from client side if they encounter page load time performance issue. Most of the page load time is spent on JS side, not on Flask side. I think most of the server side metrics are good to have. But it will be inaccurate to show tree/graph.page_load by measuring the time on Flask side only.

my 2 $

@milton0825
Copy link
Contributor

my 2 $

nit: should be my 2¢. 2 $ is too much.

@feng-tao
Copy link
Member

@milton0825 , lol, good catch. I guess I am putting 2$ on my word :)

@milton0825
Copy link
Contributor

milton0825 commented Apr 21, 2019

Jokes aside.

So to measure e2e latency we have to measure the latency on client side right? If so, how do we gather the stats from client?

@feng-tao
Copy link
Member

feng-tao commented Apr 21, 2019

@milton0825 , I don't think we did that for Airflow at Lyft. But the typical approach is to instrument the js code using RUM library(https://github.com/topics/real-user-monitoring). Google analytics provides this functionality as well which is a typical approach to measure client side performance at Lyft(Amundsen integrates with google analytics to measure the page load time).

@kunalyogenshah
Copy link
Contributor Author

kunalyogenshah commented Apr 21, 2019

A little delayed response, but some thoughts below. Thanks for the feedback!

  • Yeah this PR mainly focuses on the server side metrics. More specifically, it adds metrics for server data fetching / processing for 2 of the most used client side views (more to follow later).
  • JS does play a major role in rendering, but unless there are AJAX server side calls in JS, most of them are usually rendering, which for now feels like the smaller of the bottlenecks, unless the code goes wrong somewhere? Ideally we should track both metrics eventually as mentioned above. One way I considered was using a statsd-client module available for java script as well using NPM. But that might merit a longer proposal.
  • I agree with the idea that the errors should be collected and monitored, and that will effectively be done the same way the scheduler Statsd metrics are monitored today right? The aim for the metrics here is to provide regression test and alerts for the UI before clients have to report it.

@kunalyogenshah kunalyogenshah force-pushed the kunal_add_webserver_stats branch 2 times, most recently from 8825fb6 to d6486f8 Compare April 26, 2019 21:24
@kunalyogenshah
Copy link
Contributor Author

Updated PR to perform stats logging through a decorator. Should be extensible to most if not all endpoints in flask.

@kunalyogenshah kunalyogenshah changed the title [AIRFLOW-4360][WIP] Add performance metrics for Webserver UI. [AIRFLOW-4360] Add performance metrics for Webserver UI. Apr 29, 2019
@kunalyogenshah kunalyogenshah force-pushed the kunal_add_webserver_stats branch 2 times, most recently from be9a01e to 65e6f2b Compare April 30, 2019 20:23
@codecov-io
Copy link

codecov-io commented Apr 30, 2019

Codecov Report

Merging #5139 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5139      +/-   ##
==========================================
+ Coverage   78.54%   78.55%   +0.01%     
==========================================
  Files         469      469              
  Lines       29972    29995      +23     
==========================================
+ Hits        23541    23564      +23     
  Misses       6431     6431
Impacted Files Coverage Δ
airflow/www/decorators.py 81.42% <100%> (+6.91%) ⬆️
airflow/www/views.py 76.34% <100%> (+0.06%) ⬆️
airflow/models/taskinstance.py 92.42% <0%> (-0.18%) ⬇️
airflow/stats.py 71.21% <0%> (+1.51%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8562730...65e6f2b. Read the comment docs.

Copy link
Member

@KevinYang21 KevinYang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think this is a step towards the right direction. E2e metrics are important but we also care about the server side loading time, to better understand our performance bottleneck and identify regression point etc. Decorator implementation looks nice, ty @kunalyogenshah

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts whether increasing and decreasing the counter is the best solution. Can not the counter be increased only when an error has occurred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea around this is to catch silent errors in the implementation. We sometimes have errors that are more transient/unexpected, which causes endpoints to fail. In that case we might not hit the try catch to increment the failures count. This seemed like an assured way to catch any and all failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to do funny things to the stats over time.

@kunalyogenshah kunalyogenshah force-pushed the kunal_add_webserver_stats branch from 65e6f2b to 6cce9ca Compare May 6, 2019 17:28
@kunalyogenshah kunalyogenshah force-pushed the kunal_add_webserver_stats branch from 6cce9ca to c2456ad Compare May 6, 2019 18:58
@stale
Copy link

stale bot commented Sep 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 3, 2019
@stale stale bot closed this Sep 11, 2019
@mauliksoneji
Copy link
Contributor

Hello all, can we please reopen and take a look at merging this?

@potiuk potiuk reopened this Dec 3, 2019
@stale stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 3, 2019
@potiuk
Copy link
Member

potiuk commented Dec 3, 2019

@mauliksoneji -> sure. I see @mik-laj @KevinYang21 and @ashb might pick it up.

@github-actions github-actions bot added area:docs area:webserver Webserver related Issues labels Jan 4, 2020
@stale
Copy link

stale bot commented Jan 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 19, 2020
@stale stale bot closed this Jan 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:webserver Webserver related Issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants