-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AIRFLOW-4360] Add performance metrics for Webserver UI. #5139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4360] Add performance metrics for Webserver UI. #5139
Conversation
|
Can you complete the documentation? You should complete the |
|
Will do. Just wanted to propose this verifying correctness and getting feedback on the idea itself. Will update documentation next. |
609207e to
84a4bdb
Compare
|
just skim through the pr, is it just backend server metric or e2e metric? If latter, I don't think this captures the e2e UI pageload time for tree view/graph view. To do e2e instrumentation, we need to instrument in the JS code as well IMO. |
|
Is it possible to implement this function with middleware/decorator? Metrics for all views would be useful. @feng-tao These are just the server metrics. Javascript code is not processed. |
|
I looked at this code quickly. In my opinion, errors should be monitored and information about errors should be collected. Counting errors is not a solution. I personally use Sentry to monitor application errors. I think that there is a step in the right direction. We must consider what and how to monitor. In the near future, I will improve monitoring in this project. |
|
@mik-laj , yeah, but not sure what is the usage for the UI loadtime metric. But it will be confusing to tell the my 2 $ |
nit: should be my 2¢. 2 $ is too much. |
|
@milton0825 , lol, good catch. I guess I am putting 2$ on my word :) |
|
Jokes aside. So to measure e2e latency we have to measure the latency on client side right? If so, how do we gather the stats from client? |
|
@milton0825 , I don't think we did that for Airflow at Lyft. But the typical approach is to instrument the js code using RUM library(https://github.com/topics/real-user-monitoring). Google analytics provides this functionality as well which is a typical approach to measure client side performance at Lyft(Amundsen integrates with google analytics to measure the page load time). |
|
A little delayed response, but some thoughts below. Thanks for the feedback!
|
8825fb6 to
d6486f8
Compare
|
Updated PR to perform stats logging through a decorator. Should be extensible to most if not all endpoints in flask. |
be9a01e to
65e6f2b
Compare
Codecov Report
@@ Coverage Diff @@
## master #5139 +/- ##
==========================================
+ Coverage 78.54% 78.55% +0.01%
==========================================
Files 469 469
Lines 29972 29995 +23
==========================================
+ Hits 23541 23564 +23
Misses 6431 6431
Continue to review full report at Codecov.
|
KevinYang21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think this is a step towards the right direction. E2e metrics are important but we also care about the server side loading time, to better understand our performance bottleneck and identify regression point etc. Decorator implementation looks nice, ty @kunalyogenshah
airflow/www/decorators.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts whether increasing and decreasing the counter is the best solution. Can not the counter be increased only when an error has occurred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea around this is to catch silent errors in the implementation. We sometimes have errors that are more transient/unexpected, which causes endpoints to fail. In that case we might not hit the try catch to increment the failures count. This seemed like an assured way to catch any and all failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to do funny things to the stats over time.
65e6f2b to
6cce9ca
Compare
6cce9ca to
c2456ad
Compare
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Hello all, can we please reopen and take a look at merging this? |
|
@mauliksoneji -> sure. I see @mik-laj @KevinYang21 and @ashb might pick it up. |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Jira
Description
This PR extends Statsd usage for tracking webserver metrics as well. For starters, we track:
Tests
Once enabled, you can see the following new counters in Statsd :
Commits
Documentation
Code Quality
flake8