-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[pulsar-function-go] Add statistics and Prometheus to Go Function instances for production readiness #6105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sijie This is ready now for code review. |
|
@sijie @wolfstudy Would you please help review this issue. |
|
@jerrypeng @merlimat Please note that in this PR, I'm using the new approach to get data from Prometheus (based on their recommendations) because the way we were doing it in Python and Java is not supported in their newer client libraries. (They were very adamant about it too.) |
|
The test failures in this PR are just flaky tests. Functionally, everything works, along with #6104 |
…omission of methods for gRPC server registration in generated gRPC files for Go. (apache#4175) Generated updated gRPC files that contain service registration methods for creating gRPC service in Go. Also, upgraded proto version to 3. (apache#4175) Fixed build errors by prefixing pulsar-function-go/pb with pb alias. (apache#4175). Added instanceControlServicer.go as the servicer responsible for serving the gRPC service for the Go Function instances (apache#4175). Rough draft right now. Added changes to show intent behind passing port value to Start in function.go. Also, added some code to support healthcheck and added methods to support instanceConrolServicer. Just needed to commit changes to allow reproducible test errors. (apache#4175). Updated function.go Start method to make it more clear where we need to provide a port value (apache#4175). Added port and expectedHealthCheckInterval to use of function context. Updated all references. (apache#4175) Added Apache license to gRPC-generated files in attempt to get license check test to pass (apache#4175). Created instanceControlServicer_test.go to test gRPC server and validate that HealthCheck method returns true as expected (apache#4175). Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location (apache#4175). Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location. (Should have been part of previous commit.) Also, added expectedHealthCheckInterval to conf.yaml for testing. (apache#4175). Fixed license formatting by running mvn license:format (apache#4175). Added logic and tests to allow healthCheck to kill instances that aren't receiving their regular health checks. Still needs an end-to-end test involving FunctionManager to check for possible issues that could kill instances incorrectly (apache#4175). Removed inputTopics field from FunctionContext (apache#4175). Adding the progress I've made so far on migrating the Prometheus code to Go... currently blocked due to missing methods from the Go client. Waiting for information from the Prometheus maintainers to find a workaround. (apache#4175). Fixed license check. (apache#4175) Reverting the last two commits since they should go into a separate PR. (apache#4174). Re-added test file that was accidentially deleted (apache#4175). Added a few comments to make review easier (apache#4175). Made minor (non-functional) changes as per PR review (apache#4175). Fixed print statements (apache#4175). Re-added comment after getting maven license formatting correct (apache#4175). Removed comment that I forgot to remove (apache#4175). Fixed formatting issues for style check (apache#4175). Updated gRPC test to no longer use deprecated method (apache#4175). Fixed more formatting issues by using goimports (apache#4175). Fixed even more formatting issues (apache#4175). Fixed yet even more formatting issues (apache#4175). Added statistics functionality for supporting Prometheus and stats and status commands on Go functions. Needs testing. Also, needs review of specific locations of stats method calls to ensure we're collecting data in the right places. Also, still needs the 1m interval stats to be created. Upstream Prometheus changes prevented us from using the existing approaches for collecting these stats.
8f82e39 to
2acad27
Compare
|
/pulsarbot run-failure-checks |
1 similar comment
|
/pulsarbot run-failure-checks |
|
@sijie @merlimat @jerrypeng This PR will need a more careful review to determine if I'm incrementing the counters and gauges (and other similar Prometheus objects) at the right locations. I was a little uncertain about some of them. |
|
/pulsarbot run-failure-checks |
4 similar comments
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
|
/pulsarbot run-failure-checks |
| "github.com/stretchr/testify/assert" | ||
| ) | ||
|
|
||
| /*func test(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devinbost please remove these comments if we don't need the test case
| stat.reportSystemExceptionPrometheus(exception, ts) | ||
| } | ||
|
|
||
| //@limits(calls=5, period=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be removed?
| stat.reportUserExceptionPrometheus(err, ts) | ||
| } | ||
|
|
||
| //@limits(calls=5, period=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be removed?
| } | ||
|
|
||
| /* | ||
| func (gi *goInstance) get_avg_process_latency_1min() float32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiazhai This is a method that isn't available due to the Prometheus architecture change they rolled out to their Go library. So, we need to find another way to capture the 1 min metrics. The comment was to note that this feature is currently missing.
|
@devinbost thanks for the great work. Left some minor comments. overall. lgtm. |
Thanks for the feedback. What kind of documents would be helpful? |
sijie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this change looks good to me. Left one minor comment.
I labeled the issue as doc-required. In general, if you can provide some basic documentation how do people use this feature. That would be great.
|
|
||
| gi.processResult(msgInput, output) | ||
|
|
||
| gi.stats.processTimeEnd() // Should this be called here or before processResult(..)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called before processResult.
wolfstudy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devinbost , The changes look good to me, just a little comment, please fix
| "github.com/stretchr/testify/assert" | ||
| ) | ||
|
|
||
| /*func test(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devinbost please remove these comments if we don't need the test case
|
|
||
| metricFamilies, err := reg.Gather() | ||
| if err != nil || len(metricFamilies) != 1 { | ||
| panic("unexpected behavior of custom test registry") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace panic with t.Fatal()
| fiteredMetricFamilies := filter(metricFamilies, match) | ||
|
|
||
| if len(fiteredMetricFamilies) > 1 { | ||
| panic("Too many metric families") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace panic with t.Fatal()
| } | ||
| // Then, we need to filter the metrics in the family to one that matches our label. | ||
|
|
||
| fmt.Println(proto.MarshalTextString(metricFamilies[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use assert to compare whether the expected value and the actual value are equal? If my understanding is correct, this test case will succeed at any time, right?
E.g:
assert.Equal(t, x, y)
| // value: 41.9 | ||
| // > | ||
| // > | ||
| // > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these comments if we don't need
wolfstudy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1, about comments, will fix them in next pull request
|
I apologize for not getting the comments addressed yet. I've had a lot of other responsibilities lately. |
…tances for production readiness (apache#6105) * Enabled grpc plugin to gRPC generate.sh script to fix issues causing omission of methods for gRPC server registration in generated gRPC files for Go. (apache#4175) Generated updated gRPC files that contain service registration methods for creating gRPC service in Go. Also, upgraded proto version to 3. (apache#4175) Fixed build errors by prefixing pulsar-function-go/pb with pb alias. (apache#4175). Added instanceControlServicer.go as the servicer responsible for serving the gRPC service for the Go Function instances (apache#4175). Rough draft right now. Added changes to show intent behind passing port value to Start in function.go. Also, added some code to support healthcheck and added methods to support instanceConrolServicer. Just needed to commit changes to allow reproducible test errors. (apache#4175). Updated function.go Start method to make it more clear where we need to provide a port value (apache#4175). Added port and expectedHealthCheckInterval to use of function context. Updated all references. (apache#4175) Added Apache license to gRPC-generated files in attempt to get license check test to pass (apache#4175). Created instanceControlServicer_test.go to test gRPC server and validate that HealthCheck method returns true as expected (apache#4175). Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location (apache#4175). Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location. (Should have been part of previous commit.) Also, added expectedHealthCheckInterval to conf.yaml for testing. (apache#4175). Fixed license formatting by running mvn license:format (apache#4175). Added logic and tests to allow healthCheck to kill instances that aren't receiving their regular health checks. Still needs an end-to-end test involving FunctionManager to check for possible issues that could kill instances incorrectly (apache#4175). Removed inputTopics field from FunctionContext (apache#4175). Adding the progress I've made so far on migrating the Prometheus code to Go... currently blocked due to missing methods from the Go client. Waiting for information from the Prometheus maintainers to find a workaround. (apache#4175). Fixed license check. (apache#4175) Reverting the last two commits since they should go into a separate PR. (apache#4174). Re-added test file that was accidentially deleted (apache#4175). Added a few comments to make review easier (apache#4175). Made minor (non-functional) changes as per PR review (apache#4175). Fixed print statements (apache#4175). Re-added comment after getting maven license formatting correct (apache#4175). Removed comment that I forgot to remove (apache#4175). Fixed formatting issues for style check (apache#4175). Updated gRPC test to no longer use deprecated method (apache#4175). Fixed more formatting issues by using goimports (apache#4175). Fixed even more formatting issues (apache#4175). Fixed yet even more formatting issues (apache#4175). Added statistics functionality for supporting Prometheus and stats and status commands on Go functions. Needs testing. Also, needs review of specific locations of stats method calls to ensure we're collecting data in the right places. Also, still needs the 1m interval stats to be created. Upstream Prometheus changes prevented us from using the existing approaches for collecting these stats. * Improved formatting of Go code. Also, added some required comments to get golint to pass. apache#6105 * Fixed more Go formatting issues. apache#6105 * Fixed more formatting issues. apache#6105 * Ran 'gofmt -s -w .' apache#6105 Co-authored-by: Devin Bost <[email protected]>
|
Have discussed w/ @wolfstudy, docs have been added here. |
…tances for production readiness (apache#6105) * Enabled grpc plugin to gRPC generate.sh script to fix issues causing omission of methods for gRPC server registration in generated gRPC files for Go. (apache#4175) Generated updated gRPC files that contain service registration methods for creating gRPC service in Go. Also, upgraded proto version to 3. (apache#4175) Fixed build errors by prefixing pulsar-function-go/pb with pb alias. (apache#4175). Added instanceControlServicer.go as the servicer responsible for serving the gRPC service for the Go Function instances (apache#4175). Rough draft right now. Added changes to show intent behind passing port value to Start in function.go. Also, added some code to support healthcheck and added methods to support instanceConrolServicer. Just needed to commit changes to allow reproducible test errors. (apache#4175). Updated function.go Start method to make it more clear where we need to provide a port value (apache#4175). Added port and expectedHealthCheckInterval to use of function context. Updated all references. (apache#4175) Added Apache license to gRPC-generated files in attempt to get license check test to pass (apache#4175). Created instanceControlServicer_test.go to test gRPC server and validate that HealthCheck method returns true as expected (apache#4175). Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location (apache#4175). Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location. (Should have been part of previous commit.) Also, added expectedHealthCheckInterval to conf.yaml for testing. (apache#4175). Fixed license formatting by running mvn license:format (apache#4175). Added logic and tests to allow healthCheck to kill instances that aren't receiving their regular health checks. Still needs an end-to-end test involving FunctionManager to check for possible issues that could kill instances incorrectly (apache#4175). Removed inputTopics field from FunctionContext (apache#4175). Adding the progress I've made so far on migrating the Prometheus code to Go... currently blocked due to missing methods from the Go client. Waiting for information from the Prometheus maintainers to find a workaround. (apache#4175). Fixed license check. (apache#4175) Reverting the last two commits since they should go into a separate PR. (apache#4174). Re-added test file that was accidentially deleted (apache#4175). Added a few comments to make review easier (apache#4175). Made minor (non-functional) changes as per PR review (apache#4175). Fixed print statements (apache#4175). Re-added comment after getting maven license formatting correct (apache#4175). Removed comment that I forgot to remove (apache#4175). Fixed formatting issues for style check (apache#4175). Updated gRPC test to no longer use deprecated method (apache#4175). Fixed more formatting issues by using goimports (apache#4175). Fixed even more formatting issues (apache#4175). Fixed yet even more formatting issues (apache#4175). Added statistics functionality for supporting Prometheus and stats and status commands on Go functions. Needs testing. Also, needs review of specific locations of stats method calls to ensure we're collecting data in the right places. Also, still needs the 1m interval stats to be created. Upstream Prometheus changes prevented us from using the existing approaches for collecting these stats. * Improved formatting of Go code. Also, added some required comments to get golint to pass. apache#6105 * Fixed more Go formatting issues. apache#6105 * Fixed more formatting issues. apache#6105 * Ran 'gofmt -s -w .' apache#6105 Co-authored-by: Devin Bost <[email protected]>
Fixes #9177 ### Motivation go function added metrics collector by #6105, but havnt pass `metricsPort` to go function, also not init & start prometheus http server. As the result, function worker will keep trying to access to the metrics port to collect data, which will cause massive log errors in log history. ### Modifications - expose `metricsPort` to go function - add prometheus http server to go function ### Verifying this change - [x] Make sure that the change passes the CI checks.
Fixes #9177 go function added metrics collector by #6105, but havnt pass `metricsPort` to go function, also not init & start prometheus http server. As the result, function worker will keep trying to access to the metrics port to collect data, which will cause massive log errors in log history. - expose `metricsPort` to go function - add prometheus http server to go function - [x] Make sure that the change passes the CI checks. (cherry picked from commit 211a125)
Master issue: #4175
This PR also depends on the code in #6031 .
This PR is to add statistics (to enable calls to get function status, get function stats, and get other metrics) for Pulsar Admin usage and for sending to Prometheus.
Because Prometheus Go did not support some of the ways that we depended on to interact with Prometheus in the Python and Java Pulsar code, this change was more significant and diverges slightly from the approaches used for the other parts of the Pulsar codebase. However, based on feedback from the core maintainers of Prometheus, there is a risk that they may deprecate the other ways that we're using their library. So, the approaches used in this PR should serve as the new standard. For more information about the issues with Prometheus, please see the discussion here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/prometheus-users/NpkERPC17H4
Fixes issue #6205
Motivation
The current implementation of Go functions in Pulsar is missing key statistical information and other information that is critical for determining the health and reliability of Go functions in production. This PR solves that problem.
Does this pull request potentially affect one of the following parts:
It adds functionality that didn't previously exist for Go functions.
It adds functionality that didn't previously exist for Go functions.
It adds functionality that didn't previously exist for Go functions.
Documentation
Prometheus metrics were changed slightly. We needed to replace many of the Prometheus Counters with Gauges due to limitations in the Prometheus Go client.
More documentation will need to be created as Go functions become more production-ready.