Skip to content

Conversation

@devinbost
Copy link
Contributor

@devinbost devinbost commented Jan 21, 2020

Master issue: #4175
This PR also depends on the code in #6031 .

This PR is to add statistics (to enable calls to get function status, get function stats, and get other metrics) for Pulsar Admin usage and for sending to Prometheus.
Because Prometheus Go did not support some of the ways that we depended on to interact with Prometheus in the Python and Java Pulsar code, this change was more significant and diverges slightly from the approaches used for the other parts of the Pulsar codebase. However, based on feedback from the core maintainers of Prometheus, there is a risk that they may deprecate the other ways that we're using their library. So, the approaches used in this PR should serve as the new standard. For more information about the issues with Prometheus, please see the discussion here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/prometheus-users/NpkERPC17H4

Fixes issue #6205

Motivation

The current implementation of Go functions in Pulsar is missing key statistical information and other information that is critical for determining the health and reliability of Go functions in production. This PR solves that problem.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes)
  • The public API: (sort of)
    It adds functionality that didn't previously exist for Go functions.
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (sort of)
    It adds functionality that didn't previously exist for Go functions.
  • The admin cli options: (sort of)
    It adds functionality that didn't previously exist for Go functions.
  • Anything that affects deployment: (possibly, though unlikely, due to new dependencies)

Documentation

  • Does this pull request introduce a new feature? (sort of)
    Prometheus metrics were changed slightly. We needed to replace many of the Prometheus Counters with Gauges due to limitations in the Prometheus Go client.
  • If yes, how is the feature documented? (not documented)
  • If a feature is not applicable for documentation, explain why?
    More documentation will need to be created as Go functions become more production-ready.

@sijie sijie requested a review from wolfstudy January 22, 2020 10:30
@devinbost devinbost changed the title [pulsar-function-go] (WIP) Add statistics and Prometheus to Go Function instances for production readiness [pulsar-function-go] Add statistics and Prometheus to Go Function instances for production readiness Jan 24, 2020
@devinbost
Copy link
Contributor Author

@sijie This is ready now for code review.

@jiazhai
Copy link
Member

jiazhai commented Feb 7, 2020

@sijie @wolfstudy Would you please help review this issue.
@devinbost Thanks for the fix. seems master branch updating brings in some conflicts, would you please help rebase this work.

@devinbost
Copy link
Contributor Author

@jiazhai It will be easier for me to rebase this if we first merge this one: #6104

@devinbost
Copy link
Contributor Author

@jerrypeng @merlimat Please note that in this PR, I'm using the new approach to get data from Prometheus (based on their recommendations) because the way we were doing it in Python and Java is not supported in their newer client libraries. (They were very adamant about it too.)

@devinbost
Copy link
Contributor Author

devinbost commented Feb 7, 2020

The test failures in this PR are just flaky tests. Functionally, everything works, along with #6104
(It's just not very effective for me to need to push a new commit every time I want to re-run the tests because every time I do that, new failures occur. Hence, why I've been working on addressing the flaky tests in #6202 ). It would be easier if I could just re-run the failed tests, but I won't be able to do that until @sijie finishes creating the bot for doing so.

…omission of methods for gRPC server registration in generated gRPC files for Go. (apache#4175)

Generated updated gRPC files that contain service registration methods for creating gRPC service in Go. Also, upgraded proto version to 3. (apache#4175)

Fixed build errors by prefixing pulsar-function-go/pb with pb alias. (apache#4175).

Added instanceControlServicer.go as the servicer responsible for serving the gRPC service for the Go Function instances (apache#4175). Rough draft right now.

Added changes to show intent behind passing port value to Start in function.go. Also, added some code to support healthcheck and added methods to support instanceConrolServicer. Just needed to commit changes to allow reproducible test errors. (apache#4175).

Updated function.go Start method to make it more clear where we need to provide a port value (apache#4175).

Added port and expectedHealthCheckInterval to use of function context. Updated all references. (apache#4175)

Added Apache license to gRPC-generated files in attempt to get license check test to pass (apache#4175).

Created instanceControlServicer_test.go to test gRPC server and validate that HealthCheck method returns true as expected (apache#4175).

Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location (apache#4175).

Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location. (Should have been part of previous commit.) Also, added expectedHealthCheckInterval to conf.yaml for testing. (apache#4175).

Fixed license formatting by running mvn license:format (apache#4175).

Added logic and tests to allow healthCheck to kill instances that aren't receiving their regular health checks. Still needs an end-to-end test involving FunctionManager to check for possible issues that could kill instances incorrectly (apache#4175).

Removed inputTopics field from FunctionContext (apache#4175).

Adding the progress I've made so far on migrating the Prometheus code to Go... currently blocked due to missing methods from the Go client. Waiting for information from the Prometheus maintainers to find a workaround. (apache#4175).

Fixed license check. (apache#4175)

Reverting the last two commits since they should go into a separate PR. (apache#4174).

Re-added test file that was accidentially deleted (apache#4175).

Added a few comments to make review easier (apache#4175).

Made minor (non-functional) changes as per PR review (apache#4175).

Fixed print statements (apache#4175).

Re-added comment after getting maven license formatting correct (apache#4175).

Removed comment that I forgot to remove (apache#4175).

Fixed formatting issues for style check (apache#4175).

Updated gRPC test to no longer use deprecated method (apache#4175).

Fixed more formatting issues by using goimports (apache#4175).

Fixed even more formatting issues (apache#4175).

Fixed yet even more formatting issues (apache#4175).

Added statistics functionality for supporting Prometheus and stats and status commands on Go functions. Needs testing. Also, needs review of specific locations of stats method calls to ensure we're collecting data in the right places. Also, still needs the 1m interval stats to be created. Upstream Prometheus changes prevented us from using the existing approaches for collecting these stats.
@devinbost devinbost force-pushed the go-features-prometheus branch from 8f82e39 to 2acad27 Compare February 17, 2020 20:40
@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

1 similar comment
@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

@sijie @merlimat @jerrypeng This PR will need a more careful review to determine if I'm incrementing the counters and gauges (and other similar Prometheus objects) at the right locations. I was a little uncertain about some of them.

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

4 similar comments
@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

"github.com/stretchr/testify/assert"
)

/*func test(){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devinbost please remove these comments if we don't need the test case

stat.reportSystemExceptionPrometheus(exception, ts)
}

//@limits(calls=5, period=60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be removed?

stat.reportUserExceptionPrometheus(err, ts)
}

//@limits(calls=5, period=60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be removed?

}

/*
func (gi *goInstance) get_avg_process_latency_1min() float32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiazhai This is a method that isn't available due to the Prometheus architecture change they rolled out to their Go library. So, we need to find another way to capture the 1 min metrics. The comment was to note that this feature is currently missing.

@jiazhai
Copy link
Member

jiazhai commented Apr 2, 2020

@devinbost thanks for the great work. Left some minor comments. overall. lgtm.
It would be more great if you could provide some documents on this?
@sijie @wolfstudy Would you please take a look of this PR?

@devinbost
Copy link
Contributor Author

@jiazhai

@devinbost thanks for the great work. Left some minor comments. overall. lgtm.
It would be more great if you could provide some documents on this?
@sijie @wolfstudy Would you please take a look of this PR?

Thanks for the feedback. What kind of documents would be helpful?

@sijie sijie added doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. doc-required Your PR changes impact docs and you will update later. labels Apr 9, 2020
@sijie sijie added this to the 2.6.0 milestone Apr 9, 2020
Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change looks good to me. Left one minor comment.

I labeled the issue as doc-required. In general, if you can provide some basic documentation how do people use this feature. That would be great.


gi.processResult(msgInput, output)

gi.stats.processTimeEnd() // Should this be called here or before processResult(..)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called before processResult.

Copy link
Member

@wolfstudy wolfstudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinbost , The changes look good to me, just a little comment, please fix

"github.com/stretchr/testify/assert"
)

/*func test(){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devinbost please remove these comments if we don't need the test case


metricFamilies, err := reg.Gather()
if err != nil || len(metricFamilies) != 1 {
panic("unexpected behavior of custom test registry")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace panic with t.Fatal()

fiteredMetricFamilies := filter(metricFamilies, match)

if len(fiteredMetricFamilies) > 1 {
panic("Too many metric families")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace panic with t.Fatal()

}
// Then, we need to filter the metrics in the family to one that matches our label.

fmt.Println(proto.MarshalTextString(metricFamilies[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use assert to compare whether the expected value and the actual value are equal? If my understanding is correct, this test case will succeed at any time, right?

E.g:

assert.Equal(t, x, y)

// value: 41.9
// >
// >
// >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these comments if we don't need

Copy link
Member

@wolfstudy wolfstudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1, about comments, will fix them in next pull request

@wolfstudy wolfstudy merged commit 25344c7 into apache:master May 12, 2020
@devinbost
Copy link
Contributor Author

I apologize for not getting the comments addressed yet. I've had a lot of other responsibilities lately.

Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this pull request May 27, 2020
…tances for production readiness (apache#6105)

* Enabled grpc plugin to gRPC generate.sh script to fix issues causing omission of methods for gRPC server registration in generated gRPC files for Go. (apache#4175)

Generated updated gRPC files that contain service registration methods for creating gRPC service in Go. Also, upgraded proto version to 3. (apache#4175)

Fixed build errors by prefixing pulsar-function-go/pb with pb alias. (apache#4175).

Added instanceControlServicer.go as the servicer responsible for serving the gRPC service for the Go Function instances (apache#4175). Rough draft right now.

Added changes to show intent behind passing port value to Start in function.go. Also, added some code to support healthcheck and added methods to support instanceConrolServicer. Just needed to commit changes to allow reproducible test errors. (apache#4175).

Updated function.go Start method to make it more clear where we need to provide a port value (apache#4175).

Added port and expectedHealthCheckInterval to use of function context. Updated all references. (apache#4175)

Added Apache license to gRPC-generated files in attempt to get license check test to pass (apache#4175).

Created instanceControlServicer_test.go to test gRPC server and validate that HealthCheck method returns true as expected (apache#4175).

Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location (apache#4175).

Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location. (Should have been part of previous commit.) Also, added expectedHealthCheckInterval to conf.yaml for testing. (apache#4175).

Fixed license formatting by running mvn license:format (apache#4175).

Added logic and tests to allow healthCheck to kill instances that aren't receiving their regular health checks. Still needs an end-to-end test involving FunctionManager to check for possible issues that could kill instances incorrectly (apache#4175).

Removed inputTopics field from FunctionContext (apache#4175).

Adding the progress I've made so far on migrating the Prometheus code to Go... currently blocked due to missing methods from the Go client. Waiting for information from the Prometheus maintainers to find a workaround. (apache#4175).

Fixed license check. (apache#4175)

Reverting the last two commits since they should go into a separate PR. (apache#4174).

Re-added test file that was accidentially deleted (apache#4175).

Added a few comments to make review easier (apache#4175).

Made minor (non-functional) changes as per PR review (apache#4175).

Fixed print statements (apache#4175).

Re-added comment after getting maven license formatting correct (apache#4175).

Removed comment that I forgot to remove (apache#4175).

Fixed formatting issues for style check (apache#4175).

Updated gRPC test to no longer use deprecated method (apache#4175).

Fixed more formatting issues by using goimports (apache#4175).

Fixed even more formatting issues (apache#4175).

Fixed yet even more formatting issues (apache#4175).

Added statistics functionality for supporting Prometheus and stats and status commands on Go functions. Needs testing. Also, needs review of specific locations of stats method calls to ensure we're collecting data in the right places. Also, still needs the 1m interval stats to be created. Upstream Prometheus changes prevented us from using the existing approaches for collecting these stats.

* Improved formatting of Go code. Also, added some required comments to get golint to pass. apache#6105

* Fixed more Go formatting issues. apache#6105

* Fixed more formatting issues. apache#6105

* Ran 'gofmt -s -w .' apache#6105

Co-authored-by: Devin Bost <[email protected]>
@Anonymitaet Anonymitaet removed the doc-required Your PR changes impact docs and you will update later. label Jun 28, 2020
@Anonymitaet
Copy link
Member

Have discussed w/ @wolfstudy, docs have been added here.
Will copy to Pulsar later.

huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
…tances for production readiness (apache#6105)

* Enabled grpc plugin to gRPC generate.sh script to fix issues causing omission of methods for gRPC server registration in generated gRPC files for Go. (apache#4175)

Generated updated gRPC files that contain service registration methods for creating gRPC service in Go. Also, upgraded proto version to 3. (apache#4175)

Fixed build errors by prefixing pulsar-function-go/pb with pb alias. (apache#4175).

Added instanceControlServicer.go as the servicer responsible for serving the gRPC service for the Go Function instances (apache#4175). Rough draft right now.

Added changes to show intent behind passing port value to Start in function.go. Also, added some code to support healthcheck and added methods to support instanceConrolServicer. Just needed to commit changes to allow reproducible test errors. (apache#4175).

Updated function.go Start method to make it more clear where we need to provide a port value (apache#4175).

Added port and expectedHealthCheckInterval to use of function context. Updated all references. (apache#4175)

Added Apache license to gRPC-generated files in attempt to get license check test to pass (apache#4175).

Created instanceControlServicer_test.go to test gRPC server and validate that HealthCheck method returns true as expected (apache#4175).

Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location (apache#4175).

Fixed bug in FunctionContext (and context_test.go) where the inputTopics field was being referenced when it wasn't getting populated. Updated GetInputTopics method to get input topics from the source location. (Should have been part of previous commit.) Also, added expectedHealthCheckInterval to conf.yaml for testing. (apache#4175).

Fixed license formatting by running mvn license:format (apache#4175).

Added logic and tests to allow healthCheck to kill instances that aren't receiving their regular health checks. Still needs an end-to-end test involving FunctionManager to check for possible issues that could kill instances incorrectly (apache#4175).

Removed inputTopics field from FunctionContext (apache#4175).

Adding the progress I've made so far on migrating the Prometheus code to Go... currently blocked due to missing methods from the Go client. Waiting for information from the Prometheus maintainers to find a workaround. (apache#4175).

Fixed license check. (apache#4175)

Reverting the last two commits since they should go into a separate PR. (apache#4174).

Re-added test file that was accidentially deleted (apache#4175).

Added a few comments to make review easier (apache#4175).

Made minor (non-functional) changes as per PR review (apache#4175).

Fixed print statements (apache#4175).

Re-added comment after getting maven license formatting correct (apache#4175).

Removed comment that I forgot to remove (apache#4175).

Fixed formatting issues for style check (apache#4175).

Updated gRPC test to no longer use deprecated method (apache#4175).

Fixed more formatting issues by using goimports (apache#4175).

Fixed even more formatting issues (apache#4175).

Fixed yet even more formatting issues (apache#4175).

Added statistics functionality for supporting Prometheus and stats and status commands on Go functions. Needs testing. Also, needs review of specific locations of stats method calls to ensure we're collecting data in the right places. Also, still needs the 1m interval stats to be created. Upstream Prometheus changes prevented us from using the existing approaches for collecting these stats.

* Improved formatting of Go code. Also, added some required comments to get golint to pass. apache#6105

* Fixed more Go formatting issues. apache#6105

* Fixed more formatting issues. apache#6105

* Ran 'gofmt -s -w .' apache#6105

Co-authored-by: Devin Bost <[email protected]>
codelipenghui pushed a commit that referenced this pull request Jan 30, 2021
Fixes #9177

### Motivation

go function added metrics collector by #6105, but havnt pass `metricsPort` to go function, also not init & start prometheus http server. As the result, function worker will keep trying to access to the metrics port to collect data, which will cause massive log errors in log history.

### Modifications

- expose `metricsPort` to go function
- add prometheus http server to go function

### Verifying this change

- [x] Make sure that the change passes the CI checks.
codelipenghui pushed a commit that referenced this pull request Feb 4, 2021
Fixes #9177

go function added metrics collector by #6105, but havnt pass `metricsPort` to go function, also not init & start prometheus http server. As the result, function worker will keep trying to access to the metrics port to collect data, which will cause massive log errors in log history.

- expose `metricsPort` to go function
- add prometheus http server to go function

- [x] Make sure that the change passes the CI checks.

(cherry picked from commit 211a125)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/function doc Your PR contains doc changes, no matter whether the changes are in markdown or code files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants