-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Intention:
We intend to change http metrics to be polled from gorouter via a prometheus exposition endpoint as a histogram of request latencies per application rather than each request generating 4 http start stop envelopes. The new exposition endpoint will be polled by prom-scraper and the histograms will be emitted into the loggregator pipeline instead of the multiple start stop messages.
This will reduce the amount of data flowing through the system for http metrics substantially as histograms can include data from multiple requests.
Problem:
- the volume of metrics produced by even a moderate sized foundation can require a lot of logging infrastructure as well as logging costs. 4-5 loggregator envelopes for each http request to a application or system component
- interpreting http metrics requires reading through numerous http start stop metrics and individually counting them or transforming them into latency by subtracting.
- When this volume gets too high, loss occurs on the reader, and on the system(especially in log-cache, where querying or calculating this many metrics can be difficult). This introduces significant error in the value reported by calculations on these metrics.
Proposal
Each gorouter will collect latencies and generate prometheus format histograms of latencies for applications that have had requests recently. The routing release will configure the loggregator agent “prom scraper” to scrape metrics from gorouter and emit them into the loggregator system by creating a prom-scrape-config file(see https://github.com/cloudfoundry/loggregator-agent-release/blob/main/docs/prom-scraper.md and https://github.com/cloudfoundry/loggregator-agent-release/blob/main/jobs/loggr-forwarder-agent/templates/prom_scraper_config.yml.erb for examples of how to do so).
Throughput calculations can then be made by summing the difference of the _count metric for each gorouter and new latency calculations can use _bucket metrics for each gorouter.
Initially the new metrics format will be available but not on by default with the old metrics format. An operator will be able to enable or disable either the new or old metrics formats. We will encourage operators to use the new metrics format exclusively as soon as they are able.
Eventually (no earlier than 3 months after new metrics are available) the old metrics format will be removed from gorouter.
Risks and Mitigations
Downstream consumers may struggle or fail to move to the new format. This would cause difficulty deprecating old formats
Having both types of metrics enabled at the same time will increase overall metrics load which may surprise operators. We could address this by never emitting both formats by default.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status