[HCP Telemetry] Periodic Refresh for Dynamic Telemetry Configuration#18168
[HCP Telemetry] Periodic Refresh for Dynamic Telemetry Configuration#18168
Conversation
agent/hcp/deps.go
Outdated
There was a problem hiding this comment.
In deps.go, the sink and exporter are now initialized with a HCP specific "TelemetryConfigProvider" which continuously fetches configuration from HCP and updates it as needed.
agent/hcp/telemetry/filter.go
Outdated
agent/hcp/telemetry/filter_test.go
Outdated
There was a problem hiding this comment.
Moved to client_test.go
agent/hcp/telemetry/otel_exporter.go
Outdated
There was a problem hiding this comment.
The OTELExporter now holds an EndpointProvider from which it obtains the endpoint where it needs to send metrics to.
The endpointProvider can be any struct that conforms to the interface. The OTELExporter uses the GetEndpoint() of an endpointProvider to obtain the url.
agent/hcp/telemetry/otel_sink.go
Outdated
There was a problem hiding this comment.
The OTELSink now holds an ConfigProvider from which it obtains the filters and labels to process metrics.
The ConfigProvider can be any struct that conforms to the interface. The OTELSink uses the GetFilters() and GetLabels() of an ConfigProvider.
agent/hcp/telemetry/otel_sink.go
Outdated
There was a problem hiding this comment.
Don't need labels / filter logic in here since the provider will give this information.
agent/hcp/telemetry/otel_sink.go
Outdated
There was a problem hiding this comment.
We now use a resource without attributes (labels). Once created, a resource can't change its attributes, so this won't work anymore.
To support dynamic configuration, i.e. changing labels, we will now send custom labels at the individual metric level.
CC @jjti : I'm assuming there is no impact on the prom side for this, but just tagging you for some 👀
There was a problem hiding this comment.
Okay yeah, understood. Payload size will increase, but there shouldn't be any backend change because of this since we wind up adding the attributes to each sample regardless during the OTLP > Prom conversion
There was a problem hiding this comment.
Sounds good, let me cut a ticket to look into this and gather info. If it does become a problem, I think we should fast track the compression plan that we had previously, which would mean:
- Consul/collector should compress export payloads
- TGW should be able to handle compression (think it currently doesn't?)
There was a problem hiding this comment.
TGW should be able to handle compression (think it currently doesn't?)
Correct that it currently doesn't. There's a trade-off where we'll up our CPU/mem usage in Consul + CTGW but for these repetitive sample labels I imagine it's worth it. When we address this it'd be nice to see how the increased cost of compression helps us w/ payload size.
We use snappy encoding already to compress payloads in the gateway, and I see that's a dependency of Consul already, that may be one route (adding Content-Encoding: snappy)
There was a problem hiding this comment.
Change from http to grpc would result in a better way of handling payload size as well. We should probably backlog this as the HIP client already uses grpc over traefik and we should probably do the same to reduce the egress charges for customers and for our own hcp clusters.
agent/hcp/telemetry/otel_sink.go
Outdated
There was a problem hiding this comment.
We need to handle adding the provider labels at the metric level, in addition to the go metrics label. This function merges both into OTEL attribute key value pairs.
40186a6 to
7b42fce
Compare
agent/hcp/client/client.go
Outdated
There was a problem hiding this comment.
Changes:
- Init client with context to add a logger
- Refactored out the
TelemetryConfig,MetricsConfig, newRefreshConfigand the conversions into thetelemetry_config.gofile
Why this refactor?
- This file was growing larger and larger
- Modified the endpoint, label and filters validation and parsing to be done when converting the response. This reduces the code over in
deps.goandtelemetry_config_provider.go. It also avoids code duplication.
agent/hcp/client/client.go
Outdated
There was a problem hiding this comment.
Moved to new file telemetry_config.go
agent/hcp/client/client.go
Outdated
There was a problem hiding this comment.
Moved conversions to new file telemetry_config.go
agent/hcp/client/client_test.go
Outdated
There was a problem hiding this comment.
Removed most tests in this file to put them in telemetry_config_test.go
Fixed the actual client test which was not mocking the TGW client properly, and test the FetchTelemetryConfig function including validations and conversion.
agent/hcp/client/telemetry_config.go
Outdated
There was a problem hiding this comment.
- This new file contains all logic pertaining to
TelemetryConfig, an object that contains telemetry config from HCP.
I settled on the following model for the object:
type TelemetryConfig struct {
MetricsConfig *MetricsConfig
RefreshConfig *RefreshConfig
}In the future, we can add fields here like TraceConfig *TraceConfig or LogsConfig *LogsConfig
I removed the upper level label and endpoint, instead we should perform this conversion and pass around the object as it should be used.
This greatly reduces the complexity of code in deps.go and telemetry_config.go, and removes any duplicated code.
- This file contains all conversions that are needed to get a final TelemetryObject, in particular, the default labels setting, the conversion of a string endpoint to a url, etc.
agent/hcp/deps.go
Outdated
There was a problem hiding this comment.
Changes:
- Init the
TelemetryConfigProvider - Remove any parsing that is now unnecessary with the client refactor (url parsing)
agent/hcp/deps.go
Outdated
There was a problem hiding this comment.
Now returns an error to test those properly.
agent/hcp/deps_test.go
Outdated
There was a problem hiding this comment.
Some of the tests were cleaned up thanks to the refactor of the URL parsing into the client.
New tests were added for the new TelemetryConfigProvider
agent/hcp/telemetry/otel_exporter.go
Outdated
There was a problem hiding this comment.
The OTELExporter now holds an EndpointProvider from which it obtains the filters and labels to process metrics.
The ConfigProvider can be any struct that conforms to the interface. The EndpointProvider uses the GetEndpoint() of an EndpointProvider.
There was a problem hiding this comment.
New TelemetryConfigProvider which holds dynamic configuration and provides labels, filters and the endpoint to consumers.
There was a problem hiding this comment.
pass in the ctx and refreshInterval to avoid needing a rw lock call here.
There was a problem hiding this comment.
All the Get* methods acquire a reader lock and defer the unlock before returning the config field
There was a problem hiding this comment.
Only modify config if it has changed. This is verified by comparing the hash of the old vs new config
go.mod
Outdated
There was a problem hiding this comment.
New version to bring in the RefreshConfig changes
6151673 to
5a1f5c2
Compare
fa60971 to
7ea6280
Compare
Description
New feature for HCP Telemetry that allows us to dynamically re-configure metrics export to HCP.
Background
Currently, we need to restart the Consul server to bring in any updates to filters, a new endpoint or new labels from the Consul Telemetry Gateway in HCP.
Ideally, we are able to modify these dynamically, which requires logic to continuously fetch new configuration and update it as necessary.
Implementation
This PR enables this functionality, with changes to :
TelemetryConfigProvider:time.Ticker.GetEndpoint(),GetFilters()andGetLabels()ConfigProviderinterface to access filters and labels.EndpointProviderinterface to access the endpoint.** Cleanup / Refactor ***
clientpackage was cleaned to separate out theTelemetryConfigobject to parse and convert objects.This refactor was necessary since now the TelemetryConfigProvider also fetches telemetry config. This avoid duplication of logic for the parsing of the endpoint, labels ,etc.
Testing & Reproduction steps
Links
PR Checklist