Skip to content

[HCP Telemetry] Periodic Refresh for Dynamic Telemetry Configuration#18168

Merged
Achooo merged 42 commits intomainfrom
cc-4960/hcp-telemetry-periodic-refresh
Aug 1, 2023
Merged

[HCP Telemetry] Periodic Refresh for Dynamic Telemetry Configuration#18168
Achooo merged 42 commits intomainfrom
cc-4960/hcp-telemetry-periodic-refresh

Conversation

@Achooo
Copy link
Copy Markdown
Contributor

@Achooo Achooo commented Jul 18, 2023

Description

New feature for HCP Telemetry that allows us to dynamically re-configure metrics export to HCP.

Background
Currently, we need to restart the Consul server to bring in any updates to filters, a new endpoint or new labels from the Consul Telemetry Gateway in HCP.

Ideally, we are able to modify these dynamically, which requires logic to continuously fetch new configuration and update it as necessary.

Implementation
This PR enables this functionality, with changes to :

  1. New object: TelemetryConfigProvider:
  • Periodically fetches telemetry configuration from HCP, using a time.Ticker.
  • Holds configuration and exposes it via methods : GetEndpoint(), GetFilters() and GetLabels()
  • Updates configuration, ONLY if it has changed. Comparisons are done using a hash function on the structs holding the config.
  • Handles concurrency access to read the config and modify the config.
  1. OTELSink: uses a new ConfigProvider interface to access filters and labels.
  • Since labels are now dynamic, we can't add them to the OTEL resource (immutable) so a new function is added to compute OTEL labels dynamically.
  • Handle dynamic filters
  1. OTELExporter: uses a new EndpointProvider interface to access the endpoint.

** Cleanup / Refactor ***

  1. The client package was cleaned to separate out the TelemetryConfig object to parse and convert objects.
  2. A logger was added to the client to log bad filter information

This refactor was necessary since now the TelemetryConfigProvider also fetches telemetry config. This avoid duplication of logic for the parsing of the endpoint, labels ,etc.

Testing & Reproduction steps

  • End to end tests:
  • Unit tests:

Links

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In deps.go, the sink and exporter are now initialized with a HCP specific "TelemetryConfigProvider" which continuously fetches configuration from HCP and updates it as needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to client.go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to client_test.go

Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTELExporter now holds an EndpointProvider from which it obtains the endpoint where it needs to send metrics to.

The endpointProvider can be any struct that conforms to the interface. The OTELExporter uses the GetEndpoint() of an endpointProvider to obtain the url.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTELSink now holds an ConfigProvider from which it obtains the filters and labels to process metrics.

The ConfigProvider can be any struct that conforms to the interface. The OTELSink uses the GetFilters() and GetLabels() of an ConfigProvider.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need labels / filter logic in here since the provider will give this information.

Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now use a resource without attributes (labels). Once created, a resource can't change its attributes, so this won't work anymore.

To support dynamic configuration, i.e. changing labels, we will now send custom labels at the individual metric level.

CC @jjti : I'm assuming there is no impact on the prom side for this, but just tagging you for some 👀

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yeah, understood. Payload size will increase, but there shouldn't be any backend change because of this since we wind up adding the attributes to each sample regardless during the OTLP > Prom conversion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let me cut a ticket to look into this and gather info. If it does become a problem, I think we should fast track the compression plan that we had previously, which would mean:

  1. Consul/collector should compress export payloads
  2. TGW should be able to handle compression (think it currently doesn't?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TGW should be able to handle compression (think it currently doesn't?)

Correct that it currently doesn't. There's a trade-off where we'll up our CPU/mem usage in Consul + CTGW but for these repetitive sample labels I imagine it's worth it. When we address this it'd be nice to see how the increased cost of compression helps us w/ payload size.

We use snappy encoding already to compress payloads in the gateway, and I see that's a dependency of Consul already, that may be one route (adding Content-Encoding: snappy)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change from http to grpc would result in a better way of handling payload size as well. We should probably backlog this as the HIP client already uses grpc over traefik and we should probably do the same to reduce the egress charges for customers and for our own hcp clusters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle adding the provider labels at the metric level, in addition to the go metrics label. This function merges both into OTEL attribute key value pairs.

@Achooo Achooo force-pushed the cc-4960/hcp-telemetry-periodic-refresh branch from 40186a6 to 7b42fce Compare July 25, 2023 17:27
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes:

  • Init client with context to add a logger
  • Refactored out the TelemetryConfig, MetricsConfig, new RefreshConfigand the conversions into the telemetry_config.go file

Why this refactor?

  1. This file was growing larger and larger
  2. Modified the endpoint, label and filters validation and parsing to be done when converting the response. This reduces the code over in deps.go and telemetry_config_provider.go. It also avoids code duplication.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to new file telemetry_config.go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved conversions to new file telemetry_config.go

Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed most tests in this file to put them in telemetry_config_test.go
Fixed the actual client test which was not mocking the TGW client properly, and test the FetchTelemetryConfig function including validations and conversion.

Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This new file contains all logic pertaining to TelemetryConfig, an object that contains telemetry config from HCP.

I settled on the following model for the object:

type TelemetryConfig struct {
 	MetricsConfig *MetricsConfig
 	RefreshConfig *RefreshConfig
 }

In the future, we can add fields here like TraceConfig *TraceConfig or LogsConfig *LogsConfig

I removed the upper level label and endpoint, instead we should perform this conversion and pass around the object as it should be used.

This greatly reduces the complexity of code in deps.go and telemetry_config.go, and removes any duplicated code.

  1. This file contains all conversions that are needed to get a final TelemetryObject, in particular, the default labels setting, the conversion of a string endpoint to a url, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes:

  1. Init the TelemetryConfigProvider
  2. Remove any parsing that is now unnecessary with the client refactor (url parsing)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now returns an error to test those properly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tests were cleaned up thanks to the refactor of the URL parsing into the client.
New tests were added for the new TelemetryConfigProvider

Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTELExporter now holds an EndpointProvider from which it obtains the filters and labels to process metrics.

The ConfigProvider can be any struct that conforms to the interface. The EndpointProvider uses the GetEndpoint() of an EndpointProvider.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New TelemetryConfigProvider which holds dynamic configuration and provides labels, filters and the endpoint to consumers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass in the ctx and refreshInterval to avoid needing a rw lock call here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the Get* methods acquire a reader lock and defer the unlock before returning the config field

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only modify config if it has changed. This is verified by comparing the hash of the old vs new config

go.mod Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version to bring in the RefreshConfig changes

@Achooo Achooo force-pushed the cc-4960/hcp-telemetry-periodic-refresh branch from 6151673 to 5a1f5c2 Compare July 25, 2023 19:09
@Achooo Achooo force-pushed the cc-4960/hcp-telemetry-periodic-refresh branch from fa60971 to 7ea6280 Compare August 1, 2023 13:21