[SVLS-7934] feat: Support TLS certificate for trace/stats flusher#961
[SVLS-7934] feat: Support TLS certificate for trace/stats flusher#961
Conversation
bottlecap/src/config/yaml.rs
Outdated
| #[serde(deserialize_with = "deserialize_optional_string")] | ||
| pub http_protocol: Option<String>, | ||
| #[serde(deserialize_with = "deserialize_optional_string")] | ||
| pub ssl_ca_cert: Option<String>, |
There was a problem hiding this comment.
Where does the DD_SSL_CA_CERT environment variable comes from?
There was a problem hiding this comment.
It's handled by this code, which puts the value of env var
DD_SSL_CA_CERT to the field ssl_ca_cert. Is this what you are asking?
There was a problem hiding this comment.
No, I'm asking where does DD_SSL_CA_CERT name convention comes from, does this come from the Datadog Agent config? Does it come from the docs? Did you came up with it?
There was a problem hiding this comment.
Good question. I came up with it. Upon further search, I found tls_ca_cert here:
https://github.com/DataDog/integrations-core/blob/master/http_check/datadog_checks/http_check/data/conf.yaml.example#L477
so I'll rename the env var to DD_SSL_CA_CERT. Does this sound good to you?
There was a problem hiding this comment.
Yeah, although I'd confirm with the Agent team!
There was a problem hiding this comment.
Env var name LGTM from the perspective of agent-configuration team!
There was a problem hiding this comment.
Discussed offline. Changing to DD_TLS_CERT_FILE to be consistent with:
https://github.com/DataDog/datadog-agent/blob/0638dfce1e1f3a9ae336334d4df01cb2a5e35120/pkg/config/setup/config.go#L1410
The same config option has different names in different places. This PR just picks one of them.
duncanista
left a comment
There was a problem hiding this comment.
Great work!
One las thing, do logs/metrics/proxy clients need this change too?
@duncanista Maybe, but somehow I didn't see any error with logs/metrics client in my test. If this or future customer reports such error, we can reproduce it then fix it in a separate PR. |
## Problem A customer reported that their Lambda is behind a proxy, and the Rust-based extension encounters an error when sending logs and metrics to Datadog via the proxy. A previous PR #961 fixed this for traces and stats, but not for other things because the customer and I didn't see any error with them at that time. ## This PR Applies the env var `DD_TLS_CERT_FILE` to logs flusher and proxy flusher as well. Example: `DD_TLS_CERT_FILE=/opt/ca-cert.pem`, so the when the extension flushes logs or proxied data to Datadog, the HTTP client created can load and use this cert, and connect the proxy properly. ## Testing 1. Create a Lambda in a VPC with a proxy EC2 instance. 2. Connect to the proxy instance. With the help of ChatGPT, set up a custom-build nginx with `ngx_http_proxy_connect_module` 3. Save the CA certificate from the proxy server to local machine 4. In the CDK stack, add a layer to the Lambda, which includes the CA certificate `ca-cert.pem` 5. Set env vars: - `DD_TLS_CERT_FILE=/opt/ca-cert.pem` - `DD_PROXY_HTTPS=http://10.0.0.30:3128`, where `10.0.0.30` is the private IP of the proxy EC2 instance - `DD_LOG_LEVEL=debug` 6. Invoke the Lambda ## Result **Before:** Log flushing failed: > DD_EXTENSION | ERROR | LOGS | Failed to send request after 97 ms and 3 attempts: reqwest::Error { kind: Request, url: "https://http-intake.logs.datadoghq.com/api/v2/logs", source: hyper_util::client::legacy::Error(Connect, ConnectFailed(Custom { kind: Other, error: Custom { kind: InvalidData, error: InvalidCertificate(UnknownIssuer) } })) } **After:** No such error ## Next steps Do the same thing for dogstatsd metric flusher. Metric flusher is in a separate repo https://github.com/DataDog/serverless-components, so let's create separate PRs for that change. ## Notes Customer report issue: #919
…er (#984) ## Motivation A customer reported that their Lambda is behind a proxy, and the Rust-based extension encounters an error when sending logs and metrics to Datadog via the proxy. ## Overview Previous PRs in fixed this for traces/stats (#961) and logs/proxy (#979). DataDog/serverless-components#61 in `serverless-components` repo fixed this issue for dogstatsd metrics. This PR upgrades the dependency on `serverless-components` to patch this update. ## Testing #### Steps See #979 #### Result **Before:** Metrics flushing failed: > DD_EXTENSION | DEBUG | Flushing 0 series and 1 distributions DD_EXTENSION | DEBUG | Sending distributions: SketchPayload { ... } DD_EXTENSION | DEBUG | Request to https://api.datadoghq.com/api/beta/sketches took 144ms DD_EXTENSION | ERROR | Error shipping data: None Failed to send request after 3 attempts DD_EXTENSION | ERROR | Failed to flush some metrics due to shipping errors: 0 series and 1 sketches **After:** Metrics flushing succeeded: > DD_EXTENSION | DEBUG | Flushing 0 series and 1 distributions DD_EXTENSION | DEBUG | Sending distributions: SketchPayload { ... } DD_EXTENSION | DEBUG | Request to https://api.datadoghq.com/api/beta/sketches took 619ms DD_EXTENSION | DEBUG | Successfully flushed 0 series and 1 distributions ## Notes Customer report issue: #919
## Problem
A customer reported that their Lambda is behind a proxy, and the
Rust-based extension can't send traces to Datadog via the proxy, while
the previous go-based extension worked.
## This PR
Supports the env var `DD_TLS_CERT_FILE`: The path to a file of
concatenated CA certificates in PEM format.
Example: `DD_TLS_CERT_FILE=/opt/ca-cert.pem`, so the when the extension
flushes traces/stats to Datadog, the HTTP client created can load and
use this cert, and connect the proxy properly.
## Testing
### Steps
1. Create a Lambda in a VPC with an NGINX proxy.
2. Add a layer to the Lambda, which includes the CA certificate
`ca-cert.pem`
3. Set env vars:
- `DD_TLS_CERT_FILE=/opt/ca-cert.pem`
- `DD_PROXY_HTTPS=http://10.0.0.30:3128`, where `10.0.0.30` is the
private IP of the proxy EC2 instance
- `DD_LOG_LEVEL=debug`
4. Update routing rules of security groups so the Lambda can reach
`http://10.0.0.30:3128`
5. Invoke the Lambda
### Result
**Before**
Trace flush failed with error logs:
> DD_EXTENSION | ERROR | Max retries exceeded, returning request error
error=Network error: client error (Connect) attempts=1
DD_EXTENSION | ERROR | TRACES | Request failed: No requests sent
**After**
Trace flush is successful:
> DD_EXTENSION | DEBUG | TRACES | Flushing 1 traces
DD_EXTENSION | DEBUG | TRACES | Added root certificate from
/opt/ca-cert.pem
DD_EXTENSION | DEBUG | TRACES | Proxy connector created with proxy:
Some("http://10.0.0.30:3128")
DD_EXTENSION | DEBUG | Sending with retry
url=https://trace.agent.datadoghq.com/api/v0.2/traces payload_size=1120
max_retries=1
DD_EXTENSION | DEBUG | Received response status=202 Accepted attempt=1
DD_EXTENSION | DEBUG | Request succeeded status=202 Accepted attempts=1
DD_EXTENSION | DEBUG | TRACES | Flushing took 1609 ms
## Notes
This fix only covers trace flusher and stats flusher, which use
`ServerlessTraceFlusher::get_http_client()` to create the HTTP client.
It doesn't cover logs flusher and proxy flusher, which use a different
function (http.rs:get_client()) to create the HTTP client. However, logs
flushing was successful in my tests, even if no certificate was added.
We can come back to logs/proxy flusher if someone reports an error.
Problem
A customer reported that their Lambda is behind a proxy, and the Rust-based extension can't send traces to Datadog via the proxy, while the previous go-based extension worked.
This PR
Supports the env var
DD_TLS_CERT_FILE: The path to a file of concatenated CA certificates in PEM format.Example:
DD_TLS_CERT_FILE=/opt/ca-cert.pem, so the when the extension flushes traces/stats to Datadog, the HTTP client created can load and use this cert, and connect the proxy properly.Testing
Steps
ca-cert.pemDD_TLS_CERT_FILE=/opt/ca-cert.pemDD_PROXY_HTTPS=http://10.0.0.30:3128, where10.0.0.30is the private IP of the proxy EC2 instanceDD_LOG_LEVEL=debughttp://10.0.0.30:3128Result
Before
Trace flush failed with error logs:
After
Trace flush is successful:
Notes
This fix only covers trace flusher and stats flusher, which use
ServerlessTraceFlusher::get_http_client()to create the HTTP client. It doesn't cover logs flusher and proxy flusher, which use a different function (http.rs:get_client()) to create the HTTP client. However, logs flushing was successful in my tests, even if no certificate was added. We can come back to logs/proxy flusher if someone reports an error.