Add Kafka Connect as a built‑in JMX metrics target by aaaugustine29 · Pull Request #15561 · open-telemetry/opentelemetry-java-instrumentation

aaaugustine29 · 2025-12-06T18:08:52Z

Overview:
This change introduces Kafka Connect as a first‑class JMX target system in the JMX metrics library. It adds a ruleset and documentation that cover both Apache Kafka Connect and Confluent Platform variants from the outset, so users can enable Kafka Connect monitoring without custom YAML.

Details:
Added kafka-connect.yaml JMX rules that map worker, rebalance, connector, task, source/sink task, and task-error MBeans into OpenTelemetry metrics, including Apache‑only metrics (e.g., worker rebalance protocol, per‑connector task counts, predicate/transform metadata, converter metadata, source transaction sizes, sink record lag max).
Defined connector and task status as state metrics using the superset of status values across Apache and Confluent, to avoid vendor‑specific enum mismatches.
Documented the new target in kafka-connect.md, including metric groups, attributes, and the dual‑vendor compatibility model (no renames; Apache list as a superset of Confluent docs).
Added self‑contained tests for the Kafka Connect rules that load the YAML, build metric definitions, and validate key state mappings and metric presence, ensuring the new target is ready to consume from day one.

Testing:
./gradlew -Dorg.gradle.configuration-cache.parallel=false instrumentation:jmx-metrics:library:test

linux-foundation-easycla · 2025-12-06T18:08:59Z

The committers listed above are authorized under a signed CLA.

✅ login: laurit / name: Lauri Tulmin (3f08f02, 69f32f4)

laurit · 2025-12-08T11:42:52Z

@SylvainJuge could you review this

SylvainJuge

Hi @aaaugustine29, thanks for opening this!

There are quite a lot of metrics added here, so it makes it quite challenging to review them all.

I don't have any expertise in Kafka Connect, so you are probably more knowledgeable here.

I would suggest to :

implement test with a real instance of the target system, ideally the two apache/confluent variants
as a first step, focus on the "essential" metrics, do not include everything that is available, this is where your knowledge might be useful
try to simplify the the maximum by using metric attributes to provide breakdown when possible if the metrics represent a partition (for example on state).

aaaugustine29 · 2025-12-08T22:33:54Z

@SylvainJuge Thanks for your help and guidance. At this point, the metrics have been reduced to the minimum set without losing any information. That being said, that doesn't mean we need to keep everything. In particular, your previous comment brings up the opportunity for consolidating some of them with metric attributes. However, there will be a loss of info for a niche and advanced group. What's your guidance on this?

And to clarify your comment about testing, having tests that actually instantiate a kafka connect cluster will be very heavy, I could emulate what the apache jmx server would produce, would that be sufficient?

aaaugustine29 · 2026-01-06T19:08:47Z

Hi @SylvainJuge, I hope you had a good holiday season. I've tried to thematically include metric attributes and yaml tags where possible. Do you have any more comments or concerns?

aaaugustine29 · 2026-01-20T14:21:30Z

Hi @SylvainJuge, I was wondering if you had any time to take a look at the current state of the MR?

aaaugustine29 · 2026-01-24T18:57:11Z

Hi @aaaugustine29 , sorry I hadn't had time to dedicate to reviewing this PR.

In order to properly be able to validate the metrics, this PR is still lacking a real end-to-end test with a kafka connect instance. Without this we have no guarantee that any of the metrics described here will actually be properly captured when deployed on a real instance. So this makes it very hard for anyone to approve it without having any significant kafka (and kafka connect) knowledge and experience.

Another thing here is that there are lots of metrics that are "derived" from other metrics and we should not capture them as they should be computed on the backend. I would suggest to re-evaluate the relevance of all the metrics with .max, .avg, .rate or .min in their names. The goal of those pre-defined metrics is to capture the essential metrics for a given system, not to capture all the possible metrics exposed, otherwise this will create lots of noise and will likely require users to discard most metrics or having to maintain their own metrics definitions.

Hi @SylvainJuge
Testing has been updated to instantiate a container that can emit the metrics. Only a few can't be tested without instantiating multiple heavier containers. I can do so if you'd prefer.

I've also removed some derivable metrics, please give it a look whenever you get a chance. Thanks!

aaaugustine29 · 2026-02-12T22:41:32Z

Hi @SylvainJuge, just checking in to see if you have had any time to review this. I think it should be in a good state.

SylvainJuge · 2026-02-20T13:42:53Z

Hi @aaaugustine29, I'm currently waiting to have #16212 to be merged first. I think it does provide a few hints and general recommendations that would be worth following for this type of PR to add new JMX metrics.

robsunday · 2026-02-27T14:02:49Z

+        metric: task.count
+        type: updowncounter
+        unit: "{task}"
+        desc: The number of tasks run in this worker.


This description is a bit unclear to me. Is it a total number of tasks that were run since the startup time ? If so then maybe it can be calculated by backend as a sum of data points of kafka.connect.worker.task.startup ? Also in this case type should be counter.

If this is a numbe of currently running tasks then description should reflect it clearly.

Its the number of tasks (processes related to a type of connector) running on a worker (node). The number of tasks could go up and down depending on the rebalancing of that worker. But actually seeing that number go up or down in a useful way is a pretty advanced use case of kafka connect.

So shouldn't this description look like this?
The number of tasks running in this worker.

aaaugustine29 · 2026-02-28T15:38:59Z

Hi @robsunday, I really appreciate you taking the time to thoroughly review this PR. I have updated everything according to your comments, other than the two that I left responses to. I'm happy to change those after your response though. Thank you again!

robsunday · 2026-03-03T14:59:43Z

+        metric: task.count
+        type: updowncounter
+        unit: "{task}"
+        desc: The number of tasks run in this worker.


So shouldn't this description look like this?
The number of tasks running in this worker.

robsunday · 2026-03-03T16:01:05Z

+
+    MetricsVerifier verifier = MetricsVerifier.create().disableStrictMode();
+    for (String metricName : metricNames) {
+      verifier.add(metricName, metric -> {});


I assume you are going to fill this method with metrics verification, like it is done for other targets.
If needed then you will soon be able to verify also values of retrieved metrics (see this PR )

robsunday · 2026-03-03T16:04:37Z

+            restarting: [restarting, RESTARTING]
+            destroyed: [destroyed, DESTROYED]
+            unknown: "*"
+      # kafka.connect.task.class


Please remove it

robsunday · 2026-03-03T16:08:20Z

+      # kafka.connect.worker.rebalance.protocol
+      connect-protocol:
+        metric: protocol
+        type: updowncounter


Is updowncounter a good type for protocol? I think gauge is more appropriate.
The question is if this metric is really needed? Possibly it is just a configuration setting that will not change, but this is just my guess

@SylvainJuge should this be a state metric?

Couldn't find type: state used in any of the existing yamls. updowncounter vs gage can often be decided by thinking whether adding up the values for different attributes makes sense.

@laurit you are right, there is currently no usage of state metrics in the metrics definitions that are included in this project.

I haven't looked closely, but having this defined as a state metric means that the value could change over the lifetime of the process.

Also worth asking about is the relevance of this metric, for example what does this metric provides in terms of information about the health or state of the system. The description seems to indicate it's more a property or the result of the cluster configuration, in that case I don't think it would be relevant to capture this as a metric as it would be constant most of the time.

robsunday · 2026-03-03T16:15:59Z

+      # kafka.connect.connector.status
+      status:
+        metric: status
+        type: updowncounter


I think it should be gauge

robsunday · 2026-03-03T16:16:17Z

+    metricAttribute:
+      kafka.connect.connector: param(connector)
+    mapping:
+      # kafka.connect.connector.status


What is unit here?

robsunday · 2026-03-03T16:17:08Z

+        type: gauge
+        unit: "1"
+        desc: The fraction of time this task has spent in the running state.
+      # kafka.connect.task.status


Please specify unit.

robsunday · 2026-03-03T16:17:35Z

+      # kafka.connect.task.status
+      status:
+        metric: status
+        type: updowncounter


Rather gauge?

laurit · 2026-03-04T13:23:37Z

+            rule -> {
+              assertThat(rule.getMetricType())
+                  .isNotEqualTo(
+                      io.opentelemetry.instrumentation.jmx.internal.engine.MetricInfo.Type.STATE);


could import io.opentelemetry.instrumentation.jmx.internal.engine.MetricInfo

laurit · 2026-03-04T13:26:59Z

+
+  @Test
+  void kafkaConnectRulesUseBasicMetricTypes() throws Exception {
+    io.opentelemetry.instrumentation.jmx.internal.yaml.JmxConfig config = loadKafkaConnectConfig();


could import io.opentelemetry.instrumentation.jmx.internal.yaml.JmxConfig

laurit · 2026-03-04T13:38:47Z

+      try {
+        stream.close();
+      } catch (IOException ignored) {
+        // best effort cleanup
+      }


you could try using try-with-resources for the stream to avoid handling close here

laurit · 2026-03-04T13:54:24Z

+            });
+  }
+
+  private static HttpResponseData sendRequest(String method, String url, String body)


Alternatively could consider using the armeria http client that is used in other tests. Could be a bit easier that using the http url connection.

import io.opentelemetry.testing.internal.armeria.client.WebClient; import io.opentelemetry.testing.internal.armeria.common.AggregatedHttpRequest; import io.opentelemetry.testing.internal.armeria.common.AggregatedHttpResponse; import io.opentelemetry.testing.internal.armeria.common.HttpMethod; import io.opentelemetry.testing.internal.armeria.common.MediaType; private static final WebClient client = WebClient.of(); private static void createConnector(String connectUrl, String connectorConfigJson) { AggregatedHttpResponse response = sendRequest(HttpMethod.POST, connectUrl + "/connectors", connectorConfigJson); assertThat(response.status().code()).isIn(200, 201, 409); } private static void awaitConnectorRunning(String connectUrl, String connectorName) { await() .atMost(Duration.ofMinutes(2)) .pollInterval(Duration.ofSeconds(1)) .untilAsserted( () -> { AggregatedHttpResponse response = sendRequest(HttpMethod.GET, connectUrl + "/connectors/" + connectorName + "/status", null); assertThat(response.status().code()).isEqualTo(200); assertThat(response.contentUtf8()).contains("\"state\":\"RUNNING\""); }); } private static AggregatedHttpResponse sendRequest(HttpMethod method, String url, String body) { AggregatedHttpRequest request = body != null ? AggregatedHttpRequest.of(method, url, MediaType.JSON, body) : AggregatedHttpRequest.of(method, url); return client.execute(request).aggregate().join(); }

laurit · 2026-03-04T14:02:42Z

+                          assertThat(metric.getMetricType())
+                              .isNotEqualTo(
+                                  io.opentelemetry.instrumentation.jmx.internal.engine.MetricInfo
+                                      .Type.STATE));


curious why using state metrics would be a problem?

otelbot · 2026-03-10T13:48:56Z

Thank you for your contribution @aaaugustine29! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey.

aaaugustine29 requested a review from a team as a code owner December 6, 2025 18:08

aaaugustine29 added 6 commits December 6, 2025 13:28

initial commit with changes

6293d68

slight change to support both apache and confluent

d46eb77

fix small error and unit tests

31b44c8

fix linting issue

2f25096

fix spotless issues

81fa539

fix readme

dd17633

aaaugustine29 force-pushed the main branch from ded4cef to dd17633 Compare December 6, 2025 18:29

aaaugustine29 and others added 2 commits December 6, 2025 13:29

Merge branch 'main' into main

d20153d

finally fix markdown

c985e86

aaaugustine29 added 2 commits December 8, 2025 10:28

change from state to updowncounter for wider compatibility

9ccdc08

Apply spotless

471c9a7

SylvainJuge reviewed Dec 8, 2025

View reviewed changes

aaaugustine29 added 4 commits December 8, 2025 11:42

update for simplicity and further compatibility

7ceb694

apply spotless

6d3af5f

remove worker startup percentages

a2fca88

remove unnecessary metrics

c08109f

SylvainJuge reviewed Dec 10, 2025

View reviewed changes

aaaugustine29 and others added 7 commits December 20, 2025 10:04

update testing

d48dea4

improve metric consolidation and utilize yaml anchoring

8371f77

Merge branch 'main' into main

22423b9

fix tests and update readme

75bc740

adding spotlessApply changes

cf22149

Merge branch 'main' into main

3d853f9

Merge branch 'main' into main

ee28250

laurit added this to the v2.26.0 milestone Feb 27, 2026

robsunday reviewed Feb 27, 2026

View reviewed changes

aaaugustine29 and others added 8 commits February 27, 2026 14:25

update for clarity and better semantics

4f08c30

Merge branch 'main' into main

1562394

apply spotless

ec9445d

Merge branch 'main' into main

951ab24

fix serialization reference that was causing build issue.

0312ae5

update metric names for consistency

39cf369

fix task description and apply spotless

c2af63f

update for consistency

3e86517

aaaugustine29 added 3 commits March 1, 2026 11:07

fix missed unit

f010dce

update readme to match new metrics

2316bdb

fix formatting for readme

5b20882

robsunday reviewed Mar 3, 2026

View reviewed changes

laurit reviewed Mar 4, 2026

View reviewed changes

Comment thread ...trics/library/src/test/java/io/opentelemetry/instrumentation/jmx/rules/KafkaConnectTest.java

laurit added 2 commits March 9, 2026 16:06

Merge branch 'main' into main

3f08f02

address review comments

69f32f4

laurit approved these changes Mar 10, 2026

View reviewed changes

laurit merged commit 21432cf into open-telemetry:main Mar 10, 2026
93 checks passed

SylvainJuge mentioned this pull request Mar 11, 2026

jmx: add link to document kafka-connect supported #16444

Merged

Conversation

aaaugustine29 commented Dec 6, 2025

Uh oh!

linux-foundation-easycla Bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurit commented Dec 8, 2025

Uh oh!

SylvainJuge left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaaugustine29 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaaugustine29 commented Jan 6, 2026

Uh oh!

aaaugustine29 commented Jan 20, 2026

Uh oh!

aaaugustine29 commented Jan 24, 2026

Uh oh!

aaaugustine29 commented Feb 12, 2026

Uh oh!

SylvainJuge commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaaugustine29 commented Feb 28, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurit Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla Bot commented Dec 6, 2025 •

edited

Loading

aaaugustine29 commented Dec 8, 2025 •

edited

Loading

laurit Mar 4, 2026 •

edited

Loading