Skip to content

Add throughput testing infrastructure and fix Python telemetry performance#2742

Merged
jmthomas merged 15 commits intomainfrom
cmd_tlm_test
Jan 29, 2026
Merged

Add throughput testing infrastructure and fix Python telemetry performance#2742
jmthomas merged 15 commits intomainfrom
cmd_tlm_test

Conversation

@jmthomas
Copy link
Copy Markdown
Member

@jmthomas jmthomas commented Jan 18, 2026

Summary

This PR adds comprehensive throughput testing infrastructure and fixes a critical Python telemetry performance bottleneck, improving Python throughput by 10x.

Key Changes

  • Fix Python telemetry throughput bottleneck - JSONPath caching in JsonAccessor improves throughput from ~320 Hz to 3,545 Hz
  • Add throughput testing server - Standalone TCP/IP server for measuring COSMOS command/telemetry throughput
  • Add fire-and-forget command mode - Skip ACK waiting when timeout <= 0 for high-throughput scenarios
  • Add packet caching in TargetModel - Thread-safe caching with 10-second timeout reduces Redis lookups
  • Add UPDATE_INTERVAL tests - Verify queued writes functionality in both Ruby and Python

Performance Results

Metric Before After Improvement
Python telemetry ~320 Hz 3,545 Hz 10x
Ruby telemetry ~2,700 Hz ~2,700 Hz baseline

Python now outperforms Ruby at high telemetry rates while maintaining zero packet loss.

Root Cause Analysis

The Python performance issue was caused by jsonpath_ng.parse() recompiling JSONPath expressions on every call (~2.7ms per call). When identifying packets in unique_id_mode, this caused ~5.7ms overhead per packet. Adding lru_cache to cache parsed expressions reduced this to ~0.12µs (23,000x speedup per call).

Files Changed

Performance Fixes:

  • openc3/python/openc3/accessors/json_accessor.py - Add JSONPath caching with lru_cache
  • openc3/python/openc3/microservices/interface_microservice.py - Fix missing self.queued initialization
  • openc3/python/pyproject.toml - Add orjson as optional dependency

Throughput Testing Infrastructure:

  • examples/throughput_server/ - New standalone throughput testing server
  • openc3-cosmos-demo/targets/INST/procedures/throughput_test.rb - Ruby throughput test
  • openc3-cosmos-demo/targets/INST2/procedures/throughput_test.py - Python throughput test
  • openc3-cosmos-demo/targets/*/screens/throughput.txt - Throughput monitoring screens

Command/Telemetry Optimizations:

  • openc3/lib/openc3/topics/command_topic.rb - Fire-and-forget mode
  • openc3/python/openc3/topics/command_topic.py - Fire-and-forget mode
  • openc3/lib/openc3/models/target_model.rb - Packet caching
  • openc3/python/openc3/models/target_model.py - Packet caching

Test Coverage:

  • openc3/spec/microservices/interface_microservice_spec.rb - UPDATE_INTERVAL test
  • openc3/python/test/microservices/test_interface_microservice.py - UPDATE_INTERVAL tests
  • openc3/spec/models/target_model_spec.rb - Packet caching tests
  • openc3/python/test/models/test_target_model.py - Packet caching tests

Test plan

  • All Python protocol tests pass (211 tests)
  • Ruby interface_microservice tests pass (15 tests)
  • Python interface_microservice tests pass (10 tests)
  • Throughput tests verified with throughput_server
  • Manual testing with DEMO plugin

🤖 Generated with Claude Code

jmthomas and others added 6 commits January 16, 2026 12:00
Throughput Server (examples/throughput_server/):
- Standalone TCP/IP server for measuring COSMOS command/telemetry throughput
- Dual-port operation for INST (7778) and INST2 (7780) targets
- CCSDS packet encoding/decoding with configurable streaming rates
- Time-compensated streaming to maintain accurate rates up to 100kHz
- Pre-allocated buffers for minimal allocation in hot paths
- Raw TCP rate test scripts (Ruby/Python) achieving ~300-500k cmd/s

DEMO Plugin Changes:
- Add THROUGHPUT_STATUS telemetry packet with rate/count metrics
- Add throughput commands: START_STREAM, STOP_STREAM, GET_STATS, RESET_STATS
- Add throughput_test procedures for INST (Ruby) and INST2 (Python)
- Add throughput screen for real-time monitoring
- Add plugin variables to toggle between simulator and throughput server
- Configure LengthProtocol for CCSDS packet framing when using throughput server

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add fire-and-forget mode to CommandTopic.send_command when timeout <= 0
  to skip ACK waiting for high-throughput command scenarios
- Add thread-safe packet caching in TargetModel with 10-second timeout
  to reduce Redis lookups for repeated packet access
- Cache is automatically invalidated when set_packet is called
- Add unit tests for packet caching in both Ruby and Python

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Root cause: jsonpath_ng.parse() was recompiling JSONPath expressions on
every call, taking ~2.7ms per call. When identifying packets in
unique_id_mode (required for INST2 due to mixed CCSDS/JSON packet types),
this caused ~5.7ms overhead per packet, limiting throughput to ~320 Hz.

Changes:
- Add lru_cache to JSONPath parsing in JsonAccessor (103x speedup)
- Add orjson as optional dependency for faster JSON parsing
- Fix missing self.queued initialization in Python interface_microservice
- Update throughput test scripts for both Ruby and Python

Results:
- Python telemetry: 320 Hz → 3,545 Hz (10x improvement)
- Python now outperforms Ruby at high rates (3,545 Hz vs 2,627 Hz)
- Zero packet loss maintained at all tested rates

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Revert the bytearray optimization in Python protocols to maintain
backward compatibility for custom protocol implementations. The change
from bytes to bytearray could break user code that:
- Type checks self.data expecting bytes
- Relies on immutability of self.data
- Uses bytes-specific operations

The JSONPath caching fix (the real 10x performance improvement) remains
intact.

Also adds tests for UPDATE_INTERVAL option in both Ruby and Python
interface_microservice to verify the queued writes functionality.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 18, 2026

Codecov Report

❌ Patch coverage is 88.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.69%. Comparing base (0647fab) to head (7f83f8e).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
openc3/lib/openc3/topics/command_topic.rb 50.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2742   +/-   ##
=======================================
  Coverage   78.68%   78.69%           
=======================================
  Files         671      671           
  Lines       54738    54796   +58     
  Branches      731      731           
=======================================
+ Hits        43072    43122   +50     
- Misses      11586    11594    +8     
  Partials       80       80           
Flag Coverage Δ
python 80.32% <ø> (+<0.01%) ⬆️
ruby-api 82.68% <ø> (ø)
ruby-backend 81.80% <88.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@interface.options.each do |option_name, option_values|
if option_name.upcase == 'OPTIMIZE_THROUGHPUT'
# OPTIMIZE_THROUGHPUT was changed to UPDATE_INTERVAL to better represent the setting
if option_name.upcase == 'UPDATE_INTERVAL' or option_name.upcase == 'OPTIMIZE_THROUGHPUT'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a bug ... missing keyword that we already changed in Python and in the docs

while True: # Loop until we get some data
try:
data = self.read_socket.recv(4096, socket.MSG_DONTWAIT)
data = self.read_socket.recv(65535, socket.MSG_DONTWAIT)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now sure how much of an optimization this is but it matches Ruby

@jmthomas
Copy link
Copy Markdown
Member Author

jmthomas commented Jan 18, 2026

Results calculated on my Macbook Pro M3 Max with 36GB RAM with Docker CPU limit 10 and memory limit 16GB RAM.

Ruby results:

2026/01/18 02:35:28.545 (throughput_test.rb:183): Command Throughput:
2026/01/18 02:35:28.545 (throughput_test.rb:184):   100 cmd burst:  442.5 cmd/s
2026/01/18 02:35:28.546 (throughput_test.rb:185):   500 cmd burst:  437.4 cmd/s
2026/01/18 02:35:28.546 (throughput_test.rb:186):   1000 cmd burst: 421.2 cmd/s
2026/01/18 02:35:28.547 (throughput_test.rb:188): 
2026/01/18 02:35:28.547 (throughput_test.rb:188): Telemetry Throughput:
2026/01/18 02:35:28.547 (throughput_test.rb:189):   100 Hz target:   99.0 Hz (0.0% loss)
2026/01/18 02:35:28.547 (throughput_test.rb:190):   1000 Hz target:  977.8 Hz (0.0% loss)
2026/01/18 02:35:28.547 (throughput_test.rb:191):   2000 Hz target:  1979.6 Hz (0.0% loss)
2026/01/18 02:35:28.547 (throughput_test.rb:192):   3000 Hz target:  2762.2 Hz (0% loss)
2026/01/18 02:35:28.548 (throughput_test.rb:193):   4000 Hz target:  2626.8 Hz (0% loss)

Python results:

2026-01-18T02:44:32.696048Z (throughput_test.py:191): Command Throughput:
2026-01-18T02:44:32.696222Z (throughput_test.py:192):   100 cmd burst:  586.0 cmd/s
2026-01-18T02:44:32.696430Z (throughput_test.py:193):   500 cmd burst:  623.3 cmd/s
2026-01-18T02:44:32.696663Z (throughput_test.py:194):   1000 cmd burst: 541.6 cmd/s
2026-01-18T02:44:32.696972Z (throughput_test.py:196): 
2026-01-18T02:44:32.696972Z (throughput_test.py:196): Telemetry Throughput:
2026-01-18T02:44:32.697271Z (throughput_test.py:197):   100 Hz target:   98.8 Hz (0% loss)
2026-01-18T02:44:32.697677Z (throughput_test.py:200):   1000 Hz target:  984.4 Hz (0% loss)
2026-01-18T02:44:32.697970Z (throughput_test.py:203):   2000 Hz target:  1978.2 Hz (0% loss)
2026-01-18T02:44:32.698207Z (throughput_test.py:206):   3000 Hz target:  2955.0 Hz (0% loss)
2026-01-18T02:44:32.699203Z (throughput_test.py:209):   4000 Hz target:  3460.0 Hz (0% loss)

@jmthomas jmthomas requested review from clayandgen and ryanmelt and removed request for ryanmelt January 20, 2026 15:32
# This provides a ~1000x speedup for repeated accesses (2.7ms -> 2.7µs)
@lru_cache(maxsize=256)
def _parse_jsonpath(path):
return parse(path)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the huge win for Python telemetry performance ... it only affects JSON and CBOR but it affects everything if you have at least 1 JSON or CBOR packet defined.

@jmthomas
Copy link
Copy Markdown
Member Author

The packet caching I added to target_model has a measurable impact on commanding. Python sees a 30-80% improvement and Ruby shows a 15-20% improvement. The caching optimization primarily benefits command operations where repeated packet lookups occur during burst sending.

@jmthomas
Copy link
Copy Markdown
Member Author

Comparing these changes (plus other previous changes) from 6.10.4 to now:

With only binary CCSDS telemetry (no JSON, CBOR, XML, HTML):
2.2x - 2.6x improvement in Python command throughput
1.2 - 1.3x improvement in Ruby command throughput
Slight regressions at the highest telemetry rates (most likely other factors)

Full telemetry JSON, CBOR, XML, HTML:
5x improvement in Python telemetry due to the JSONPath caching
Slight regressions at the highest telemetry rates (most likely other factors)

Copy link
Copy Markdown
Contributor

@clayandgen clayandgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to run it myself shortly, in the meantime, reminder to remove the cache statistics data!

Copy link
Copy Markdown
Contributor

@clayandgen clayandgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the Throughput scripts to a suite could be nice for the "Test Results" formatting!

Apple M4 Max, 36GB RAM results. Python commanding seems consistently higher performance than the M3 Max, Ruby results are comparable 👍

============================================================
SUMMARY (Ruby)
============================================================
Command Throughput:
	100 cmd burst:  421.4 cmd/s
	500 cmd burst:  455.5 cmd/s
	1000 cmd burst: 486.5 cmd/s
 
Telemetry Throughput:
	100 Hz target:   100.0 Hz (0.0% loss)
	1000 Hz target:  1000.8 Hz (0.0% loss)
	2000 Hz target:  1959.8 Hz (0.0% loss)
	3000 Hz target:  2958.2 Hz (9.99% loss)
	4000 Hz target:  2940.6 Hz (0% loss)
	5000 Hz target:  2840.4 Hz (0% loss)


============================================================
SUMMARY (Python)
============================================================

Command Throughput:
	100 cmd burst:  705.4 cmd/s
	500 cmd burst:  664.8 cmd/s
	1000 cmd burst: 724.0 cmd/s

Telemetry Throughput:
	100 Hz target:   99.0 Hz (0% loss)
	1000 Hz target:  976.4 Hz (0% loss)
	2000 Hz target:  1963.6 Hz (0% loss)
	3000 Hz target:  2949.6 Hz (0% loss)
	4000 Hz target:  3895.6 Hz (0% loss)
	5000 Hz target:  3788.0 Hz (0% loss)


1. Start the throughput server:
```bash
python throughput_server.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: python examples/throughput_server/throughput_server.py

packets_received = final_cosmos_count - initial_cosmos_count

# Calculate actual rate from test data (more accurate than server's TLM_SENT_RATE which is stale)
actual_rate = packets_sent.to_f / duration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One observation, is that in the Command test it uses the actual system clock time (Time.now) as part of the calculation, whereas in Telemetry test we use the duration. This is probably arbitrarily close, but thought to note the discrepancy

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
12 New issues

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@jmthomas jmthomas merged commit a4ba7aa into main Jan 29, 2026
47 of 49 checks passed
@jmthomas jmthomas deleted the cmd_tlm_test branch January 29, 2026 01:42
jmthomas added a commit that referenced this pull request Mar 21, 2026
Add throughput testing infrastructure and fix Python telemetry performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants