Performance Optimization
HORUS is already fast by default. This guide helps you squeeze out extra performance when needed.
Cross-Platform Philosophy
HORUS is designed for development on any OS with production deployment on Linux:
| Phase | Supported Platforms | Performance |
|---|---|---|
| Development | Windows, macOS, Linux | Good (standard IPC) |
| Testing | Windows, macOS, Linux | Good (standard IPC) |
| Production | Linux (recommended) | Best (sub-100ns with RT) |
All performance features use graceful degradation - your code runs everywhere, with maximum performance on Linux. Advanced features like RtConfig (SCHED_FIFO, mlockall) and SIMD acceleration automatically fall back to safe defaults on unsupported platforms.
Why HORUS is Fast
Shared Memory Architecture
Zero network overhead: Data written to shared memory, read directly by subscribers
| Platform | Shared Memory Backend |
|---|---|
| Linux | /dev/shm (tmpfs) |
| macOS | POSIX shm (shm_open) |
| Windows | Named Shared Memory (CreateFileMapping) |
Zero serialization: Fixed-size structs copied directly to shared memory
Zero-copy loan pattern: Publishers write directly to shared memory slots
Cache-Optimized Structures
64-byte alignment: Ring buffer headers are cache-line aligned to prevent false sharing
#[repr(C, align(64))] // Cache-line aligned
pub struct RingBufferHeader {
// Producer (head) and consumer (tail) on SEPARATE cache lines
}
Padding prevention: False sharing eliminated with explicit padding between producer and consumer fields
Atomic operations: Lock-free operations with appropriate memory ordering
Wait-Free & Lock-Free Operations
Wait-free Topic (SPSC): 87ns send-only latency - no CAS loops, bounded constant time
Lock-free Topic (MPMC): 313ns send-only latency - CAS-based for multi-producer coordination
Per-consumer tracking: Each subscriber maintains independent position
Benchmark Results
Measured Latency
Measurement Note: All latencies are send-only (one-direction publish). For round-trip (send+receive), approximately double these values.
| Message Type | Size | MPMC (send-only) | SPSC (send-only) | ROS2 DDS | Speedup |
|---|---|---|---|---|---|
| CmdVel | 16B | ~313ns | 87ns | 50-100µs | 230-575x |
| IMU | 304B | ~500ns | ~160ns | 80-150µs | 160-940x |
| LaserScan | 1.5KB | ~2.2µs | ~400ns | 150-300µs | 68-750x |
| PointCloud | 120KB | ~360µs | ~120µs | 500µs-1ms | 1.4-8x |
Key insight: Latency scales linearly with message size.
Throughput
HORUS can handle:
- 12M+ messages/second for small messages (16B) with Topic
- 3M+ messages/second for small messages (16B) with Topic
- 1M+ messages/second for medium messages (1KB)
- 100K+ messages/second for large messages (100KB)
Build Optimization
Always Use Release Mode
Debug builds are 10-100x slower:
# SLOW: Debug build
horus run
# FAST: Release build
horus run --release
Why it matters:
- Debug: 50µs per tick
- Release: 500ns per tick
- 100x difference for the same code
Link-Time Optimization (LTO)
Enable LTO in your Cargo.toml for additional 10-20% speedup:
# Cargo.toml
[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
Warning: Slower compilation, but faster execution.
Target CPU Features
CPU-Specific Optimizations:
HORUS compiles with Rust compiler optimizations enabled in release mode. For advanced CPU-specific tuning, the framework is optimized for modern x86-64 and ARM64 processors.
Gains: 5-15% from CPU-specific SIMD instructions (automatically enabled in release builds).
SIMD Acceleration
HORUS uses SIMD (Single Instruction Multiple Data) intrinsics for high-performance memory operations, achieving sub-100ns IPC latency for small messages.
Cross-Platform Support
SIMD acceleration works across platforms with automatic fallback:
| Platform | SIMD Support | Fallback |
|---|---|---|
| x86_64 (Linux/Windows/macOS) | AVX2 | Standard memcpy |
| ARM64 (macOS M1/M2, Linux) | NEON (future) | Standard memcpy |
| Other architectures | None | Standard memcpy |
Your code runs on any platform - SIMD just provides extra performance on supported hardware.
How SIMD is Used
On x86_64, the framework uses std::arch::x86_64 intrinsics for:
| Operation | SIMD Instruction | Speedup |
|---|---|---|
| Memory copy (4KB+) | AVX2 _mm256_* | 2-4x |
| Cache prefetch | _mm_prefetch | Reduces latency |
| Non-temporal stores | _mm256_stream_* | Bypasses cache for large writes |
Automatic Feature Detection
SIMD features are detected at runtime using is_x86_feature_detected!():
# Check available SIMD features on your CPU
rustc --print cfg | grep target_feature
# Common features on modern CPUs:
# target_feature="avx2"
# target_feature="sse4.2"
Enabling Maximum SIMD
For maximum performance, compile with native CPU features:
# Build for your specific CPU (enables all supported features)
RUSTFLAGS="-C target-cpu=native" cargo build --release
# Or specify AVX2 explicitly
RUSTFLAGS="-C target-feature=+avx2" cargo build --release
When SIMD Applies
SIMD acceleration automatically applies to:
- Topic send/recv: Message copying for messages >= 4KB
- Shared memory operations: Large buffer transfers using non-temporal streaming stores
For messages smaller than 4KB, standard ptr::copy_nonoverlapping is used (setup overhead exceeds SIMD benefit for small copies).
SIMD Impact on Latency
Benchmarks on Intel Core i9-13900K (AVX2 enabled):
| Message Size | Without SIMD | With SIMD | Improvement |
|---|---|---|---|
| 16B (CmdVel) | ~150ns | ~87ns | 1.7x |
| 304B (IMU) | ~280ns | ~160ns | 1.75x |
| 1.5KB (LaserScan) | ~800ns | ~400ns | 2x |
Note: SIMD acceleration applies to messages >= 4KB. Smaller messages use standard
ptr::copy_nonoverlapping. The latency improvements above are from the overall Topic pipeline, including cache-line alignment and lock-free algorithms.
Fallback Behavior
On CPUs without AVX2, HORUS automatically falls back to:
- Standard
memcpy/ptr::copy_nonoverlapping(always available)
Performance remains good, but peak throughput is reduced.
Verifying SIMD is Active
// At runtime, check CPU features
#[cfg(target_arch = "x86_64")]
{
if is_x86_feature_detected!("avx2") {
println!("AVX2 acceleration enabled");
} else {
println!("Using standard memcpy fallback");
}
}
Message Optimization
Use Fixed-Size Types
// FAST: Fixed-size array
pub struct LaserScan {
pub ranges: [f32; 360], // Stack-allocated
}
// SLOW: Dynamic vector
pub struct BadLaserScan {
pub ranges: Vec<f32>, // Heap-allocated
}
Impact: Fixed-size avoids heap allocations in hot path.
Choose Typed Messages Over Generic
// FAST: Small, fixed-size struct
let topic: Topic<Pose2D> = Topic::new("pose")?;
topic.send(Pose2D { x: 1.0, y: 2.0, theta: 0.5 });
// IPC latency: ~87-313ns depending on SPSC/MPMC
// SLOWER: Larger struct with more data
let topic: Topic<SensorBundle> = Topic::new("sensors")?;
// Latency scales linearly with message size
Rule: Use the smallest struct that represents your data. Avoid padding and unused fields.
Choose Appropriate Precision
// f32 (single precision) - sufficient for most robotics
pub struct FastPose {
pub x: f32, // 4 bytes
pub y: f32, // 4 bytes
}
// f64 (double precision) - scientific applications
pub struct PrecisePose {
pub x: f64, // 8 bytes
pub y: f64, // 8 bytes
}
Rule: Use f32 unless you need scientific precision.
Minimize Message Size
// GOOD: 8 bytes
struct CompactCmd {
linear: f32, // 4 bytes
angular: f32, // 4 bytes
}
// BAD: 1KB+ bytes
struct BloatedCmd {
linear: f32,
angular: f32,
metadata: [u8; 256], // Unused
debug_info: [u8; 768], // Unused
}
Every byte matters: Latency scales with message size.
Batch Small Messages
Instead of sending 100 separate f32 values:
// SLOW: 100 separate messages
for value in values {
topic.send(value); // 100 IPC operations
}
// FAST: One batched message
pub struct BatchedData {
values: [f32; 100],
}
topic.send(batched); // 1 IPC operation
Speedup: 50-100x for batched operations.
Node Optimization
Keep tick() Fast
Target: <1ms per tick for real-time control.
// GOOD: Fast tick
fn tick(&mut self) {
let data = self.read_sensor(); // Quick read
self.process_pub.send(data); // ~500ns
}
// BAD: Slow tick
fn tick(&mut self) {
let data = std::fs::read_to_string("config.yaml").unwrap(); // 1-10ms!
// ...
}
File I/O, network calls, sleeps = slow. Do these in init() or separate threads.
Pre-Allocate in init()
fn init(&mut self) -> Result<()> {
// Pre-allocate buffers
self.buffer = vec![0.0; 10000];
// Open connections
self.device = Device::open()?;
// Load configuration
self.config = Config::from_file("config.yaml")?;
Ok(())
}
fn tick(&mut self) {
// Use pre-allocated resources - no allocations here!
self.buffer[0] = self.device.read();
}
Allocations in tick() = slow. Move to init().
Avoid Unnecessary Cloning
// BAD: Unnecessary clone
fn tick(&mut self) {
if let Some(data) = self.sub.recv() {
let copy = data.clone(); // Unnecessary!
self.process(copy);
}
}
// GOOD: Direct use
fn tick(&mut self) {
if let Some(data) = self.sub.recv() {
self.process(data); // Already cloned by recv()
}
}
Topic::recv() already clones data. Don't clone again.
Minimize Logging
// BAD: Logging every tick
fn tick(&mut self) {
hlog!(debug, "Tick #{}", self.counter); // Slow!
self.counter += 1;
}
// GOOD: Conditional logging
fn tick(&mut self) {
if self.counter % 1000 == 0 { // Log every 1000 ticks
hlog!(info, "Reached tick #{}", self.counter);
}
self.counter += 1;
}
Logging is expensive. Log sparingly in hot paths.
Scheduler Optimization
Understanding Tick Rate
The default scheduler runs at 60 Hz (16ms per tick). Use Scheduler presets to change it:
// Default: 60 Hz
let scheduler = Scheduler::new();
// Preset: 10kHz for high-performance
let scheduler = Scheduler::high_performance();
Key Point: Keep individual node tick() methods fast (ideally <1ms) to maintain the target tick rate.
Use Priority Levels
// Critical tasks run first (order 0 = highest)
scheduler.add(safety).order(0).done();
// Logging runs last (order 100 = lowest)
scheduler.add(logger).order(100).done();
Predictable execution order = better performance. Use lower numbers for higher priority tasks.
Minimize Node Count
// BAD: 50 small nodes
for i in 0..50 {
scheduler.add(TinyNode::new(i)).order(50).done();
}
// GOOD: One aggregated node
scheduler.add(AggregatedNode::new()).order(50).done();
Fewer nodes = less scheduling overhead.
Ultra-Low-Latency Networking (Linux)
HORUS provides optional kernel bypass networking for sub-microsecond latency requirements.
Transport Options
| Transport | Latency (send-only) | Throughput | Requirements |
|---|---|---|---|
| Shared Memory (Topic SPSC) | ~87ns | 12M+ msg/s | Local only (wait-free) |
| Shared Memory (Topic MPMC) | ~313ns | 3M+ msg/s | Local only (lock-free) |
| io_uring | 2-3µs | 500K+ msg/s | Linux 5.1+ |
| Batch UDP | 3-5µs | 300K+ msg/s | Linux 3.0+ |
| Standard UDP | 5-10µs | 200K+ msg/s | Cross-platform |
Enable io_uring Transport
io_uring eliminates syscalls on the send path using kernel-side polling:
# Build with io_uring support (Cargo feature flag)
cargo build --release --features io-uring-net
Requirements:
- Linux 5.1+ (5.6+ recommended for SQ polling)
- CAP_SYS_NICE capability for SQ_POLL mode
Enable Batch UDP (Linux)
Batch UDP uses sendmmsg/recvmmsg syscalls for efficient batched network I/O:
# Batch UDP is automatically enabled on Linux - no extra dependencies needed
cargo build --release
Requirements:
- Linux 3.0+ (available on virtually all modern Linux systems)
Enable All Ultra-Low-Latency Features
# Build with all ultra-low-latency features (io_uring)
cargo build --release --features ultra-low-latency
Smart Transport Selection
For network topics, HORUS automatically selects the best transport based on available system features and kernel version. Configure network endpoints through topic configuration rather than the Topic::new() API (which creates local shared memory topics). See Network Backends for details.
Shared Memory Optimization
Check Available Space
df -h /dev/shm
Insufficient space = message drops.
Increase /dev/shm Size
# Increase to 4GB
sudo mount -o remount,size=4G /dev/shm
More space = larger buffer capacity.
Clean Up Stale Topics
Note: HORUS automatically cleans up sessions after each run. Manual cleanup is rarely needed.
# Clean all HORUS shared memory (if needed after crashes)
rm -rf /dev/shm/horus/
Stale topics from crashes can waste space, but auto-cleanup prevents this in normal operation.
Topic Memory Usage
Topics use shared memory slots proportional to message size. Keep messages small to reduce memory footprint:
// Small messages use less shared memory
let cmd: Topic<CmdVel> = Topic::new("cmd_vel")?; // 16B per slot
// Large messages use more shared memory
let cloud: Topic<PointCloud> = Topic::new("cloud")?; // 120KB per slot
Balance: Message size directly affects shared memory consumption.
Profiling and Measurement
Built-In Metrics
HORUS automatically tracks node performance metrics. Use horus monitor to view real-time performance data including tick duration, messages sent, and CPU usage.
Available metrics (on NodeMetrics):
total_ticks: Total number of ticksavg_tick_duration_ms: Average tick time in millisecondsmax_tick_duration_ms: Worst-case tick time in millisecondsmessages_sent: Messages publishedmessages_received: Messages receivederrors_count: Total error countuptime_seconds: Node uptime in seconds
IPC Latency Logging
HORUS automatically tracks IPC timing for each topic operation. The horus monitor web interface displays per-log-entry metrics:
Tick: 12μs | IPC: 296ns
Each log entry includes tick_us (node tick time in microseconds) and ipc_ns (IPC write time in nanoseconds).
Manual Profiling
use std::time::Instant;
fn tick(&mut self) {
let start = Instant::now();
self.expensive_operation();
let duration = start.elapsed();
println!("Operation took: {:?}", duration);
}
CPU Profiling
Use perf on Linux:
# Profile your application
perf record --call-graph dwarf horus run --release
# View results
perf report
Hotspots show where CPU time is spent.
Common Performance Pitfalls
Pitfall: Using Debug Builds
# SLOW: 50µs/tick
horus run
# FAST: 500ns/tick
horus run --release
Fix: Always use --release for benchmarks and production.
Pitfall: Allocations in tick()
// BAD
fn tick(&mut self) {
let buffer = vec![0.0; 1000]; // Heap allocation every tick!
}
// GOOD
struct Node {
buffer: Vec<f32>, // Pre-allocated
}
fn init(&mut self) -> Result<()> {
self.buffer = vec![0.0; 1000]; // Allocate once
Ok(())
}
Fix: Pre-allocate in init().
Pitfall: Excessive Logging
// BAD: 60 logs per second
fn tick(&mut self) {
hlog!(debug, "Tick"); // Every 16ms!
}
// GOOD: 1 log per second
fn tick(&mut self) {
self.tick_count += 1;
if self.tick_count % 60 == 0 {
hlog!(info, "60 ticks completed");
}
}
Fix: Log sparingly.
Pitfall: Large Message Types
// BAD: 1MB per message
pub struct HugeMessage {
image: [u8; 1_000_000],
}
// GOOD: Compressed or separate channel
pub struct CompressedImage {
data: Vec<u8>, // JPEG compressed, ~50KB
}
Fix: Compress or split large data.
Pitfall: Synchronous I/O in tick()
// BAD: Blocking I/O
fn tick(&mut self) {
let data = std::fs::read("data.txt").unwrap(); // Blocks!
}
// GOOD: Async or pre-loaded
fn init(&mut self) -> Result<()> {
self.data = std::fs::read("data.txt")?; // Load once
Ok(())
}
Fix: Move I/O to init() or use async.
Performance Checklist
Before deployment, verify:
- Build in release mode (
--release) - Profile with
perfor similar - tick() completes in <1ms
- No allocations in tick()
- Messages use fixed-size types
- Logging is rate-limited
-
/dev/shmhas sufficient space - IPC latency is <10µs
- Priority levels set correctly
Measuring Your Performance
Latency Measurement
use std::time::Instant;
struct BenchmarkNode {
pub_topic: Topic<f32>,
sub_topic: Topic<f32>,
start_time: Option<Instant>,
}
impl Node for BenchmarkNode {
fn tick(&mut self) {
// Publish
self.start_time = Some(Instant::now());
self.pub_topic.send(42.0);
// Receive
if let Some(data) = self.sub_topic.recv() {
if let Some(start) = self.start_time {
let latency = start.elapsed();
println!("Round-trip latency: {:?}", latency);
}
}
}
}
Throughput Measurement
struct ThroughputTest {
pub_topic: Topic<f32>,
message_count: u64,
start_time: Instant,
}
impl Node for ThroughputTest {
fn tick(&mut self) {
for _ in 0..1000 {
self.pub_topic.send(42.0);
self.message_count += 1;
}
if self.message_count % 100_000 == 0 {
let elapsed = self.start_time.elapsed().as_secs_f64();
let throughput = self.message_count as f64 / elapsed;
println!("Throughput: {:.0} msg/s", throughput);
}
}
}
Real-Time Configuration
For hard real-time applications requiring deterministic latency, HORUS provides system-level RT configuration:
use horus::prelude::*;
// Configure for hard real-time operation
let config = RtConfig::hard_realtime(Some(&[2, 3])); // Pin to isolated cores
config.apply()?;
// This enables:
// - mlockall() - No page faults
// - SCHED_FIFO priority 80 - Preempts normal processes
// - CPU affinity - No migration jitter
// - Stack prefaulting - No lazy allocation
For detailed configuration options, see the Real-Time Configuration Guide.
Next Steps
- Apply these optimizations to your Examples
- Configure Real-Time Settings for deterministic latency
- Set up Real-Time Nodes with WCET constraints
- Learn about Multi-Language Support
- Read the Core Concepts for deeper understanding
- Check the CLI Reference for build options