Skip to content

Add trajectory smoothness metrics for evaluation#102

Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:feat/add-trajectory-smoothness-metrics
Open

Add trajectory smoothness metrics for evaluation#102
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:feat/add-trajectory-smoothness-metrics

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 4, 2026

Why

ADE / FDE answer "is the trajectory accurate?" Smoothness metrics answer "is the trajectory drivable?" — and right now there is no easy way to ask the second question during evaluation.

The kinematic derivations already exist in finetune/rl/rewards/comfort_reward.py, but they're wrapped in a within-bound boolean used as an RL reward. A researcher who wants to compare two checkpoints on "did this model produce smoother trajectories?" has no metric for that. Several open issues mention model-output quality (e.g. #10 reports occasional erratic predictions) — smoothness numbers help triage that exact class of failure.

What

New module src/alpamayo_r1/metrics/smoothness_metrics.py exposing:

compute_smoothness_metrics(
    pred_xyz: Tensor,          # [B, N, K, T, 3]
    pred_rot: Tensor,          # [B, N, K, T, 3, 3]
    disable_summary: bool = False,
    planning_freq_hz: float = 10.0,
) -> dict[str, Tensor]

Returns per-batch tensors of shape [B] with both the RMS (typical signal level) and the absolute peak (worst-case) of:

  • smoothness/jerk_lon (longitudinal jerk, m/s³)
  • smoothness/accel_lon (longitudinal acceleration, m/s²)
  • smoothness/accel_lat (lateral acceleration, m/s²)
  • smoothness/yaw_rate (rad/s)
  • smoothness/yaw_accel (rad/s²)

Each → <key>_rms and <key>_max. _std variants are added by summarize_metric when N > 1, exactly mirroring compute_minade / compute_minfde so dashboards already grouping on *_std pick them up automatically.

Also exposes gather_dynamics(pred_xyz, pred_rot, planning_freq_hz) for callers that want the raw per-timestep signals.

Design notes

  • Pure additive change. comfort_reward.py is untouched. The small _diff / _diff_yaw helpers are deliberately duplicated so the metrics layer doesn't take a dependency on the finetune.rl package. The module docstring calls out the intentional duplication and the reason.
  • Yaw wraparound is handled so a heading transition across ±π does not report the spurious ~2π·freq_hz spike a naive diff would yield.
  • Both peak and RMS are reported because they answer different questions: peak captures worst-case discomfort, RMS captures sustained effort.
  • Shape validation up front (pred_xyz must be [B, N, K, T, 3], pred_rot must be [B, N, K, T, 3, 3] with matching leading dims), raising ValueError rather than failing deep in a tensor op.

Tests

src/alpamayo_r1/metrics/test_smoothness_metrics.py — 9 pytest cases, no GPU / no HF needed. Verified locally:

PASS: constant velocity gives v_lon=5, zero everywhere else
PASS: constant accel ax=2 (interior mean = 2.000000)
PASS: yaw wraparound (max |yaw_rate| = 0.400000, NOT ~62)
PASS: emits 10 expected keys, shape [B]
PASS: _std variants when N>1
PASS: disable_summary drops _std
PASS: constant-velocity reports near-zero on all signals
PASS: rejects xyz wrong shape
PASS: rejects rot wrong shape

Test fixtures use float64 to keep finite-difference noise at machine precision (twice-differentiating a 0.1 s step in float32 accumulates ~1e-4 jerk noise that would force loose tolerances and make the contract less crisp).

Migration

None. New module, new file, new tests. Existing call sites untouched.

Adds src/alpamayo_r1/metrics/smoothness_metrics.py with
compute_smoothness_metrics(pred_xyz, pred_rot, disable_summary,
planning_freq_hz) -> dict[str, Tensor] that reports per-batch RMS and
max magnitude of:

- jerk_lon  (longitudinal jerk, m/s^3)
- accel_lon (longitudinal acceleration, m/s^2)
- accel_lat (lateral acceleration, m/s^2)
- yaw_rate  (rad/s)
- yaw_accel (rad/s^2)

Why this is worth adding:

The kinematic derivations already exist in
finetune/rl/rewards/comfort_reward.py, but they are wrapped in a
within-bound boolean used as an RL reward. There is no easy way to
report the actual signal magnitudes during evaluation today, so a
researcher who wants to compare two checkpoints on "did this model
produce smoother trajectories?" has no metric for that.

This module surfaces the same physical signals as evaluation metrics
parallel to compute_ade / compute_fde. ADE/FDE answer "is the
trajectory accurate?"; smoothness metrics answer "is the trajectory
drivable?". Both are needed and they are independent.

Design notes:

- Pure additive change. comfort_reward.py is untouched. We deliberately
  duplicate the small _diff helpers so the metrics layer doesn't take a
  dependency on the RL training package; the docstring calls out the
  intentional duplication and explains why.
- Returns per-batch tensors of shape [B], matching the convention of
  every other metric in this module after summarize_metric. Per-key
  _std variants are added when N > 1 (consistent with compute_minade
  / compute_minfde).
- Yaw rate uses pi-wraparound handling so heading transitions across
  +/-pi do not report the spurious ~2*pi*freq_hz spike a naive diff
  would yield.
- Both peak and RMS are reported because they answer different
  questions: peak captures worst-case discomfort, RMS captures sustained
  effort; we have seen models that are good on one and bad on the other.

Also adds src/alpamayo_r1/metrics/test_smoothness_metrics.py with 9
pytest cases covering: returned keys + shapes, constant-velocity is
identically smooth, constant-acceleration recovers the input ax,
pi-wraparound is handled, _std gating on N and disable_summary, and
shape rejection on malformed inputs.

All 9 functional invariants verified locally against torch (no GPU /
HF needed) -- output:

    PASS: constant velocity gives v_lon=5, zero everywhere else
    PASS: constant accel ax=2 (interior mean = 2.000000)
    PASS: yaw wraparound (max |yaw_rate| = 0.400000, NOT ~62)
    PASS: emits 10 expected keys, shape [B]
    PASS: _std variants when N>1
    PASS: disable_summary drops _std
    PASS: constant-velocity reports near-zero on all signals
    PASS: rejects xyz wrong shape
    PASS: rejects rot wrong shape

Signed-off-by: lonexreb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant