Add trajectory smoothness metrics for evaluation#102
Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
Open
Add trajectory smoothness metrics for evaluation#102lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
Conversation
Adds src/alpamayo_r1/metrics/smoothness_metrics.py with
compute_smoothness_metrics(pred_xyz, pred_rot, disable_summary,
planning_freq_hz) -> dict[str, Tensor] that reports per-batch RMS and
max magnitude of:
- jerk_lon (longitudinal jerk, m/s^3)
- accel_lon (longitudinal acceleration, m/s^2)
- accel_lat (lateral acceleration, m/s^2)
- yaw_rate (rad/s)
- yaw_accel (rad/s^2)
Why this is worth adding:
The kinematic derivations already exist in
finetune/rl/rewards/comfort_reward.py, but they are wrapped in a
within-bound boolean used as an RL reward. There is no easy way to
report the actual signal magnitudes during evaluation today, so a
researcher who wants to compare two checkpoints on "did this model
produce smoother trajectories?" has no metric for that.
This module surfaces the same physical signals as evaluation metrics
parallel to compute_ade / compute_fde. ADE/FDE answer "is the
trajectory accurate?"; smoothness metrics answer "is the trajectory
drivable?". Both are needed and they are independent.
Design notes:
- Pure additive change. comfort_reward.py is untouched. We deliberately
duplicate the small _diff helpers so the metrics layer doesn't take a
dependency on the RL training package; the docstring calls out the
intentional duplication and explains why.
- Returns per-batch tensors of shape [B], matching the convention of
every other metric in this module after summarize_metric. Per-key
_std variants are added when N > 1 (consistent with compute_minade
/ compute_minfde).
- Yaw rate uses pi-wraparound handling so heading transitions across
+/-pi do not report the spurious ~2*pi*freq_hz spike a naive diff
would yield.
- Both peak and RMS are reported because they answer different
questions: peak captures worst-case discomfort, RMS captures sustained
effort; we have seen models that are good on one and bad on the other.
Also adds src/alpamayo_r1/metrics/test_smoothness_metrics.py with 9
pytest cases covering: returned keys + shapes, constant-velocity is
identically smooth, constant-acceleration recovers the input ax,
pi-wraparound is handled, _std gating on N and disable_summary, and
shape rejection on malformed inputs.
All 9 functional invariants verified locally against torch (no GPU /
HF needed) -- output:
PASS: constant velocity gives v_lon=5, zero everywhere else
PASS: constant accel ax=2 (interior mean = 2.000000)
PASS: yaw wraparound (max |yaw_rate| = 0.400000, NOT ~62)
PASS: emits 10 expected keys, shape [B]
PASS: _std variants when N>1
PASS: disable_summary drops _std
PASS: constant-velocity reports near-zero on all signals
PASS: rejects xyz wrong shape
PASS: rejects rot wrong shape
Signed-off-by: lonexreb <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
ADE / FDE answer "is the trajectory accurate?" Smoothness metrics answer "is the trajectory drivable?" — and right now there is no easy way to ask the second question during evaluation.
The kinematic derivations already exist in
finetune/rl/rewards/comfort_reward.py, but they're wrapped in a within-bound boolean used as an RL reward. A researcher who wants to compare two checkpoints on "did this model produce smoother trajectories?" has no metric for that. Several open issues mention model-output quality (e.g. #10 reports occasional erratic predictions) — smoothness numbers help triage that exact class of failure.What
New module
src/alpamayo_r1/metrics/smoothness_metrics.pyexposing:Returns per-batch tensors of shape
[B]with both the RMS (typical signal level) and the absolute peak (worst-case) of:smoothness/jerk_lon(longitudinal jerk, m/s³)smoothness/accel_lon(longitudinal acceleration, m/s²)smoothness/accel_lat(lateral acceleration, m/s²)smoothness/yaw_rate(rad/s)smoothness/yaw_accel(rad/s²)Each →
<key>_rmsand<key>_max._stdvariants are added bysummarize_metricwhenN > 1, exactly mirroringcompute_minade/compute_minfdeso dashboards already grouping on*_stdpick them up automatically.Also exposes
gather_dynamics(pred_xyz, pred_rot, planning_freq_hz)for callers that want the raw per-timestep signals.Design notes
comfort_reward.pyis untouched. The small_diff/_diff_yawhelpers are deliberately duplicated so themetricslayer doesn't take a dependency on thefinetune.rlpackage. The module docstring calls out the intentional duplication and the reason.pred_xyzmust be[B, N, K, T, 3],pred_rotmust be[B, N, K, T, 3, 3]with matching leading dims), raisingValueErrorrather than failing deep in a tensor op.Tests
src/alpamayo_r1/metrics/test_smoothness_metrics.py— 9 pytest cases, no GPU / no HF needed. Verified locally:Test fixtures use
float64to keep finite-difference noise at machine precision (twice-differentiating a 0.1 s step infloat32accumulates ~1e-4 jerk noise that would force loose tolerances and make the contract less crisp).Migration
None. New module, new file, new tests. Existing call sites untouched.