Skip to content

Conversation

@nitin4real
Copy link
Contributor

No description provided.

@claude
Copy link

claude bot commented Dec 19, 2025

Code Review for PR #1901: Fix/sarvam audio start_ms

Summary

This PR fixes the calculation of start_ms for ASR transcripts in the Sarvam ASR extension by implementing proper timing based on total audio sent and transcript duration.


✅ Positive Aspects

  1. Solves a Real Problem: The PR addresses the TODO comment and implements the missing start_ms calculation logic
  2. Correct Implementation: The calculation start_ms = max(0, total_audio_sent_ms - duration_ms) is logically sound
  3. Good Documentation: The inline comments clearly explain the calculation logic
  4. Version Bump: Properly incremented the version from 0.1.0 to 0.1.1 in manifest.json
  5. Uses Existing Infrastructure: Leverages the audio_timeline tracking that's already in place

🔍 Code Quality Observations

extension.py:349-360

Good practices:

  • Clear variable naming (total_audio_sent_ms, duration_ms)
  • Use of max(0, ...) to prevent negative values
  • Multi-line expression formatting improves readability

Minor suggestion:
The calculation assumes that the transcript duration directly corresponds to the most recently sent audio. This works for streaming ASR where transcripts are returned in order, but consider documenting this assumption:

# Calculate start_ms: transcript corresponds to audio that was sent
# Assumption: Sarvam returns transcripts in-order for streaming audio
# Start is total audio sent minus the duration of this transcript
start_ms = max(0, total_audio_sent_ms - duration_ms)

🤔 Potential Issues & Questions

1. Edge Case: Reconnection Timing

In extension.py:194-197, when reconnecting:

self.sent_user_audio_duration_ms_before_last_reset += (
    self.audio_timeline.get_total_user_audio_duration()
)
self.audio_timeline.reset()

Question: What happens if a transcript arrives during or immediately after reconnection? The timing calculation should still be correct, but it would be good to verify this behavior is tested.

2. Precision and Audio Duration Accuracy

duration_ms = int(audio_duration * 1000) if audio_duration > 0 else 0

Observation: Converting float seconds to integer milliseconds may lose sub-millisecond precision. This is likely acceptable for ASR timestamps, but if Sarvam returns very precise durations, consider using round() instead of int() to minimize accumulated error.

3. Calculation Accuracy Concern

The calculation total_audio_sent_ms - duration_ms assumes that:

  • The transcript's audio_duration matches exactly the amount of audio consumed
  • There's no significant latency or buffering in the Sarvam service

Question: Does Sarvam's API documentation guarantee that metrics.audio_duration represents the exact duration of audio that was transcribed? If there's VAD (Voice Activity Detection) or silence trimming, the duration might not match the sent audio perfectly.

4. Test Coverage

Looking at tests/test_asr_result.py:147-150:

"metrics": {
    "audio_duration": 2.0,
    "processing_latency": 0.1,
},

Missing: The test doesn't validate that start_ms is calculated correctly. Consider adding assertions like:

self.stop_test_if_checking_failed(
    ten_env_tester,
    data_dict["start_ms"] >= 0,
    f"start_ms should be non-negative: {data_dict['start_ms']}"
)

🔒 Security Considerations

✅ No security concerns identified. The changes involve timing calculations only and don't introduce new attack surfaces.


⚡ Performance Considerations

✅ No performance impact. The changes add minimal arithmetic operations that are negligible compared to audio processing and network I/O.


🧪 Test Coverage Recommendations

  1. Add explicit start_ms validation in test_asr_result.py
  2. Test reconnection scenarios to ensure timing remains accurate after reconnection
  3. Test multiple consecutive transcripts to verify cumulative timing accuracy
  4. Test edge case where audio_duration is 0 or missing

📝 Suggested Improvements

1. Add logging for timing calculations (debugging aid):

# Calculate start_ms: transcript corresponds to audio that was sent
# Start is total audio sent minus the duration of this transcript
start_ms = max(0, total_audio_sent_ms - duration_ms)

if self.config.dump:  # Or use a debug log level
    self.ten_env.log_debug(
        f"Timing: total_sent={total_audio_sent_ms}ms, "
        f"duration={duration_ms}ms, start={start_ms}ms",
        category=LOG_CATEGORY_VENDOR,
    )

2. Consider defensive programming:

# Validate that start_ms is reasonable
if start_ms > total_audio_sent_ms:
    self.ten_env.log_warn(
        f"Suspicious timing: start_ms ({start_ms}) > total_sent ({total_audio_sent_ms})",
        category=LOG_CATEGORY_VENDOR,
    )
    start_ms = 0  # Fallback to safe value

✨ Overall Assessment

Rating: ✅ Approve with Minor Suggestions

This is a solid fix that resolves the TODO and implements the missing functionality correctly. The implementation is clean, well-documented, and follows the existing code patterns in the extension.

Recommendation:

  • ✅ Safe to merge after considering the test coverage suggestions
  • The minor suggestions above are optional improvements for robustness, not blockers

Great work on:

  • Clear, understandable code
  • Proper version management
  • Good inline documentation

Review performed by Claude Code - TEN Framework Code Review

@nitin4real nitin4real changed the title Fix/sarvam audio start_ms fix: correct audio start_ms handling Dec 19, 2025
@plutoless plutoless merged commit cd90ec2 into main Dec 19, 2025
34 of 35 checks passed
@plutoless plutoless deleted the fix/sarvam branch December 19, 2025 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants