owa-data

OWA Data Pipeline

Pipeline for converting recording-optimized OWAMcap files into training-optimized HuggingFace Datasets.

Quick Start

See DEMO.md for a complete walkthrough with example data.

Pipeline Overview

Our pipeline converts 300+ hours of data from OWAMcap to FSL in under 1 hour by never reading or decoding media files during conversion.

Stage	Script	Output	Format
1	`01_raw_to_event.py`	Event Dataset	RLDS-Event (timestamp + event per row)
2	`02_event_to_fsl.py`	FSL Dataset	FSL (tokens + images per row)

For converting to traditional step-based formats (e.g., RLDS, LeRobot compatible), see event_to_binned.py.

Why this approach?

Existing data formats are optimized for either recording or training, but not both:

Recording-oriented (rosbag, mcap): Great for capture, but not directly usable for ML training
Training-oriented (TFDS, RLDS, LeRobot): Great for training, but impractical for recording raw sensor streams

Optimizing for both simultaneously is fundamentally impossible. Our solution: define multiple formats along the recording→training spectrum and convert progressively.

Our pipeline: OWAMcap → RLDS-Event → FSL Dataset

RLDS-Event: Similar to RLDS, but each row is an event (with nanosecond timestamp) rather than a step. No information loss from binning/grouping.
FSL Dataset (Fixed Sequence Length): Similar to conversational format commonly used in VLM fine-tuning—each row contains a sequence and its associated images. The difference is that FSL is pre-tokenized and episode-aware packed, eliminating runtime overhead.

Feature Comparison

Feature	Our Pipeline	RLDS	LeRobotDataset
Episode-aware packing	✓	✗	✗
Video encoding	✓	✗	✓
Multi-rate sensor support	✓	✗	✗
Discrete event support	✓	✗	✗

Our Pipeline = OWAMcap → RLDS-Event → FSL Dataset

Notes:

Episode-aware packing: Sequence packing is a well-established technique (NVIDIA NeMo, HuggingFace TRL) that eliminates padding waste—NeMo reports up to 10x FLOPs improvement and 6x training time reduction. Standard packing concatenates unrelated samples; we make it episode-aware by concatenating temporally adjacent events within the same episode. This preserves sequential context, enabling models to learn from history (e.g., previous frames, prior actions).
Video encoding: OWAMcap uses MediaRef to reference video-encoded frames without re-encoding.
Multi-rate sensor / Discrete event support: Other formats using "step" as a row require a global fixed rate for the entire table, forcing binning/grouping. This prevents multi-rate sensors and discrete events from being stored as-is.

Stage 1: MCAP → Event Dataset

Converts raw MCAP files into a flat event-oriented HuggingFace Dataset. Each row is a single event (screen frame, key press, mouse move, etc.) with nanosecond timestamps.

python scripts/01_raw_to_event.py \
  --config configs/mcap_to_event_example.yaml \
  --input_dir /path/to/mcap/files \
  --output_dir /path/to/event-dataset

Schema:

Column	Type	Description
`episode_path`	string	Source MCAP file path
`topic`	string	Event topic (screen, keyboard, mouse, etc.)
`timestamp_ns`	int64	Timestamp in nanoseconds
`message_type`	string	Message type identifier
`mcap_message`	binary	Serialized message bytes

Features: Rate limiting per topic, topic filtering, train/test splitting

Stage 2: Event Dataset → FSL Dataset

Converts Event Dataset into Fixed Sequence Length format with pre-computed tokenization.

python scripts/02_event_to_fsl.py \
  --config configs/internvl3_example.yaml \
  --input_dir /path/to/event-dataset \
  --output_dir /path/to/fsl-dataset

Schema:

Column	Type	Description
`input_ids`	sequence[int]	Pre-tokenized token IDs
`attention_mask`	sequence[int]	Attention mask (1 = valid, 0 = padding)
`texts`	string	Raw text (for debugging)
`images`	sequence[string]	Serialized ScreenCaptured messages (JSON)
`episode_path`	string	Source episode path

Appendix: Converting to Traditional Formats

For compatibility with existing robotics frameworks (RLDS, LeRobot), you can convert Event Dataset to time-binned step format:

python scripts/event_to_binned.py \
  --input-dir /path/to/event-dataset \
  --output-dir /path/to/binned-dataset \
  --fps 10 \
  --filter-empty-actions

Schema:

Column	Type	Description
`episode_path`	string	Source MCAP file path
`bin_idx`	int32	Time bin index
`timestamp_ns`	int64	Bin start timestamp
`state`	sequence[binary]	Screen events in this bin
`actions`	sequence[binary]	Action events in this bin

When to use: If your training code expects state-action pairs similar to RLDS or LeRobot.

Programmatic Usage

Building custom pipelines? See ARCHITECTURE.md for the full API reference — high-level functions (build_event_dataset, build_fsl_dataset) and low-level components (IntervalExtractor, EventResampler, EventEncoder, Tokenization).

Event Tokenization

The owa.data.tokenization module converts MCAP events to/from token sequences for VLM training.

from owa.data.tokenization import (
    ImageTokenConfig, EventTokenizationContext,
    expand_tokenizer_for_events, tokenize_event, decode_episode
)
from owa.data.encoders import create_encoder

# Setup (once, with side effects)
encoder = create_encoder("factorized")
image_config = ImageTokenConfig(prefix="<img>", token="<IMG_CONTEXT>", length=256, suffix="</img>")
expand_tokenizer_for_events(tokenizer, encoder, image_config)

# Create immutable context for tokenization
ctx = EventTokenizationContext(encoder, tokenizer, image_config)

# Encode: McapMessage → TokenizedEvent
result = tokenize_event(ctx, mcap_msg)
print(result["token_ids"])  # List[int]

# Decode: token IDs → McapMessages
for msg in decode_episode(ctx, result["token_ids"]):
    print(msg.topic, msg.timestamp)

Design:

expand_tokenizer_for_events() — Side-effect function, mutates tokenizer (call once)
EventTokenizationContext — Immutable container, validates tokenizer has required tokens
tokenize_event(), decode_episode() — Pure functions, no side effects

Dataset Transforms

Raw datasets contain binary MCAP messages. Transforms convert them to training-ready format on-the-fly using HuggingFace's set_transform().

from owa.data.datasets import load_from_disk

# Event Dataset
dataset = load_from_disk("/path/to/event-dataset")
dataset["train"].auto_set_transform(stage="event", encoder_type="hierarchical", load_images=True)

# FSL Dataset
dataset = load_from_disk("/path/to/fsl-dataset")
dataset["train"].auto_set_transform(stage="fsl", load_images=True)

# Binned Dataset
dataset = load_from_disk("/path/to/binned-dataset")
dataset["train"].auto_set_transform(stage="binned", instruction="Complete the computer task")

Training Examples

scripts/single_shuffle_loader.py — Single GPU training
scripts/multi_gpu_loader.py — Distributed multi-GPU training

References

nanoVLM Sequence Packing — Sequence packing reference
olmo-core FSLDataset — FSL implementation reference
HuggingFace Datasets — Dataset handling foundation
RLDS, LeRobot — Robotics dataset formats
rosbag, mcap — Recording formats

Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
owa/data		owa/data
scripts		scripts
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEMO.md		DEMO.md
README.md		README.md
pipeline.svg		pipeline.svg
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

OWA Data Pipeline

Quick Start

Pipeline Overview

Feature Comparison

Stage 1: MCAP → Event Dataset

Stage 2: Event Dataset → FSL Dataset

Appendix: Converting to Traditional Formats

Programmatic Usage

Event Tokenization

Dataset Transforms

Training Examples

References

FilesExpand file tree

owa-data

Directory actions

More options

Directory actions

More options

Latest commit

History

owa-data

Folders and files

parent directory

README.md

OWA Data Pipeline

Quick Start

Pipeline Overview

Feature Comparison

Stage 1: MCAP → Event Dataset

Stage 2: Event Dataset → FSL Dataset

Appendix: Converting to Traditional Formats

Programmatic Usage

Event Tokenization

Dataset Transforms

Training Examples

References