Skip to content

LargeBinary Migration Doesn't Update Datafusion Schema #11028

@ntjohnson1

Description

@ntjohnson1

Describe the bug
If we've already registered a dataset on the cloud with list<list<uint8>> in the schema then when we try to access that column the schema expects the original type but we are returning list<large_binary>

To Reproduce
Simpler repro:

  1. Using dataplatform update to point to local rerun version
  2. pixi run dev
  3. enter a pixi shell with rerun installed pixi run -e examples
import rerun as rr
import os
CATALOG_URL = os.environ["REDAP_URI"]

client = rr.catalog.CatalogClient(CATALOG_URL, token=os.getenv('REDAP_TOKEN'))
dataset: rr.catalog.DatasetEntry = client.get_dataset_entry(name="droid:raw")

df = dataset.dataframe_query_view(index="real_time", contents="/thumbnail/camera/wrist").df()
print(df.schema())
bad_result = df.limit(1)
print(bad_result)

Steps to reproduce the behavior:

  1. Launch local notebook from cloud repo and point to existing dataset
  2. Query a column with blob type
>> dataset.dataframe_query_view(index="real_time", contents="/thumbnail/camera/wrist").df().limit(1)
Exception: DataFusion error: Execution("Arrow error: Invalid argument error: column types must match schema types, expected List(Field { name: \"item\", data_type: List(Field { name: \"item\", data_type: UInt8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: \"item\", data_type: LargeBinary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 4")

Expected behavior
We can query data even if migrating the format.

Rerun version

rerun-cli 0.25.0-alpha.1+dev (base map_view nasm native_viewer oss_server release_no_web_viewer) [rustc 1.88.0 (6b00bc388 2025-06-23), LLVM 20.1.5] aarch64-apple-darwin (debug)
Video features: av1 default ffmpeg nasm serde

commit: 241df80

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataplatformRerun Data Platform integration🪳 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions