Skip to content

docs(engine): clarify bytes_scanned holds billing-relevant bytes, not scan volume#222

Merged
hugocorreia90 merged 1 commit intomainfrom
docs/bytes-scanned-semantic-clarification
Apr 22, 2026
Merged

docs(engine): clarify bytes_scanned holds billing-relevant bytes, not scan volume#222
hugocorreia90 merged 1 commit intomainfrom
docs/bytes-scanned-semantic-clarification

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

Follow-up clarification to #219.

bytes_scanned carries the adapter's billed bytes figure, not literal scan volume. For BigQuery that's totalBytesBilled (with the 10 MB per-query minimum floor), which matches the BigQuery console's "Bytes billed" field — not "Bytes processed". The previous field-level docs either omitted that distinction or buried it; anyone comparing Rocky's output to the warehouse console would have reached for the wrong column.

This PR adds adapter-state-neutral docstrings covering all four adapters (BigQuery, Databricks, Snowflake, DuckDB) to every bytes_scanned and bytes_written field declaration — 6 sites across ExecutionStats, ModelExecution, MaterializationOutput, ReplayModelOutput, TraceModelEntry, and PerModelCostHistorical.

The cascade is the whole point

schemars emits /// rustdoc as the description field in generated JSON schemas, so the documented semantic cascades through just codegen into:

  • schemas/{run,replay,trace,cost}.schema.jsondescription field
  • integrations/dagster/src/dagster_rocky/types_generated/{run,replay,trace,cost}_schema.py — Pydantic v2 Field(description=...)
  • editors/vscode/src/types/generated/{run,replay,trace,cost}.ts — TypeScript JSDoc

Net effect: the BQ-console-comparison nugget is now visible in VS Code hover (rustdoc), Dagster's Pydantic IDE integrations (Python docstrings), and any downstream JSON schema consumer.

Files touched

Rust source (6 field sites, 3 files):

  • engine/crates/rocky-core/src/traits.rsExecutionStats.{bytes_scanned, bytes_written} (internal adapter-trait type; rustdoc only)
  • engine/crates/rocky-core/src/state.rsModelExecution.{bytes_scanned, bytes_written} (persisted mirror; rustdoc only)
  • engine/crates/rocky-cli/src/output.rs:
    • MaterializationOutput.{bytes_scanned, bytes_written} — cascades to run.*
    • ReplayModelOutput.{bytes_scanned, bytes_written} — cascades to replay.*
    • TraceModelEntry.{bytes_scanned, bytes_written} — cascades to trace.*
    • PerModelCostHistorical.{bytes_scanned, bytes_written} — cascades to cost.*

Generated (4 commands × 3 surfaces = 12 files): schemas/{run,replay,trace,cost}.schema.json, integrations/dagster/src/dagster_rocky/types_generated/{run,replay,trace,cost}_schema.py, editors/vscode/src/types/generated/{run,replay,trace,cost}.ts.

Test plan

  • cargo test --workspace — green (doc-only change)
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo fmt --all --check — clean
  • uv run pytest in integrations/dagster/ — 312 passed
  • npm run compile in editors/vscode/ — clean
  • just regen-fixtures — no fixture diff (byte-stable; descriptions live in the schema, not the emitted payload)
  • Spot-checked that the "Bytes billed" / "Bytes processed" phrase appears in all 4 target schemas + 4 Pydantic files + 4 TS files

Notes

  • Zero behavior change. Only rustdoc + the expected codegen cascade.
  • Wording is adapter-state-neutral (e.g. "Databricks: when populated, byte count from the statement-execution manifest; None today until the manifest plumbing lands"), so it stays correct whether the in-flight Databricks bytes_scanned override lands before or after this PR.

… scan volume

bytes_scanned carries the adapter's *billed* bytes figure, not literal
scan volume — for BigQuery that's `totalBytesBilled` (with the 10 MB
per-query minimum floor), which matches the BigQuery console's
"Bytes billed" field, not "Bytes processed". The previous field-level
docs either omitted that distinction or buried it; anyone comparing
Rocky's output to the warehouse console would have reached for the
wrong column.

Added adapter-state-neutral docstrings covering all four adapters
(BigQuery, Databricks, Snowflake, DuckDB) to every `bytes_scanned`
and `bytes_written` field declaration — 6 sites across
`ExecutionStats`, `ModelExecution`, `MaterializationOutput`,
`ReplayModelOutput`, `TraceModelEntry`, and `PerModelCostHistorical`.

The rustdoc cascades via `just codegen` into `description` fields on
`run`/`replay`/`trace`/`cost` JSON schemas, Pydantic v2
`Field(description=...)`, and TypeScript JSDoc — so the documented
semantic is now visible in VS Code hover, Dagster's Pydantic IDE
integrations, and any downstream schema consumer. Zero behavior change
(just regen-fixtures confirmed byte-stable).
@hugocorreia90 hugocorreia90 merged commit 1450c50 into main Apr 22, 2026
15 checks passed
@hugocorreia90 hugocorreia90 deleted the docs/bytes-scanned-semantic-clarification branch April 22, 2026 14:55
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4

Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0
/ dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG.

Engine headlines (12 PRs):
- Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end
  (#223 infra, #228 reads, #230 write tap, #231 discover warm-up,
  #232 state controls + --cache-ttl override)
- Arc 2 wave 3 complete — bytes_scanned / bytes_written on
  MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake
  deferred doc, #222 docstring cascade). Real $ on rocky cost for
  BQ + Databricks
- FR-005 Unity Catalog workspace-binding reconcile (#226)
- FR-002 Fivetran connector metadata via SourceOutput.metadata (#225)
- Housekeeping: compute_backoff dedup into rocky_core::retry (#217)

Dagster headlines (4 PRs):
- FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor
  on RockyResource startup (#224)
- FR-003 RockyResource.state_health() (#227) + FR follow-up threading
  doctor(check=state_rw) for sub-second probes (#229)
- RockyResource.cost() wiring + fixture (#218)

VS Code: regenerated TS bindings for engine 1.14.0 type additions.
No extension feature changes.

* chore(integrations/dagster): regenerate test fixtures for engine 1.14.0

36 fixtures picked up the new engine version string in their top-level
"version" field. No schema changes — just the version bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant