Skip to content

feat(engine/rocky-databricks): override execute_statement_with_stats with total_byte_count#221

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/bytes-scanned-databricks
Apr 22, 2026
Merged

feat(engine/rocky-databricks): override execute_statement_with_stats with total_byte_count#221
hugocorreia90 merged 1 commit intomainfrom
feat/bytes-scanned-databricks

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

  • Overrides the default WarehouseAdapter::execute_statement_with_stats on DatabricksWarehouseAdapter so Databricks materializations now surface real byte accounting in MaterializationOutput.bytes_scanned instead of inheriting the all-None stub.
  • Extends Manifest to deserialize total_byte_count from the Databricks SQL Statement Execution response and adds DatabricksConnector::execute_statement_with_stats (stats-aware counterpart to execute_statement) plus a free stats_from_response helper.
  • execute_statement's signature and behaviour are unchanged — the default trait impl keeps delegating to it for callers that don't need stats. Databricks slice of Trust-system Arc 2 wave 3.

Why bytes_scanned holds total_byte_count

Matches the #219 naming convention: ExecutionStats.bytes_scanned is the billing-relevant bytes figure for the adapter. Databricks is DBU-priced (not bytes-priced), so total_byte_count isn't a cost driver the way BigQuery's totalBytesBilled is — it's the byte count Databricks natively reports for a statement, surfaced in the bytes_scanned slot so the cost pipeline stays free of adapter-specific branching. Documented inline on stats_from_response.

Scope notes

  • Databricks only. Snowflake slice to follow in a sibling PR. DuckDB / rocky-cli / run.rs unchanged — those were already wired in feat(engine/rocky-bigquery): plumb totalBytesBilled into bytes_scanned #219.
  • No WarehouseAdapter trait changes — only the override.
  • bytes_written stays None; Databricks doesn't expose a bytes-written figure on the Statement Execution response. rows_affected stays Nonetotal_row_count is the result-row count, not the DML-affected-row count.

Test plan

  • cargo test -p rocky-databricks -p rocky-core -p rocky-cli — all green. 4 new unit tests on stats_from_response + Manifest deserialization (happy path with total_byte_count, missing manifest, manifest-without-total_byte_count).
  • cargo clippy --workspace --all-targets -- -D warnings — clean.
  • cargo fmt --all --check — clean.
  • just codegen — no output-struct change, byte-stable (verified no-op).
  • just regen-fixtures — DuckDB playground doesn't exercise Databricks, byte-stable (verified no-op).
  • uv run pytest in integrations/dagster/ — 312 passed, unaffected.

Files touched

  • engine/crates/rocky-databricks/src/connector.rsManifest parses total_byte_count; new stats_from_response helper + DatabricksConnector::execute_statement_with_stats method; 4 new unit tests.
  • engine/crates/rocky-databricks/src/adapter.rs — override WarehouseAdapter::execute_statement_with_stats on DatabricksWarehouseAdapter.

…with total_byte_count

Override the default WarehouseAdapter::execute_statement_with_stats
(added in #219) on DatabricksWarehouseAdapter so Databricks
materializations surface real byte accounting in
MaterializationOutput.bytes_scanned instead of inheriting the all-None
stub.

total_byte_count is the byte count Databricks natively reports on the
Statement Execution response manifest; mapping it into ExecutionStats.
bytes_scanned matches the #219 convention (billing-relevant bytes slot,
even though Databricks is DBU-priced rather than bytes-priced).
execute_statement's signature is unchanged; the default trait impl
continues to delegate to it for callers that don't need stats.

Snowflake slice to follow in a sibling PR.
@hugocorreia90 hugocorreia90 merged commit 8837a27 into main Apr 22, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/bytes-scanned-databricks branch April 22, 2026 14:51
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4

Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0
/ dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG.

Engine headlines (12 PRs):
- Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end
  (#223 infra, #228 reads, #230 write tap, #231 discover warm-up,
  #232 state controls + --cache-ttl override)
- Arc 2 wave 3 complete — bytes_scanned / bytes_written on
  MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake
  deferred doc, #222 docstring cascade). Real $ on rocky cost for
  BQ + Databricks
- FR-005 Unity Catalog workspace-binding reconcile (#226)
- FR-002 Fivetran connector metadata via SourceOutput.metadata (#225)
- Housekeeping: compute_backoff dedup into rocky_core::retry (#217)

Dagster headlines (4 PRs):
- FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor
  on RockyResource startup (#224)
- FR-003 RockyResource.state_health() (#227) + FR follow-up threading
  doctor(check=state_rw) for sub-second probes (#229)
- RockyResource.cost() wiring + fixture (#218)

VS Code: regenerated TS bindings for engine 1.14.0 type additions.
No extension feature changes.

* chore(integrations/dagster): regenerate test fixtures for engine 1.14.0

36 fixtures picked up the new engine version string in their top-level
"version" field. No schema changes — just the version bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant