Skip to content

Add setting to query Iceberg tables as of a specific timestamp#71072

Merged
20 commits merged intoClickHouse:masterfrom
bretthoerner:bretthoerner/iceberg-query-timestamp
Mar 19, 2025
Merged

Add setting to query Iceberg tables as of a specific timestamp#71072
20 commits merged intoClickHouse:masterfrom
bretthoerner:bretthoerner/iceberg-query-timestamp

Conversation

@bretthoerner
Copy link
Copy Markdown
Contributor

@bretthoerner bretthoerner commented Oct 25, 2024

Open questions:

Having this be a setting is probably not the correct solution, but if it's going to be part of the SQL syntax then I think I'd need some ideas for where it could fit in before I spin my wheels there. And it might affect much more than just the Iceberg backend...

If a setting is OK, I think it should probably change to a string that I parse as a DateTime? If so, I'm trying to figure out where exactly to do that translation, so I can weave the necessary timezone information down (I'm assuming I'd want to use something like parseDateTime64BestEffort?).


Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add setting to query Iceberg tables as of a specific timestamp

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
  • Exclude: AST fuzzers

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Oct 25, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-feature Pull request with new product feature label Oct 25, 2024
@robot-ch-test-poll3
Copy link
Copy Markdown
Contributor

robot-ch-test-poll3 commented Oct 25, 2024

This is an automated comment for commit f0e6bdf with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success

@divanik divanik self-assigned this Oct 28, 2024
@bretthoerner bretthoerner force-pushed the bretthoerner/iceberg-query-timestamp branch from 7b7d1aa to 89d9d50 Compare October 28, 2024 18:26
@bretthoerner bretthoerner changed the title Add option to query Iceberg tables as of a specific timestamp Add setting to query Iceberg tables as of a specific timestamp Oct 28, 2024
@bretthoerner bretthoerner force-pushed the bretthoerner/iceberg-query-timestamp branch 3 times, most recently from 911f4a8 to 7ba8b02 Compare October 29, 2024 16:02
Enabling this setting can lead to incorrect result as in case of evolved schema all data files will be read using the same schema.
:::
)", 0) \
DECLARE(Int64, iceberg_query_at_timestamp_ms, 0, R"(
Copy link
Copy Markdown
Contributor

@Enmk Enmk Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to create a new setting field type (add class SettingFieldDateTime into SettingField.h ). With the new SettingFieldDateTime you have complete control over parsed of setting value from string, see SettingFieldNumber for some inspiration.

This line would be then changed to:

Suggested change
DECLARE(Int64, iceberg_query_at_timestamp_ms, 0, R"(
DECLARE(DateTime, iceberg_query_at_timestamp, 0, R"(

Copy link
Copy Markdown
Contributor Author

@bretthoerner bretthoerner Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this out and I'd like to do this but might need some more guidance on navigating the metaprogramming going on here. 🫠

I'll try again next week.

@melvynator
Copy link
Copy Markdown
Member

@melvynator melvynator mentioned this pull request Nov 3, 2024
divanik
divanik previously requested changes Nov 4, 2024
break;
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you, please, add chassert (CH version of assert) that schema_id == current_schema_id = metadata_object->getValue("current-schema-id")), please? Or you can throw a LOGICAL_ERROR in case it is not

Copy link
Copy Markdown
Contributor Author

@bretthoerner bretthoerner Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this, but it's breaking other tests that explicitly look for other errors:

assert "UNSUPPORTED_METHOD" in error

This is because we already have code (that still works as expected on my branch) ensuring the schema_id is the same and that no schema migrations have happened:

https://github.com/bretthoerner/ClickHouse/blob/d5e1798474b068546b86143dbf1631980c2df755/src/Storages/ObjectStorage/DataLakes/IcebergMetadata.cpp#L286-L309

If this makes sense to you, I'll remove my extra chassert? Thanks!

"SELECT number, toString(number + 1) FROM numbers(200)"
)

assert (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, please, create a new test (you can do it directly in this file). Secondly, we need better testing here. You can insert several blocks with some lag and verify the table state between these insertions

{
const auto * iceberg_metadata = dynamic_cast<const IcebergMetadata *>(&other);
return iceberg_metadata && getVersion() == iceberg_metadata->getVersion();
return iceberg_metadata && getManifestListFile() == iceberg_metadata->getManifestListFile();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this change, tbh. I think that we shouldn't say that two metadata objects are equal if they have the same manifest lists but different table schema or partition schema

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use a tuple or something and also compare the schema and partition schema.

The problem with the code as it existed is that it only compared the metadata version, but the top-level metadata contains a history of many snapshots, schemas, etc. And so if you try to query an older version (via timestamp) it would do this comparison, decide the older snapshot you're trying to query is the same as the current snapshot, and then noop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a tuple of metadata_version, schema_id, snapshot_id now, although in some sense I think I could just use snapshot_id since it implies the rest. 🤷

@bretthoerner bretthoerner force-pushed the bretthoerner/iceberg-query-timestamp branch 2 times, most recently from 342f403 to 7749a8e Compare November 8, 2024 19:38
@bretthoerner bretthoerner force-pushed the bretthoerner/iceberg-query-timestamp branch from 7749a8e to d5e1798 Compare November 15, 2024 20:56
@bretthoerner bretthoerner requested a review from divanik November 18, 2024 19:02
@divanik
Copy link
Copy Markdown
Member

divanik commented Nov 25, 2024

Hi @bretthoerner! I noticed some integration tests need attention in your PR. Would you mind taking a look at them? You can find the details by clicking Click here in this box.

On the failed Integration Tests page, you'll see all the failing tests. No need to worry about the filesystem cache test - I'll look into that one myself.

I noticed there's a sanitizer issue in one of the failed tests. You can run tests with sanitizer enabled using:
cmake ....... -DSANITIZE=address/undefined/thread

It would be really helpful if you could run the integration tests locally as well. Here's our guide on how to do that.

Please don't hesitate to reach out if you have any questions - I'm happy to help!

@bretthoerner
Copy link
Copy Markdown
Contributor Author

@divanik Thanks, I missed the ASAN, I'll check it out now. Some of failures are known though and I called them out here, I'm curious for your thoughts. (Basically, the new assert is duplicating existing checks we have.)

@bretthoerner bretthoerner force-pushed the bretthoerner/iceberg-query-timestamp branch from d5e1798 to 1e6f59f Compare November 25, 2024 16:16
@bretthoerner
Copy link
Copy Markdown
Contributor Author

The current test failures look like an OOM/CI issue? At least, they don't seem related to Iceberg at all. Is it possible to rerun just those tests somehow?

@divanik
Copy link
Copy Markdown
Member

divanik commented Dec 9, 2024

Hi @bretthoerner,
I'm planning another big refactoring of this file this week. After that, you'll need to pull master and resolve conflicts (it wouldn't make sense to do it before the refactoring, imho). After that we will be able to merge the feature

@alexey-milovidov alexey-milovidov mentioned this pull request Dec 31, 2024
76 tasks
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 18, 2025

Dear @divanik, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@divanik divanik self-assigned this Feb 24, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 27, 2025

Workflow [PR], commit [6d86aac]

@divanik
Copy link
Copy Markdown
Member

divanik commented Mar 4, 2025

Hi. I did a separate PR because I read the docs and haven't understood how I can reveal info about schema evolution from snapshot-log so I preferred iteration through metadata (and all the logic is another one because of this fact). But this work was very useful, I reused some functions and integration tests from this PR, so thank you for your contribution.

@divanik divanik dismissed their stale review March 11, 2025 15:27

No longer relevant

@divanik divanik assigned divanik and unassigned divanik Mar 11, 2025
@github-merge-queue github-merge-queue bot closed this pull request by merging all changes into ClickHouse:master in a487db3 Mar 19, 2025
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants