Skip to content

Read optimization for Iceberg tables using Iceberg metadata#90001

Closed
ianton-ru wants to merge 6 commits intoClickHouse:masterfrom
ianton-ru:optimize_count_in_datalake
Closed

Read optimization for Iceberg tables using Iceberg metadata#90001
ianton-ru wants to merge 6 commits intoClickHouse:masterfrom
ianton-ru:optimize_count_in_datalake

Conversation

@ianton-ru
Copy link
Copy Markdown
Contributor

@ianton-ru ianton-ru commented Nov 13, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Read optimization for Iceberg tables using Iceberg metadata

Documentation entry for user-facing changes

When Iceberg table is from Iceberg catalog, catalog returns additional metadata about table, including count of rows, count of nulls, min/max values for columns in files.
If some column in some file has zero nulls and min value is equal to max value, we can make a decision that this column has the same value for each rows in this specific file.
That we can avoid reading this column from file and create values based on metadata.
This reduces number of requests to remote storage like S3 or Azure, and as result reduces time on request execution.

Current PR introduces setting allow_experimental_iceberg_read_optimization to turn on this optimization.

@alesapin
Copy link
Copy Markdown
Member

@ianton-ru I understand the optimization, but does it really optimize some real-world workloads? To me it looks like a very rare situation. Even if user has such columns in some files, very likely they will read some other columns in the same query. And in this case benefit will be minuscule.

Not really sure it worth some specific and quite complex code. Maybe you can provide some context on it?

@ianton-ru
Copy link
Copy Markdown
Contributor Author

@alesapin
Yes, benefits mostly only when query select only constant columns from files. But may be great. In our test instance query like SELECT value,count() FROM iceberg.table GROUP BY key where value is constant in each file, but different in different files, near 4 billion rows, time reduced from 60+ seconds to 4 seconds.

@alesapin alesapin added the can be tested Allows running workflows for external contributors label Nov 14, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Nov 14, 2025

Workflow [PR], commit [8513c3a]

Summary:

job_name test_name status info comment
Build (amd_compat) failure
Cmake configuration failure cidb
Stateless tests (arm_binary, parallel) failure
03711_query_cache_http_introspection FAIL cidb
Integration tests (amd_tsan, 5/6) failure
test_storage_nats/test_nats_jet_stream.py::test_nats_overloaded_insert FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Nov 14, 2025
@divanik divanik self-requested a review November 14, 2025 11:40
@divanik divanik self-assigned this Nov 14, 2025
ianton-ru and others added 4 commits November 18, 2025 10:15
Copy link
Copy Markdown
Member

@divanik divanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly scimmed the PR and left some comments but the main question is here: #90001 (comment)

std::vector<Iceberg::ManifestFileEntry> equality_deletes_files;
std::exception_ptr exception;
std::mutex exception_mutex;
Int32 table_schema_id;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify, why do we need new field here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When schema was change, some parquet files can cave old schema, some new.
Say, initially schema had column name, and file1 was created with this schema.
Later column was renamed to old_name, added new column name, and file2 was created with new schema.
But column index stay the same.
Iceberg metadata with min/max values does not contains column names, only indexes.
Query, on the contrary, contains only names, and I need map column name => metadata.
So I made next steps to get correct column metadata:

  • get current schema (with this table_schema_id)
  • get column index for each column in current schema (again with table_schema_id)
  • create map column name => metadata instead of column index => metadata from Iceberg metadata.
    All this in new DataFileMetaInfo constructor, parameter schema_id is this table_schema_id.

/// Object metadata: size, modification time, etc.
std::optional<ObjectMetadata> metadata;
/// Information about columns
std::optional<DataFileMetaInfoPtr> file_meta_info;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any reasons to store field file_meta_info in the class RelativePathWithMetadata? If not, please, move it to ObjectInfo

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it either to DataLakeObjectMetadata or IcebergDataObject

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just historical reasons. Will move into DataLakeObjectMetadata

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to ObjectInfo. data_lake_metadata in ObjectInfo is optional and can absent independently.

std::unique_ptr<PullingPipelineExecutor> reader;

public:
std::map<size_t, ConstColumnWithValue> constant_columns_with_values;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least this field should be constant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be changed in ReaderHolder::operator=

Comment on lines +70 to +85
struct DataFileInfo
{
std::string file_path;
std::optional<DataFileMetaInfoPtr> file_meta_info;

explicit DataFileInfo(const std::string & file_path_)
: file_path(file_path_) {}

explicit DataFileInfo(std::string && file_path_)
: file_path(std::move(file_path_)) {}

bool operator==(const DataFileInfo & rhs) const
{
return file_path == rhs.file_path;
}
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems as an odd abstraction level, let's remove it completely

constexpr size_t FIELD_MASK_RECT = 0x4;
constexpr size_t FIELD_MASK_ALL = 0x7;

void DataFileMetaInfo::serialize(WriteBuffer & out) const
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to do these serialization/deserialization functions versioned. There are two approaches: you either make a separate protocol version for this structure or you inherit protocol version from above protocol. In the latter case you name function as serializeForClusterFunction/deserializeForClusterFunction and take protocol versions from functions above

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. serialize/deserialize is already called only when protocol version is 5 or greater. Why need check it inside one more time?

/// Not empty when allow_experimental_iceberg_read_optimization=true
/// and some columns were removed from read list as columns with constant values.
/// Restore data for these columns.
for (const auto & constant_column : reader.constant_columns_with_values)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use structure binding

std::map<size_t, ConstColumnWithValue> constant_columns_with_values;
std::unordered_set<String> constant_columns;

NamesAndTypesList requested_columns_copy = read_from_format_info.requested_columns;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make copy later, when you indeed need a copy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be changed in the next block - constant columns are removed from list.
Can rename it to non_constant_requested_columns.

columns.emplace_back(type->createColumn(), type, name);
builder.init(Pipe(std::make_shared<ConstChunkGenerator>(
std::make_shared<const Block>(columns), *num_rows_from_cache, max_block_size)));
if (!constant_columns.empty())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain these lines of code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I remove constant columns from requested list to read only non-constant. Contant columns are restored later in StorageObjectStorageSource::generate() from metadata without reading from source file.

builder.addSimpleTransform([&](const SharedHeader & header)
{
return std::make_shared<ExtractColumnsTransform>(header, read_from_format_info.requested_columns);
return std::make_shared<ExtractColumnsTransform>(header, requested_columns_copy);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is indeed an optimization. Does this code really ensure that we don't read constant lines from the file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in requested_columns_copy only non-constant columns, constant column were removed from this list above.

@divanik divanik marked this pull request as draft November 20, 2025 15:48
@divanik
Copy link
Copy Markdown
Member

divanik commented Nov 20, 2025

Found some major problem, so convert the PR to draft until it is resolved

@divanik
Copy link
Copy Markdown
Member

divanik commented Dec 29, 2025

Feel free to reopen the PR and tag me in DM if the major problem is resolved

@divanik divanik closed this Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants