Skip to content

Do less requests for SHOW TABLES query in DataLakeCatalog #93835

@alesapin

Description

@alesapin

Right now when query SHOW TABLES is executed in DataLakeCatalog database we do two kind of requests:

  1. List all the tables, for example this one for IcebergRest: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L516.
  2. For each table from p.1 get metadata about each table, like: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L933.

We do this because of IDatabaseTablesIterator API: https://github.com/ClickHouse/ClickHouse/blob/master/src/Databases/IDatabase.h#L51. It should be able to produce a IStorage object. However for queries like SHOW TABLES and even for most queries for system.tables it's not needed. We already have "lightweight" method for iteration over entities in databases: https://github.com/ClickHouse/ClickHouse/blob/master/src/Databases/IDatabase.h#L273-L276

However the only difference for "lightweight" iterator are some flags for StorageObjectStorage object. We still do a request to catalog for each table:

for (const auto & table_name : iceberg_tables)
{
if (filter_by_table_name && !filter_by_table_name(table_name))
continue;
/// NOTE: There are one million of different ways how we can receive
/// weird response from different catalogs. tryGetTableImpl will not
/// throw only in case of expected errors, but sometimes we can receive
/// completely unexpected results for some objects which can be stored
/// in catalogs. But this function is used in SHOW TABLES query which
/// should return at least properly described tables. That is why we
/// have this try/catch here.
try
{
promises.emplace_back(std::make_shared<std::promise<StoragePtr>>());
futures.emplace_back(promises.back()->get_future());
pool.scheduleOrThrow(
[this, table_name, skip_not_loaded, context_, promise = promises.back()] mutable
{
StoragePtr storage = nullptr;
try
{
storage = tryGetTableImpl(table_name, context_, true, skip_not_loaded);
.

The proposal is to introduce simple structure like "StorageDescription" which will return only some parts of IStorage metadata. At least table name which will be sufficient for "SHOW TABLES" query. Also we will need to introduce new Iterator for this structure and adapt places in code where we already use "lightweight" iterator. Either use information from new really lightweight iterator, or switch to ordinary "getTablesIterator".

Metadata

Metadata

Labels

comp-datalakeData lake table formats (Iceberg/Delta/Hudi) integration.performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions