-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Do less requests for SHOW TABLES query in DataLakeCatalog #93835
Description
Right now when query SHOW TABLES is executed in DataLakeCatalog database we do two kind of requests:
- List all the tables, for example this one for IcebergRest: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L516.
- For each table from p.1 get metadata about each table, like: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L933.
We do this because of IDatabaseTablesIterator API: https://github.com/ClickHouse/ClickHouse/blob/master/src/Databases/IDatabase.h#L51. It should be able to produce a IStorage object. However for queries like SHOW TABLES and even for most queries for system.tables it's not needed. We already have "lightweight" method for iteration over entities in databases: https://github.com/ClickHouse/ClickHouse/blob/master/src/Databases/IDatabase.h#L273-L276
However the only difference for "lightweight" iterator are some flags for StorageObjectStorage object. We still do a request to catalog for each table:
ClickHouse/src/Databases/DataLake/DatabaseDataLake.cpp
Lines 658 to 681 in 3f85f2a
| for (const auto & table_name : iceberg_tables) | |
| { | |
| if (filter_by_table_name && !filter_by_table_name(table_name)) | |
| continue; | |
| /// NOTE: There are one million of different ways how we can receive | |
| /// weird response from different catalogs. tryGetTableImpl will not | |
| /// throw only in case of expected errors, but sometimes we can receive | |
| /// completely unexpected results for some objects which can be stored | |
| /// in catalogs. But this function is used in SHOW TABLES query which | |
| /// should return at least properly described tables. That is why we | |
| /// have this try/catch here. | |
| try | |
| { | |
| promises.emplace_back(std::make_shared<std::promise<StoragePtr>>()); | |
| futures.emplace_back(promises.back()->get_future()); | |
| pool.scheduleOrThrow( | |
| [this, table_name, skip_not_loaded, context_, promise = promises.back()] mutable | |
| { | |
| StoragePtr storage = nullptr; | |
| try | |
| { | |
| storage = tryGetTableImpl(table_name, context_, true, skip_not_loaded); |
The proposal is to introduce simple structure like "StorageDescription" which will return only some parts of IStorage metadata. At least table name which will be sufficient for "SHOW TABLES" query. Also we will need to introduce new Iterator for this structure and adapt places in code where we already use "lightweight" iterator. Either use information from new really lightweight iterator, or switch to ordinary "getTablesIterator".