Improve performance of hive path parsing by using extractKeyValuePairs instead of regex#79067
Conversation
|
There are a few things to be done:
Tests with a DEBUG build: Query execution time went down from 12.734 seconds to 6.843; Tests with std::chrono: Regex impl: Total of 4.909 seconds extractKeyValuePairs impl: Total of 0.379 seconds 13x times faster |
|
Some benchmarks on release build with AWS S3 public data, suffers from latency variations tho: |
|
Benchmark with dataset created by me using a build from #76802 with a few modifications to allow file table function to use hive as well. |
|
Release build chrono benchmark in Total 2.26542 seconds Total 0.229874 seconds |
I have just implemented this thing and ran a benchmark, the gain is very small (because a std::unordered_set still needs to be created to check for duplicates, but we can afford to create a set of references instead of copying the strings). The chrono benchmark for processing 10k files had a difference of 0.035511476 seconds. For 10k files I don't think this matters, but if we start talking about 1mi files, then it'll be a 3.5s difference. |
406f6f2 to
18b8175
Compare
|
@nickitat I am working on a faster implementation with less memory allocations, I'll update the PR soon |
This reverts commit 72214cc.
|
Hi @nickitat. This should be ready for review. The idea of this PR is: use the built-in The original implementation of In this particular case, we know for sure that the input still exists in memory when processing the key-value pairs, so we can afford to return references instead of entire strings. Therefore, I refactored |
|
@nickitat kind ping |
src/Core/Block.cpp
Outdated
|
|
||
|
|
||
| const ColumnWithTypeAndName * Block::findByName(const std::string & name, bool case_insensitive) const | ||
| const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const |
There was a problem hiding this comment.
| const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const | |
| const ColumnWithTypeAndName * Block::findByName(std::string_view name, bool case_insensitive) const |
- this one overload should be enough for both cases AFAIU
| { | ||
| extractor.extract(path_without_filename, key_values); | ||
| } | ||
| catch (const extractKV::DuplicateKeyFoundException & ex) |
There was a problem hiding this comment.
why do we throw an exception of this special type to then catch it and transform to a standard CH exception?
There was a problem hiding this comment.
Just so that we can give the user a better exception message since extractKeyValuePairs has no notion of hive partitioning
| for (const auto & item : map) | ||
| { | ||
| auto type = tryInferDataTypeByEscapingRule(item.second, format_settings, FormatSettings::EscapingRule::Raw); | ||
| const std::string key(item.first); |
There was a problem hiding this comment.
add_virtual, VirtualColumnsDescription and ColumnsDescription rely on strings. Specially the last two. I am not sure it is safe to refactor those classes to use references.
Plus, it is not worth the effort. Unlike the other parseHivePartitioningKeysAndValues call, this is inside the getVirtualsForFileLikeStorage function, which is called ocasionally upon storage creation for a single filepath entry.
No performance gain will be observed, I believe
This reverts commit ee9e4e5.
|
|
b8d7ddb
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250430) * Fix Build due to ClickHouse/ClickHouse#79067 * Fix build due to ClickHouse/ClickHouse#79417 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang chen <[email protected]>
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve performance of hive path parsing by using
extractKeyValuePairsinstead of regexDocumentation entry for user-facing changes