Improve performance of hive path parsing by using `extractKeyValuePairs` instead of regex by arthurpassos · Pull Request #79067 · ClickHouse/ClickHouse

arthurpassos · 2025-04-11T22:40:13Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

arthurpassos · 2025-04-11T22:44:49Z

There are a few things to be done:

~~benchmarks~~
~~remove bool setting, exists only to make benchmarks easier~~
~~do not return std::unordered_map, use raw columns instead. Keeping the map for now just because it is easier to deal with~~

Tests with a DEBUG build:

arthur :) SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_extract_kvp_for_hive_path_parsing=0, optimize_count_from_files=0;

SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE (date >= '2025-01-01') AND (date <= '2025-01-31')
GROUP BY date
ORDER BY date ASC
SETTINGS use_extract_kvp_for_hive_path_parsing = 0, optimize_count_from_files = 0

Query id: 997340de-7be7-42af-9c05-0a4914df8440

    ┌─count()─┐
 1. │  292213 │
 2. │  402440 │
 3. │  409341 │
 4. │  432302 │
 5. │  433954 │
 6. │  366260 │
 7. │  352121 │
 8. │  399976 │
 9. │  534013 │
10. │  408769 │
11. │  361190 │
12. │  380525 │
13. │  408248 │
14. │  352684 │
15. │  354014 │
16. │  375439 │
17. │  425661 │
18. │  360666 │
19. │  388509 │
20. │  350291 │
21. │  324412 │
22. │  432369 │
23. │  326010 │
24. │  369243 │
25. │  338988 │
26. │  309651 │
27. │  332102 │
28. │  305953 │
29. │  355332 │
30. │  335134 │
31. │  328684 │
    └─────────┘

31 rows in set. Elapsed: 12.734 sec. Processed 11.55 million rows, 0.00 B (906.78 thousand rows/s., 0.00 B/s.)
Peak memory usage: 589.35 KiB.

arthur :) SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_extract_kvp_for_hive_path_parsing=1, optimize_count_from_files=0;

SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE (date >= '2025-01-01') AND (date <= '2025-01-31')
GROUP BY date
ORDER BY date ASC
SETTINGS use_extract_kvp_for_hive_path_parsing = 1, optimize_count_from_files = 0

Query id: 00df9211-93eb-4197-8eba-c3d4c1c7e142

    ┌─count()─┐
 1. │  292213 │
 2. │  402440 │
 3. │  409341 │
 4. │  432302 │
 5. │  433954 │
 6. │  366260 │
 7. │  352121 │
 8. │  399976 │
 9. │  534013 │
10. │  408769 │
11. │  361190 │
12. │  380525 │
13. │  408248 │
14. │  352684 │
15. │  354014 │
16. │  375439 │
17. │  425661 │
18. │  360666 │
19. │  388509 │
20. │  350291 │
21. │  324412 │
22. │  432369 │
23. │  326010 │
24. │  369243 │
25. │  338988 │
26. │  309651 │
27. │  332102 │
28. │  305953 │
29. │  355332 │
30. │  335134 │
31. │  328684 │
    └─────────┘

31 rows in set. Elapsed: 6.843 sec. Processed 11.55 million rows, 0.00 B (1.69 million rows/s., 0.00 B/s.)
Peak memory usage: 481.66 KiB.

arthur :)

Query execution time went down from 12.734 seconds to 6.843;

Tests with std::chrono:

Regex impl:

[arthur] 2025.04.11 16:34:22.232276 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 804 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:23.033714 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 795 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:23.819555 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 780 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:24.648519 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 823 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:25.487017 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 833 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:26.319090 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 826 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:26.369812 [ 1645631 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 48 milliseconds to parse hive partitioning using regex on 55 paths

Total of 4.909 seconds
Average of 810 ms per 1k file

extractKeyValuePairs impl:

[arthur] 2025.04.11 16:36:01.141189 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 50 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:01.422195 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 56 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:01.707895 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 68 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.002515 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 70 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.281829 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 78 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.428536 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 51 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.438095 [ 1645651 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 6 milliseconds to parse hive partitioning using extractKeyValuePairs on 55 paths

Total of 0.379 seconds
Average of 62 ms per 1k file

13x times faster

clickhouse-gh · 2025-04-11T23:49:10Z

Workflow [PR], commit [b737d09]

arthurpassos · 2025-04-12T15:10:12Z

Some benchmarks on release build with AWS S3 public data, suffers from latency variations tho:

laptop@arthur:~/work/use_extract_key_value_pairs_for_hive$ ./cmake-build-release/programs/clickhouse benchmark -i 10 --cumulative -q "select count() from s3('s3://daylight-openstreetmap/parquet/osm_features/**', NOSIGN, Parquet) where release='v1.54' SETTINGS use_extract_kvp_for_hive_path_parsing=0;"
Loaded 1 queries.

Queries executed: 1.

localhost:9000, queries: 1, QPS: 0.159, RPS: 200560445.161, MiB/s: 0.000, result RPS: 0.159, result MiB/s: 0.000.

0%		6.278 sec.	
10%		6.278 sec.	
20%		6.278 sec.	
30%		6.278 sec.	
40%		6.278 sec.	
50%		6.278 sec.	
60%		6.278 sec.	
70%		6.278 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 2.

localhost:9000, queries: 2, QPS: 0.176, RPS: 221440997.465, MiB/s: 0.000, result RPS: 0.176, result MiB/s: 0.000.

0%		5.091 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.091 sec.	
40%		5.091 sec.	
50%		6.278 sec.	
60%		6.278 sec.	
70%		6.278 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 3.

localhost:9000, queries: 3, QPS: 0.179, RPS: 225753367.290, MiB/s: 0.000, result RPS: 0.179, result MiB/s: 0.000.

0%		5.091 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.357 sec.	
40%		5.357 sec.	
50%		5.357 sec.	
60%		5.357 sec.	
70%		5.357 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 4.

localhost:9000, queries: 4, QPS: 0.183, RPS: 231189179.433, MiB/s: 0.000, result RPS: 0.183, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.049 sec.	
20%		5.091 sec.	
30%		5.091 sec.	
40%		5.091 sec.	
50%		5.357 sec.	
60%		5.357 sec.	
70%		5.357 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 5.

localhost:9000, queries: 5, QPS: 0.185, RPS: 233761637.227, MiB/s: 0.000, result RPS: 0.185, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.049 sec.	
20%		5.091 sec.	
30%		5.091 sec.	
40%		5.145 sec.	
50%		5.145 sec.	
60%		5.145 sec.	
70%		5.357 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 6.

localhost:9000, queries: 6, QPS: 0.186, RPS: 234970936.246, MiB/s: 0.000, result RPS: 0.186, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.145 sec.	
50%		5.219 sec.	
60%		5.219 sec.	
70%		5.357 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 7.

localhost:9000, queries: 7, QPS: 0.187, RPS: 236274438.551, MiB/s: 0.000, result RPS: 0.187, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.145 sec.	
50%		5.150 sec.	
60%		5.219 sec.	
70%		5.219 sec.	
80%		5.357 sec.	
90%		5.357 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 8.

localhost:9000, queries: 8, QPS: 0.182, RPS: 229735940.938, MiB/s: 0.000, result RPS: 0.182, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.150 sec.	
50%		5.219 sec.	
60%		5.219 sec.	
70%		5.357 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.542 sec.	
99%		6.542 sec.	
99.9%		6.542 sec.	
99.99%		6.542 sec.	



Queries executed: 10.

localhost:9000, queries: 10, QPS: 0.184, RPS: 231702525.701, MiB/s: 0.000, result RPS: 0.184, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.077 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.150 sec.	
50%		5.219 sec.	
60%		5.219 sec.	
70%		5.320 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.542 sec.	
99%		6.542 sec.	
99.9%		6.542 sec.	
99.99%		6.542 sec.

laptop@arthur:~/work/use_extract_key_value_pairs_for_hive$ ./cmake-build-release/programs/clickhouse benchmark -i 10 --cumulative -q "select count() from s3('s3://daylight-openstreetmap/parquet/osm_features/**', NOSIGN, Parquet) where release='v1.54' SETTINGS use_extract_kvp_for_hive_path_parsing=1;"
Loaded 1 queries.

Queries executed: 1.

localhost:9000, queries: 1, QPS: 0.178, RPS: 224878983.229, MiB/s: 0.000, result RPS: 0.178, result MiB/s: 0.000.

0%		5.598 sec.	
10%		5.598 sec.	
20%		5.598 sec.	
30%		5.598 sec.	
40%		5.598 sec.	
50%		5.598 sec.	
60%		5.598 sec.	
70%		5.598 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 2.

localhost:9000, queries: 2, QPS: 0.204, RPS: 256598082.655, MiB/s: 0.000, result RPS: 0.204, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.212 sec.	
30%		4.212 sec.	
40%		4.212 sec.	
50%		5.598 sec.	
60%		5.598 sec.	
70%		5.598 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 3.

localhost:9000, queries: 3, QPS: 0.213, RPS: 268699134.012, MiB/s: 0.000, result RPS: 0.213, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.212 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.243 sec.	
60%		4.243 sec.	
70%		4.243 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 4.

localhost:9000, queries: 4, QPS: 0.218, RPS: 274512020.430, MiB/s: 0.000, result RPS: 0.218, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.243 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.288 sec.	
60%		4.288 sec.	
70%		4.288 sec.	
80%		4.288 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 5.

localhost:9000, queries: 5, QPS: 0.221, RPS: 278759436.251, MiB/s: 0.000, result RPS: 0.221, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.235 sec.	
30%		4.235 sec.	
40%		4.243 sec.	
50%		4.243 sec.	
60%		4.243 sec.	
70%		4.288 sec.	
80%		4.288 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 6.

localhost:9000, queries: 6, QPS: 0.223, RPS: 280933693.296, MiB/s: 0.000, result RPS: 0.223, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.235 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.288 sec.	
60%		4.288 sec.	
70%		4.301 sec.	
80%		4.301 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 7.

localhost:9000, queries: 7, QPS: 0.224, RPS: 282464949.698, MiB/s: 0.000, result RPS: 0.224, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.235 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.288 sec.	
60%		4.301 sec.	
70%		4.301 sec.	
80%		4.309 sec.	
90%		4.309 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 8.

localhost:9000, queries: 8, QPS: 0.216, RPS: 272610478.286, MiB/s: 0.000, result RPS: 0.216, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.235 sec.	
30%		4.243 sec.	
40%		4.288 sec.	
50%		4.301 sec.	
60%		4.301 sec.	
70%		4.309 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.747 sec.	
99%		5.747 sec.	
99.9%		5.747 sec.	
99.99%		5.747 sec.	



Queries executed: 10.

localhost:9000, queries: 10, QPS: 0.217, RPS: 273842130.509, MiB/s: 0.000, result RPS: 0.217, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.243 sec.	
30%		4.288 sec.
40%		4.301 sec.	
50%		4.309 sec.	
60%		4.309 sec.	
70%		4.399 sec.	
80%		4.527 sec.	
90%		5.598 sec.	
95%		5.747 sec.	
99%		5.747 sec.	
99.9%		5.747 sec.	
99.99%		5.747 sec.

arthurpassos · 2025-04-12T15:15:51Z

Benchmark with dataset created by me using a build from #76802 with a few modifications to allow file table function to use hive as well.

arthur :) INSERT INTO TABLE FUNCTION file(
    'file_root',
    'Parquet',
    'year UInt16, country String, name String'
)
PARTITION BY (year, country)
SELECT
    toUInt16(2000 + number % 1000) AS year, 
    concat('Country', toString(number % 100)) AS country,
    concat('Name', toString(number)) AS name
FROM numbers(100000);

./cmake-build-release/programs/clickhouse benchmark -i 50 --cumulative -q "select count() from file('file_root/**', 'Parquet', 'year UInt16, country String, name String') where year=2980 SETTINGS use_extract_kvp_for_hive_path_parsing=0;"

Queries executed: 50.

localhost:9000, queries: 50, QPS: 2.323, RPS: 232.329, MiB/s: 0.000, result RPS: 2.323, result MiB/s: 0.000.

0%		0.356 sec.	
10%		0.378 sec.	
20%		0.404 sec.	
30%		0.415 sec.	
40%		0.422 sec.	
50%		0.430 sec.	
60%		0.433 sec.	
70%		0.440 sec.	
80%		0.446 sec.	
90%		0.461 sec.	
95%		0.488 sec.	
99%		0.534 sec.	
99.9%		0.534 sec.	
99.99%		0.534 sec.

laptop@arthur:~/work/use_extract_key_value_pairs_for_hive$ ./cmake-build-release/programs/clickhouse benchmark -i 50 --cumulative -q "select count() from file('file_root/**', 'Parquet', 'year UInt16, country String, name String') where year=2980 SETTINGS use_extract_kvp_for_hive_path_parsing=1;"

Queries executed: 50.

localhost:9000, queries: 50, QPS: 5.436, RPS: 543.630, MiB/s: 0.000, result RPS: 5.436, result MiB/s: 0.000.

0%		0.157 sec.	
10%		0.164 sec.	
20%		0.166 sec.	
30%		0.169 sec.	
40%		0.174 sec.	
50%		0.178 sec.	
60%		0.179 sec.	
70%		0.181 sec.	
80%		0.186 sec.	
90%		0.191 sec.	
95%		0.213 sec.	
99%		0.281 sec.	
99.9%		0.281 sec.	
99.99%		0.281 sec.

arthurpassos · 2025-04-12T16:51:21Z

Release build chrono benchmark in us:

[arthur] 2025.04.12 10:44:38.763259 [ 2782422 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> S3Client: Found region us-west-2 for bucket daylight-openstreetmap
[arthur] 2025.04.12 10:44:41.258421 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 284537 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:41.526399 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 265389 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:41.786028 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 253837 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.051724 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 244544 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.334001 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 229650 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.618122 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 240470 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.898153 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 243169 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:43.188188 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 254853 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:43.439767 [ 2783550 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 248971 us to parse 1000 hive paths with regex

Total 2.26542 seconds

[arthur] 2025.04.12 10:43:32.816585 [ 2782422 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> S3Client: Resolving region for bucket daylight-openstreetmap
[arthur] 2025.04.12 10:43:33.466107 [ 2782422 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> S3Client: Found region us-west-2 for bucket daylight-openstreetmap
[arthur] 2025.04.12 10:43:35.746444 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 21277 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.042181 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 21721 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.331335 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 34715 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.588042 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 18957 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.871638 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 26434 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.143553 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 28175 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.478662 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 35328 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.755174 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 23737 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.990836 [ 2782395 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 19530 us to parse 1000 hive paths with extractKeyValuePairs

Total 0.229874 seconds

arthurpassos · 2025-04-12T17:18:43Z

do not return std::unordered_map, use raw columns instead. Keeping the map for now just because it is easier to deal with

I have just implemented this thing and ran a benchmark, the gain is very small (because a std::unordered_set still needs to be created to check for duplicates, but we can afford to create a set of references instead of copying the strings).

The chrono benchmark for processing 10k files had a difference of 0.035511476 seconds.

For 10k files I don't think this matters, but if we start talking about 1mi files, then it'll be a 3.5s difference.

src/Storages/VirtualColumnUtils.cpp

arthurpassos · 2025-04-15T12:40:05Z

@nickitat I am working on a faster implementation with less memory allocations, I'll update the PR soon

This reverts commit 72214cc.

arthurpassos · 2025-04-18T11:54:44Z

Hi @nickitat. This should be ready for review.

The idea of this PR is: use the built-in extractKeyValuePairs to extract the hive partition columns instead of regex. Using extractKeyValuePairs alone already provided a big performance boost, but it was still not optimal.

The original implementation of extractKeyValuePairs would copy the key-value pairs to two String output columns (Map(String, String)).

In this particular case, we know for sure that the input still exists in memory when processing the key-value pairs, so we can afford to return references instead of entire strings.

Therefore, I refactored extractKeyValuePairs a bit to accomodate yet another "strategy". One that returns a absl::flat_hash_map<std::string_view, std::string_view> instead of a column Map(String, String)

arthurpassos · 2025-04-28T17:03:24Z

@nickitat kind ping

nickitat · 2025-04-25T21:03:37Z

src/Core/Block.cpp



-const ColumnWithTypeAndName * Block::findByName(const std::string & name, bool case_insensitive) const
+const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const


Suggested change

const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const

const ColumnWithTypeAndName * Block::findByName(std::string_view name, bool case_insensitive) const

this one overload should be enough for both cases AFAIU

nickitat · 2025-04-25T21:13:14Z

src/Storages/VirtualColumnUtils.cpp

+    {
+        extractor.extract(path_without_filename, key_values);
+    }
+    catch (const extractKV::DuplicateKeyFoundException & ex)


why do we throw an exception of this special type to then catch it and transform to a standard CH exception?

Just so that we can give the user a better exception message since extractKeyValuePairs has no notion of hive partitioning

nickitat · 2025-04-28T16:25:20Z

src/Storages/VirtualColumnUtils.cpp

+        for (const auto & item : map)
        {
-            auto type = tryInferDataTypeByEscapingRule(item.second, format_settings, FormatSettings::EscapingRule::Raw);
+            const std::string key(item.first);


do they have to be strings?

add_virtual, VirtualColumnsDescription and ColumnsDescription rely on strings. Specially the last two. I am not sure it is safe to refactor those classes to use references.

Plus, it is not worth the effort. Unlike the other parseHivePartitioningKeysAndValues call, this is inside the getVirtualsForFileLikeStorage function, which is called ocasionally upon storage creation for a single filepath entry.

No performance gain will be observed, I believe

This reverts commit ee9e4e5.

nickitat · 2025-04-29T14:18:46Z

AST Fuzzer - #79451
other jobs succeeded after restart, but statuses are not updated

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250430) * Fix Build due to ClickHouse/ClickHouse#79067 * Fix build due to ClickHouse/ClickHouse#79417 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang chen <[email protected]>

use extractKeyValuePairs to parse hive paths

ad9eddb

remove useless comments

8ba4784

arthurpassos mentioned this pull request Apr 11, 2025

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex Altinity/ClickHouse#734

Merged

alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 11, 2025

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Apr 11, 2025

arthurpassos added 3 commits April 11, 2025 23:27

change test names

fd4f6d0

add setting to changeshistory

c80830d

Merge branch 'master' into use_extract_key_value_pairs_for_hive

1cf1fb1

nickitat self-assigned this Apr 12, 2025

polish implementation

18b8175

arthurpassos force-pushed the use_extract_key_value_pairs_for_hive branch from 406f6f2 to 18b8175 Compare April 12, 2025 19:16

arthurpassos added 7 commits April 12, 2025 16:17

remove change in settingschangehistory

0a04ebb

uncludes

6881fd8

tidy build

6b83d71

undo no longer necessary changes

01306d7

hhh

3d3ad56

only throw duplicate exception in case value is different

39e39e0

rmv duplicate key

b8f3b03

arthurpassos commented Apr 12, 2025

View reviewed changes

src/Storages/VirtualColumnUtils.cpp Show resolved Hide resolved

arthurpassos added 4 commits April 15, 2025 16:27

further improve performance

6da0112

well well

df22e9f

slightly better

171d453

Revert "slightly better"

d344c22

This reverts commit 72214cc.

arthurpassos added 16 commits April 15, 2025 16:27

remove some uneeded abstractions

4cf9663

tmp

5aa997d

looking good already

d11c984

rmv one test until I have time to make it compile

95bf0d6

rename reference state handler

83d3430

fix ut build

77a3d95

add benchmark test

6f89d54

add tests back

aad9ca6

fix

a4cd114

try catch

eef43d4

add tests back

b37cf23

remove unuesed badarguments

2b6962f

add missing result in reference file

7fd382d

missing order by

cb901e7

remove benchmark test

04714bb

tmp

ee9e4e5

nickitat reviewed Apr 28, 2025

View reviewed changes

arthurpassos added 2 commits April 28, 2025 19:50

remove unused API and pass string_view by value

d23a5f6

Revert "tmp"

b737d09

This reverts commit ee9e4e5.

nickitat approved these changes Apr 29, 2025

View reviewed changes

nickitat added this pull request to the merge queue Apr 29, 2025

Merged via the queue into ClickHouse:master with commit b8d7ddb Apr 29, 2025
117 of 122 checks passed

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 29, 2025

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Apr 30, 2025

Fix Build due to ClickHouse/ClickHouse#79067

e75eae2

Algunenano mentioned this pull request May 14, 2025

Examples build is broken #80206

Closed



		const ColumnWithTypeAndName * Block::findByName(const std::string & name, bool case_insensitive) const
		const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const

Conversation

arthurpassos commented Apr 11, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

arthurpassos commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Apr 12, 2025

Uh oh!

arthurpassos commented Apr 12, 2025

Uh oh!

arthurpassos commented Apr 12, 2025

Uh oh!

arthurpassos commented Apr 12, 2025

Uh oh!

Uh oh!

arthurpassos commented Apr 15, 2025

Uh oh!

arthurpassos commented Apr 18, 2025

Uh oh!

arthurpassos commented Apr 28, 2025

Uh oh!

nickitat Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

nickitat Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

nickitat Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

nickitat commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arthurpassos commented Apr 11, 2025 •

edited

Loading

clickhouse-gh bot commented Apr 11, 2025 •

edited

Loading

nickitat commented Apr 29, 2025 •

edited

Loading