Skip to content

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex#79067

Merged
nickitat merged 38 commits intoClickHouse:masterfrom
arthurpassos:use_extract_key_value_pairs_for_hive
Apr 29, 2025
Merged

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex#79067
nickitat merged 38 commits intoClickHouse:masterfrom
arthurpassos:use_extract_key_value_pairs_for_hive

Conversation

@arthurpassos
Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@arthurpassos
Copy link
Copy Markdown
Contributor Author

arthurpassos commented Apr 11, 2025

There are a few things to be done:

  1. benchmarks
  2. remove bool setting, exists only to make benchmarks easier
  3. do not return std::unordered_map, use raw columns instead. Keeping the map for now just because it is easier to deal with

Tests with a DEBUG build:

arthur :) SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_extract_kvp_for_hive_path_parsing=0, optimize_count_from_files=0;

SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE (date >= '2025-01-01') AND (date <= '2025-01-31')
GROUP BY date
ORDER BY date ASC
SETTINGS use_extract_kvp_for_hive_path_parsing = 0, optimize_count_from_files = 0

Query id: 997340de-7be7-42af-9c05-0a4914df8440

    ┌─count()─┐
 1. │  292213 │
 2. │  402440 │
 3. │  409341 │
 4. │  432302 │
 5. │  433954 │
 6. │  366260 │
 7. │  352121 │
 8. │  399976 │
 9. │  534013 │
10. │  408769 │
11. │  361190 │
12. │  380525 │
13. │  408248 │
14. │  352684 │
15. │  354014 │
16. │  375439 │
17. │  425661 │
18. │  360666 │
19. │  388509 │
20. │  350291 │
21. │  324412 │
22. │  432369 │
23. │  326010 │
24. │  369243 │
25. │  338988 │
26. │  309651 │
27. │  332102 │
28. │  305953 │
29. │  355332 │
30. │  335134 │
31. │  328684 │
    └─────────┘

31 rows in set. Elapsed: 12.734 sec. Processed 11.55 million rows, 0.00 B (906.78 thousand rows/s., 0.00 B/s.)
Peak memory usage: 589.35 KiB.
arthur :) SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_extract_kvp_for_hive_path_parsing=1, optimize_count_from_files=0;

SELECT count()
FROM s3('http://localhost:11111/aws-public-blockchain/v1.0/btc/transactions/**.parquet', NOSIGN)
WHERE (date >= '2025-01-01') AND (date <= '2025-01-31')
GROUP BY date
ORDER BY date ASC
SETTINGS use_extract_kvp_for_hive_path_parsing = 1, optimize_count_from_files = 0

Query id: 00df9211-93eb-4197-8eba-c3d4c1c7e142

    ┌─count()─┐
 1. │  292213 │
 2. │  402440 │
 3. │  409341 │
 4. │  432302 │
 5. │  433954 │
 6. │  366260 │
 7. │  352121 │
 8. │  399976 │
 9. │  534013 │
10. │  408769 │
11. │  361190 │
12. │  380525 │
13. │  408248 │
14. │  352684 │
15. │  354014 │
16. │  375439 │
17. │  425661 │
18. │  360666 │
19. │  388509 │
20. │  350291 │
21. │  324412 │
22. │  432369 │
23. │  326010 │
24. │  369243 │
25. │  338988 │
26. │  309651 │
27. │  332102 │
28. │  305953 │
29. │  355332 │
30. │  335134 │
31. │  328684 │
    └─────────┘

31 rows in set. Elapsed: 6.843 sec. Processed 11.55 million rows, 0.00 B (1.69 million rows/s., 0.00 B/s.)
Peak memory usage: 481.66 KiB.

arthur :)

Query execution time went down from 12.734 seconds to 6.843;

Tests with std::chrono:

Regex impl:

[arthur] 2025.04.11 16:34:22.232276 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 804 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:23.033714 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 795 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:23.819555 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 780 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:24.648519 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 823 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:25.487017 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 833 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:26.319090 [ 1645515 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 826 milliseconds to parse hive partitioning using regex on 1000 paths
[arthur] 2025.04.11 16:34:26.369812 [ 1645631 ] {e84c2ecf-6be9-4617-b9e5-14a5c8ebd0a5} <Information> HivePathParsing: Took 48 milliseconds to parse hive partitioning using regex on 55 paths

Total of 4.909 seconds
Average of 810 ms per 1k file

extractKeyValuePairs impl:

[arthur] 2025.04.11 16:36:01.141189 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 50 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:01.422195 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 56 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:01.707895 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 68 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.002515 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 70 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.281829 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 78 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.428536 [ 1645220 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 51 milliseconds to parse hive partitioning using extractKeyValuePairs on 1000 paths
[arthur] 2025.04.11 16:36:02.438095 [ 1645651 ] {4fa9e444-b7c6-433a-97ec-a3b5ffc6ebfd} <Information> HivePathParsing: Took 6 milliseconds to parse hive partitioning using extractKeyValuePairs on 55 paths

Total of 0.379 seconds
Average of 62 ms per 1k file

13x times faster

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 11, 2025

Workflow [PR], commit [b737d09]

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Apr 11, 2025
@nickitat nickitat self-assigned this Apr 12, 2025
@arthurpassos
Copy link
Copy Markdown
Contributor Author

Some benchmarks on release build with AWS S3 public data, suffers from latency variations tho:

laptop@arthur:~/work/use_extract_key_value_pairs_for_hive$ ./cmake-build-release/programs/clickhouse benchmark -i 10 --cumulative -q "select count() from s3('s3://daylight-openstreetmap/parquet/osm_features/**', NOSIGN, Parquet) where release='v1.54' SETTINGS use_extract_kvp_for_hive_path_parsing=0;"
Loaded 1 queries.

Queries executed: 1.

localhost:9000, queries: 1, QPS: 0.159, RPS: 200560445.161, MiB/s: 0.000, result RPS: 0.159, result MiB/s: 0.000.

0%		6.278 sec.	
10%		6.278 sec.	
20%		6.278 sec.	
30%		6.278 sec.	
40%		6.278 sec.	
50%		6.278 sec.	
60%		6.278 sec.	
70%		6.278 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 2.

localhost:9000, queries: 2, QPS: 0.176, RPS: 221440997.465, MiB/s: 0.000, result RPS: 0.176, result MiB/s: 0.000.

0%		5.091 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.091 sec.	
40%		5.091 sec.	
50%		6.278 sec.	
60%		6.278 sec.	
70%		6.278 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 3.

localhost:9000, queries: 3, QPS: 0.179, RPS: 225753367.290, MiB/s: 0.000, result RPS: 0.179, result MiB/s: 0.000.

0%		5.091 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.357 sec.	
40%		5.357 sec.	
50%		5.357 sec.	
60%		5.357 sec.	
70%		5.357 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 4.

localhost:9000, queries: 4, QPS: 0.183, RPS: 231189179.433, MiB/s: 0.000, result RPS: 0.183, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.049 sec.	
20%		5.091 sec.	
30%		5.091 sec.	
40%		5.091 sec.	
50%		5.357 sec.	
60%		5.357 sec.	
70%		5.357 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 5.

localhost:9000, queries: 5, QPS: 0.185, RPS: 233761637.227, MiB/s: 0.000, result RPS: 0.185, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.049 sec.	
20%		5.091 sec.	
30%		5.091 sec.	
40%		5.145 sec.	
50%		5.145 sec.	
60%		5.145 sec.	
70%		5.357 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 6.

localhost:9000, queries: 6, QPS: 0.186, RPS: 234970936.246, MiB/s: 0.000, result RPS: 0.186, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.145 sec.	
50%		5.219 sec.	
60%		5.219 sec.	
70%		5.357 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 7.

localhost:9000, queries: 7, QPS: 0.187, RPS: 236274438.551, MiB/s: 0.000, result RPS: 0.187, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.145 sec.	
50%		5.150 sec.	
60%		5.219 sec.	
70%		5.219 sec.	
80%		5.357 sec.	
90%		5.357 sec.	
95%		6.278 sec.	
99%		6.278 sec.	
99.9%		6.278 sec.	
99.99%		6.278 sec.	



Queries executed: 8.

localhost:9000, queries: 8, QPS: 0.182, RPS: 229735940.938, MiB/s: 0.000, result RPS: 0.182, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.091 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.150 sec.	
50%		5.219 sec.	
60%		5.219 sec.	
70%		5.357 sec.	
80%		6.278 sec.	
90%		6.278 sec.	
95%		6.542 sec.	
99%		6.542 sec.	
99.9%		6.542 sec.	
99.99%		6.542 sec.	



Queries executed: 10.

localhost:9000, queries: 10, QPS: 0.184, RPS: 231702525.701, MiB/s: 0.000, result RPS: 0.184, result MiB/s: 0.000.

0%		5.049 sec.	
10%		5.077 sec.	
20%		5.091 sec.	
30%		5.145 sec.	
40%		5.150 sec.	
50%		5.219 sec.	
60%		5.219 sec.	
70%		5.320 sec.	
80%		5.357 sec.	
90%		6.278 sec.	
95%		6.542 sec.	
99%		6.542 sec.	
99.9%		6.542 sec.	
99.99%		6.542 sec.	
laptop@arthur:~/work/use_extract_key_value_pairs_for_hive$ ./cmake-build-release/programs/clickhouse benchmark -i 10 --cumulative -q "select count() from s3('s3://daylight-openstreetmap/parquet/osm_features/**', NOSIGN, Parquet) where release='v1.54' SETTINGS use_extract_kvp_for_hive_path_parsing=1;"
Loaded 1 queries.

Queries executed: 1.

localhost:9000, queries: 1, QPS: 0.178, RPS: 224878983.229, MiB/s: 0.000, result RPS: 0.178, result MiB/s: 0.000.

0%		5.598 sec.	
10%		5.598 sec.	
20%		5.598 sec.	
30%		5.598 sec.	
40%		5.598 sec.	
50%		5.598 sec.	
60%		5.598 sec.	
70%		5.598 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 2.

localhost:9000, queries: 2, QPS: 0.204, RPS: 256598082.655, MiB/s: 0.000, result RPS: 0.204, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.212 sec.	
30%		4.212 sec.	
40%		4.212 sec.	
50%		5.598 sec.	
60%		5.598 sec.	
70%		5.598 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 3.

localhost:9000, queries: 3, QPS: 0.213, RPS: 268699134.012, MiB/s: 0.000, result RPS: 0.213, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.212 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.243 sec.	
60%		4.243 sec.	
70%		4.243 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 4.

localhost:9000, queries: 4, QPS: 0.218, RPS: 274512020.430, MiB/s: 0.000, result RPS: 0.218, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.243 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.288 sec.	
60%		4.288 sec.	
70%		4.288 sec.	
80%		4.288 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 5.

localhost:9000, queries: 5, QPS: 0.221, RPS: 278759436.251, MiB/s: 0.000, result RPS: 0.221, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.212 sec.	
20%		4.235 sec.	
30%		4.235 sec.	
40%		4.243 sec.	
50%		4.243 sec.	
60%		4.243 sec.	
70%		4.288 sec.	
80%		4.288 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 6.

localhost:9000, queries: 6, QPS: 0.223, RPS: 280933693.296, MiB/s: 0.000, result RPS: 0.223, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.235 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.288 sec.	
60%		4.288 sec.	
70%		4.301 sec.	
80%		4.301 sec.	
90%		5.598 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 7.

localhost:9000, queries: 7, QPS: 0.224, RPS: 282464949.698, MiB/s: 0.000, result RPS: 0.224, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.235 sec.	
30%		4.243 sec.	
40%		4.243 sec.	
50%		4.288 sec.	
60%		4.301 sec.	
70%		4.301 sec.	
80%		4.309 sec.	
90%		4.309 sec.	
95%		5.598 sec.	
99%		5.598 sec.	
99.9%		5.598 sec.	
99.99%		5.598 sec.	



Queries executed: 8.

localhost:9000, queries: 8, QPS: 0.216, RPS: 272610478.286, MiB/s: 0.000, result RPS: 0.216, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.235 sec.	
30%		4.243 sec.	
40%		4.288 sec.	
50%		4.301 sec.	
60%		4.301 sec.	
70%		4.309 sec.	
80%		5.598 sec.	
90%		5.598 sec.	
95%		5.747 sec.	
99%		5.747 sec.	
99.9%		5.747 sec.	
99.99%		5.747 sec.	



Queries executed: 10.

localhost:9000, queries: 10, QPS: 0.217, RPS: 273842130.509, MiB/s: 0.000, result RPS: 0.217, result MiB/s: 0.000.

0%		4.212 sec.	
10%		4.235 sec.	
20%		4.243 sec.	
30%		4.288 sec.
40%		4.301 sec.	
50%		4.309 sec.	
60%		4.309 sec.	
70%		4.399 sec.	
80%		4.527 sec.	
90%		5.598 sec.	
95%		5.747 sec.	
99%		5.747 sec.	
99.9%		5.747 sec.	
99.99%		5.747 sec.

@arthurpassos
Copy link
Copy Markdown
Contributor Author

Benchmark with dataset created by me using a build from #76802 with a few modifications to allow file table function to use hive as well.

arthur :) INSERT INTO TABLE FUNCTION file(
    'file_root',
    'Parquet',
    'year UInt16, country String, name String'
)
PARTITION BY (year, country)
SELECT
    toUInt16(2000 + number % 1000) AS year, 
    concat('Country', toString(number % 100)) AS country,
    concat('Name', toString(number)) AS name
FROM numbers(100000);
./cmake-build-release/programs/clickhouse benchmark -i 50 --cumulative -q "select count() from file('file_root/**', 'Parquet', 'year UInt16, country String, name String') where year=2980 SETTINGS use_extract_kvp_for_hive_path_parsing=0;"

Queries executed: 50.

localhost:9000, queries: 50, QPS: 2.323, RPS: 232.329, MiB/s: 0.000, result RPS: 2.323, result MiB/s: 0.000.

0%		0.356 sec.	
10%		0.378 sec.	
20%		0.404 sec.	
30%		0.415 sec.	
40%		0.422 sec.	
50%		0.430 sec.	
60%		0.433 sec.	
70%		0.440 sec.	
80%		0.446 sec.	
90%		0.461 sec.	
95%		0.488 sec.	
99%		0.534 sec.	
99.9%		0.534 sec.	
99.99%		0.534 sec.	
laptop@arthur:~/work/use_extract_key_value_pairs_for_hive$ ./cmake-build-release/programs/clickhouse benchmark -i 50 --cumulative -q "select count() from file('file_root/**', 'Parquet', 'year UInt16, country String, name String') where year=2980 SETTINGS use_extract_kvp_for_hive_path_parsing=1;"

Queries executed: 50.

localhost:9000, queries: 50, QPS: 5.436, RPS: 543.630, MiB/s: 0.000, result RPS: 5.436, result MiB/s: 0.000.

0%		0.157 sec.	
10%		0.164 sec.	
20%		0.166 sec.	
30%		0.169 sec.	
40%		0.174 sec.	
50%		0.178 sec.	
60%		0.179 sec.	
70%		0.181 sec.	
80%		0.186 sec.	
90%		0.191 sec.	
95%		0.213 sec.	
99%		0.281 sec.	
99.9%		0.281 sec.	
99.99%		0.281 sec.	

@arthurpassos
Copy link
Copy Markdown
Contributor Author

Release build chrono benchmark in us:

[arthur] 2025.04.12 10:44:38.763259 [ 2782422 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> S3Client: Found region us-west-2 for bucket daylight-openstreetmap
[arthur] 2025.04.12 10:44:41.258421 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 284537 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:41.526399 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 265389 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:41.786028 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 253837 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.051724 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 244544 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.334001 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 229650 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.618122 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 240470 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:42.898153 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 243169 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:43.188188 [ 2783559 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 254853 us to parse 1000 hive paths with regex
[arthur] 2025.04.12 10:44:43.439767 [ 2783550 ] {9d150efb-c4fd-4f26-946f-fe71817f4b86} <Information> Arthur: Took 248971 us to parse 1000 hive paths with regex

Total 2.26542 seconds

[arthur] 2025.04.12 10:43:32.816585 [ 2782422 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> S3Client: Resolving region for bucket daylight-openstreetmap
[arthur] 2025.04.12 10:43:33.466107 [ 2782422 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> S3Client: Found region us-west-2 for bucket daylight-openstreetmap
[arthur] 2025.04.12 10:43:35.746444 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 21277 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.042181 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 21721 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.331335 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 34715 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.588042 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 18957 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:36.871638 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 26434 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.143553 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 28175 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.478662 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 35328 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.755174 [ 2782393 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 23737 us to parse 1000 hive paths with extractKeyValuePairs
[arthur] 2025.04.12 10:43:37.990836 [ 2782395 ] {90979d57-62d5-4a6b-838b-5ee81aabd782} <Information> Arthur: Took 19530 us to parse 1000 hive paths with extractKeyValuePairs

Total 0.229874 seconds

@arthurpassos
Copy link
Copy Markdown
Contributor Author

  1. do not return std::unordered_map, use raw columns instead. Keeping the map for now just because it is easier to deal with

I have just implemented this thing and ran a benchmark, the gain is very small (because a std::unordered_set still needs to be created to check for duplicates, but we can afford to create a set of references instead of copying the strings).

The chrono benchmark for processing 10k files had a difference of 0.035511476 seconds.

For 10k files I don't think this matters, but if we start talking about 1mi files, then it'll be a 3.5s difference.

@arthurpassos arthurpassos force-pushed the use_extract_key_value_pairs_for_hive branch from 406f6f2 to 18b8175 Compare April 12, 2025 19:16
@arthurpassos
Copy link
Copy Markdown
Contributor Author

@nickitat I am working on a faster implementation with less memory allocations, I'll update the PR soon

@arthurpassos
Copy link
Copy Markdown
Contributor Author

Hi @nickitat. This should be ready for review.

The idea of this PR is: use the built-in extractKeyValuePairs to extract the hive partition columns instead of regex. Using extractKeyValuePairs alone already provided a big performance boost, but it was still not optimal.

The original implementation of extractKeyValuePairs would copy the key-value pairs to two String output columns (Map(String, String)).

In this particular case, we know for sure that the input still exists in memory when processing the key-value pairs, so we can afford to return references instead of entire strings.

Therefore, I refactored extractKeyValuePairs a bit to accomodate yet another "strategy". One that returns a absl::flat_hash_map<std::string_view, std::string_view> instead of a column Map(String, String)

@arthurpassos
Copy link
Copy Markdown
Contributor Author

@nickitat kind ping



const ColumnWithTypeAndName * Block::findByName(const std::string & name, bool case_insensitive) const
const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const ColumnWithTypeAndName * Block::findByName(const std::string_view & name, bool case_insensitive) const
const ColumnWithTypeAndName * Block::findByName(std::string_view name, bool case_insensitive) const
  1. this one overload should be enough for both cases AFAIU

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
extractor.extract(path_without_filename, key_values);
}
catch (const extractKV::DuplicateKeyFoundException & ex)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we throw an exception of this special type to then catch it and transform to a standard CH exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so that we can give the user a better exception message since extractKeyValuePairs has no notion of hive partitioning

for (const auto & item : map)
{
auto type = tryInferDataTypeByEscapingRule(item.second, format_settings, FormatSettings::EscapingRule::Raw);
const std::string key(item.first);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do they have to be strings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_virtual, VirtualColumnsDescription and ColumnsDescription rely on strings. Specially the last two. I am not sure it is safe to refactor those classes to use references.

Plus, it is not worth the effort. Unlike the other parseHivePartitioningKeysAndValues call, this is inside the getVirtualsForFileLikeStorage function, which is called ocasionally upon storage creation for a single filepath entry.

No performance gain will be observed, I believe

@nickitat
Copy link
Copy Markdown
Member

nickitat commented Apr 29, 2025

AST Fuzzer - #79451
other jobs succeeded after restart, but statuses are not updated

@nickitat nickitat added this pull request to the merge queue Apr 29, 2025
Merged via the queue into ClickHouse:master with commit b8d7ddb Apr 29, 2025
117 of 122 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 29, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Apr 30, 2025
baibaichen pushed a commit to apache/gluten that referenced this pull request Apr 30, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250430)

* Fix Build due to ClickHouse/ClickHouse#79067

* Fix build due to ClickHouse/ClickHouse#79417

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants