Escape index filenames to prevent broken parts#94079
Escape index filenames to prevent broken parts#94079Algunenano merged 9 commits intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [8b3c8fb] Summary: ❌
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical issue where MergeTree indices with non-ASCII characters in their names could create broken data parts. The fix applies proper filename escaping to prevent filesystem-related issues with special characters.
Changes:
- Centralized filename escaping for index names by moving
escapeForFileNamelogic intoIMergeTreeIndex::getFileName() - Updated mutation and size calculation code to use the escaped filenames consistently
- Added a regression test with non-UTF8 and special characters in index names
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Storages/MergeTree/MergeTreeIndices.h | Changed getFileName() from inline to a separate implementation and updated documentation |
| src/Storages/MergeTree/MergeTreeIndices.cpp | Implemented getFileName() with proper filename escaping using escapeForFileName |
| src/Storages/MergeTree/IMergeTreeDataPart.cpp | Removed redundant escaping since it's now handled in getFileName() and added escaping to hasSecondaryIndex() |
| src/Storages/MergeTree/MutateTask.cpp | Updated to use escapeForFileName for index filename construction during mutations |
| tests/queries/0_stateless/03789_skip_index_bad_utf8.sql | Regression test with non-ASCII characters in index names to verify attach/detach works correctly |
| tests/queries/0_stateless/03789_skip_index_bad_utf8.reference | Expected output for the regression test |
Co-authored-by: Copilot <[email protected]>
|
Added a compatibility setting so indices keep working after upgrading (as long as you set |
| node.restart_with_latest_version() | ||
|
|
||
| node.query("SELECT count() FROM test_index_filename WHERE `column` = 24 SETTINGS use_query_condition_cache=0;") | ||
| node.wait_for_log_line("File for index `minmax_index_ESPAÑA` does not exist") |
There was a problem hiding this comment.
Looks like the test is flaky because of this line. Why? Does it fail to discard the index?
There was a problem hiding this comment.
Having a look. Thanks!
There was a problem hiding this comment.
The problem was that when running the test multiple times the test framework starts with the version it was left (master), not with the original declared in the cluster (25.12).
| { | ||
| {"min_columns_to_activate_adaptive_write_buffer", 500, 500, "New setting"}, | ||
| {"materialize_statistics_on_merge", true, true, "New setting"}, | ||
| {"escape_index_filenames", false, true, "Escape non-ascii characters in filenames created for indices"}, |
There was a problem hiding this comment.
I was thinking maybe we can have a better way how to preserve a compatibility e.g. automatically fallback to non-escaped file name in case we didn't find an escaped one on the disk, but no. You solution looks better. JFYI
There was a problem hiding this comment.
I thought the same thing, then I started doing it and it required too many changes in many disconnected places
a763c03
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Escape index filenames to prevent broken parts. With this change ClickHouse will fail to load indices with non-ascii characters in their name created by previous versions. To handle it you can use the merge tree setting
escape_index_filenames.Queries won't fail if indices cannot be loaded, you will see a
Debugmessage like:This message is the same as when the index is not materialized for a part. To workaround this change and be able to load old indices you can alter the table settings (
alter table test modify setting escape_index_filenames=0).Closes #91417
Documentation entry for user-facing changes