Skip to content

Escape index filenames to prevent broken parts#94079

Merged
Algunenano merged 9 commits intoClickHouse:masterfrom
Algunenano:idx_utf8
Jan 23, 2026
Merged

Escape index filenames to prevent broken parts#94079
Algunenano merged 9 commits intoClickHouse:masterfrom
Algunenano:idx_utf8

Conversation

@Algunenano
Copy link
Copy Markdown
Member

@Algunenano Algunenano commented Jan 13, 2026

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Escape index filenames to prevent broken parts. With this change ClickHouse will fail to load indices with non-ascii characters in their name created by previous versions. To handle it you can use the merge tree setting escape_index_filenames.

Queries won't fail if indices cannot be loaded, you will see a Debug message like:

File for index `minmax_index_ÑÑÑÑÑÑ` does not exist (./store/019/0199d890-b99b-4c29-97d1-b037a85fe65d/all_2_2_0/skp_idx_minmax_index_%C3%91%C3%91%C3%91%C3%91%C3%91%C3%91.*). Skipping it.

This message is the same as when the index is not materialized for a part. To workaround this change and be able to load old indices you can alter the table settings (alter table test modify setting escape_index_filenames=0).

Closes #91417

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 13, 2026

Workflow [PR], commit [8b3c8fb]

Summary:

job_name test_name status info comment
Integration tests (amd_tsan, 3/6) failure
test_restore_db_replica/test.py::test_query_after_restore_db_replica[alter table-no exists table-no restart] FAIL cidb IGNORED
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: (STID: 1941-1bfa) FAIL cidb, issue ISSUE EXISTS
Stateless tests (arm_asan, targeted) error IGNORED
BuzzHouse (amd_ubsan) error IGNORED

@clickhouse-gh clickhouse-gh bot added the pr-backward-incompatible Pull request with backwards incompatible changes label Jan 13, 2026
@nikitamikhaylov nikitamikhaylov self-assigned this Jan 13, 2026
@Algunenano Algunenano requested a review from Copilot January 14, 2026 12:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where MergeTree indices with non-ASCII characters in their names could create broken data parts. The fix applies proper filename escaping to prevent filesystem-related issues with special characters.

Changes:

  • Centralized filename escaping for index names by moving escapeForFileName logic into IMergeTreeIndex::getFileName()
  • Updated mutation and size calculation code to use the escaped filenames consistently
  • Added a regression test with non-UTF8 and special characters in index names

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/Storages/MergeTree/MergeTreeIndices.h Changed getFileName() from inline to a separate implementation and updated documentation
src/Storages/MergeTree/MergeTreeIndices.cpp Implemented getFileName() with proper filename escaping using escapeForFileName
src/Storages/MergeTree/IMergeTreeDataPart.cpp Removed redundant escaping since it's now handled in getFileName() and added escaping to hasSecondaryIndex()
src/Storages/MergeTree/MutateTask.cpp Updated to use escapeForFileName for index filename construction during mutations
tests/queries/0_stateless/03789_skip_index_bad_utf8.sql Regression test with non-ASCII characters in index names to verify attach/detach works correctly
tests/queries/0_stateless/03789_skip_index_bad_utf8.reference Expected output for the regression test

@Algunenano Algunenano requested a review from Copilot January 16, 2026 12:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.

@Algunenano
Copy link
Copy Markdown
Member Author

Added a compatibility setting so indices keep working after upgrading (as long as you set compatibility or escape_index_filenames correctly)

@Algunenano Algunenano marked this pull request as ready for review January 16, 2026 12:29
node.restart_with_latest_version()

node.query("SELECT count() FROM test_index_filename WHERE `column` = 24 SETTINGS use_query_condition_cache=0;")
node.wait_for_log_line("File for index `minmax_index_ESPAÑA` does not exist")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test is flaky because of this line. Why? Does it fail to discard the index?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a look. Thanks!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that when running the test multiple times the test framework starts with the version it was left (master), not with the original declared in the cluster (25.12).

{
{"min_columns_to_activate_adaptive_write_buffer", 500, 500, "New setting"},
{"materialize_statistics_on_merge", true, true, "New setting"},
{"escape_index_filenames", false, true, "Escape non-ascii characters in filenames created for indices"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe we can have a better way how to preserve a compatibility e.g. automatically fallback to non-escaped file name in case we didn't find an escaped one on the disk, but no. You solution looks better. JFYI

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing, then I started doing it and it required too many changes in many disconnected places

@Algunenano Algunenano added this pull request to the merge queue Jan 23, 2026
Merged via the queue into ClickHouse:master with commit a763c03 Jan 23, 2026
127 of 132 checks passed
@Algunenano Algunenano deleted the idx_utf8 branch January 23, 2026 14:39
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backward-incompatible Pull request with backwards incompatible changes pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indices with bad utf-8 names break parts

4 participants