Skip to content

CASSANDRA-20389 prevent too long table names not fitting file names#4038

Closed
k-rus wants to merge 3 commits intoapache:trunkfrom
k-rus:cass-20389-long-table-name
Closed

CASSANDRA-20389 prevent too long table names not fitting file names#4038
k-rus wants to merge 3 commits intoapache:trunkfrom
k-rus:cass-20389-long-table-name

Conversation

@k-rus
Copy link
Copy Markdown
Contributor

@k-rus k-rus commented Apr 3, 2025

The length of table names was not controlled. This is likely due to confusion between validation functions with similar names. As result creating a table with too long name led to the too long file name exception during the table creation.

This commit adds a validation of table name lengths to avoid the too long file name errors. The validation length is based on how the table name is used to create file/directory name and needs to be exact to prevent the too long file name exception, but allow all other table names, which don't lead to the too long file name exception. Since this limit is different from the existing name length limit of 48 characters, the validation of lengths is separated from the common validation function, which checks for empty name and valid characters. This separation also allows to be specific with errors in number of cases.

Note that there will be more PRs with more cases for different name length:

  • Index names participate in file names together with index meta information such as index type, version and includes optional parameters, thus the length will vary on exact index created.

New constants are added for the table name length limit and for the file name length limit.
Two similar validation functions are replaced with one to avoid confusion. Name related constants are stored in the same utility class.

Table name related code are moved into methods in TableMetadata class, so their semantics are more clear and to allow reuse, e.g., in asserting the table name length constant.
Tests are added for the long table names and non-alphanumeric names.

Keyspace name validation function is shared between two classes and a unit test of it is added.

Other changes in this commit:

  • Fix few minor suggestions for code improvements in affected files
  • Add git-ignoring VS code generated files

patch by Ruslan Fomkin; reviewed by Piotr Kołaczkowski, Dmitry Konstantinov, Maxwell Guo for CASSANDRA-20389

@k-rus k-rus changed the title [WIP] CASSANDRA-20389 prevent too long table names to fit file names [WIP] CASSANDRA-20389 prevent too long table names not fitting file names Apr 4, 2025
@k-rus k-rus changed the title [WIP] CASSANDRA-20389 prevent too long table names not fitting file names CASSANDRA-20389 prevent too long table names not fitting file names Apr 4, 2025
@k-rus k-rus marked this pull request as ready for review April 4, 2025 09:28
Copy link
Copy Markdown
Contributor

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor code-style nitpicks ;)

Comment thread src/java/org/apache/cassandra/schema/TableMetadata.java Outdated
Comment thread src/java/org/apache/cassandra/schema/SchemaConstants.java Outdated
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Apr 23, 2025

@smiklosovic @pkolaczk I pushed a commit to separate name length validations from the validation for emptiness and characters. The rational is:

  • Different names require different length validations and some validations will be more dynamic, e.g., table name in 4.x or index name.
  • It gives opportunity to separate errors on length and on characters.
    My understanding is that originally any name should fit 48 character, thus it made sense to have the length validation in the same function as the character validation. However, since the length validation was missed for some names, there can be longer names in existing databases, and thus 48 characters cannot be applied.

What do you think about the last implementation?

@k-rus k-rus force-pushed the cass-20389-long-table-name branch from 5f8ac66 to 6fc8152 Compare April 23, 2025 15:53
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Apr 23, 2025

I updated the PR title to include the latest changes.

@k-rus k-rus requested review from pkolaczk and smiklosovic April 23, 2025 20:21
@smiklosovic
Copy link
Copy Markdown
Contributor

i wont be available whole next week

@smiklosovic
Copy link
Copy Markdown
Contributor

smiklosovic commented Apr 25, 2025

@k-rus see? I love the current approach! I will try to review this, definitely after 5th May, maybe next week but as I said availability will be limited. You are free to ask another commiter for the second review if you want.

edit: I put some comments in.

Comment thread src/java/org/apache/cassandra/schema/TableMetadata.java Outdated
Comment thread src/java/org/apache/cassandra/schema/TableMetadata.java Outdated
Comment thread src/java/org/apache/cassandra/schema/SchemaConstants.java
Comment thread src/java/org/apache/cassandra/schema/KeyspaceMetadata.java Outdated
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Apr 26, 2025

@k-rus see? I love the current approach! I will try to review this, definitely after 5th May, maybe next week but as I said availability will be limited. You are free to ask another commiter for the second review if you want.

edit: I put some comments in.

I confirm that I understand that you will have possibility to review after next week. I think it's better to wait until you are back. Meanwhile I addressed some suggestions and replied to another.

if (!isValidName(name))
except("Table name must not be empty or not contain non-alphanumeric-underscore characters (got \"%s\")", name);

if (name.length() > TABLE_NAME_LENGTH)
Copy link
Copy Markdown
Contributor

@smiklosovic smiklosovic Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little bit off, I would expect that if isValidTableName is true, then I will not be required to have yet another check where I check if name.length() > TABLE_NAME_LENGTH. That might be part of that already no?

What might happen is that a dev calls "isValidTableName(name)" and if it returns true, why would they check if length is smaller than TABLE_NAME_LENGTH. They just forget to do this. Because they just got "true" from "isValidTableName" .

In other words, by having the logic spread here like this, "isValidName" is not actually telling truth, does it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two checks and the assert are replaced with a single method call to validateTableName, which also encapsulates throwing an exception on invalid table name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @smiklosovic , do you agree with @k-rus 's fix ? changing the name from "!isValidName(name)" to "isValidCharsName"

if (name.length() > TABLE_NAME_LENGTH)
except("Table name must not be more than %d characters long (got %d characters for \"%s\")", TABLE_NAME_LENGTH, name.length(), name);

assert getTableDirectoryName().length() <= FILENAME_LENGTH : String.format("Generated directory name for a table of %d characters doesn't fit the max filename legnth of %s. This unexpectedly wasn't prevented by check of the table name length, %d, to fit %d characters (got table name \"%s\" and generated directory name \"%s\"", getTableDirectoryName().length(), FILENAME_LENGTH, name.length(), TABLE_NAME_LENGTH, name, getTableDirectoryName());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with this. If I got true from "isValidTableName" there is no way I would go and check this as well.

That means what we would move three "get" methods bellow to SchemaConstants as well and accommodate their signatures to reflect the changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the suggestion of moving getTableDirectoryName into the utility class. As it operate on a table name and used with TableMetadata instance.

@k-rus k-rus force-pushed the cass-20389-long-table-name branch 4 times, most recently from a8f20f7 to 85e976b Compare May 10, 2025 11:18
@k-rus k-rus requested a review from smiklosovic May 10, 2025 13:58
swasik added a commit to swasik/scylladb that referenced this pull request May 28, 2025
Currently the table name length is limited to 48 characters, same way
as all the other names. This was compatible with Cassandra 3 but due
to a bug the validation of this constraint was removed in
Cassandra 4 and 5 (see CASSANDRA-20389). Nowadays, some of the usage
scenarios depends on the lack of limitation in Cassandra, such as
some feature store workflows generating long table names.

This path introduces the behavior that is compatible with
Cassandra 4 and 5. It limits table name to 222 characters becasue
when the existing code creates a directory to store the new table,
this directory's name is created by sstables::init_table_storage(),
which takes the table's full name, a dash (-), and a 32-byte UUID
string. Because most Linux filesystems limit filename components
to 255 bytes, this means that any table name longer than 222 bytes
will attempt to mkdir() a directory name longer than the allowed
255 bytes, and fail.

Exactly the same limit (222 characters) is proposed in the patch
fixing directory creation failures in Cassandra 4 and 5
(see github.com/apache/cassandra/pull/4038, SchemaConstants.java).
swasik added a commit to swasik/scylladb that referenced this pull request Jun 2, 2025
Currently the name length of tables, materialized views, keyspaces
and indexes is limited to 48 characters. This was compatible with
Cassandra 3 but due to a bug the validation of this constraints was
removed in Cassandra 4 and 5 (see CASSANDRA-20389). Nowadays,
some of the usage scenarios depends on the lack of limitation in
Cassandra, such as some feature store workflows generating long
table names.

This path introduces the behavior that is compatible with
Cassandra 4 and 5. It limits names to 218 characters because
when the existing code creates a directory to store the new table,
this directory's name is created by sstables::init_table_storage(),
which takes the table's full name, a dash (-), and a 32-byte UUID
string. Because most Linux filesystems limit filename components
to 255 bytes, this means that any table name longer than 222 bytes
will attempt to mkdir() a directory name longer than the allowed
255 bytes, and fail. Extra 4 characters are reserved for _cdc
suffix for CDC-generated tables making the limit equal 218 chars.

Similar limit (222 characters) is proposed in the patch
fixing directory creation failures in Cassandra 4 and 5
(see github.com/apache/cassandra/pull/4038, SchemaConstants.java).

The patch also fixes tests that were broken by introducing this
patch.
swasik added a commit to swasik/scylladb that referenced this pull request Jun 2, 2025
Currently the name length of tables, materialized views, keyspaces
and indexes is limited to 48 characters. This was compatible with
Cassandra 3 but due to a bug the validation of this constraints was
removed in Cassandra 4 and 5 (see CASSANDRA-20389). Nowadays,
some of the usage scenarios depends on the lack of limitation in
Cassandra, such as some feature store workflows generating long
table names.

This path introduces the behavior that is compatible with
Cassandra 4 and 5. It limits names to 218 characters because
when the existing code creates a directory to store the new table,
this directory's name is created by sstables::init_table_storage(),
which takes the table's full name, a dash (-), and a 32-byte UUID
string. Because most Linux filesystems limit filename components
to 255 bytes, this means that any table name longer than 222 bytes
will attempt to mkdir() a directory name longer than the allowed
255 bytes, and fail. Extra 4 characters are reserved for _cdc
suffix for CDC-generated tables making the limit equal 218 chars.

Similar limit (222 characters) is proposed in the patch
fixing directory creation failures in Cassandra 4 and 5
(see github.com/apache/cassandra/pull/4038, SchemaConstants.java).

The patch also fixes tests that were broken by introducing this
patch.
swasik added a commit to swasik/scylladb that referenced this pull request Jun 2, 2025
Currently the name length of tables, materialized views, keyspaces
and indexes is limited to 48 characters. This was compatible with
Cassandra 3 but due to a bug the validation of this constraints was
removed in Cassandra 4 and 5 (see CASSANDRA-20389). Nowadays,
some of the usage scenarios depends on the lack of limitation in
Cassandra, such as some feature store workflows generating long
table names.

This path introduces the behavior that is compatible with
Cassandra 4 and 5. It limits names to 218 characters because
when the existing code creates a directory to store the new table,
this directory's name is created by sstables::init_table_storage(),
which takes the table's full name, a dash (-), and a 32-byte UUID
string. Because most Linux filesystems limit filename components
to 255 bytes, this means that any table name longer than 222 bytes
will attempt to mkdir() a directory name longer than the allowed
255 bytes, and fail. Extra 4 characters are reserved for _cdc
suffix for CDC-generated tables making the limit equal 218 chars.

Similar limit (222 characters) is proposed in the patch
fixing directory creation failures in Cassandra 4 and 5
(see github.com/apache/cassandra/pull/4038, SchemaConstants.java).

The patch also fixes tests that were broken by introducing this
patch.
@Maxwell-Guo
Copy link
Copy Markdown
Contributor

@beobal would you mind take a look ?

@k-rus k-rus force-pushed the cass-20389-long-table-name branch from b906eb3 to d9941e7 Compare August 18, 2025 08:15
@Maxwell-Guo
Copy link
Copy Markdown
Contributor

I am + 1 if all my comments are addressed and ci is green . But i think it is better if @beobal or @ekaterinadimitrova2 take another look.

@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Aug 18, 2025

I am + 1 if all my comments are addressed and ci is green

Thank you for the review and finding my implementation for the ticket suitable as all the comments are not related to the main contribution. I hope it will be possible to find a reasonable way to deliver those changes as I find important to deliver them.

But i think it is better if @beobal or @ekaterinadimitrova2 take another look.

As new reviewers might disagree with the current main implementation for the ticket I will wait with running an extensive CI.

@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Aug 18, 2025

I am + 1 if all my comments are addressed and ci is green

Thank you for the review and finding my implementation for the ticket suitable as all the comments are not related to the main contribution. I hope it will be possible to find a reasonable way to deliver those changes as I find important to deliver them.

I can think that adding VScode to ignore can be done in a separate ticket and PR, but it will be blocker for this PR until it's committed to trunk and release branches. Other changes do not make much sense outside this PR and will be difficult to deliver, so I still think that the separate commit is a good tradeoff.
@Maxwell-Guo what do you think?

@ekaterinadimitrova2
Copy link
Copy Markdown
Contributor

I am + 1 if all my comments are addressed and ci is green . But i think it is better if @beobal or @ekaterinadimitrova2 take another look.

I don't mind reviewing, but we already have three committers doing a review here. Plus @pkolaczk, who is also a long-time contributor. Is there anything specific you are worried about and you want to discuss?

Copy link
Copy Markdown
Contributor

@Maxwell-Guo Maxwell-Guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on green ci

@k-rus k-rus force-pushed the cass-20389-long-table-name branch 2 times, most recently from 812c2e1 to 31c8233 Compare August 18, 2025 15:26
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Aug 21, 2025

I ported this patch to 5.0, see #4325. It failed a test on the error message for keyspace of null when called for create table, so I adjusted validateKeyspaceName to not change the exception message.
This situation cannot happen on trunk, still I ported the change back in 0f7b756f4b9c6d59e22cf78b3d9d8ba20b2e890a
However, it cannot be tested.

@pkolaczk @netudima @Maxwell-Guo Do you want to review the commit and re-review the PR?

@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Aug 21, 2025

I ported this patch to 5.0, see #4325. It failed a test on the error message for keyspace of null when called for create table, so I adjusted validateKeyspaceName to not change the exception message. This situation cannot happen on trunk, still I ported the change back in 0f7b756 However, it cannot be tested.

@pkolaczk @netudima @Maxwell-Guo Do you want to review the commit and re-review the PR?

Actually I found another bug in this implementation. ire doesn't throw, so I need to add throw.

@k-rus k-rus marked this pull request as draft August 21, 2025 09:24
@k-rus k-rus marked this pull request as ready for review August 21, 2025 13:51
@k-rus k-rus force-pushed the cass-20389-long-table-name branch from 7efcb08 to c9321b5 Compare August 22, 2025 13:28
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Aug 22, 2025

Now I fixed that the keyspace validation throws the same exceptions as before this PR. See c9321b57c4fe078561cf9bcbed15dd2612277c6d

Unfortunately I don't find a good way to add tests on the error message in such case, since it's about internal calls as normally keyspaces are not created in the same API call for creating tables.

Also some formatting doesn't look great. I have double checked that this is in accordance with generate-idea-files. Since I am not fun of manual formatting I use auto-format on my changes and thus it doesn't work to format manually.

The result of running CI doesn't show any related failures: http://ci-cassandra.infra.datastax.com/job/cassandra/25/testReport/

@pkolaczk @netudima @Maxwell-Guo do you want to review the new commit? c9321b57c4fe078561cf9bcbed15dd2612277c6d

(the commit will be squashed after I am sure that the work is accepted)

@k-rus k-rus force-pushed the cass-20389-long-table-name branch 2 times, most recently from a71489f to f53e6cf Compare August 27, 2025 09:41
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Aug 27, 2025

While porting this patch to 4.1 I noticed one more place where I missed to throw exception on keyspace validation error. As result I refactored this PR to throw always from inside the keyspace validation and not rely on passed exception throwers from callers. See this commit: f53e6cfc798eb038bdf71539e9a0b945b876baf5
CI build can be found at http://ci-cassandra.infra.datastax.com/job/cassandra/29/testReport/ (no related errors to my understanding)
Let me know if you have any questions about it.

I plan to port this commit to 5.0 and 4.1 work, and will squash commits here.

cc @pkolaczk @netudima @Maxwell-Guo

Copy link
Copy Markdown
Contributor

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
Yes, moving throw to inside validate is much cleaner and less fragile than before.

@k-rus k-rus force-pushed the cass-20389-long-table-name branch from f53e6cf to 872fcc6 Compare September 1, 2025 19:07
Includes:
- Simplify passing methods as lambdas.
- Use chars instead of single char strings.
- Simplify assertion method calls.
- Remove unnecessary method throws declarations.
- Add VScode local folder to gitignore.
- Fix typo in a comment
Fixes CASSANDRA-20389

The length of table names was not controlled. This is likely due to
confusion between validation methods with similar names. As result
creating tables with too long names led to the too long file name
exceptions during table creations.

This commit adds a validation of table name lengths to avoid the too
long file name errors. The validation length is based on how the table
name is used to create file/directory names, and needs to be exact to
prevent the too long file name exception, but allow all other table
names, which didn't lead to the too long file name exception. This
length limit is different from the existing name length limit of 48
characters used by common validation functions.

Thus, this commit moves out the length validation from the validation
methods into a separate length validation method, so the errors on
names are more specific. The non-length validation methods combined
into a single method, which checks for empty names and valid characters.

New constants are added for the length limits.

Table name related code are moved into methods in TableMetadata class,
so their semantics are more clear and to allow reuse, e.g., in
asserting the table name length constant.
Tests are added for the long table names and non-alphanumeric names.

Keyspace name validation function is now shared between two classes and
a unit test of it is added.

Patch by Ruslan Fomkin; reviewed by Piotr Kołaczkowski, Dmitry
Konstantinov, Maxwell Guo for CASSANDRA-20389
@k-rus k-rus force-pushed the cass-20389-long-table-name branch from 872fcc6 to 150b528 Compare September 5, 2025 08:37
@k-rus
Copy link
Copy Markdown
Contributor Author

k-rus commented Sep 11, 2025

merged

@k-rus k-rus closed this Sep 11, 2025
@k-rus k-rus deleted the cass-20389-long-table-name branch September 11, 2025 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants