Skip to content

[DO NOT REVIEW] Introduce a separate limit for max table name length#24295

Closed
swasik wants to merge 1 commit intoscylladb:masterfrom
swasik:gh-4480
Closed

[DO NOT REVIEW] Introduce a separate limit for max table name length#24295
swasik wants to merge 1 commit intoscylladb:masterfrom
swasik:gh-4480

Conversation

@swasik
Copy link
Copy Markdown
Contributor

@swasik swasik commented May 28, 2025

Currently the table name length is limited to 48 characters, same way as all the other names. This was compatible with Cassandra 3 but due to a bug the validation of this constraint was removed in Cassandra 4 and 5 (see CASSANDRA-20389). Nowadays, some of the usage scenarios depends on the lack of limitation in Cassandra, such as some feature store workflows generating long table names.

This path introduces the behavior that is compatible with Cassandra 4 and 5. It limits table name to 222 characters because when the existing code creates a directory to store the new table, this directory's name is created by sstables::init_table_storage(), which takes the table's full name, a dash (-), and a 32-byte UUID string. Because most Linux filesystems limit filename components to 255 bytes, this means that any table name longer than 222 bytes will attempt to mkdir() a directory name longer than the allowed 255 bytes, and fail.

Exactly the same limit (222 characters) is proposed in the patch fixing directory creation failures in Cassandra 4 and 5 (see apache/cassandra/pull/4038, SchemaConstants.java).

Closes #4480, #10984.

New issue scylladb/scylla-drivers#75 was created to track the updates needed in drivers.

Backport: 2025.2 as this is a compatibility bug fix that is needed to make ScyllaDB work with some of the AI workflows.

@swasik swasik requested review from nyh and tgrabiec as code owners May 28, 2025 09:06
@swasik swasik self-assigned this May 28, 2025
@swasik swasik requested review from dawmd and nyh and removed request for nyh and tgrabiec May 28, 2025 09:08
@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 boost/cql_query_test
✅ - dtest with consistent topology changes
✅ - dtest with gossip topology changes
✅ - dtest with tablets
✅ - Unit Tests

Build Details:

  • Duration: 8 hr 16 min
  • Builder: spider8.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor

nyh commented May 29, 2025

@swasik in scylladb/scylla-cluster-tests#10984 it was noted we have the same problem for materialized-view names: They tried a 55-character MV name, which works in Cassandra but was rejected in Scylla.

I don't know if your fix fixes MV names as well (it should if the same validation function is used for both), but we at least need a test for the MV case - which I believe we are missing (I'm surprised I missed that!).

If we wanted to be really thorough here (and to be honest, I think we should), we should also consider keyspace length, index names, and other places where we may or may not have that old "48" limitation.

Copy link
Copy Markdown
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks mostly good, but I think you should change NAME_LENGTH to 222 and not try to "split" it to two constants, because MV names and probably keyspace names have exactly the same problem - I just never got around to writing the tests for them (tell me if you need help to write those tests, and I will). See more comments below.

Comment thread schema/schema.hh Outdated
// a dash (-), and a 32-byte UUID string. Because most Linux filesystems limit filename
// components to 255 bytes, this means that any table name longer than 222 bytes will attempt
// to mkdir() a directory name longer than the allowed 255 bytes, and fail.
static constexpr int32_t TABLE_NAME_LENGTH = 222;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have just changed NAME_LENGTH to 222 and not have two constants:

I don't think there is any reason for this NAME_LENGTH vs. TABLE_NAME_LENGTH separation. You explain that NAME_LENGTH is now for "The maximum length of any name (excluding tables)" but what does "any name" refers? It certainly not names of columns, for example - these have completely different lengths. So I guess you meant length of keyspaces, materialized views and tables - I believe these are the only thing that used schema::NAME_LENGTH. But.... I believe all of them should be increased to 222, not just table names! In particular, scylladb/scylla-cluster-tests#10984 complains about the length of views also being to limited.

So I think in your patch you should increase all of them to 222, and I suggest to please add missing tests for length limits of materialized views and keyspace names. I even have a TODO about this in test/cqlpy/test_name.py. Let me know if you need help writing those tests, I always love to write such tests ;-)

Comment thread test/cqlpy/test_name.py Outdated
# The second test tries a 100-character name - this one passes on Cassandra 4
# and 5, fails on Cassandra 3, and on Scylla reproduces issue #4480.
@pytest.mark.xfail(reason="#4480")
# and 5, fails on Cassandra 3.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't remove the comment that a certain test "reproduces issue #4480". Such comments are actually very important for understanding regressions or missing backports. You should remove the xfail line on the next line, but not the mention on the comment.

Comment thread test/cqlpy/test_name.py
Comment thread test/cqlpy/test_name.py
n = padded_name(223)
with passes_or_raises(InvalidRequest, match=n):
with new_named_table(cql, test_keyspace, n, schema) as table:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really important for us to enshrine the fact that 223 is not allowed? I did the same test below with 500, instead of 223, because really 500 is much more than we can ever allow. But is it important to be sure we never allow 223?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as with 223 we will fail the table creation because of exceeding 255 filename length.

Comment thread test/cqlpy/test_name.py Outdated
# succeed, or fail gracefully. We mark this test cassandra_bug because
# Cassandra 5 hangs on this test (CASSANDRA-20425 and CASSANDRA-20389).
def test_table_name_length_500(cql, test_keyspace, cassandra_bug):
def test_table_name_length_500(cql, test_keyspace):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that this test now passes on Cassandra (with test/cqlpy/run-cassandra)?
I explained in the comment that Cassandra has a bug that causes it to hang on this test. Did they fix it?

If they fixed it, please update the comment to either remove the "We mark this test cassandra_bug..." part completely, or change it to be past tense. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nyh so I think that I do not understand exactly what is the meaning of cassandra_bug parameter - could you explain? I though it is some leftover as it is not used inside the test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "cassanda_bug" fixture skips the test jf running on Cassandra. It also serves as a visual cue that this is a test that we know that passes on Scylla but not on Cassandra, but it's not a mistake - rather we consider Scylla's behavior to be correct and Cassandra's to be wrong. A comment should explain why to be the case - if possible linking to an open Cassandra issue.

The "scylla_only" fixture does the same thing but has a slightly different shade of meaning: scylla_only means we are testing a feature we know exists only on Scylla but it's not a Cassandra bug, it's just a extra feature we decided to implement.

All other tests, which have neither of these two tags, are expected when you run them against Cassandra. The easiest way to do that it with the test/cqlpy/run-cassandra script. The purpose and usage of that script is explained in test/cqlpy/README.md.

@nyh
Copy link
Copy Markdown
Contributor

nyh commented May 29, 2025

Also please change the "Closes #4480" in the cover letter to "Fixes #4480" (Github treats both the same, but Avi says that our automation doesn't so we need "Fixes" here).

@swasik
Copy link
Copy Markdown
Contributor Author

swasik commented May 30, 2025

Thanks. Looks mostly good, but I think you should change NAME_LENGTH to 222 and not try to "split" it to two constants, because MV names and probably keyspace names have exactly the same problem - I just never got around to writing the tests for them (tell me if you need help to write those tests, and I will). See more comments below.

Yes keyspace, materialized views and indexes. I will add the tests - you put even a TODO note for this.

Copy link
Copy Markdown
Contributor

@dawmd dawmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test against a CDC table, I think, which adds a suffix (_cdc IIRC) to the base table's name, which might take us beyond the limit. If I understand the rationale in the cover letter, creating it should not be allowed, but maybe it's more subtle.

@swasik swasik force-pushed the gh-4480 branch 2 times, most recently from 96887a4 to 9985664 Compare June 2, 2025 11:03
@swasik swasik marked this pull request as draft June 2, 2025 12:32
@swasik swasik changed the title Introduce a separate limit for max table name length [DO NOT REVIEW] Introduce a separate limit for max table name length Jun 2, 2025
@scylladb-promoter
Copy link
Copy Markdown
Contributor

Currently the name length of tables, materialized views, keyspaces
and indexes is limited to 48 characters. This was compatible with
Cassandra 3 but due to a bug the validation of this constraints was
removed in Cassandra 4 and 5 (see CASSANDRA-20389). Nowadays,
some of the usage scenarios depends on the lack of limitation in
Cassandra, such as some feature store workflows generating long
table names.

This path introduces the behavior that is compatible with
Cassandra 4 and 5. It limits names to 218 characters because
when the existing code creates a directory to store the new table,
this directory's name is created by sstables::init_table_storage(),
which takes the table's full name, a dash (-), and a 32-byte UUID
string. Because most Linux filesystems limit filename components
to 255 bytes, this means that any table name longer than 222 bytes
will attempt to mkdir() a directory name longer than the allowed
255 bytes, and fail. Extra 4 characters are reserved for _cdc
suffix for CDC-generated tables making the limit equal 218 chars.

Similar limit (222 characters) is proposed in the patch
fixing directory creation failures in Cassandra 4 and 5
(see github.com/apache/cassandra/pull/4038, SchemaConstants.java).

The patch also fixes tests that were broken by introducing this
patch.
@nyh
Copy link
Copy Markdown
Contributor

nyh commented Jun 3, 2025

We should also test against a CDC table, I think, which adds a suffix (_cdc IIRC) to the base table's name, which might take us beyond the limit. If I understand the rationale in the cover letter, creating it should not be allowed, but maybe it's more subtle.

This is an interesting point. Looking at cdc/log.cc I see that cdc_log_suffix = "_scylla_cdc_log" so if we want to ensure that we can later create a CDC on the new table, we need to subtract a further 15 characters from that 12 limit.
Alternatively (and this probably isn't a great alternatively), we could decide that it's fine to create a table whose name is too long to add CDC too. But if we decide that, we'll need to make sure the addition of CDC fails in a good way, not with an IO-error shutdown.

We'll need to add tests for all these cases. I admit this is getting a bit more difficult than just the table-name check that we started with, but it's not very difficult (add code checks, and tests, for table, view, keyspace and cdc) and I think it may be worth our time to fix this issue correctly. In any case, the current situation that tables and views whose name is longer than 48 characters are rejected is already causing our users real problems.

Finally, there is also the issue of index name length. When you create an index, it has a default name (whose length depends on the name of the table, the column, and a word like "index") but you can also override that name explicitly. The view created for the index adds a further "_index" (if I remember correctly). I hope that the view-name length check will be applied to this view as well, but this will need to be tested as well.

@swasik let me know if you want my help writing all the tests and you will "just" need to make them pass.

Comment thread schema/schema.hh
static constexpr int32_t NAME_LENGTH =
255 - 32 // 32 bytes for UUID
- 1 // 1 byte for the dash (-) between the name and the UUID
- 4; // 4 bytes for the "_cdc" suffix in case of CDC-enabled tables
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you see this "_cdc" suffix?
In cdc/log.cc I see

static const sstring cdc_log_suffix = "_scylla_cdc_log";

@swasik
Copy link
Copy Markdown
Contributor Author

swasik commented Jun 3, 2025

This is an interesting point. Looking at cdc/log.cc I see that cdc_log_suffix = "_scylla_cdc_log" so if we want to ensure that we can later create a CDC on the new table, we need to subtract a further 15 characters from that 12 limit. Alternatively (and this probably isn't a great alternatively), we could decide that it's fine to create a table whose name is too long to add CDC too. But if we decide that, we'll need to make sure the addition of CDC fails in a good way, not with an IO-error shutdown.

After thinking about it IMO it would be worth reserving the space just for UUID + dash and disallowing creation of CDC streams for tables with the names too long to create a CDC name. This will make code much simpler and easier to maintain - in the future we may have another code that will add something to the table name and making sure that each time we properly handle it in create table: 1) will make code much more complex 2) can be impossible as it will be impossible to decrease the already introduced limit.

So checking it when enabling CDC and terminating gracefully in case of error sounds like a good idea.

We'll need to add tests for all these cases. I admit this is getting a bit more difficult than just the table-name check that we started with, but it's not very difficult (add code checks, and tests, for table, view, keyspace and cdc) and I think it may be worth our time to fix this issue correctly. In any case, the current situation that tables and views whose name is longer than 48 characters are rejected is already causing our users real problems.

I agree it is worth solving it the right way. And I agree it is not very difficult - just some checks, proper error handling and tests. The hardest part is the large amount of tests that have to be written.

@swasik let me know if you want my help writing all the tests and you will "just" need to make them pass.

Thanks. I will check how urgent it is and I will let you know if it has to be solved before my PTO. If yes, the help will be really appreciated, else I will finish it after I am back.

@ScyllaPiotr
Copy link
Copy Markdown
Contributor

Without going into the details,
a PR to Introduce a separate limit for max table name length
is not actually fixing
an issue titled Allow unlimited table name length.

@github-actions
Copy link
Copy Markdown

@swasik, this PR has merge conflicts with the base branch. Please resolve the conflicts so we can merge it.

@swasik
Copy link
Copy Markdown
Contributor Author

swasik commented Jun 23, 2025

Closing as it was in the end solved by @knowack1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow longer table name length

6 participants