Skip to content

Conversation

@erezrokah
Copy link
Member

@erezrokah erezrokah commented Sep 1, 2023

Summary

Fixes #13486. The max length of an index is either 3072 bytes or 767 bytes.
Since we use utf8mb4 which can use up to 4 bytes per char, we max out at either 768 or 191 chars (the number in bytes divided by 4).
The current code allocates 191 chars per PK which is OK if a table has 4 PKs since 191*4= 764, but fails if a table has more than that. See list of tables here

The solution is to get the max number of bytes based on the row format, and evenly distribute the max length between PKs.
This is still not the best solution as we assign the same number of bytes to a UUID and a text column (for example CQId is UUID), but I don't think it's very common to have one UUID PK and another text PK and enough other PKs so the prefix is too short.

@erezrokah erezrokah requested review from a team, bbernays and mnorbury and removed request for a team and mnorbury September 1, 2023 09:23
@cq-bot cq-bot added the mysql label Sep 1, 2023
COLUMN_TYPE,
IS_NULLABLE,
constraint_type,
sub_part
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't understand why we need sub_part. Please comment if you know why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw that comment and still don't understand the reason. It is because we want the migration to fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume so.
I think @bbernays is the one who can answer this question precisely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without it migrations wouldn't catch (and then update) the subpart for existing DBs. Not sure if this was the best tradeoff (as I agree that it inhibits the DB->DB) use case...

But by shortening the subpart, we are risking losing data in cases when the beginning of PKs are the same...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those won't really help source DB -> MySQL users

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will help all but mysql-mysql right? Which in that case we would not want to alter the length of PKs at all

Copy link
Member Author

@erezrokah erezrokah Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could force users to use pk_mode: cq-id-only and deterministic_cq_id, but I'm not sure that's the best default.
For example in Azure we hardly ever has more than 1 PK (it's always the ID), same with GCP and Name and AWS and arn, and so on.

Most tables do have a single field that acts as a PK, and in that case we would use the maximum possible length for the key with this PR. Without this fix we always use 191 which is not the maximum available value.

And it not just about mysql-mysql, it's also Postgres->MySQL and OracleDB->MySql, and maybe Firestore->MySQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with this fix we won't be shortening the prefix for tables, only making it longer. If the tables exists we would skip it (we don't detect the prefix change) and if it doesn't exist we create it.
It can't exist with a longer prefix, since that's exactly the thing that was preventing it from getting created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could detect the prefix change, but I believe that would be quite error prone

primaryKey = strings.Contains(*constraintType, "PRIMARY KEY")
}
// subpart only non nil for pks on blob/text columns
if subpart != nil && *subpart != maxPrefixLength {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we disable the PK for columns that has a subpart but it doesn't match our own number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could create issues in a source DB -> dest MySQL sync, for example MySQL -> MySQL

@erezrokah erezrokah force-pushed the fix/better_pk_prefix branch from 02ad33a to 2080191 Compare September 1, 2023 09:32
@erezrokah erezrokah force-pushed the fix/better_pk_prefix branch from 2080191 to 4e63367 Compare September 1, 2023 09:34
if len(primaryKeysIndices) > 0 {
builder.WriteString(",\n ")
builder.WriteString(" PRIMARY KEY (")
lengthPerPk := c.maxIndexLength / len(primaryKeysIndices)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  1. This makes the assumption that all keys in the composite PK have the same length... So in cases where the PK is something like req_region, req_account_id and arn the first 2 columns will be vastly over provisioned as we know an account_id is always 12 characters.
  2. Don't we need to factor in the type of each column in the PK? If one column is an int, doesn't that take up a lot less space than an text/blob that requires the subpart?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, but we don't have any way of knowing that in advance unless the source adds a column length in bytes to each column. Please note that currently we are still provisioning the same length for each key which is 191, so this PR doesn't make it worse, only better.

  2. Yes it's not 100% yet, I commented on it in the PR description, see fix!: Split max index length between PKs #13569 (comment). Please note this probably only impacts tables with many PKs, so this optimization might not be needed yet.

@erezrokah
Copy link
Member Author

Going to merge and release this next week

@erezrokah erezrokah added the automerge Automatically merge once required checks pass label Sep 4, 2023
@kodiakhq kodiakhq bot merged commit b0ad9d4 into cloudquery:main Sep 4, 2023
kodiakhq bot pushed a commit that referenced this pull request Sep 4, 2023
🤖 I have created a release *beep* *boop*
---


## [4.0.0](plugins-destination-mysql-v3.0.5...plugins-destination-mysql-v4.0.0) (2023-09-04)


### ⚠ BREAKING CHANGES

* Split max index length between PKs ([#13569](#13569))

### Bug Fixes

* **deps:** Update github.com/99designs/go-keychain digest to 9cf53c8 ([#13561](#13561)) ([a170256](a170256))
* **deps:** Update github.com/cloudquery/arrow/go/v14 digest to cd3d411 ([#13598](#13598)) ([f22bfa6](f22bfa6))
* Split max index length between PKs ([#13569](#13569)) ([b0ad9d4](b0ad9d4))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Automatically merge once required checks pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Failed to Write in MySQL Error 1071 (42000)

4 participants