-
Notifications
You must be signed in to change notification settings - Fork 547
feat(mysql): Update to SDK v2 and Arrow #10169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(mysql): Update to SDK v2 and Arrow #10169
Conversation
def106b to
e6342a1
Compare
yevgenypats
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the list/nested handling is incorrect here. Take a look here - https://github.com/cloudquery/cloudquery/blob/main/plugins/destination/postgresql/client/transformer.go#L122
OK, so I took a look and tried implementing the same approach to find out MySQL does not support array types. Does that make sense? I might be missing something with how we should handle arrays in Arrow, so happy to sync on it |
yevgenypats
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't aware mysql doesn't have nested types. Updated my review accordingly.
| } else { | ||
| recordBuilder.Field(i).(*array.TimestampBuilder).Append(arrow.Timestamp((*asTime).UnixMicro())) | ||
| } | ||
| case *types.UUIDType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed I believe as the default case should take care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the default only works if we use the string representation for UUID (text and not binary(16))
| case *arrow.BinaryType, *arrow.LargeBinaryType: | ||
| return "blob" | ||
| case *types.UUIDType: | ||
| return "binary(16)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use string I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few options for UUID:
binary(16)is the most efficientVARCHAR(36)orVARCHAR(255)which will show nicely in queriestextwhich is less efficient
The thing is that when we query the DB during migration we can't 100% say that VARCHAR(36) or VARCHAR(255) is UUID unless we try to parse the data.
Assuming binary(16) is UUID is safer (but also not 100%). So the choice for binary(16) was kind of a tradeoff to avoid using text.
We have a similar issue with bool which is an alias for tinyint(1), so we assume tinyint(1) is boolean
| case *arrow.BooleanType: | ||
| return "bool" | ||
| case *arrow.Int8Type, *arrow.Uint8Type, *arrow.Int16Type, *arrow.Uint16Type, *arrow.Int32Type, *arrow.Uint32Type, *arrow.Int64Type, *arrow.Uint64Type: | ||
| return "bigint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why all the same? It has tinyint,smallint,mediumint,int,bigint this would be very useful for db->db replication when the type is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, mostly since this is what we had before in the old type system. I'll update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split those and the floats in 7eef56a also opened cloudquery/plugin-sdk#805
| return "json" | ||
| case *arrow.StructType: | ||
| return "json" | ||
| case *types.InetType, *types.MacType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this one we can also use the default case so no special handling will be needed in reverse transform as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f9a1c30
| return "json" | ||
| case *types.InetType, *types.MacType: | ||
| return "nvarchar(255)" | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add another case when the type is a List type then we can also use json and then use marshaljson in the transformer for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point JSON for list types is a great solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f9a1c30 (seems like ValueStr already does JSON marshalling)
0d951c9 to
7eef56a
Compare
yevgenypats
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. had few non blocking questions.
| "json": schema.TypeJSON, | ||
| sqlTypeToSchemaType := map[string]arrow.DataType{ | ||
| "bool": arrow.FixedWidthTypes.Boolean, | ||
| "tinyint(1)": arrow.FixedWidthTypes.Boolean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this one is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more a defensive measure I think in case someone is doing MySQL -> MySQL sync and defined bool as tinyint(1) (similar to https://github.com/cloudquery/cloudquery/pull/9143/files). I think we can drop it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the correct reason was that we return bool a bit below so when we normalized we need to support bool too.
ef4fdd1 should fix it.
See normalization code here https://github.com/cloudquery/cloudquery/pull/10169/files#diff-7a94fa4d0f7e4c8c80b1ad959ace684a43b44408b3d1d9cee3ea9ef2d87dd9a2R29
| "smallint": arrow.PrimitiveTypes.Int16, | ||
| "int": arrow.PrimitiveTypes.Int32, | ||
| "bigint": arrow.PrimitiveTypes.Int64, | ||
| "bigint(20)": arrow.PrimitiveTypes.Int64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this one is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ef4fdd1
|
I updated the description with all the breaking changes. I think the |
Summary
Fixes #9930.
Still need to test a bit and I probably missed a bunch of stuff. Keeping as draft for now
BEGIN_COMMIT_OVERRIDE
feat: Update to plugin SDK v2
feat: Update to use Apache Arrow type system
BREAKING-CHANGE: This release introduces an internal change to our type system to use Apache Arrow. This should not have any visible breaking changes, however due to the size of the change we are introducing it under a major version bump to communicate that it might have some bugs that we weren't able to catch during our internal tests. If you encounter an issue during the upgrade, please submit a bug report.
feat: Timestamp fields type changed from
datetimetodatetime(6)to avoid losing fractional dataBREAKING-CHANGE: Timestamp fields type changed from
datetimetodatetime(6)to avoid losing fractional datafeat: Mac Address and Inet column type changed from
nvarchar(255)totextas we cannot assumenvarchar(255)columns represent valid Mac Address or InetBREAKING-CHANGE: Mac Address and Inet column type changed from
nvarchar(255)totextas we cannot assumenvarchar(255)columns represent valid Mac Address or Inetfeat: Float fields type changed from
floattodoublefor increased precisionBREAKING-CHANGE: Float fields type changed from
floattodoublefor increased precisionEND_COMMIT_OVERRIDE