Skip to content

Conversation

@erezrokah
Copy link
Member

@erezrokah erezrokah commented Apr 18, 2023

Summary

Fixes #9930.
Still need to test a bit and I probably missed a bunch of stuff. Keeping as draft for now

BEGIN_COMMIT_OVERRIDE

feat: Update to plugin SDK v2

feat: Update to use Apache Arrow type system
BREAKING-CHANGE: This release introduces an internal change to our type system to use Apache Arrow. This should not have any visible breaking changes, however due to the size of the change we are introducing it under a major version bump to communicate that it might have some bugs that we weren't able to catch during our internal tests. If you encounter an issue during the upgrade, please submit a bug report.

feat: Timestamp fields type changed from datetime to datetime(6) to avoid losing fractional data
BREAKING-CHANGE: Timestamp fields type changed from datetime to datetime(6) to avoid losing fractional data

feat: Mac Address and Inet column type changed from nvarchar(255) to text as we cannot assume nvarchar(255) columns represent valid Mac Address or Inet
BREAKING-CHANGE: Mac Address and Inet column type changed from nvarchar(255) to text as we cannot assume nvarchar(255) columns represent valid Mac Address or Inet

feat: Float fields type changed from float to double for increased precision
BREAKING-CHANGE: Float fields type changed from float to double for increased precision

END_COMMIT_OVERRIDE

@cq-bot cq-bot added the mysql label Apr 18, 2023
@erezrokah erezrokah force-pushed the feat/migration_mysql_sdk_v2 branch 2 times, most recently from def106b to e6342a1 Compare April 19, 2023 08:59
@erezrokah erezrokah marked this pull request as ready for review April 19, 2023 09:02
@erezrokah erezrokah requested review from a team and hermanschaaf and removed request for a team April 19, 2023 09:02
Copy link
Contributor

@yevgenypats yevgenypats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erezrokah
Copy link
Member Author

I think the list/nested handling is incorrect here. Take a look here - https://github.com/cloudquery/cloudquery/blob/main/plugins/destination/postgresql/client/transformer.go#L122

OK, so I took a look and tried implementing the same approach to find out MySQL does not support array types.
In that case we could use the string value or add some special serialization to handle them (e.g. string with some delimiter we later on split by). SQLite does not handle list type either.

Does that make sense? I might be missing something with how we should handle arrays in Arrow, so happy to sync on it

Copy link
Contributor

@yevgenypats yevgenypats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't aware mysql doesn't have nested types. Updated my review accordingly.

} else {
recordBuilder.Field(i).(*array.TimestampBuilder).Append(arrow.Timestamp((*asTime).UnixMicro()))
}
case *types.UUIDType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed I believe as the default case should take care of that.

Copy link
Member Author

@erezrokah erezrokah Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the default only works if we use the string representation for UUID (text and not binary(16))

case *arrow.BinaryType, *arrow.LargeBinaryType:
return "blob"
case *types.UUIDType:
return "binary(16)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use string I think.

Copy link
Member Author

@erezrokah erezrokah Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few options for UUID:

  • binary(16) is the most efficient
  • VARCHAR(36) or VARCHAR(255) which will show nicely in queries
  • text which is less efficient

The thing is that when we query the DB during migration we can't 100% say that VARCHAR(36) or VARCHAR(255) is UUID unless we try to parse the data.

Assuming binary(16) is UUID is safer (but also not 100%). So the choice for binary(16) was kind of a tradeoff to avoid using text.

We have a similar issue with bool which is an alias for tinyint(1), so we assume tinyint(1) is boolean

case *arrow.BooleanType:
return "bool"
case *arrow.Int8Type, *arrow.Uint8Type, *arrow.Int16Type, *arrow.Uint16Type, *arrow.Int32Type, *arrow.Uint32Type, *arrow.Int64Type, *arrow.Uint64Type:
return "bigint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why all the same? It has tinyint,smallint,mediumint,int,bigint this would be very useful for db->db replication when the type is important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, mostly since this is what we had before in the old type system. I'll update it

Copy link
Member Author

@erezrokah erezrokah Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split those and the floats in 7eef56a also opened cloudquery/plugin-sdk#805

return "json"
case *arrow.StructType:
return "json"
case *types.InetType, *types.MacType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this one we can also use the default case so no special handling will be needed in reverse transform as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f9a1c30

return "json"
case *types.InetType, *types.MacType:
return "nvarchar(255)"
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add another case when the type is a List type then we can also use json and then use marshaljson in the transformer for those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point JSON for list types is a great solution

Copy link
Member Author

@erezrokah erezrokah Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f9a1c30 (seems like ValueStr already does JSON marshalling)

@erezrokah erezrokah force-pushed the feat/migration_mysql_sdk_v2 branch from 0d951c9 to 7eef56a Compare April 19, 2023 13:03
@yevgenypats yevgenypats changed the title feat: Update to SDK v2 and Arrow feat(mysql): Update to SDK v2 and Arrow Apr 19, 2023
Copy link
Contributor

@yevgenypats yevgenypats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. had few non blocking questions.

"json": schema.TypeJSON,
sqlTypeToSchemaType := map[string]arrow.DataType{
"bool": arrow.FixedWidthTypes.Boolean,
"tinyint(1)": arrow.FixedWidthTypes.Boolean,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this one is needed?

Copy link
Member Author

@erezrokah erezrokah Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more a defensive measure I think in case someone is doing MySQL -> MySQL sync and defined bool as tinyint(1) (similar to https://github.com/cloudquery/cloudquery/pull/9143/files). I think we can drop it for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the correct reason was that we return bool a bit below so when we normalized we need to support bool too.
ef4fdd1 should fix it.

See normalization code here https://github.com/cloudquery/cloudquery/pull/10169/files#diff-7a94fa4d0f7e4c8c80b1ad959ace684a43b44408b3d1d9cee3ea9ef2d87dd9a2R29

"smallint": arrow.PrimitiveTypes.Int16,
"int": arrow.PrimitiveTypes.Int32,
"bigint": arrow.PrimitiveTypes.Int64,
"bigint(20)": arrow.PrimitiveTypes.Int64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this one is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ef4fdd1

@erezrokah erezrokah added automerge Automatically merge once required checks pass and removed automerge Automatically merge once required checks pass labels Apr 19, 2023
@erezrokah
Copy link
Member Author

I updated the description with all the breaking changes. I think the int changes should not be visible as we still map schema.TypeInt to arrow.PrimitiveTypes.Int64 so the mapping will be the same.

@erezrokah erezrokah added the automerge Automatically merge once required checks pass label Apr 19, 2023
@kodiakhq kodiakhq bot merged commit c5d3508 into cloudquery:main Apr 19, 2023
@erezrokah erezrokah deleted the feat/migration_mysql_sdk_v2 branch April 19, 2023 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Automatically merge once required checks pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Migrate plugin/destination/mysql to github.com/cloudquery/plugin-sdk/v2

3 participants