Skip to content

Conversation

@yevgenypats
Copy link
Contributor

@yevgenypats yevgenypats commented Apr 24, 2023

Closes #10103

BEGIN_COMMIT_OVERRIDE
feat: Update to use Apache Arrow type system

BREAKING-CHANGE: This release introduces an internal change to our type system to use Apache Arrow. This should not have any visible breaking changes, however due to the size of the change we are introducing it under a major version bump to communicate that it might have some bugs that we weren't able to catch during our internal tests. If you encounter an issue during the upgrade, please submit a bug report.

END_COMMIT_OVERRIDE

@cq-bot cq-bot added the duckdb label Apr 24, 2023
@disq disq linked an issue Apr 24, 2023 that may be closed by this pull request
1 task
@yevgenypats yevgenypats marked this pull request as ready for review April 24, 2023 20:40
@yevgenypats yevgenypats requested review from a team and candiduslynx and removed request for a team April 24, 2023 20:41
@yevgenypats yevgenypats requested review from a team and hermanschaaf and removed request for candiduslynx April 24, 2023 20:41
Comment on lines 64 to 65
case "int", "int4", "integer":
return arrow.PrimitiveTypes.Uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right; it probably shouldn't be unsigned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also the ones above it)

case "blob", "bytea", "binary", "varbinary":
return arrow.BinaryTypes.Binary
case "date":
return arrow.FixedWidthTypes.Date64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return arrow.FixedWidthTypes.Date64
return arrow.FixedWidthTypes.Date32

We need to be careful with Date64 - it stores the date in millisecond precision, for reasons that I don't understand. I think what we need to use here is Date32, which stores the number of days since unix epoch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird. as far as I understand this is just the width of the array....I looked again but I don't see that in arrow...can you point me to that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's a python thing but yeah it should matter anyway. ok will change it if you were bitten by that already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had a hard-to-debug error at some point where two dates like 2021-01-01 and 2021-01-01 were coming back as different, and it turned out to be a difference in milliseconds that isn't printed by the date64 string output.

if err != nil {
return fmt.Errorf("failed to execute upsert: %w", err)
}
// func (c *Client) upsert(table *schema.Table, data []any) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented out section can probably be removed now

} else {
ind := strings.Index(strTimestamp, ".")
if len(strTimestamp[ind+1:]) < 6 {
tmp[w.schema.Field(j).Name] = strTimestamp + "000000"[len(strTimestamp[ind+1:]):]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I actually didn't know you can do "000000"[i:i+1] in Go without assigning a variable first 😄

An alternative would be to use strings.Repeat("0", 6 - len(strTimestamp[ind+1:])

Copy link
Member

@hermanschaaf hermanschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add destination.WithManagedWriter()

Copy link
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments (non blocking) and will run locally too

}
func (c *Client) Read(ctx context.Context, sc *arrow.Schema, sourceName string, res chan<- arrow.Record) error {
tableName := schema.TableName(sc)
f, err := os.CreateTemp("", fmt.Sprintf("%s-*.json", tableName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this temporary file when we're done with it?

return "text[]"
case schema.TypeIntArray:
return "int[]"
case *arrow.Date64Type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find that it stores it in milliseconds in the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? That's how the type is defined here https://arrow.apache.org/docs/python/generated/pyarrow.date64.html#pyarrow.date64 "milliseconds since UNIX epoch 1970-01-01"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a python thing but anyway changed it as prob there is no reason for date64 anyway as date32 should be enough

return arrow.PrimitiveTypes.Uint8
case "smallint":
return arrow.PrimitiveTypes.Uint16
case "int", "int4", "integer":
Copy link
Member

@erezrokah erezrokah Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need int, int4? Those are aliases for integer. And if we do need them, should we also added signed to this case, and all aliases to other cases as specified in https://duckdb.org/docs/sql/data_types/overview.html?

For example tinyint has an int1 alias

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the alias. Yeah it's better if handle the aliases as well (no reason not to handle them I think)

@yevgenypats yevgenypats added the automerge Automatically merge once required checks pass label Apr 25, 2023
@kodiakhq kodiakhq bot merged commit e38eae6 into main Apr 25, 2023
@kodiakhq kodiakhq bot deleted the feat/duckdb_arrow branch April 25, 2023 20:28
kodiakhq bot pushed a commit that referenced this pull request Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Automatically merge once required checks pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Migrate plugin/destination/duckdb to github.com/cloudquery/plugin-sdk/v2

6 participants