-
Notifications
You must be signed in to change notification settings - Fork 547
feat(duckdb): Update to SDK V2 (Arrow) #10275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case "int", "int4", "integer": | ||
| return arrow.PrimitiveTypes.Uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right; it probably shouldn't be unsigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also the ones above it)
| case "blob", "bytea", "binary", "varbinary": | ||
| return arrow.BinaryTypes.Binary | ||
| case "date": | ||
| return arrow.FixedWidthTypes.Date64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return arrow.FixedWidthTypes.Date64 | |
| return arrow.FixedWidthTypes.Date32 |
We need to be careful with Date64 - it stores the date in millisecond precision, for reasons that I don't understand. I think what we need to use here is Date32, which stores the number of days since unix epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird. as far as I understand this is just the width of the array....I looked again but I don't see that in arrow...can you point me to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I'm basing it on: https://arrow.apache.org/docs/python/generated/pyarrow.date64.html
And e.g. here, when converting to time.Time, it throws away all the millisecond precision being stored anyway and just returns the date (year/month/day) https://github.com/apache/arrow/blob/72191880d3a66e6ecf65d9c249dcb2856a697f47/go/arrow/datatype_fixedwidth.go#L99-L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it's a python thing but yeah it should matter anyway. ok will change it if you were bitten by that already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had a hard-to-debug error at some point where two dates like 2021-01-01 and 2021-01-01 were coming back as different, and it turned out to be a difference in milliseconds that isn't printed by the date64 string output.
| if err != nil { | ||
| return fmt.Errorf("failed to execute upsert: %w", err) | ||
| } | ||
| // func (c *Client) upsert(table *schema.Table, data []any) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented out section can probably be removed now
| } else { | ||
| ind := strings.Index(strTimestamp, ".") | ||
| if len(strTimestamp[ind+1:]) < 6 { | ||
| tmp[w.schema.Field(j).Name] = strTimestamp + "000000"[len(strTimestamp[ind+1:]):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I actually didn't know you can do "000000"[i:i+1] in Go without assigning a variable first 😄
An alternative would be to use strings.Repeat("0", 6 - len(strTimestamp[ind+1:])
hermanschaaf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add destination.WithManagedWriter()
erezrokah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments (non blocking) and will run locally too
| } | ||
| func (c *Client) Read(ctx context.Context, sc *arrow.Schema, sourceName string, res chan<- arrow.Record) error { | ||
| tableName := schema.TableName(sc) | ||
| f, err := os.CreateTemp("", fmt.Sprintf("%s-*.json", tableName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this temporary file when we're done with it?
| return "text[]" | ||
| case schema.TypeIntArray: | ||
| return "int[]" | ||
| case *arrow.Date64Type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case *arrow.Date64Type: | |
| case *arrow.Date32Type: |
If we do https://github.com/cloudquery/cloudquery/pull/10275/files#r1176178054 cc @hermanschaaf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to find that it stores it in milliseconds in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? That's how the type is defined here https://arrow.apache.org/docs/python/generated/pyarrow.date64.html#pyarrow.date64 "milliseconds since UNIX epoch 1970-01-01"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a python thing but anyway changed it as prob there is no reason for date64 anyway as date32 should be enough
| return arrow.PrimitiveTypes.Uint8 | ||
| case "smallint": | ||
| return arrow.PrimitiveTypes.Uint16 | ||
| case "int", "int4", "integer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need int, int4? Those are aliases for integer. And if we do need them, should we also added signed to this case, and all aliases to other cases as specified in https://duckdb.org/docs/sql/data_types/overview.html?
For example tinyint has an int1 alias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the alias. Yeah it's better if handle the aliases as well (no reason not to handle them I think)
Co-authored-by: Kemal <[email protected]>
Missed from my CR at #10275
Closes #10103
BEGIN_COMMIT_OVERRIDE
feat: Update to use Apache Arrow type system
BREAKING-CHANGE: This release introduces an internal change to our type system to use Apache Arrow. This should not have any visible breaking changes, however due to the size of the change we are introducing it under a major version bump to communicate that it might have some bugs that we weren't able to catch during our internal tests. If you encounter an issue during the upgrade, please submit a bug report.
END_COMMIT_OVERRIDE