Skip to content

Conversation

@brucehsu
Copy link
Contributor

@brucehsu brucehsu commented May 17, 2023

Summary

This PR addresses #10749.

  • Migrate dependencies to plugin-sdk/v3
  • Deprecate ColumnCreationOptions in favour of inline fields
  • Migrate to Arrow types (TypeInt->Int64, TypeTimestamp->Timestamp_us)

BEGIN_COMMIT_OVERRIDE
feat: Update to use Apache Arrow type system (#10831)

BREAKING-CHANGE: This release introduces an internal change to our type system to use Apache Arrow. This should not have any visible breaking changes, however due to the size of the change we are introducing it under a major version bump to communicate that it might have some bugs that we weren't able to catch during our internal tests. If you encounter an issue during the upgrade, please submit a bug report. You will also need to update destinations depending on which one you use:

  • Azure Blob Storage >= v3.2.0
  • BigQuery >= v3.0.0
  • ClickHouse >= v3.1.1
  • DuckDB >= v1.1.6
  • Elasticsearch >= v2.0.0
  • File >= v3.2.0
  • Firehose >= v2.0.2
  • GCS >= v3.2.0
  • Gremlin >= v2.1.10
  • Kafka >= v3.0.1
  • Meilisearch >= v2.0.1
  • Microsoft SQL Server >= v4.2.0
  • MongoDB >= v2.0.1
  • MySQL >= v2.0.2
  • Neo4j >= v3.0.0
  • PostgreSQL >= v4.2.0
  • S3 >= v4.4.0
  • Snowflake >= v2.1.1
  • SQLite >= v2.2.0

END_COMMIT_OVERRIDE

@brucehsu brucehsu requested review from a team and amanenk and removed request for a team May 17, 2023 22:03
@brucehsu
Copy link
Contributor Author

brucehsu commented May 17, 2023

@candiduslynx
Copy link
Contributor

@brucehsu the breaking change in sources is expected when performing the upgrade to plugin-sdk/v3.
I'll mark the PR as breaking accordingly.

@candiduslynx candiduslynx changed the title feat(hackernews): Migrate to SDKv3 feat(hackernews)!: Migrate to SDKv3 May 18, 2023
@candiduslynx candiduslynx requested review from hermanschaaf and yevgenypats and removed request for amanenk May 18, 2023 07:20
@hermanschaaf
Copy link
Member

Nice one, thank you @brucehsu!

We will have to wait for all destinations to be migrated to v3 (and released) before merging this, because sources using SDK v3 will require a destination that also supports SDK v3.

The other thing is that I think we should/will try and update the "Breaking Change" detection script that commented above to accept Arrow type names as aliases of the old names, so it looks a bit less scary and we can verify that there are no real breaking changes. Not saying you need to do this for this PR, I can take a look at this soon :)

@brucehsu
Copy link
Contributor Author

@hermanschaaf Gotcha! I was about to ask how do we cope with the change in types when say writing to an existing PostgreSQL database since that'd probably also change the types of columns.

@hermanschaaf
Copy link
Member

I was about to ask how do we cope with the change in types when say writing to an existing PostgreSQL database since that'd probably also change the types of columns.

@brucehsu I can probably shed some light on this! Previously we had our own type system, which we refer to as CQTypes. As you know, we're now migrating to use Arrow types. We've created Arrow extensions for some of the unique CQTypes, so there is a fairly straight-forward mapping between CQTypes and Arrow types. For example, Int -> Int64, String -> String, JSON -> JSON, etc.

We're first migrating all the destination plugins to support plugin-sdk V3 right now. It's mostly easy at this point, as most of them already support Arrow from the v2 migration. Now when a destination plugin receives an Arrow type of Int64, it will translate this to the same column type that it was using before when it received an Int CQType. Destinations have already been doing this since V2. Because it's translating to the same column type as before, it doesn't break the schema for users.

The next step will be to have all the sources use and send Arrow types directly. We'll start out by translating CQTypes to their direct Arrow equivalents in the source (as you've done here by upgrading to v3 😄 ). Then in the next phase, we will start allowing more Arrow types to be used, which may result in new, or more specific, column types in the destinations. For example, we could convert nested Go structs to their equivalent Arrow struct types, and then send that over the wire instead of a JSON column. The destination can then decide how to handle this: BigQuery might write it as a RECORD column, while postgres might still write it as a JSON column. We will also introduce a setting that allows users to maintain the previous behavior, so that this again does not need to be breaking.

@cloudquery cloudquery deleted a comment from github-actions bot May 26, 2023
brucehsu and others added 3 commits May 26, 2023 16:06
- Migrate dependencies to plugin-sdk/v3
- Deprecate ColumnCreationOptions in favour of inline fields
- Migrate to Arrow types (TypeInt->Int64, TypeTimestamp->Timestamp_us)
@candiduslynx candiduslynx force-pushed the feat/10749_hackernews_sdk_v3_migration branch from 73a17ea to c192474 Compare May 26, 2023 13:06
@hermanschaaf hermanschaaf added the automerge Automatically merge once required checks pass label May 26, 2023
@kodiakhq kodiakhq bot merged commit ee3465f into cloudquery:main May 26, 2023
kodiakhq bot pushed a commit that referenced this pull request May 30, 2023
🤖 I have created a release *beep* *boop*
---


## [2.0.0](plugins-source-hackernews-v1.3.1...plugins-source-hackernews-v2.0.0) (2023-05-30)


### ⚠ BREAKING CHANGES

* This release introduces an internal change to our type system to use [Apache Arrow](https://arrow.apache.org/). This should not have any visible breaking changes, however due to the size of the change we are introducing it under a major version bump to communicate that it might have some bugs that we weren't able to catch during our internal tests. If you encounter an issue during the upgrade, please submit a [bug report](https://github.com/cloudquery/cloudquery/issues/new/choose). You will also need to update destinations depending on which one you use:
    - Azure Blob Storage >= v3.2.0
    - BigQuery >= v3.0.0
    - ClickHouse >= v3.1.1
    - DuckDB >= v1.1.6
    - Elasticsearch >= v2.0.0
    - File >= v3.2.0
    - Firehose >= v2.0.2
    - GCS >= v3.2.0
    - Gremlin >= v2.1.10
    - Kafka >= v3.0.1
    - Meilisearch >= v2.0.1
    - Microsoft SQL Server >= v4.2.0
    - MongoDB >= v2.0.1
    - MySQL >= v2.0.2
    - Neo4j >= v3.0.0
    - PostgreSQL >= v4.2.0
    - S3 >= v4.4.0
    - Snowflake >= v2.1.1
    - SQLite >= v2.2.0

### Features

* **deps:** Upgrade to Apache Arrow v13 (latest `cqmain`) ([#10605](#10605)) ([a55da3d](a55da3d))
* Update to use [Apache Arrow](https://arrow.apache.org/) type system ([#10831](#10831)) ([ee3465f](ee3465f))


### Bug Fixes

* **deps:** Update module github.com/cloudquery/plugin-pb-go to v1.0.8 ([#10798](#10798)) ([27ff430](27ff430))
* **deps:** Update module github.com/cloudquery/plugin-sdk/v3 to v3.6.7 ([#11043](#11043)) ([3c6d885](3c6d885))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Automatically merge once required checks pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Migrate plugin/source/hackernews to github.com/cloudquery/plugin-sdk/v3

4 participants