Skip to content

Conversation

@windoze
Copy link
Member

@windoze windoze commented Jul 15, 2022

This PR includes:

  1. Extended former InputLocation classes to support both read and write functions, also renamed it to DataLocation to reflect this change.
  2. Added a GenericLocation which supports all Spark confs, modes, and options, so it can be used to operate virtually any connectors supported by Spark
  3. In GenericLocation, I added a format-specific patching mechanism to workaround quirks come with different connectors, e.g. CosmosDb requires rows to have an id column with unique values.
  4. Update FeathrGenJob and FeathrJoinJob, enabling them to use a JSON-encoded string of DataLocation instead of plain path as the input and output target.

Theoretically, Feathr core can support all Spark connectors with this patch, but we still run a series of compatibility tests to confirm the final list.

NOTE: This PR only involves Feathr core, the corresponding Feathr Client changes will be in upcoming PRs.

@xiaoyongzhu
Copy link
Member

This PR looks good to me and is a good way of extending to other sources in the future. Thanks @windoze for the work!

@windoze windoze added the safe to test Tag to execute build pipeline for a PR from forked repo label Jul 15, 2022
@windoze windoze merged commit 671bae3 into main Aug 13, 2022
@xiaoyongzhu xiaoyongzhu deleted the windoze/generic-io branch August 22, 2022 17:15
ahlag pushed a commit to ahlag/feathr that referenced this pull request Aug 26, 2022
* GenericLocation for DataFrame read/write

* WIP

* Generate id column

* Fix unit test

* Parse string into DataLocation

* Id column must be string

* Fix auth logic

* Fix unit test

* Fix id column generation

* CosmosDb Sink
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Tag to execute build pipeline for a PR from forked repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants