Refreshed JDBC support #380

windoze · 2022-06-17T10:09:34Z

This is the refreshed PR for #124, the old one was stale for too long and cannot be cleanly merged.
This PR does following things:

Add a --system-properties command line parameter to the Spark job, so the client can pass multiple secrets to the job.
The reading from JDBC function already exists in the job code but doesn't have public interface to use, now this function is enabled when you have JdbcSource in the feature definition file.
On the client side, a JdbcSource is added to represent a data source from JDBC, with auth, and corresponding data model change.
Various fixes to both job and client sides.
An E2E test to get offline features from JDBC data source.

Currently the E2E test cannot run in the CI env because it requires a database with green_tripdata_2020-04.csv imported into a table called green_tripdata_2020_04, with auth configured correctly.

To run the E2E test locally, you need:

Create a SQL database instance with user/pass auth on Azure, AAD/token auth is not covered by this test.
Import green_tripdata_2020-04.csv into the database with the table name set to green_tripdata_2020_04, be aware of the column type, all numeric columns like prices or distances must be "FLOAT" and all ID columns must be "INT" if you're using AzureSQL.
Set following environment variables
- nycTaxiBatchJdbcSource_USER: The user name to login to database server
- nycTaxiBatchJdbcSource_PASSWORD: The password to login to database server

Then you should be able to run the test_sql_source.py directly or via pytest, both Databricks and Azure Synapse should work. You can use whatever user/pass combination for each Jdbc data source, not limited to the previously global unique JDBC_* settings.

src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala

src/main/scala/com/linkedin/feathr/offline/transformation/AnchorToDataSourceMapper.scala

xiaoyongzhu · 2022-06-17T19:49:00Z

--system-properties sounds very generic and might be confusing for future developers, and assuming the intent is only to pass the credentials, maybe we can call something closer to credentials? Like --jdbc-credentials ?, or --client-secrets?

xiaoyongzhu · 2022-06-17T19:56:43Z

feathr_project/feathr/registry/_feature_registry_purview.py

+                AtlasAttributeDef(
+                    name="url", typeName="string", cardinality=Cardinality.SINGLE),
+                AtlasAttributeDef(
+                    name="dbtable", typeName="string", cardinality=Cardinality.SINGLE),


Can we add those in a customized dict or something so it's more flexible, rather than adding it here in the attributes? Reasons being:

Those schemas are hard (or impossible) to change

Not everyone is using JDBC

Putting those in a map<string,string> allows us to have more flexibility.

I don't plan to do larger refactor to this part in this PR, yihui is working on the data model re-design and will touch this as well. So let's keep this change as simple as possible, to reduce the probability to have conflict with her change.

BTW all these fields are optional so it won't break existing HDFS support.

As to --system-properties, if you look into the code on Scala side, you'll find that the name does serve the purpose.

Refreshed JDBC support

cc55252

windoze requested review from blee1234, blrchen, hangfei and jaymo001 June 17, 2022 10:53

windoze added the safe to test Tag to execute build pipeline for a PR from forked repo label Jun 17, 2022

Clean up commented out codes

708b318

windoze commented Jun 17, 2022

View reviewed changes

src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala Show resolved Hide resolved

windoze commented Jun 17, 2022

View reviewed changes

src/main/scala/com/linkedin/feathr/offline/transformation/AnchorToDataSourceMapper.scala Show resolved Hide resolved

windoze added 3 commits June 17, 2022 22:11

File format check and preprocessing

9a04639

Use new method to check data source type

2c6bde3

df loader fallback

d648713

xiaoyongzhu reviewed Jun 17, 2022

View reviewed changes

xiaoyongzhu mentioned this pull request Jun 17, 2022

Use elegant way to get all features for a project and fix potential concurrency issues #368

Closed

windoze added 2 commits June 18, 2022 16:58

Cleanup

9670b7f

Fix unit test

d1ce777

Yuqing-cat approved these changes Jun 20, 2022

View reviewed changes

blrchen approved these changes Jun 20, 2022

View reviewed changes

windoze merged commit 0d1157f into main Jun 20, 2022

xiaoyongzhu deleted the windoze/job-sys-prop branch August 22, 2022 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refreshed JDBC support #380

Refreshed JDBC support #380

Uh oh!

windoze commented Jun 17, 2022

Uh oh!

Uh oh!

Uh oh!

xiaoyongzhu commented Jun 17, 2022

Uh oh!

xiaoyongzhu Jun 17, 2022 •

edited

Loading

Uh oh!

windoze Jun 18, 2022 •

edited

Loading

Uh oh!

windoze Jun 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Refreshed JDBC support #380

Refreshed JDBC support #380

Uh oh!

Conversation

windoze commented Jun 17, 2022

Uh oh!

Uh oh!

Uh oh!

xiaoyongzhu commented Jun 17, 2022

Uh oh!

xiaoyongzhu Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windoze Jun 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windoze Jun 19, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xiaoyongzhu Jun 17, 2022 •

edited

Loading

windoze Jun 18, 2022 •

edited

Loading