Skip to content

Conversation

@windoze
Copy link
Member

@windoze windoze commented Jun 17, 2022

This is the refreshed PR for #124, the old one was stale for too long and cannot be cleanly merged.
This PR does following things:

  1. Add a --system-properties command line parameter to the Spark job, so the client can pass multiple secrets to the job.
  2. The reading from JDBC function already exists in the job code but doesn't have public interface to use, now this function is enabled when you have JdbcSource in the feature definition file.
  3. On the client side, a JdbcSource is added to represent a data source from JDBC, with auth, and corresponding data model change.
  4. Various fixes to both job and client sides.
  5. An E2E test to get offline features from JDBC data source.

Currently the E2E test cannot run in the CI env because it requires a database with green_tripdata_2020-04.csv imported into a table called green_tripdata_2020_04, with auth configured correctly.

To run the E2E test locally, you need:

  • Create a SQL database instance with user/pass auth on Azure, AAD/token auth is not covered by this test.
  • Import green_tripdata_2020-04.csv into the database with the table name set to green_tripdata_2020_04, be aware of the column type, all numeric columns like prices or distances must be "FLOAT" and all ID columns must be "INT" if you're using AzureSQL.
  • Set following environment variables
    • nycTaxiBatchJdbcSource_USER: The user name to login to database server
    • nycTaxiBatchJdbcSource_PASSWORD: The password to login to database server

Then you should be able to run the test_sql_source.py directly or via pytest, both Databricks and Azure Synapse should work. You can use whatever user/pass combination for each Jdbc data source, not limited to the previously global unique JDBC_* settings.

@windoze windoze added the safe to test Tag to execute build pipeline for a PR from forked repo label Jun 17, 2022
@xiaoyongzhu
Copy link
Member

--system-properties sounds very generic and might be confusing for future developers, and assuming the intent is only to pass the credentials, maybe we can call something closer to credentials? Like --jdbc-credentials ?, or --client-secrets?

AtlasAttributeDef(
name="url", typeName="string", cardinality=Cardinality.SINGLE),
AtlasAttributeDef(
name="dbtable", typeName="string", cardinality=Cardinality.SINGLE),
Copy link
Member

@xiaoyongzhu xiaoyongzhu Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add those in a customized dict or something so it's more flexible, rather than adding it here in the attributes? Reasons being:

  1. Those schemas are hard (or impossible) to change
  2. Not everyone is using JDBC

Putting those in a map<string,string> allows us to have more flexibility.

Copy link
Member Author

@windoze windoze Jun 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't plan to do larger refactor to this part in this PR, yihui is working on the data model re-design and will touch this as well. So let's keep this change as simple as possible, to reduce the probability to have conflict with her change.

BTW all these fields are optional so it won't break existing HDFS support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to --system-properties, if you look into the code on Scala side, you'll find that the name does serve the purpose.

@windoze windoze merged commit 0d1157f into main Jun 20, 2022
@xiaoyongzhu xiaoyongzhu deleted the windoze/job-sys-prop branch August 22, 2022 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Tag to execute build pipeline for a PR from forked repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants