-
Notifications
You must be signed in to change notification settings - Fork 238
Refreshed JDBC support #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/main/scala/com/linkedin/feathr/offline/source/dataloader/BatchDataLoader.scala
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/transformation/AnchorToDataSourceMapper.scala
Show resolved
Hide resolved
|
|
| AtlasAttributeDef( | ||
| name="url", typeName="string", cardinality=Cardinality.SINGLE), | ||
| AtlasAttributeDef( | ||
| name="dbtable", typeName="string", cardinality=Cardinality.SINGLE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add those in a customized dict or something so it's more flexible, rather than adding it here in the attributes? Reasons being:
- Those schemas are hard (or impossible) to change
- Not everyone is using JDBC
Putting those in a map<string,string> allows us to have more flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't plan to do larger refactor to this part in this PR, yihui is working on the data model re-design and will touch this as well. So let's keep this change as simple as possible, to reduce the probability to have conflict with her change.
BTW all these fields are optional so it won't break existing HDFS support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to --system-properties, if you look into the code on Scala side, you'll find that the name does serve the purpose.
This is the refreshed PR for #124, the old one was stale for too long and cannot be cleanly merged.
This PR does following things:
--system-propertiescommand line parameter to the Spark job, so the client can pass multiple secrets to the job.JdbcSourceis added to represent a data source from JDBC, with auth, and corresponding data model change.Currently the E2E test cannot run in the CI env because it requires a database with
green_tripdata_2020-04.csvimported into a table calledgreen_tripdata_2020_04, with auth configured correctly.To run the E2E test locally, you need:
green_tripdata_2020-04.csvinto the database with the table name set togreen_tripdata_2020_04, be aware of the column type, all numeric columns like prices or distances must be "FLOAT" and all ID columns must be "INT" if you're using AzureSQL.nycTaxiBatchJdbcSource_USER: The user name to login to database servernycTaxiBatchJdbcSource_PASSWORD: The password to login to database serverThen you should be able to run the
test_sql_source.pydirectly or viapytest, both Databricks and Azure Synapse should work. You can use whatever user/pass combination for each Jdbc data source, not limited to the previously global uniqueJDBC_*settings.