Install a global init script to collect HMS lineage for a few weeks.#513
Install a global init script to collect HMS lineage for a few weeks.#513
Conversation
Codecov Report
@@ Coverage Diff @@
## main #513 +/- ##
==========================================
+ Coverage 80.98% 81.12% +0.13%
==========================================
Files 31 33 +2
Lines 3392 3438 +46
Branches 658 667 +9
==========================================
+ Hits 2747 2789 +42
- Misses 491 493 +2
- Partials 154 156 +2
|
| return base64.b64encode(_init_script_content.encode()).decode() | ||
|
|
||
| def _add_global_init_script(self): | ||
| self._ws.global_init_scripts.create( |
There was a problem hiding this comment.
Do we check for existing init scripts? Can we append to existing scripts? Is there a risk?
There was a problem hiding this comment.
Keep scripts separate, yes. But don't add lineage enabler if it's already there. Manual reproduction: run installer twice and verify only one script created by ucx
There was a problem hiding this comment.
Hi @nfx - This is what the code does -
if spark config present and the init script is enabled - skips creating a new one
If spark config present but the init script is disabled - asks if the user wants to enable it and enables/leave as it is based on user response
If spark config is not present then creates a new enabled global init script.
src/databricks/labs/ucx/runtime.py
Outdated
|
|
||
|
|
||
| @task("assessment", depends_on=[setup_schema]) | ||
| def enable_hms_lineage(cfg: WorkspaceConfig): |
There was a problem hiding this comment.
It'll be a better fit if the init script is rolled out in the installer. It'll be simpler to support. Ask it in the installation flow as a question
There was a problem hiding this comment.
added in install.py
| return base64.b64encode(_init_script_content.encode()).decode() | ||
|
|
||
| def _add_global_init_script(self): | ||
| self._ws.global_init_scripts.create( |
nfx
left a comment
There was a problem hiding this comment.
Don't mark review comments as resolved unless you have really addressed them
| return created_script | ||
|
|
||
| def add_spark_config_for_hms_lineage(self): | ||
| created_script = self._add_global_init_script() |
There was a problem hiding this comment.
Add a mandatory check for any global init script with "spark.databricks.dataLineage.enabled" string to skip the addition of this init script
| self._add_sql_wh_config() | ||
| return created_script.script_id | ||
|
|
||
| def _add_sql_wh_config(self): |
There was a problem hiding this comment.
Don't add empty methods, they confuse people while reading the code
| def check_lineage_spark_config_exists(self) -> GlobalInitScriptDetailsWithContent: | ||
| for script in self._ws.global_init_scripts.list(): | ||
| gscript = self._ws.global_init_scripts.get(script_id=script.script_id) | ||
| if gscript: |
There was a problem hiding this comment.
Nit: Too much level of nesting, rewrite with "if gscript is None: continue"
There was a problem hiding this comment.
Took care of the same.
| if "spark.databricks.dataLineage.enabled" in base64.b64decode(gscript.script).decode("utf-8"): | ||
| return gscript | ||
|
|
||
| def _get_init_script_content(self): |
There was a problem hiding this comment.
Keep the init script content in a file.
src/databricks/labs/ucx/install.py
Outdated
| def _install_spark_config_for_hms_lineage(self): | ||
| if ( | ||
| self._prompts | ||
| and self._question( |
There was a problem hiding this comment.
Check first if the script exists, before asking the question.
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs) * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)). * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)). **New features** * Added support for the experimental Databricks CLI launcher ([#517](#517)). * Added support for external Hive Metastores including AWS Glue ([#400](#400)). * Added more views to assessment dashboard ([#474](#474)). * Added rate limit for creating backup group to increase stability ([#500](#500)). * Added deduplication for mount point list ([#569](#569)). * Added documentation to describe interaction with external Hive Metastores ([#473](#473)). * Added failure injection for job failure message propagation ([#591](#591)). * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)). * Added a global init script to collect Hive Metastore lineage ([#513](#513)). * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)). * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)). **Bug fixes** * Fixed handling for `OWN` table permissions ([#571](#571)). * Fixed handling of keys with and without values. ([#514](#514)). * Fixed integration test failures related to concurrent group delete ([#584](#584)). * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)). * Fixed missing group entitlement migration bug ([#583](#583)). * Fixed entitlement application for account-level groups ([#529](#529)). * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)). * Fixed installer to migrate between different configuration file versions ([#596](#596)). * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)). * Improved error message for not null constraints violated ([#532](#532)). * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)). * Introduced Safer access to workspace objects' properties. ([#530](#530)). * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)). * Running apply permission task before assessment should display message ([#487](#487)). * Split integration tests from blocking the merge queue ([#496](#496)). * Support more than one dashboard per step ([#472](#472)). * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)). * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs) * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)). * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)). **New features** * Added support for the experimental Databricks CLI launcher ([#517](#517)). * Added support for external Hive Metastores including AWS Glue ([#400](#400)). * Added more views to assessment dashboard ([#474](#474)). * Added rate limit for creating backup group to increase stability ([#500](#500)). * Added deduplication for mount point list ([#569](#569)). * Added documentation to describe interaction with external Hive Metastores ([#473](#473)). * Added failure injection for job failure message propagation ([#591](#591)). * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)). * Added a global init script to collect Hive Metastore lineage ([#513](#513)). * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)). * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)). **Bug fixes** * Fixed handling for `OWN` table permissions ([#571](#571)). * Fixed handling of keys with and without values. ([#514](#514)). * Fixed integration test failures related to concurrent group delete ([#584](#584)). * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)). * Fixed missing group entitlement migration bug ([#583](#583)). * Fixed entitlement application for account-level groups ([#529](#529)). * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)). * Fixed installer to migrate between different configuration file versions ([#596](#596)). * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)). * Improved error message for not null constraints violated ([#532](#532)). * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)). * Introduced Safer access to workspace objects' properties. ([#530](#530)). * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)). * Running apply permission task before assessment should display message ([#487](#487)). * Split integration tests from blocking the merge queue ([#496](#496)). * Support more than one dashboard per step ([#472](#472)). * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)). * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs) * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)). * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)). **New features** * Added support for the experimental Databricks CLI launcher ([#517](#517)). * Added support for external Hive Metastores including AWS Glue ([#400](#400)). * Added more views to assessment dashboard ([#474](#474)). * Added rate limit for creating backup group to increase stability ([#500](#500)). * Added deduplication for mount point list ([#569](#569)). * Added documentation to describe interaction with external Hive Metastores ([#473](#473)). * Added failure injection for job failure message propagation ([#591](#591)). * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)). * Added a global init script to collect Hive Metastore lineage ([#513](#513)). * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)). * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)). **Bug fixes** * Fixed handling for `OWN` table permissions ([#571](#571)). * Fixed handling of keys with and without values. ([#514](#514)). * Fixed integration test failures related to concurrent group delete ([#584](#584)). * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)). * Fixed missing group entitlement migration bug ([#583](#583)). * Fixed entitlement application for account-level groups ([#529](#529)). * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)). * Fixed installer to migrate between different configuration file versions ([#596](#596)). * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)). * Improved error message for not null constraints violated ([#532](#532)). * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)). * Introduced Safer access to workspace objects' properties. ([#530](#530)). * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)). * Running apply permission task before assessment should display message ([#487](#487)). * Split integration tests from blocking the merge queue ([#496](#496)). * Support more than one dashboard per step ([#472](#472)). * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)). * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).


Fixes #324