Added crawling for init scripts on local files to assessment workflow by prajin-29 · Pull Request #960 · databrickslabs/ucx

prajin-29 · 2024-02-20T10:18:39Z

Changes

Scanning Init Script for local file and S3

Linked issues

#954

Resolves #954

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: ...
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

codecov · 2024-02-20T10:20:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.48%. Comparing base (bc843c9) to head (6b5da21).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #960      +/-   ##
==========================================
+ Coverage   88.45%   88.48%   +0.03%     
==========================================
  Files          47       47              
  Lines        6157     6165       +8     
  Branches     1102     1105       +3     
==========================================
+ Hits         5446     5455       +9     
- Misses        472      473       +1     
+ Partials      239      237       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

prajin-29 · 2024-02-20T10:22:39Z

@nfx . Couldn't find the options for gcp and abfss in InitScriptInfo. Can you suggest what can be done here.
For s3 I think we can connect using boto3 using endpoint url.

github-actions · 2024-02-22T10:23:57Z

✅ 109/109 passed, 14 skipped, 1h10m25s total

_{Running from acceptance #1458}

nfx · 2024-02-22T15:42:49Z

@prajin-29 this bug is not high priority enough, just cover this with a unit test and pick a more hi-prio one

nfx · 2024-02-22T15:43:23Z

src/databricks/labs/ucx/assessment/clusters.py

+                    with open(split[1], "r") as file:
+                        data = file.read()
+                    if data is not None:
+                        return base64.b64decode(data).decode("utf-8")


why do you think a local file is going to be base64 encoded?

gitguardian · 2024-02-27T10:26:54Z

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.

^{_{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}}

nfx

lgtm

* Added AWS IAM roles support to `databricks labs ucx migrate-credentials` command ([#973](#973)). This commit adds AWS Identity and Access Management (IAM) roles support to the `databricks labs ucx migrate-credentials` command, resolving issue [#862](#862) and being related to pull request [#874](#874). It includes the addition of a `load` function to `AWSResourcePermissions` to return identified instance profiles and the creation of an `IamRoleMigration` class under `aws/credentials.py` to migrate identified AWS instance profiles. Additionally, user documentation and a new CLI command `databricks labs ucx migrate-credentials` have been added, and the changes have been thoroughly tested with manual, unit, and integration tests. The functionality additions include new methods such as `add_uc_role_policy` and `update_uc_trust_role`, among others, designed to facilitate the migration process for AWS IAM roles. * Added `create-catalogs-schemas` command to prepare destination catalogs and schemas before table migration ([#1028](#1028)). The Databricks Labs Unity Catalog (UCX) tool has been updated with a new `create-catalogs-schemas` command to facilitate the creation of destination catalogs and schemas prior to table migration. This command should be executed after the `create-table-mapping` command and is designed to prepare the workspace for migrating tables to UC. Additionally, a new `CatalogSchema` class has been added to the `hive_metastore` package to manage the creation of catalogs and schemas in the Hive metastore. This new functionality simplifies the process of preparing the destination Hive metastore for table migration, reducing the likelihood of user errors and ensuring that the metastore is properly configured. Unit tests have been added to the `tests/unit/hive_metastore` directory to verify the behavior of the `CatalogSchema` class and the new `create-catalogs-schemas` command. This command is intended for use in contexts where GCP is not supported. * Added automated upgrade option to set up cluster policy ([#1024](#1024)). This commit introduces an automated upgrade option for setting up a cluster policy for older versions of UCX, separating the cluster creation policy from install.py to installer.policy.py and adding an upgrade script for older UCX versions. A new class, `ClusterPolicyInstaller`, is added to the `policy.py` file in the `installer` package to manage the creation and update of a Databricks cluster policy for Unity Catalog Migration. This class handles creating a new cluster policy with specific configurations, extracting external Hive Metastore configurations, and updating job policies. Additionally, the commit includes refactoring, removal of library references, and a new script, v0.15.0_added_cluster_policy.py, which contains the upgrade function. The changes are tested through manual and automated testing with unit tests and integration tests. This feature is intended for software engineers working with the project. * Added crawling for init scripts on local files to assessment workflow ([#960](#960)). This commit introduces the ability to crawl init scripts stored on local files and S3 as part of the assessment workflow, resolving issue [#9](#9) * Added database filter for the `assessment` workflow ([#989](#989)). In this release, we have added a new configuration option, `include_databases`, to the assessment workflow which allows users to specify a list of databases to include for migration, rather than crawling all the databases in the Hive Metastore. This feature is implemented in the `TablesCrawler`, `UdfsCrawler`, `GrantsCrawler` classes and the associated functions such as `_all_databases`, `getIncludeDatabases`, `_select_databases`. These changes aim to improve efficiency and reduce unnecessary crawling, and are accompanied by modifications to existing functionality, as well as the addition of unit and integration tests. The changes have been manually tested and verified on a staging environment. * Estimate migration effort based on assessment database ([#1008](#1008)). In this release, a new functionality has been added to estimate the migration effort for each asset in the assessment database. The estimation is presented in days and is displayed on a new estimates dashboard with a summary widget for a global estimate per object type, along with assumptions and scope for each object type. A new `query` parameter has been added to the `SimpleQuery` class to support this feature. Additional changes include the update of the `_install_viz` and `_install_query` methods, the inclusion of the `data_source_id` in the query metadata, and the addition of tests to ensure the proper functioning of the new feature. A new fixture, `mock_installation_with_jobs`, has been added to support testing of the assessment estimates dashboard. * Explicitly write to `hive_metastore` from `crawl_tables` task ([#1021](#1021)). In this release, we have improved the clarity and specificity of our handling of the `hive_metastore` in the `crawl_tables` task. Previously, the `df.write.saveAsTable` method was used without explicitly specifying the `hive_metastore` database, which could result in ambiguity. To address this issue, we have updated the `saveAsTable` method to include the `hive_metastore` database, ensuring that tables are written to the correct location in the Hive metastore. These changes are confined to the `src/databricks/labs/ucx/hive_metastore/tables.scala` file and affect the `crawl_tables` task. While no new methods have been added, the existing `saveAsTable` method has been modified to enhance the accuracy and predictability of our interaction with the Hive metastore. * Improved documentation for `databricks labs ucx move` command ([#1025](#1025)). The `databricks labs ucx move` command has been updated with new improvements to its documentation, providing enhanced clarity and ease of use for developers and administrators. This command facilitates the movement of UC tables/table(s) from one schema to another, either in the same or different catalog, during the table upgrade process. A significant enhancement is the preservation of the source table's permissions when moving to a new schema or catalog, maintaining the original table's access controls, simplifying the management of table permissions, and streamlining the migration process. These improvements aim to facilitate a more efficient table migration experience, ensuring that developers and administrators can effectively manage their UC tables while maintaining the desired level of access control and security. * Updated databricks-sdk requirement from ~=0.20.0 to ~=0.21.0 ([#1030](#1030)). In this update, the `databricks-sdk` package requirement has been updated to version `~=0.21.0` from `~=0.20.0`. This new version addresses several bugs and provides enhancements, including the fix for the `get_workspace_client` method in GCP, the use of the `all-apis` scope with the external browser, and an attempt to initialize all Databricks globals. Moreover, the API's settings nesting approach has changed, which may cause compatibility issues with previous versions. Several new services and dataclasses have been added to the API, and documentation and examples have been updated accordingly. There are no updates to the `databricks-labs-blueprint` and `PyYAML` dependencies in this commit.

…#960) ## Changes Scanning Init Script for local file and S3 ### Linked issues #954 Resolves #954 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs ucx ...` - [ ] added a new workflow - [ ] modified existing workflow: `...` - [ ] added a new table - [ ] modified existing table: `...` ### Tests  - [ ] manually tested - [x] added unit tests - [ ] added integration tests - [ ] verified on staging environment (screenshot attached)

* Added AWS IAM roles support to `databricks labs ucx migrate-credentials` command ([#973](#973)). This commit adds AWS Identity and Access Management (IAM) roles support to the `databricks labs ucx migrate-credentials` command, resolving issue [#862](#862) and being related to pull request [#874](#874). It includes the addition of a `load` function to `AWSResourcePermissions` to return identified instance profiles and the creation of an `IamRoleMigration` class under `aws/credentials.py` to migrate identified AWS instance profiles. Additionally, user documentation and a new CLI command `databricks labs ucx migrate-credentials` have been added, and the changes have been thoroughly tested with manual, unit, and integration tests. The functionality additions include new methods such as `add_uc_role_policy` and `update_uc_trust_role`, among others, designed to facilitate the migration process for AWS IAM roles. * Added `create-catalogs-schemas` command to prepare destination catalogs and schemas before table migration ([#1028](#1028)). The Databricks Labs Unity Catalog (UCX) tool has been updated with a new `create-catalogs-schemas` command to facilitate the creation of destination catalogs and schemas prior to table migration. This command should be executed after the `create-table-mapping` command and is designed to prepare the workspace for migrating tables to UC. Additionally, a new `CatalogSchema` class has been added to the `hive_metastore` package to manage the creation of catalogs and schemas in the Hive metastore. This new functionality simplifies the process of preparing the destination Hive metastore for table migration, reducing the likelihood of user errors and ensuring that the metastore is properly configured. Unit tests have been added to the `tests/unit/hive_metastore` directory to verify the behavior of the `CatalogSchema` class and the new `create-catalogs-schemas` command. This command is intended for use in contexts where GCP is not supported. * Added automated upgrade option to set up cluster policy ([#1024](#1024)). This commit introduces an automated upgrade option for setting up a cluster policy for older versions of UCX, separating the cluster creation policy from install.py to installer.policy.py and adding an upgrade script for older UCX versions. A new class, `ClusterPolicyInstaller`, is added to the `policy.py` file in the `installer` package to manage the creation and update of a Databricks cluster policy for Unity Catalog Migration. This class handles creating a new cluster policy with specific configurations, extracting external Hive Metastore configurations, and updating job policies. Additionally, the commit includes refactoring, removal of library references, and a new script, v0.15.0_added_cluster_policy.py, which contains the upgrade function. The changes are tested through manual and automated testing with unit tests and integration tests. This feature is intended for software engineers working with the project. * Added crawling for init scripts on local files to assessment workflow ([#960](#960)). This commit introduces the ability to crawl init scripts stored on local files and S3 as part of the assessment workflow, resolving issue [#9](#9) * Added database filter for the `assessment` workflow ([#989](#989)). In this release, we have added a new configuration option, `include_databases`, to the assessment workflow which allows users to specify a list of databases to include for migration, rather than crawling all the databases in the Hive Metastore. This feature is implemented in the `TablesCrawler`, `UdfsCrawler`, `GrantsCrawler` classes and the associated functions such as `_all_databases`, `getIncludeDatabases`, `_select_databases`. These changes aim to improve efficiency and reduce unnecessary crawling, and are accompanied by modifications to existing functionality, as well as the addition of unit and integration tests. The changes have been manually tested and verified on a staging environment. * Estimate migration effort based on assessment database ([#1008](#1008)). In this release, a new functionality has been added to estimate the migration effort for each asset in the assessment database. The estimation is presented in days and is displayed on a new estimates dashboard with a summary widget for a global estimate per object type, along with assumptions and scope for each object type. A new `query` parameter has been added to the `SimpleQuery` class to support this feature. Additional changes include the update of the `_install_viz` and `_install_query` methods, the inclusion of the `data_source_id` in the query metadata, and the addition of tests to ensure the proper functioning of the new feature. A new fixture, `mock_installation_with_jobs`, has been added to support testing of the assessment estimates dashboard. * Explicitly write to `hive_metastore` from `crawl_tables` task ([#1021](#1021)). In this release, we have improved the clarity and specificity of our handling of the `hive_metastore` in the `crawl_tables` task. Previously, the `df.write.saveAsTable` method was used without explicitly specifying the `hive_metastore` database, which could result in ambiguity. To address this issue, we have updated the `saveAsTable` method to include the `hive_metastore` database, ensuring that tables are written to the correct location in the Hive metastore. These changes are confined to the `src/databricks/labs/ucx/hive_metastore/tables.scala` file and affect the `crawl_tables` task. While no new methods have been added, the existing `saveAsTable` method has been modified to enhance the accuracy and predictability of our interaction with the Hive metastore. * Improved documentation for `databricks labs ucx move` command ([#1025](#1025)). The `databricks labs ucx move` command has been updated with new improvements to its documentation, providing enhanced clarity and ease of use for developers and administrators. This command facilitates the movement of UC tables/table(s) from one schema to another, either in the same or different catalog, during the table upgrade process. A significant enhancement is the preservation of the source table's permissions when moving to a new schema or catalog, maintaining the original table's access controls, simplifying the management of table permissions, and streamlining the migration process. These improvements aim to facilitate a more efficient table migration experience, ensuring that developers and administrators can effectively manage their UC tables while maintaining the desired level of access control and security. * Updated databricks-sdk requirement from ~=0.20.0 to ~=0.21.0 ([#1030](#1030)). In this update, the `databricks-sdk` package requirement has been updated to version `~=0.21.0` from `~=0.20.0`. This new version addresses several bugs and provides enhancements, including the fix for the `get_workspace_client` method in GCP, the use of the `all-apis` scope with the external browser, and an attempt to initialize all Databricks globals. Moreover, the API's settings nesting approach has changed, which may cause compatibility issues with previous versions. Several new services and dataclasses have been added to the API, and documentation and examples have been updated accordingly. There are no updates to the `databricks-labs-blueprint` and `PyYAML` dependencies in this commit.

prajin-29 added 2 commits February 20, 2024 14:37

Adding local file read property for init scripts

2924c1f

Adding local file read property for init scripts

3a46c9e

prajin-29 had a problem deploying to account-admin February 20, 2024 10:18 — with GitHub Actions Failure

Merge branch 'main' into feature/init_script_changes

3927b7e

prajin-29 temporarily deployed to account-admin February 22, 2024 10:14 — with GitHub Actions Inactive

nfx requested changes Feb 22, 2024

View reviewed changes

Adding local file read property for init scripts

cbc4ec3

prajin-29 temporarily deployed to account-admin February 27, 2024 10:21 — with GitHub Actions Inactive

prajin-29 added 2 commits February 27, 2024 15:51

Merge branch 'main' into feature/init_script_changes

953a87f

Adding local file read property for init scripts

d596bb6

prajin-29 temporarily deployed to account-admin February 27, 2024 10:26 — with GitHub Actions Inactive

Adding local file read property for init scripts

085e46b

prajin-29 had a problem deploying to account-admin February 27, 2024 10:34 — with GitHub Actions Failure

nfx approved these changes Feb 27, 2024

View reviewed changes

nfx added the step/assessment go/uc/upgrade - Assessment Step label Mar 4, 2024

nfx changed the title ~~Scanning Init Script for local file and S3~~ Added crawling for init scripts on local files to assessment workflow Mar 4, 2024

prajin-29 added 2 commits March 5, 2024 07:37

Merge branch 'main' into feature/init_script_changes

8e1fbfc

Adding Integration Testing

d0db655

nfx approved these changes Mar 5, 2024

View reviewed changes

nfx marked this pull request as ready for review March 5, 2024 07:58

nfx requested review from a team and william-conti March 5, 2024 07:58

nfx temporarily deployed to account-admin March 5, 2024 07:58 — with GitHub Actions Inactive

Merge branch 'main' into feature/init_script_changes

6b5da21

prajin-29 had a problem deploying to account-admin March 7, 2024 05:37 — with GitHub Actions Failure

prajin-29 temporarily deployed to account-admin March 7, 2024 18:04 — with GitHub Actions Inactive

nfx added the ready to merge label Mar 7, 2024

nfx merged commit 40568e5 into main Mar 8, 2024

nfx deleted the feature/init_script_changes branch March 8, 2024 16:04

nfx mentioned this pull request Mar 8, 2024

Release v0.16.0 #1034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added crawling for init scripts on local files to assessment workflow#960

Added crawling for init scripts on local files to assessment workflow#960
nfx merged 10 commits intomainfrom
feature/init_script_changes

prajin-29 commented Feb 20, 2024 •

edited

Loading

Uh oh!

codecov bot commented Feb 20, 2024 •

edited

Loading

Uh oh!

prajin-29 commented Feb 20, 2024

Uh oh!

github-actions bot commented Feb 22, 2024 •

edited

Loading

Uh oh!

nfx commented Feb 22, 2024

Uh oh!

nfx Feb 22, 2024

Uh oh!

gitguardian bot commented Feb 27, 2024 •

edited

Loading

Uh oh!

nfx left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prajin-29 commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Linked issues

Functionality

Tests

Uh oh!

codecov bot commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

prajin-29 commented Feb 20, 2024

Uh oh!

github-actions bot commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nfx commented Feb 22, 2024

Uh oh!

nfx Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

gitguardian bot commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

️✅ There are no secrets present in this pull request anymore.

Uh oh!

nfx left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prajin-29 commented Feb 20, 2024 •

edited

Loading

codecov bot commented Feb 20, 2024 •

edited

Loading

github-actions bot commented Feb 22, 2024 •

edited

Loading

gitguardian bot commented Feb 27, 2024 •

edited

Loading