Skip to content

Added AWS IAM role support to databricks labs ucx create-uber-principal command#993

Merged
nfx merged 15 commits intomainfrom
feature/uber-iam-profile
Mar 11, 2024
Merged

Added AWS IAM role support to databricks labs ucx create-uber-principal command#993
nfx merged 15 commits intomainfrom
feature/uber-iam-profile

Conversation

@mwojtyczka
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka commented Feb 29, 2024

Changes

Added CLI command databricks labs ucx create-uber-principal for creating uber-IAM profile for performing external table migration on AWS.

Logic:

  • Stop if UCX migration cluster policy is not found
  • Collect paths of all locations/paths used in tables (call external_location.snapshot)
  • If cluster policy has an existing iam instance profile/role specified, then add/update migration policy providing access to the locations
  • If cluster policy does not have iam instance profile/role specified, then create new iam profile/role and migration policy, and add it to the cluster policy

Linked issues

Resolves #879

Related issues:

Functionality

  • added new CLI command

Tests

  • manually tested
  • added unit tests

TODO

  • added integration tests
  • verified on staging environment (screenshot attached)

@mwojtyczka mwojtyczka added the pr/do-not-merge this pull request is not ready to merge label Feb 29, 2024
@mwojtyczka mwojtyczka requested review from a team and fannijako February 29, 2024 15:02
@mwojtyczka mwojtyczka requested review from nfx and removed request for fannijako February 29, 2024 15:02
@mwojtyczka mwojtyczka removed the pr/do-not-merge this pull request is not ready to merge label Feb 29, 2024
@mwojtyczka mwojtyczka changed the title Added CLI command for creating uber-IAM profile for performing external table migration on AWS DRAFT: Added CLI command for creating uber-IAM profile for performing external table migration on AWS Feb 29, 2024
@mwojtyczka mwojtyczka changed the title DRAFT: Added CLI command for creating uber-IAM profile for performing external table migration on AWS Added CLI command for creating uber-IAM profile for performing external table migration on AWS Feb 29, 2024
@mwojtyczka mwojtyczka marked this pull request as draft February 29, 2024 15:06
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 82.94118% with 58 lines in your changes are missing coverage. Please review.

Project coverage is 88.64%. Comparing base (2b22656) to head (fc574d9).
Report is 1 commits behind head on main.

Files Patch % Lines
src/databricks/labs/ucx/aws/access.py 87.76% 18 Missing and 11 partials ⚠️
src/databricks/labs/ucx/cli.py 64.28% 15 Missing ⚠️
src/databricks/labs/ucx/assessment/aws.py 75.43% 10 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #993      +/-   ##
==========================================
- Coverage   88.98%   88.64%   -0.35%     
==========================================
  Files          51       52       +1     
  Lines        6501     6655     +154     
  Branches     1169     1194      +25     
==========================================
+ Hits         5785     5899     +114     
- Misses        466      500      +34     
- Partials      250      256       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 29, 2024

✅ 109/109 passed, 19 skipped, 1h0m38s total

Running from acceptance #1559

@nfx nfx requested a review from nkvuong February 29, 2024 18:56

iam_policy_name = f"UCX_MIGRATION_POLICY_{config.inventory_database}"
prompts = Prompts()
if iam_role_name_in_cluster_policy and aws_permissions.is_role_exists(iam_role_name_in_cluster_policy):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too much logic for cli.py. move this into AWSPermissions method and pass it Prompts instance, so that it's unit testable. See service_principal_migration.run(prompts) as the example

logger.info(f"Cluster policy \"{cluster_policy.name}\" updated successfully")


def _get_cluster_policy(w: WorkspaceClient, policy_id: str) -> Policy:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a private method in AWSResourcePermissions

return w.cluster_policies.get(policy_id=policy_id)
except NotFound as err:
msg = f"UCX Policy {policy_id} not found, please reinstall UCX"
logger.error(msg)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to log error - it's caught up the stack

return None


def _update_cluster_policy_with_aws_instance_profile(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private method in AWSResourcePermissions

@nfx nfx changed the title Added CLI command for creating uber-IAM profile for performing external table migration on AWS Added AWS IAM role support to databricks labs ucx create-uber-principal command Mar 4, 2024
@nfx nfx requested a review from HariGS-DB March 4, 2024 16:20
nfx pushed a commit that referenced this pull request Mar 5, 2024
…zure Service Principal for migration (#976)

## Changes
 - Added new cli cmd for create-master-principal in labs.yml, cli.py
- Added separate class for AzureApiClient to separate out azure API
calls
- Added logic to create SPN, secret, roleassignment in resources and
update workspace config with spn client_id
- added logic to call create spn, update rbac of all storage account to
that spn, update ucx cluster policy with spn secret for each storage
account
 - test unit and int test cases

Resolves #881 

Related issues: 
- #993
- #693

### Functionality 

- [ ] added relevant user documentation
- [X] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests

- [X] manually tested
- [X] added unit tests
- [X] added integration tests
- [ ] verified on staging environment (screenshot attached)
nkvuong added a commit that referenced this pull request Mar 6, 2024
author Vuong <[email protected]> 1709737244 +0000
committer Vuong <[email protected]> 1709739422 +0000

parent c866d42
author Vuong <[email protected]> 1709737244 +0000
committer Vuong <[email protected]> 1709739396 +0000

parent c866d42
author Vuong <[email protected]> 1709737244 +0000
committer Vuong <[email protected]> 1709739377 +0000

parent c866d42
author Vuong <[email protected]> 1709737244 +0000
committer Vuong <[email protected]> 1709739250 +0000

add trust relationship update

Fix integration tests on AWS (#978)

Update groups permissions validation to use Table ACL cluster (#979)

Renamed columns in assessment SQL queries to use actual names, not aliases (#983)

<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary -->
Aliases are usually not allowed in projections (as they are replaced
later in the query execution phases). While the DBSQL was smart enough
to handle the references via aliases, for some setups this results in an
error. Changing column references to use actual names fixes this.

<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #980

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [x] manually tested
- [ ] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

Fixed `config.yml` upgrade from very old versions (#984)

Added `upgraded_from_workspace_id` property to migrated tables to indicated the source workspace. (#987)

Added table parameter `upgraded_from_ws` to migrated tables. The
parameters contains the sources workspace id.

Resolves #899

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [x] manually tested
- [x] added unit tests
- [x] added integration tests
- [x] verified on staging environment (screenshot attached)

Added group members difference to the output of `validate-groups-membership` cli command (#995)

The `validate-groups-membership` command has been updated to include a
comparison of group memberships at both the account and workspace
levels, displaying the difference in members between the two levels in a
new column. This enhancement allows for a more detailed analysis of
group memberships, with the added functionality implemented in the
`validate_group_membership` function in the `groups.py` file located in
the `databricks/labs/ucx/workspace_access` directory. A new output
field, "group\_members\_difference," has been added to represent the
difference in the number of members between a workspace group and an
associated account group. The corresponding unit test file,
"test\_groups.py," has been updated to include a new test case that
verifies the calculation of the "group\_members\_difference" value. This
change provides users with a more comprehensive view of their group
memberships and allows them to easily identify any discrepancies between
the account and workspace levels. The functionality of the other
commands remains unchanged.

Improved installation integration test flakiness (#998)

- improved `_infer_error_from_job_run` and `_infer_error_from_task_run`
to also catch `KeyError` and `ValueError`
- removed retries for `Unknown` errors for installation tests

Expanded end-user documentation with detailed descriptions for workflows and commands (#999)

The Databricks Labs UCX project has been updated with several new
features to assist in upgrading to Unity Catalog. These include various
workflows and command-line utilities, such as an assessment workflow
that generates a detailed compatibility report for workspace entities
and a group migration workflow to upgrade all Databricks workspace
assets. Additionally, new utility commands have been added for managing
cross-workspace installations, and users can now view deployed
workflows' status and repair failed workflows. A new end-user
documentation has also been introduced, featuring comprehensive
descriptions of workflows, commands, and an assessment report image. The
Assessment Report, generated from UCX tools, now includes a more
detailed summary of the assessment findings, table counts, database
summaries, and external locations. Improved documentation for external
Hive Metastore integration and a new debugging notebook are also
included in this release. Lastly, the workspace group migration feature
has been expanded to handle potential conflicts when migrating multiple
workspaces with locally scoped group names.

Release v0.14.0 (#1000)

* Added `upgraded_from_workspace_id` property to migrated tables to
indicated the source workspace
([#987](#987)). In this
release, updates have been made to the `_migrate_external_table`,
`_migrate_dbfs_root_table`, and `_migrate_view` methods in the
`table_migrate.py` file to include a new parameter `upgraded_from_ws` in
the SQL commands used to alter tables, views, or managed tables. This
parameter is used to store the source workspace ID in the migrated
tables, indicating the migration origin. A new utility method
`sql_alter_from` has been added to the `Table` class in `tables.py` to
generate the SQL command with the new parameter. Additionally, a new
class-level attribute `UPGRADED_FROM_WS_PARAM` has been added to the
`Table` class in `tables.py` to indicate the source workspace. A new
property `upgraded_from_workspace_id` has been added to migrated tables
to store the source workspace ID. These changes resolve issue
[#899](#899) and are tested
through manual testing, unit tests, and integration tests. No new CLI
commands, workflows, or tables have been added or modified, and there
are no changes to user documentation.
* Added a command to create account level groups if they do not exist
([#763](#763)). This commit
introduces a new feature that enables the creation of account-level
groups if they do not already exist in the account. A new command,
`create-account-groups`, has been added to the `databricks labs ucx`
tool, which crawls all workspaces in the account and creates
account-level groups if a corresponding workspace-local group is not
found. The feature supports various scenarios, including creating
account-level groups that exist in some workspaces but not in others,
and creating multiple account-level groups with the same name but
different members. Several new methods have been added to the
`account.py` file to support the new feature, and the `test_account.py`
file has been updated with new tests to ensure the correct behavior of
the `create_account_level_groups` method. Additionally, the `cli.py`
file has been updated to include the new `create-account-groups`
command. With these changes, users can easily manage account-level
groups and ensure that they are consistent across all workspaces in the
account, improving the overall user experience.
* Added assessment for the incompatible `RunSubmit` API usages
([#849](#849)). In this
release, the assessment functionality for incompatible `RunSubmit` API
usages has been significantly enhanced through various changes. The
'clusters.py' file has seen improvements in clarity and consistency with
the renaming of private methods `check_spark_conf` to
`_check_spark_conf` and `check_cluster_failures` to
`_check_cluster_failures`. The `_assess_clusters` method has been
updated to call the renamed `_check_cluster_failures` method for
thorough checks of cluster configurations, resulting in better
assessment functionality. A new `SubmitRunsCrawler` class has been added
to the `databricks.labs.ucx.assessment.jobs` module, implementing
`CrawlerBase`, `JobsMixin`, and `CheckClusterMixin` classes. This class
crawls and assesses job runs based on their submitted runs, ensuring
compatibility and identifying failure issues. Additionally, a new
configuration attribute, `num_days_submit_runs_history`, has been
introduced in the `WorkspaceConfig` class of the `config.py` module,
controlling the number of days for which submission history of
`RunSubmit` API calls is retained. Lastly, various new JSON files have
been added for unit testing, assessing the `RunSubmit` API usages
related to different scenarios like dbt task runs, Git source-based job
runs, JAR file runs, and more. These tests will aid in identifying and
addressing potential compatibility issues with the `RunSubmit` API.
* Added group members difference to the output of
`validate-groups-membership` cli command
([#995](#995)). The
`validate-groups-membership` command has been updated to include a
comparison of group memberships at both the account and workspace
levels. This enhancement is implemented through the
`validate_group_membership` function, which has been updated to
calculate the difference in members between the two levels and display
it in a new `group_members_difference` column. This allows for a more
detailed analysis of group memberships and easily identifies any
discrepancies between the account and workspace levels. The
corresponding unit test file, "test_groups.py," has been updated to
include a new test case that verifies the calculation of the
`group_members_difference` value. The functionality of the other
commands remains unchanged. The new `group_members_difference` value is
calculated as the difference in the number of members in the workspace
group and the account group, with a positive value indicating more
members in the workspace group and a negative value indicating more
members in the account group. The table template in the labs.yml file
has also been updated to include the new column for the group membership
difference.
* Added handling for empty `directory_id` if managed identity
encountered during the crawling of StoragePermissionMapping
([#986](#986)). This PR adds
a `type` field to the `StoragePermissionMapping` and `Principal`
dataclasses to differentiate between service principals and managed
identities, allowing `None` for the `directory_id` field if the
principal is not a service principal. During the migration to UC storage
credentials, managed identities are currently ignored. These changes
improve handling of managed identities during the crawling of
`StoragePermissionMapping`, prevent errors when creating storage
credentials with managed identities, and address issue
[#339](#339). The changes
are tested through unit tests, manual testing, and integration tests,
and only affect the `StoragePermissionMapping` class and related
methods, without introducing new commands, workflows, or tables.
* Added migration for Azure Service Principals with secrets stored in
Databricks Secret to UC Storage Credentials
([#874](#874)). In this
release, we have made significant updates to migrate Azure Service
Principals with their secrets stored in Databricks Secret to UC Storage
Credentials, enhancing security and management of storage access. The
changes include: Addition of a new `migrate_credentials` command in the
`labs.yml` file to migrate credentials for storage access to UC storage
credential. Modification of `secrets.py` to handle the case where a
secret has been removed from the backend and to log warning messages for
secrets with invalid Base64 bytes. Introduction of the
`StorageCredentialManager` and `ServicePrincipalMigration` classes in
`credentials.py` to manage Azure Service Principals and their associated
client secrets, and to migrate them to UC Storage Credentials. Addition
of a new `directory_id` attribute in the `Principal` class and its
associated dataclass in `resources.py` to store the directory ID for
creating UC storage credentials using a service principal. Creation of a
new pytest fixture, `make_storage_credential_spn`, in `fixtures.py` to
simplify writing tests requiring Databricks Storage Credentials with
Azure Service Principal auth. Addition of a new test file for the Azure
integration of the project, including new classes, methods, and test
cases for testing the migration of Azure Service Principals to UC
Storage Credentials. These improvements will ensure better security and
management of storage access using Azure Service Principals, while
providing more efficient and robust testing capabilities.
* Added permission migration support for feature tables and the root
permissions for models and feature tables
([#997](#997)). This commit
introduces support for migration of permissions related to feature
tables and sets root permissions for models and feature tables. New
functions such as `feature_store_listing`, `feature_tables_root_page`,
`models_root_page`, and `tokens_and_passwords` have been added to
facilitate population of a workspace access page with necessary
permissions information. The `factory` function in `manager.py` has been
updated to include new listings for models' root page, feature tables'
root page, and the feature store for enhanced management and access
control of models and feature tables. New classes and methods have been
implemented to handle permissions for these resources, utilizing
`GenericPermissionsSupport`, `AccessControlRequest`, and `MigratedGroup`
classes. Additionally, new test methods have been included to verify
feature tables listing functionality and root page listing functionality
for feature tables and registered models. The test manager method has
been updated to include `feature-tables` in the list of items to be
checked for permissions, ensuring comprehensive testing of permission
functionality related to these new feature tables.
* Added support for serving endpoints
([#990](#990)). In this
release, we have made significant enhancements to support serving
endpoints in our open-source library. The `fixtures.py` file in the
`databricks.labs.ucx.mixins` module has been updated with new classes
and functions to create and manage serving endpoints, accompanied by
integration tests to verify their functionality. We have added a new
listing for serving endpoints in the assessment's permissions crawling,
using the `ws.serving_endpoints.list` function and the
`serving-endpoints` category. A new integration test, "test_endpoints,"
has been added to verify that assessments now crawl permissions for
serving endpoints. This test demonstrates the ability to migrate
permissions from one group to another. The test suite has been updated
to ensure the proper functioning of the new feature and improve the
assessment of permissions for serving endpoints, ensuring compatibility
with the updated `test_manager.py` file.
* Expanded end-user documentation with detailed descriptions for
workflows and commands
([#999](#999)). The
Databricks Labs UCX project has been updated with several new features
to assist in upgrading to Unity Catalog, including an assessment
workflow that generates a detailed compatibility report for workspace
entities, a group migration workflow for upgrading all Databricks
workspace assets, and utility commands for managing cross-workspace
installations. The Assessment Report now includes a more detailed
summary of the assessment findings, table counts, database summaries,
and external locations. Additional improvements include expanded
workspace group migration to handle potential conflicts with locally
scoped group names, enhanced documentation for external Hive Metastore
integration, a new debugging notebook, and detailed descriptions of
table upgrade considerations, data access permissions, external storage,
and table crawler.
* Fixed `config.yml` upgrade from very old versions
([#984](#984)). In this
release, we've introduced enhancements to the configuration upgrading
process for `config.yml` in our open-source library. We've replaced the
previous `v1_migrate` class method with a new implementation that
specifically handles migration from version 1. The new method retrieves
the `groups` field, extracts the `selected` value, and assigns it to the
`include_group_names` key in the configuration. The
`backup_group_prefix` value from the `groups` field is assigned to the
`renamed_group_prefix` key, and the `groups` field is removed, with the
version number updated to 2. These changes simplify the code and improve
readability, enabling users to upgrade smoothly from version 1 of the
configuration. Furthermore, we've added new unit tests to the
`test_config.py` file to ensure backward compatibility. Two new tests,
`test_v1_migrate_zeroconf` and `test_v1_migrate_some_conf`, have been
added, utilizing the `MockInstallation` class and loading the
configuration using `WorkspaceConfig`. These tests enhance the
robustness and reliability of the migration process for `config.yml`.
* Renamed columns in assessment SQL queries to use actual names, not
aliases ([#983](#983)). In
this update, we have resolved an issue where aliases used for column
references in SQL queries caused errors in certain setups by renaming
them to use actual names. Specifically, for assessment SQL queries, we
have modified the definition of the `is_delta` column to use the actual
`table_format` name instead of the alias `format`. This change improves
compatibility and enhances the reliability of query execution. As a
software engineer, you will appreciate that this modification ensures
consistent interpretation of column references across various setups,
thereby avoiding potential errors caused by aliases. This change does
not introduce any new methods, but instead modifies existing
functionality to use actual column names, ensuring a more reliable and
consistent SQL query for the `05_0_all_tables` assessment.
* Updated groups permissions validation to use Table ACL cluster
([#979](#979)). In this
update, the `validate_groups_permissions` task has been modified to
utilize the Table ACL cluster, as indicated by the inclusion of
`job_cluster="tacl"`. This task is responsible for ensuring that all
crawled permissions are accurately applied to the destination groups by
calling the `permission_manager.apply_group_permissions` method during
the migration state. This modification enhances the validation of group
permissions by performing it on the Table ACL cluster, potentially
improving performance or functionality. If you are implementing this
project, it is crucial to comprehend the consequences of this change on
your permissions validation process and adjust your workflows
appropriately.

Update databricks-labs-blueprint requirement from ~=0.2.4 to ~=0.3.0 (#1001)

Updates the requirements on
[databricks-labs-blueprint](https://github.com/databrickslabs/blueprint)
to permit the latest version.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/databrickslabs/blueprint/releases">databricks-labs-blueprint's
releases</a>.</em></p>
<blockquote>
<h2>v0.3.0</h2>
<ul>
<li>Added automated upgrade framework (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/50">#50</a>).
This update introduces an automated upgrade framework for managing and
applying upgrades to the product, with a new <code>upgrades.py</code>
file that includes a <code>ProductInfo</code> class having methods for
version handling, wheel building, and exception handling. The test code
organization has been improved, and new test cases, functions, and a
directory structure for fixtures and unit tests have been added for the
upgrades functionality. The <code>test_wheels.py</code> file now checks
the version of the Databricks SDK and handles cases where the version
marker is missing or does not contain the <code>__version__</code>
variable. Additionally, a new <code>Application State Migrations</code>
section has been added to the README, explaining the process of seamless
upgrades from version X to version Z through version Y, addressing the
need for configuration or database state migrations as the application
evolves. Users can apply these upgrades by following an idiomatic usage
pattern involving several classes and functions. Furthermore,
improvements have been made to the <code>_trim_leading_whitespace</code>
function in the <code>commands.py</code> file of the
<code>databricks.labs.blueprint</code> module, ensuring accurate and
consistent removal of leading whitespace for each line in the command
string, leading to better overall functionality and
maintainability.</li>
<li>Added brute-forcing <code>SerdeError</code> with
<code>as_dict()</code> and <code>from_dict()</code> (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/58">#58</a>).
This commit introduces a brute-forcing approach for handling
<code>SerdeError</code> using <code>as_dict()</code> and
<code>from_dict()</code> methods in an open-source library. The new
<code>SomePolicy</code> class demonstrates the usage of these methods
for manual serialization and deserialization of custom classes. The
<code>as_dict()</code> method returns a dictionary representation of the
class instance, and the <code>from_dict()</code> method, decorated with
<code>@classmethod</code>, creates a new instance from the provided
dictionary. Additionally, the GitHub Actions workflow for acceptance
tests has been updated to include the <code>ready_for_review</code>
event type, ensuring that tests run not only for opened and synchronized
pull requests but also when marked as &quot;ready for review.&quot;
These changes provide developers with more control over the
deserialization process and facilitate debugging in cases where default
deserialization fails, but should be used judiciously to avoid brittle
code.</li>
<li>Fixed nightly integration tests run as service principals (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/52">#52</a>).
In this release, we have enhanced the compatibility of our codebase with
service principals, particularly in the context of nightly integration
tests. The <code>Installation</code> class in the
<code>databricks.labs.blueprint.installation</code> module has been
refactored, deprecating the <code>current</code> method and introducing
two new methods: <code>assume_global</code> and
<code>assume_user_home</code>. These methods enable users to install and
manage <code>blueprint</code> as either a global or user-specific
installation. Additionally, the <code>existing</code> method has been
updated to work with the new <code>Installation</code> methods. In the
test suite, the <code>test_installation.py</code> file has been updated
to correctly detect global and user-specific installations when running
as a service principal. These changes improve the testability and
functionality of our software, ensuring seamless operation with service
principals during nightly integration tests.</li>
<li>Made <code>test_existing_installations_are_detected</code> more
resilient (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/51">#51</a>).
In this release, we have added a new test function
<code>test_existing_installations_are_detected</code> that checks if
existing installations are correctly detected and retries the test for
up to 15 seconds if they are not. This improves the reliability of the
test by making it more resilient to potential intermittent failures. We
have also added an import from <code>databricks.sdk.retries</code> named
<code>retried</code> which is used to retry the test function in case of
an <code>AssertionError</code>. Additionally, the test function
<code>test_existing</code> has been renamed to
<code>test_existing_installations_are_detected</code> and the
<code>xfail</code> marker has been removed. We have also renamed the
test function <code>test_dataclass</code> to
<code>test_loading_dataclass_from_installation</code> for better
clarity. This change will help ensure that the library is correctly
detecting existing installations and improve the overall quality of the
codebase.</li>
</ul>
<p>Contributors: <a
href="https://github.com/nfx"><code>@​nfx</code></a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/databrickslabs/blueprint/blob/main/CHANGELOG.md">databricks-labs-blueprint's
changelog</a>.</em></p>
<blockquote>
<h2>0.3.0</h2>
<ul>
<li>Added automated upgrade framework (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/50">#50</a>).
This update introduces an automated upgrade framework for managing and
applying upgrades to the product, with a new <code>upgrades.py</code>
file that includes a <code>ProductInfo</code> class having methods for
version handling, wheel building, and exception handling. The test code
organization has been improved, and new test cases, functions, and a
directory structure for fixtures and unit tests have been added for the
upgrades functionality. The <code>test_wheels.py</code> file now checks
the version of the Databricks SDK and handles cases where the version
marker is missing or does not contain the <code>__version__</code>
variable. Additionally, a new <code>Application State Migrations</code>
section has been added to the README, explaining the process of seamless
upgrades from version X to version Z through version Y, addressing the
need for configuration or database state migrations as the application
evolves. Users can apply these upgrades by following an idiomatic usage
pattern involving several classes and functions. Furthermore,
improvements have been made to the <code>_trim_leading_whitespace</code>
function in the <code>commands.py</code> file of the
<code>databricks.labs.blueprint</code> module, ensuring accurate and
consistent removal of leading whitespace for each line in the command
string, leading to better overall functionality and
maintainability.</li>
<li>Added brute-forcing <code>SerdeError</code> with
<code>as_dict()</code> and <code>from_dict()</code> (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/58">#58</a>).
This commit introduces a brute-forcing approach for handling
<code>SerdeError</code> using <code>as_dict()</code> and
<code>from_dict()</code> methods in an open-source library. The new
<code>SomePolicy</code> class demonstrates the usage of these methods
for manual serialization and deserialization of custom classes. The
<code>as_dict()</code> method returns a dictionary representation of the
class instance, and the <code>from_dict()</code> method, decorated with
<code>@classmethod</code>, creates a new instance from the provided
dictionary. Additionally, the GitHub Actions workflow for acceptance
tests has been updated to include the <code>ready_for_review</code>
event type, ensuring that tests run not only for opened and synchronized
pull requests but also when marked as &quot;ready for review.&quot;
These changes provide developers with more control over the
deserialization process and facilitate debugging in cases where default
deserialization fails, but should be used judiciously to avoid brittle
code.</li>
<li>Fixed nightly integration tests run as service principals (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/52">#52</a>).
In this release, we have enhanced the compatibility of our codebase with
service principals, particularly in the context of nightly integration
tests. The <code>Installation</code> class in the
<code>databricks.labs.blueprint.installation</code> module has been
refactored, deprecating the <code>current</code> method and introducing
two new methods: <code>assume_global</code> and
<code>assume_user_home</code>. These methods enable users to install and
manage <code>blueprint</code> as either a global or user-specific
installation. Additionally, the <code>existing</code> method has been
updated to work with the new <code>Installation</code> methods. In the
test suite, the <code>test_installation.py</code> file has been updated
to correctly detect global and user-specific installations when running
as a service principal. These changes improve the testability and
functionality of our software, ensuring seamless operation with service
principals during nightly integration tests.</li>
<li>Made <code>test_existing_installations_are_detected</code> more
resilient (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/51">#51</a>).
In this release, we have added a new test function
<code>test_existing_installations_are_detected</code> that checks if
existing installations are correctly detected and retries the test for
up to 15 seconds if they are not. This improves the reliability of the
test by making it more resilient to potential intermittent failures. We
have also added an import from <code>databricks.sdk.retries</code> named
<code>retried</code> which is used to retry the test function in case of
an <code>AssertionError</code>. Additionally, the test function
<code>test_existing</code> has been renamed to
<code>test_existing_installations_are_detected</code> and the
<code>xfail</code> marker has been removed. We have also renamed the
test function <code>test_dataclass</code> to
<code>test_loading_dataclass_from_installation</code> for better
clarity. This change will help ensure that the library is correctly
detecting existing installations and improve the overall quality of the
codebase.</li>
</ul>
<h2>0.2.5</h2>
<ul>
<li>Automatically enable workspace filesystem if the feature is disabled
(<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/42">#42</a>).</li>
</ul>
<h2>0.2.4</h2>
<ul>
<li>Added more integration tests for <code>Installation</code> (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/39">#39</a>).</li>
<li>Fixed <code>yaml</code> optional import error (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/38">#38</a>).</li>
</ul>
<h2>0.2.3</h2>
<ul>
<li>Added special handling for notebooks in
<code>Installation.upload(...)</code> (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/36">#36</a>).</li>
</ul>
<h2>0.2.2</h2>
<ul>
<li>Fixed issues with uploading wheels to DBFS and loading a
non-existing install state (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/34">#34</a>).</li>
</ul>
<h2>0.2.1</h2>
<ul>
<li>Aligned <code>Installation</code> framework with UCX project (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/32">#32</a>).</li>
</ul>
<h2>0.2.0</h2>
<ul>
<li>Added common install state primitives with strong typing (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/27">#27</a>).</li>
<li>Added documentation for Invoking Databricks Connect (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/28">#28</a>).</li>
<li>Added more documentation for Databricks CLI command router (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/30">#30</a>).</li>
<li>Enforced <code>pylint</code> standards (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/29">#29</a>).</li>
</ul>
<h2>0.1.0</h2>
<ul>
<li>Changed python requirement from 3.10.6 to 3.10 (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/25">#25</a>).</li>
</ul>
<h2>0.0.6</h2>
<ul>
<li>Make <code>find_project_root</code> more deterministic (<a
href="https://redirect.github.com/databrickslabs/blueprint/pull/23">#23</a>).</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/905e5ff5303a005d48bc98d101a613afeda15d51"><code>905e5ff</code></a>
Release v0.3.0 (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/59">#59</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/a029f6bb1ecf807017754e298ea685326dbedf72"><code>a029f6b</code></a>
Added brute-forcing <code>SerdeError</code> with <code>as_dict()</code>
and <code>from_dict()</code> (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/58">#58</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/c8a74f4129b4592d365aac9670eb86069f3517f7"><code>c8a74f4</code></a>
Added automated upgrade framework (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/50">#50</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/24e62ef4f060e43e02c92a7d082d95e8bc164317"><code>24e62ef</code></a>
Don't run integration tests on draft pull requests (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/55">#55</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/b4dd5abf4eaf8d022ae0b6ec7e659296ec3d2f37"><code>b4dd5ab</code></a>
Added tokei.rs badge (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/54">#54</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/01d9467f425763ab08035001270593253bce11f0"><code>01d9467</code></a>
Fixed nightly integration tests run as service principals (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/52">#52</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/aa5714179c65be8e13f54601e1d1fcd70548342d"><code>aa57141</code></a>
Made <code>test_existing_installations_are_detected</code> more
resilient (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/51">#51</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/9cbc6f863d3ea06659f37939cf1b97115dd873bd"><code>9cbc6f8</code></a>
Bump <code>databrickslabs/sandbox/acceptance</code> to v0.1.0 (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/48">#48</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/22fc1a8787b8e98de03048595202f88b7ddb9b94"><code>22fc1a8</code></a>
Use <code>databrickslabs/sandbox/acceptance</code> action (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/45">#45</a>)</li>
<li><a
href="https://github.com/databrickslabs/blueprint/commit/c7e47abd82b2f04e95b1d91f346cc1ea6df43961"><code>c7e47ab</code></a>
Release v0.2.5 (<a
href="https://redirect.github.com/databrickslabs/blueprint/issues/44">#44</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/databrickslabs/blueprint/compare/v0.2.4...v0.3.0">compare
view</a></li>
</ul>
</details>
<br />

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Run integration tests only for pull requests ready for review (#1002)

Tested on https://github.com/databrickslabs/blueprint

Reducing flakiness of create account groups (#1003)

Prompt user if Terraform utilised for deploying infrastructure (#1004)

Added prompt is_terraform_used and updated the same in the config of
WorkspaceInstaller

Resolves #393

---------

Co-authored-by: Serge Smertin <[email protected]>

Update CONTRIBUTING.md (#1005)

Closes #850

Added `databricks labs ucx create-uber-principal` command to create Azure Service Principal for migration (#976)

 - Added new cli cmd for create-master-principal in labs.yml, cli.py
- Added separate class for AzureApiClient to separate out azure API
calls
- Added logic to create SPN, secret, roleassignment in resources and
update workspace config with spn client_id
- added logic to call create spn, update rbac of all storage account to
that spn, update ucx cluster policy with spn secret for each storage
account
 - test unit and int test cases

Resolves #881

Related issues:
- #993
- #693

- [ ] added relevant user documentation
- [X] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

- [X] manually tested
- [X] added unit tests
- [X] added integration tests
- [ ] verified on staging environment (screenshot attached)

Fix gitguardian warning caused by "hello world" secret used in unit test (#1010)

Replace the plain encoded string by base64.b64encode to mitigate the
gitguardian warning.

<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #..

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [ ] manually tested
- [ ] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

Create UC external locations in Azure based on migrated storage credentials (#992)

Handle widget delete on upgrade platform bug (#1011)
nkvuong added 3 commits March 8, 2024 19:29
# Conflicts:
#	labs.yml
#	src/databricks/labs/ucx/assessment/aws.py
#	src/databricks/labs/ucx/cli.py
@nkvuong nkvuong marked this pull request as ready for review March 8, 2024 22:37
Copy link
Copy Markdown
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

nkvuong added 3 commits March 11, 2024 09:39
# Conflicts:
#	src/databricks/labs/ucx/assessment/aws.py
#	tests/integration/assessment/test_aws.py
#	tests/unit/assessment/test_aws.py
@nfx nfx merged commit 0474ef3 into main Mar 11, 2024
@nfx nfx deleted the feature/uber-iam-profile branch March 11, 2024 14:41
nfx added a commit that referenced this pull request Mar 15, 2024
* Added AWS IAM role support to `databricks labs ucx create-uber-principal` command ([#993](#993)). The `databricks labs ucx create-uber-principal` command now supports AWS Identity and Access Management (IAM) roles for external table migration. This new feature introduces a CLI command to create an `uber-IAM` profile, which checks for the UCX migration cluster policy and updates or adds the migration policy to provide access to the relevant table locations. If no IAM instance profile or role is specified in the cluster policy, a new one is created and the new migration policy is added. This change includes new methods and functions to handle AWS IAM roles, instance profiles, and related trust policies. Additionally, new unit and integration tests have been added and verified on the staging environment. The implementation also identifies all S3 buckets used by the Instance Profiles configured in the workspace.
* Added Dashboard widget to show the list of cluster policies along with DBR version ([#1013](#1013)). In this code revision, the `assessment` module of the 'databricks/labs/ucx' package has been updated to include a new `PoliciesCrawler` class, which fetches, assesses, and snapshots cluster policies. This class extends `CrawlerBase` and `CheckClusterMixin` and introduces the '_crawl', '_assess_policies', '_try_fetch', and `snapshot` methods. The `PolicyInfo` dataclass has been added to hold policy information, with a structure similar to the `ClusterInfo` dataclass. The `ClusterInfo` dataclass has been updated to include `spark_version` and `policy_id` attributes. A new table for policies has been added, and cluster policies along with the DBR version are loaded into this table. Relevant user documentation, tests, and a Dashboard widget have been added to support this feature. The `create` function in 'fixtures.py' has been updated to enable a Delta preview feature in Spark configurations, and a new SQL file has been included for querying cluster policies. Additionally, a new `crawl_cluster_policies` method has been added to scan and store cluster policies with matching configurations.
* Added `migration_status` table to capture a snapshot of migrated tables ([#1041](#1041)). A `migration_status` table has been added to track the status of migrated tables in the database, enabling improved management and tracking of migrations. The new `MigrationStatus` class, which is a dataclass that holds the source and destination schema, table, and updated timestamp, is added. The `TablesMigrate` class now has a new `_migration_status_refresher` attribute that is an instance of the new `MigrationStatusRefresher` class. This class crawls the `migration_status` table and returns a snapshot of the migration status, which is used to refresh the migration status and check if the table is upgraded. Additionally, the `_init_seen_tables` method is updated to get the seen tables from the `_migration_status_refresher` instead of fetching from the table properties. The `MigrationStatusRefresher` class fetches the migration status table and returns a snapshot of the migration status. This change also adds new test functions in the test file for the Hive metastore, which covers various scenarios such as migrating managed tables with and without caching, migrating external tables, and reverting migrated tables.
* Added a check for existing inventory database to avoid losing existing, inject installation objects in tests and try fetching existing installation before setting global as default ([#1043](#1043)). In this release, we have added a new method, `_check_inventory_database_exists`, to the `WorkspaceInstallation` class, which checks if an inventory database with a given name already exists in the Workspace. This prevents accidental overwriting of existing data and improves the robustness of handling inventory databases. The `validate_and_run` method has been updated to call `app.current_installation(workspace_client)`, allowing for a more flexible handling of installations. The `Installation` class import has been updated to include `SerdeError`, and the test suite has been updated to inject installation objects and check for existing installations before setting the global installation as default. A new argument `inventory_schema_suffix` has been added to the `factory` method for customization of the inventory schema name. We have also added a new method `check_inventory_database_exists` to the `WorkspaceInstaller` class, which checks if an inventory database already exists for a given installation type and raises an `AlreadyExists` error if it does. The behavior of the `download` method in the `WorkspaceClient` class has been mocked, and the `get_status` method has been updated to return `NotFound` in certain tests. These changes aim to improve the robustness, flexibility, and safety of the installation process in the Workspace.
* Added a check for external metastore in SQL warehouse configuration ([#1046](#1046)). In this release, we have added new functionality to the Unity Catalog (UCX) installation process to enable checking for and connecting to an external Hive metastore configuration. A new method, `_get_warehouse_config_with_external_hive_metastore`, has been introduced to retrieve the workspace warehouse config and identify if it is set up for an external Hive metastore. If so, and the user confirms the prompt, UCX will be configured to connect to the external metastore. Additionally, new methods `_extract_external_hive_metastore_sql_conf` and `test_cluster_policy_definition_<cloud_provider>_hms_warehouse()` have been added to handle the external metastore configuration for Azure, AWS, and GCP, and to handle the case when the data_access_config is empty. These changes provide more flexibility and ease of use when installing UCX with external Hive metastore configurations. The new imports `EndpointConfPair`, `GetWorkspaceWarehouseConfigResponse` from the `databricks.sdk.service.sql` package are used to handle the endpoint configuration of the SQL warehouse.
* Added integration tests for AWS - create locations ([#1026](#1026)). In this release, we have added comprehensive integration tests for AWS resources and their management in the `tests/unit/assessment/test_aws.py` file. The `AWSResources` class has been updated with new methods (AwsIamRole, add_uc_role, add_uc_role_policy, and validate_connection) and the regular expression for matching S3 resource ARN has been modified. The `create_external_locations` method now allows for creating external locations without validating them, and the `_identify_missing_external_locations` function has been enhanced to match roles with a wildcard pattern. The new tests include validating the integration of AWS services with the system, testing the CLI's behavior when it is missing, and introducing new configuration scenarios with the addition of a Key Management Service (KMS) key during the creation of IAM roles and policies. These changes improve the robustness and reliability of AWS resource integration and handling in our system.
* Bump Databricks SDK to v0.22.0 ([#1059](#1059)). In this release, we are bumping the Databricks SDK version to 0.22.0 and upgrading the `databricks-labs-lsql` package to ~0.2.2. The new dependencies for this release include `databricks-sdk==0.22.0`, `databricks-labs-lsql~=0.2.2`, `databricks-labs-blueprint~=0.4.3`, and `PyYAML>=6.0.0,<7.0.0`. In the `fixtures.py` file, we have added `PermissionLevel.CAN_QUERY` to the `CAN_VIEW` and `CAN_MANAGE` permissions in the `_path` function, allowing users to query the endpoint. Additionally, we have updated the `test_endpoints` function in the `test_generic.py` file as part of the integration tests for workspace access. This change updates the permission level for creating a serving endpoint from `CAN_MANAGE` to `CAN_QUERY`, meaning that the assigned group can now only query the endpoint. We have also included the `test_feature_tables` function in the commit, which tests the behavior of feature tables in the Databricks workspace. This change only affects the `test_endpoints` function and its assert statements, and does not impact the functionality of the `test_feature_tables` function.
* Changed default UCX installation folder to `/Applications/ucx` from `/Users/<me>/.ucx` to allow multiple users users utilising the same installation ([#854](#854)). In this release, we've added a new advanced feature that allows users to force the installation of UCX over an existing installation using the `UCX_FORCE_INSTALL` environment variable. This variable can take two values `global` and 'user', providing more control and flexibility in installing UCX. The default UCX installation folder has been changed to /Applications/ucx from /Users/<me>/.ucx to enable multiple users to utilize the same installation. A table detailing the expected install location, `install_folder`, and mode for each combination of global and user values has been added to the README file. We've also added user prompts to confirm the installation if UCX is already installed and the `UCX_FORCE_INSTALL` variable is set to 'user'. This feature is useful when users want to install UCX in a specific location or force the installation over an existing installation. However, it is recommended to use this feature with caution, as it can potentially break existing installations if not used correctly. Additionally, several changes to the implementation of the UCX installation process have been made, as well as new tests to ensure that the installation process works correctly in various scenarios.
* Fix: Recover lost fix for `webbrowser.open` mock ([#1052](#1052)). A fix has been implemented to address an issue related to the mock for `webbrowser.open` in the tests `test_repair_run` and `test_get_existing_installation_global`. This change prevents the `webbrowser.open` function from being called during these tests, which helps improve test stability and consistency. No new methods have been added, and the existing functionality of these tests has only been modified to include the `webbrowser.open` mock. This modification aims to enhance the reliability and predictability of these specific tests, ensuring accurate and consistent results.
* Improved table migrations logic ([#1050](#1050)). This change introduces improvements to table migrations logic by refactoring unit tests to load table mappings from JSON instead of inline structs, adding an `escape_sql_identifier` function where missing, and preparing for ACLs migration. The `uc_grant_sql` method in `grants.py` has been updated to accept optional `object_type` and `object_key` parameters, and the hive-to-UC mapping has been expanded to include mappings for views. Additionally, new JSON files for external source table configuration have been added, and new functions have been introduced for loading fixture data from JSON files and creating mocked `WorkspaceClient` and `TableMapping` objects for testing. The changes improve the maintainability and security of the codebase, prepare it for future migration tasks, and ensure that the code is more adaptable and robust. The changes have been manually tested and verified on the staging environment.
* Moved `SqlBackend` implementation to `databricks-labs-lsql` dependency ([#1042](#1042)). In this change, the `SqlBackend` implementation, including classes such as `StatementExecutionBackend` and `RuntimeBackend`, has been moved to a separate library, `databricks-labs-lsql`, which is managed at <https://github.com/databrickslabs/lsql>. This refactoring simplifies the current repository, promotes code reuse, and improves modularity by leveraging an external dependency. The modification includes adding a new line in the .gitignore file to exclude `*.out` files from version control.
* Prepare for a PyPI release ([#1038](#1038)). In preparation for a PyPI release, this change introduces a new GitHub Actions workflow that automates the package release process and ensures the integrity of the released packages by signing them with Sigstore. When a new git tag starting with `v` is pushed, this workflow is triggered, building wheels using hatch, drafting a new GitHub release, publishing the package distributions to PyPI, and signing the artifacts with Sigstore. The `pyproject.toml` file is now used for metadata, replacing `setup.cfg` and `setup.py`, and is cached to improve build performance. In addition, the `pyproject.toml` file has been updated with recent metadata in preparation for the release, including updates to the package's authors, development status, classifiers, and dependencies.
* Prevent fragile `mock.patch('databricks...')` in the test code ([#1037](#1037)). This change introduces a custom `pylint` checker to improve code flexibility and maintainability by preventing fragile `mock.patch` designs in test code. The new checker discourages the use of `MagicMock` and encourages the use of `create_autospec` to ensure that mocks have the same attributes and methods as the original class. This change has been implemented in multiple test files, including `test_cli.py`, `test_locations.py`, `test_mapping.py`, `test_table_migrate.py`, `test_table_move.py`, `test_workspace_access.py`, `test_redash.py`, `test_scim.py`, and `test_verification.py`, to improve the robustness and maintainability of the test code. Additionally, the commit removes the `verification.py` file, which contained a `VerificationManager` class for verifying applied permissions, scope ACLs, roles, and entitlements for various objects in a Databricks workspace.
* Removed `mocker.patch("databricks...)` from `test_cli` ([#1047](#1047)). In this release, we have made significant updates to the library's handling of Azure and AWS workspaces. We have added new parameters `azure_resource_permissions` and `aws_permissions` to the `_execute_for_cloud` function in `cli.py`, which are passed to the `func_azure` and `func_aws` functions respectively. The `create_uber_principal` and `principal_prefix_access` commands have also been updated to include these new parameters. Additionally, the `_azure_setup_uber_principal` and `_aws_setup_uber_principal` functions have been updated to accept the new `azure_resource_permissions` and `aws_resource_permissions` parameters. The `_azure_principal_prefix_access` and `_aws_principal_prefix_access` functions have also been updated similarly. We have also introduced a new `aws_resources` parameter in the `migrate_credentials` command, which is used to migrate Azure Service Principals in ADLS Gen2 locations to UC storage credentials. In terms of testing, we have replaced the `mocker.patch` calls with the creation of `AzureResourcePermissions` and `AWSResourcePermissions` objects, improving the code's readability and maintainability. Overall, these changes significantly enhance the library's functionality and maintainability in handling Azure and AWS workspaces.
* Require Hatch v1.9.4 on build machines ([#1049](#1049)). In this release, we have updated the Hatch package version to 1.9.4 on build machines, addressing issue [#1049](#1049). The changes include updating the toolchain dependencies and setup in the `.codegen.json` file, which simplifies the setup process and now relies on a pre-existing Hatch environment and Python 3. The acceptance workflow has also been updated to use the latest version of Hatch and the `databrickslabs/sandbox/acceptance` GitHub action version `v0.1.4`. Hatch is a Python package manager that simplifies package development and management, and this update provides new features and bug fixes that can help improve the reliability and performance of the acceptance workflow. This change requires version 1.9.4 of the Hatch package on build machines, and it will affect the build process for the project but will not have any impact on the functionality of the project itself. As a software engineer adopting this project, it's important to note this change to ensure that the build process runs smoothly and takes advantage of any new features or improvements in Hatch 1.9.4.
* Set acceptance tests to timeout after 45 minutes ([#1036](#1036)). As part of issue [#1036](#1036), the acceptance tests in this open-source library now have a 45-minute timeout configured, improving the reliability and stability of the testing environment. This change has been implemented in the `.github/workflows/acceptance.yml` file by adding the `timeout` parameter to the step where the `databrickslabs/sandbox/acceptance` action is called. This ensures that the acceptance tests will not run indefinitely and prevents any potential issues caused by long-running tests. By adopting this project, software engineers can now benefit from a more stable and reliable testing environment, with acceptance tests that are guaranteed to complete within a maximum of 45 minutes.
* Updated databricks-labs-blueprint requirement from ~0.4.1 to ~0.4.3 ([#1058](#1058)). In this release, the version requirement for the `databricks-labs-blueprint` library has been updated from ~0.4.1 to ~0.4.3 in the pyproject.toml file. This change is necessary to support issues [#1056](#1056) and [#1057](#1057). The code has been manually tested and is ready for further testing to ensure the compatibility and smooth functioning of the software. It is essential to thoroughly test the latest version of the `databricks-labs-blueprint` library with the existing codebase before deploying it to production. This includes running a comprehensive suite of tests such as unit tests, integration tests, and verification on the staging environment. This modification allows the software to use the latest version of the library, improving its functionality and overall performance.
* Use `MockPrompts.extend()` functionality in test_install to supply multiple prompts ([#1057](#1057)). This diff introduces the `MockPrompts.extend()` functionality in the `test_install` module to enable the supplying of multiple prompts for testing purposes. A new `base_prompts` dictionary with default prompts has been added and is extended with additional prompts for specific test cases. This allows for the testing of various scenarios, such as when UCX is already installed on the workspace and the user is prompted to choose between global or user installation. Additionally, new `force_user_environ` and `force_global_env` dictionaries have been added to simulate different installation environments. The functionality of the `WorkspaceInstaller` class and mocking of `webbrowser.open` are also utilized in the test cases. These changes aim to ensure the proper functioning of the configuration process for different installation scenarios.
@nfx nfx mentioned this pull request Mar 15, 2024
nfx added a commit that referenced this pull request Mar 15, 2024
* Added AWS IAM role support to `databricks labs ucx
create-uber-principal` command
([#993](#993)). The
`databricks labs ucx create-uber-principal` command now supports AWS
Identity and Access Management (IAM) roles for external table migration.
This new feature introduces a CLI command to create an `uber-IAM`
profile, which checks for the UCX migration cluster policy and updates
or adds the migration policy to provide access to the relevant table
locations. If no IAM instance profile or role is specified in the
cluster policy, a new one is created and the new migration policy is
added. This change includes new methods and functions to handle AWS IAM
roles, instance profiles, and related trust policies. Additionally, new
unit and integration tests have been added and verified on the staging
environment. The implementation also identifies all S3 buckets used by
the Instance Profiles configured in the workspace.
* Added Dashboard widget to show the list of cluster policies along with
DBR version
([#1013](#1013)). In this
code revision, the `assessment` module of the 'databricks/labs/ucx'
package has been updated to include a new `PoliciesCrawler` class, which
fetches, assesses, and snapshots cluster policies. This class extends
`CrawlerBase` and `CheckClusterMixin` and introduces the '_crawl',
'_assess_policies', '_try_fetch', and `snapshot` methods. The
`PolicyInfo` dataclass has been added to hold policy information, with a
structure similar to the `ClusterInfo` dataclass. The `ClusterInfo`
dataclass has been updated to include `spark_version` and `policy_id`
attributes. A new table for policies has been added, and cluster
policies along with the DBR version are loaded into this table. Relevant
user documentation, tests, and a Dashboard widget have been added to
support this feature. The `create` function in 'fixtures.py' has been
updated to enable a Delta preview feature in Spark configurations, and a
new SQL file has been included for querying cluster policies.
Additionally, a new `crawl_cluster_policies` method has been added to
scan and store cluster policies with matching configurations.
* Added `migration_status` table to capture a snapshot of migrated
tables ([#1041](#1041)). A
`migration_status` table has been added to track the status of migrated
tables in the database, enabling improved management and tracking of
migrations. The new `MigrationStatus` class, which is a dataclass that
holds the source and destination schema, table, and updated timestamp,
is added. The `TablesMigrate` class now has a new
`_migration_status_refresher` attribute that is an instance of the new
`MigrationStatusRefresher` class. This class crawls the
`migration_status` table and returns a snapshot of the migration status,
which is used to refresh the migration status and check if the table is
upgraded. Additionally, the `_init_seen_tables` method is updated to get
the seen tables from the `_migration_status_refresher` instead of
fetching from the table properties. The `MigrationStatusRefresher` class
fetches the migration status table and returns a snapshot of the
migration status. This change also adds new test functions in the test
file for the Hive metastore, which covers various scenarios such as
migrating managed tables with and without caching, migrating external
tables, and reverting migrated tables.
* Added a check for existing inventory database to avoid losing
existing, inject installation objects in tests and try fetching existing
installation before setting global as default
([#1043](#1043)). In this
release, we have added a new method, `_check_inventory_database_exists`,
to the `WorkspaceInstallation` class, which checks if an inventory
database with a given name already exists in the Workspace. This
prevents accidental overwriting of existing data and improves the
robustness of handling inventory databases. The `validate_and_run`
method has been updated to call
`app.current_installation(workspace_client)`, allowing for a more
flexible handling of installations. The `Installation` class import has
been updated to include `SerdeError`, and the test suite has been
updated to inject installation objects and check for existing
installations before setting the global installation as default. A new
argument `inventory_schema_suffix` has been added to the `factory`
method for customization of the inventory schema name. We have also
added a new method `check_inventory_database_exists` to the
`WorkspaceInstaller` class, which checks if an inventory database
already exists for a given installation type and raises an
`AlreadyExists` error if it does. The behavior of the `download` method
in the `WorkspaceClient` class has been mocked, and the `get_status`
method has been updated to return `NotFound` in certain tests. These
changes aim to improve the robustness, flexibility, and safety of the
installation process in the Workspace.
* Added a check for external metastore in SQL warehouse configuration
([#1046](#1046)). In this
release, we have added new functionality to the Unity Catalog (UCX)
installation process to enable checking for and connecting to an
external Hive metastore configuration. A new method,
`_get_warehouse_config_with_external_hive_metastore`, has been
introduced to retrieve the workspace warehouse config and identify if it
is set up for an external Hive metastore. If so, and the user confirms
the prompt, UCX will be configured to connect to the external metastore.
Additionally, new methods `_extract_external_hive_metastore_sql_conf`
and `test_cluster_policy_definition_<cloud_provider>_hms_warehouse()`
have been added to handle the external metastore configuration for
Azure, AWS, and GCP, and to handle the case when the data_access_config
is empty. These changes provide more flexibility and ease of use when
installing UCX with external Hive metastore configurations. The new
imports `EndpointConfPair`, `GetWorkspaceWarehouseConfigResponse` from
the `databricks.sdk.service.sql` package are used to handle the endpoint
configuration of the SQL warehouse.
* Added integration tests for AWS - create locations
([#1026](#1026)). In this
release, we have added comprehensive integration tests for AWS resources
and their management in the `tests/unit/assessment/test_aws.py` file.
The `AWSResources` class has been updated with new methods (AwsIamRole,
add_uc_role, add_uc_role_policy, and validate_connection) and the
regular expression for matching S3 resource ARN has been modified. The
`create_external_locations` method now allows for creating external
locations without validating them, and the
`_identify_missing_external_locations` function has been enhanced to
match roles with a wildcard pattern. The new tests include validating
the integration of AWS services with the system, testing the CLI's
behavior when it is missing, and introducing new configuration scenarios
with the addition of a Key Management Service (KMS) key during the
creation of IAM roles and policies. These changes improve the robustness
and reliability of AWS resource integration and handling in our system.
* Bump Databricks SDK to v0.22.0
([#1059](#1059)). In this
release, we are bumping the Databricks SDK version to 0.22.0 and
upgrading the `databricks-labs-lsql` package to ~0.2.2. The new
dependencies for this release include `databricks-sdk==0.22.0`,
`databricks-labs-lsql~=0.2.2`, `databricks-labs-blueprint~=0.4.3`, and
`PyYAML>=6.0.0,<7.0.0`. In the `fixtures.py` file, we have added
`PermissionLevel.CAN_QUERY` to the `CAN_VIEW` and `CAN_MANAGE`
permissions in the `_path` function, allowing users to query the
endpoint. Additionally, we have updated the `test_endpoints` function in
the `test_generic.py` file as part of the integration tests for
workspace access. This change updates the permission level for creating
a serving endpoint from `CAN_MANAGE` to `CAN_QUERY`, meaning that the
assigned group can now only query the endpoint. We have also included
the `test_feature_tables` function in the commit, which tests the
behavior of feature tables in the Databricks workspace. This change only
affects the `test_endpoints` function and its assert statements, and
does not impact the functionality of the `test_feature_tables` function.
* Changed default UCX installation folder to `/Applications/ucx` from
`/Users/<me>/.ucx` to allow multiple users users utilising the same
installation ([#854](#854)).
In this release, we've added a new advanced feature that allows users to
force the installation of UCX over an existing installation using the
`UCX_FORCE_INSTALL` environment variable. This variable can take two
values `global` and 'user', providing more control and flexibility in
installing UCX. The default UCX installation folder has been changed to
/Applications/ucx from /Users/<me>/.ucx to enable multiple users to
utilize the same installation. A table detailing the expected install
location, `install_folder`, and mode for each combination of global and
user values has been added to the README file. We've also added user
prompts to confirm the installation if UCX is already installed and the
`UCX_FORCE_INSTALL` variable is set to 'user'. This feature is useful
when users want to install UCX in a specific location or force the
installation over an existing installation. However, it is recommended
to use this feature with caution, as it can potentially break existing
installations if not used correctly. Additionally, several changes to
the implementation of the UCX installation process have been made, as
well as new tests to ensure that the installation process works
correctly in various scenarios.
* Fix: Recover lost fix for `webbrowser.open` mock
([#1052](#1052)). A fix has
been implemented to address an issue related to the mock for
`webbrowser.open` in the tests `test_repair_run` and
`test_get_existing_installation_global`. This change prevents the
`webbrowser.open` function from being called during these tests, which
helps improve test stability and consistency. No new methods have been
added, and the existing functionality of these tests has only been
modified to include the `webbrowser.open` mock. This modification aims
to enhance the reliability and predictability of these specific tests,
ensuring accurate and consistent results.
* Improved table migrations logic
([#1050](#1050)). This
change introduces improvements to table migrations logic by refactoring
unit tests to load table mappings from JSON instead of inline structs,
adding an `escape_sql_identifier` function where missing, and preparing
for ACLs migration. The `uc_grant_sql` method in `grants.py` has been
updated to accept optional `object_type` and `object_key` parameters,
and the hive-to-UC mapping has been expanded to include mappings for
views. Additionally, new JSON files for external source table
configuration have been added, and new functions have been introduced
for loading fixture data from JSON files and creating mocked
`WorkspaceClient` and `TableMapping` objects for testing. The changes
improve the maintainability and security of the codebase, prepare it for
future migration tasks, and ensure that the code is more adaptable and
robust. The changes have been manually tested and verified on the
staging environment.
* Moved `SqlBackend` implementation to `databricks-labs-lsql` dependency
([#1042](#1042)). In this
change, the `SqlBackend` implementation, including classes such as
`StatementExecutionBackend` and `RuntimeBackend`, has been moved to a
separate library, `databricks-labs-lsql`, which is managed at
<https://github.com/databrickslabs/lsql>. This refactoring simplifies
the current repository, promotes code reuse, and improves modularity by
leveraging an external dependency. The modification includes adding a
new line in the .gitignore file to exclude `*.out` files from version
control.
* Prepare for a PyPI release
([#1038](#1038)). In
preparation for a PyPI release, this change introduces a new GitHub
Actions workflow that automates the package release process and ensures
the integrity of the released packages by signing them with Sigstore.
When a new git tag starting with `v` is pushed, this workflow is
triggered, building wheels using hatch, drafting a new GitHub release,
publishing the package distributions to PyPI, and signing the artifacts
with Sigstore. The `pyproject.toml` file is now used for metadata,
replacing `setup.cfg` and `setup.py`, and is cached to improve build
performance. In addition, the `pyproject.toml` file has been updated
with recent metadata in preparation for the release, including updates
to the package's authors, development status, classifiers, and
dependencies.
* Prevent fragile `mock.patch('databricks...')` in the test code
([#1037](#1037)). This
change introduces a custom `pylint` checker to improve code flexibility
and maintainability by preventing fragile `mock.patch` designs in test
code. The new checker discourages the use of `MagicMock` and encourages
the use of `create_autospec` to ensure that mocks have the same
attributes and methods as the original class. This change has been
implemented in multiple test files, including `test_cli.py`,
`test_locations.py`, `test_mapping.py`, `test_table_migrate.py`,
`test_table_move.py`, `test_workspace_access.py`, `test_redash.py`,
`test_scim.py`, and `test_verification.py`, to improve the robustness
and maintainability of the test code. Additionally, the commit removes
the `verification.py` file, which contained a `VerificationManager`
class for verifying applied permissions, scope ACLs, roles, and
entitlements for various objects in a Databricks workspace.
* Removed `mocker.patch("databricks...)` from `test_cli`
([#1047](#1047)). In this
release, we have made significant updates to the library's handling of
Azure and AWS workspaces. We have added new parameters
`azure_resource_permissions` and `aws_permissions` to the
`_execute_for_cloud` function in `cli.py`, which are passed to the
`func_azure` and `func_aws` functions respectively. The
`create_uber_principal` and `principal_prefix_access` commands have also
been updated to include these new parameters. Additionally, the
`_azure_setup_uber_principal` and `_aws_setup_uber_principal` functions
have been updated to accept the new `azure_resource_permissions` and
`aws_resource_permissions` parameters. The
`_azure_principal_prefix_access` and `_aws_principal_prefix_access`
functions have also been updated similarly. We have also introduced a
new `aws_resources` parameter in the `migrate_credentials` command,
which is used to migrate Azure Service Principals in ADLS Gen2 locations
to UC storage credentials. In terms of testing, we have replaced the
`mocker.patch` calls with the creation of `AzureResourcePermissions` and
`AWSResourcePermissions` objects, improving the code's readability and
maintainability. Overall, these changes significantly enhance the
library's functionality and maintainability in handling Azure and AWS
workspaces.
* Require Hatch v1.9.4 on build machines
([#1049](#1049)). In this
release, we have updated the Hatch package version to 1.9.4 on build
machines, addressing issue
[#1049](#1049). The changes
include updating the toolchain dependencies and setup in the
`.codegen.json` file, which simplifies the setup process and now relies
on a pre-existing Hatch environment and Python 3. The acceptance
workflow has also been updated to use the latest version of Hatch and
the `databrickslabs/sandbox/acceptance` GitHub action version `v0.1.4`.
Hatch is a Python package manager that simplifies package development
and management, and this update provides new features and bug fixes that
can help improve the reliability and performance of the acceptance
workflow. This change requires version 1.9.4 of the Hatch package on
build machines, and it will affect the build process for the project but
will not have any impact on the functionality of the project itself. As
a software engineer adopting this project, it's important to note this
change to ensure that the build process runs smoothly and takes
advantage of any new features or improvements in Hatch 1.9.4.
* Set acceptance tests to timeout after 45 minutes
([#1036](#1036)). As part of
issue [#1036](#1036), the
acceptance tests in this open-source library now have a 45-minute
timeout configured, improving the reliability and stability of the
testing environment. This change has been implemented in the
`.github/workflows/acceptance.yml` file by adding the `timeout`
parameter to the step where the `databrickslabs/sandbox/acceptance`
action is called. This ensures that the acceptance tests will not run
indefinitely and prevents any potential issues caused by long-running
tests. By adopting this project, software engineers can now benefit from
a more stable and reliable testing environment, with acceptance tests
that are guaranteed to complete within a maximum of 45 minutes.
* Updated databricks-labs-blueprint requirement from ~0.4.1 to ~0.4.3
([#1058](#1058)). In this
release, the version requirement for the `databricks-labs-blueprint`
library has been updated from ~0.4.1 to ~0.4.3 in the pyproject.toml
file. This change is necessary to support issues
[#1056](#1056) and
[#1057](#1057). The code has
been manually tested and is ready for further testing to ensure the
compatibility and smooth functioning of the software. It is essential to
thoroughly test the latest version of the `databricks-labs-blueprint`
library with the existing codebase before deploying it to production.
This includes running a comprehensive suite of tests such as unit tests,
integration tests, and verification on the staging environment. This
modification allows the software to use the latest version of the
library, improving its functionality and overall performance.
* Use `MockPrompts.extend()` functionality in test_install to supply
multiple prompts
([#1057](#1057)). This diff
introduces the `MockPrompts.extend()` functionality in the
`test_install` module to enable the supplying of multiple prompts for
testing purposes. A new `base_prompts` dictionary with default prompts
has been added and is extended with additional prompts for specific test
cases. This allows for the testing of various scenarios, such as when
UCX is already installed on the workspace and the user is prompted to
choose between global or user installation. Additionally, new
`force_user_environ` and `force_global_env` dictionaries have been added
to simulate different installation environments. The functionality of
the `WorkspaceInstaller` class and mocking of `webbrowser.open` are also
utilized in the test cases. These changes aim to ensure the proper
functioning of the configuration process for different installation
scenarios.
dmoore247 pushed a commit that referenced this pull request Mar 23, 2024
…zure Service Principal for migration (#976)

## Changes
 - Added new cli cmd for create-master-principal in labs.yml, cli.py
- Added separate class for AzureApiClient to separate out azure API
calls
- Added logic to create SPN, secret, roleassignment in resources and
update workspace config with spn client_id
- added logic to call create spn, update rbac of all storage account to
that spn, update ucx cluster policy with spn secret for each storage
account
 - test unit and int test cases

Resolves #881 

Related issues: 
- #993
- #693

### Functionality 

- [ ] added relevant user documentation
- [X] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests

- [X] manually tested
- [X] added unit tests
- [X] added integration tests
- [ ] verified on staging environment (screenshot attached)
dmoore247 pushed a commit that referenced this pull request Mar 23, 2024
…pal` command (#993)

## Changes
Added CLI command `databricks labs ucx create-uber-principal` for
creating uber-IAM profile for performing external table migration on
AWS.

Logic:
* Stop if UCX migration cluster policy is not found
* Collect paths of all locations/paths used in tables (call
`external_location.snapshot`)
* If cluster policy has an existing iam instance profile/role specified,
then add/update migration policy providing access to the locations
* If cluster policy does not have iam instance profile/role specified,
then create new iam profile/role and migration policy, and add it to the
cluster policy

### Linked issues

Resolves #879 

Related issues:
- #976
- #693

### Functionality 

- [x] added new CLI command

### Tests

- [x] manually tested
- [x] added unit tests

### TODO
- [x] added integration tests
- [x] verified on staging environment (screenshot attached)

---------

Co-authored-by: Vuong <[email protected]>
dmoore247 pushed a commit that referenced this pull request Mar 23, 2024
* Added AWS IAM role support to `databricks labs ucx
create-uber-principal` command
([#993](#993)). The
`databricks labs ucx create-uber-principal` command now supports AWS
Identity and Access Management (IAM) roles for external table migration.
This new feature introduces a CLI command to create an `uber-IAM`
profile, which checks for the UCX migration cluster policy and updates
or adds the migration policy to provide access to the relevant table
locations. If no IAM instance profile or role is specified in the
cluster policy, a new one is created and the new migration policy is
added. This change includes new methods and functions to handle AWS IAM
roles, instance profiles, and related trust policies. Additionally, new
unit and integration tests have been added and verified on the staging
environment. The implementation also identifies all S3 buckets used by
the Instance Profiles configured in the workspace.
* Added Dashboard widget to show the list of cluster policies along with
DBR version
([#1013](#1013)). In this
code revision, the `assessment` module of the 'databricks/labs/ucx'
package has been updated to include a new `PoliciesCrawler` class, which
fetches, assesses, and snapshots cluster policies. This class extends
`CrawlerBase` and `CheckClusterMixin` and introduces the '_crawl',
'_assess_policies', '_try_fetch', and `snapshot` methods. The
`PolicyInfo` dataclass has been added to hold policy information, with a
structure similar to the `ClusterInfo` dataclass. The `ClusterInfo`
dataclass has been updated to include `spark_version` and `policy_id`
attributes. A new table for policies has been added, and cluster
policies along with the DBR version are loaded into this table. Relevant
user documentation, tests, and a Dashboard widget have been added to
support this feature. The `create` function in 'fixtures.py' has been
updated to enable a Delta preview feature in Spark configurations, and a
new SQL file has been included for querying cluster policies.
Additionally, a new `crawl_cluster_policies` method has been added to
scan and store cluster policies with matching configurations.
* Added `migration_status` table to capture a snapshot of migrated
tables ([#1041](#1041)). A
`migration_status` table has been added to track the status of migrated
tables in the database, enabling improved management and tracking of
migrations. The new `MigrationStatus` class, which is a dataclass that
holds the source and destination schema, table, and updated timestamp,
is added. The `TablesMigrate` class now has a new
`_migration_status_refresher` attribute that is an instance of the new
`MigrationStatusRefresher` class. This class crawls the
`migration_status` table and returns a snapshot of the migration status,
which is used to refresh the migration status and check if the table is
upgraded. Additionally, the `_init_seen_tables` method is updated to get
the seen tables from the `_migration_status_refresher` instead of
fetching from the table properties. The `MigrationStatusRefresher` class
fetches the migration status table and returns a snapshot of the
migration status. This change also adds new test functions in the test
file for the Hive metastore, which covers various scenarios such as
migrating managed tables with and without caching, migrating external
tables, and reverting migrated tables.
* Added a check for existing inventory database to avoid losing
existing, inject installation objects in tests and try fetching existing
installation before setting global as default
([#1043](#1043)). In this
release, we have added a new method, `_check_inventory_database_exists`,
to the `WorkspaceInstallation` class, which checks if an inventory
database with a given name already exists in the Workspace. This
prevents accidental overwriting of existing data and improves the
robustness of handling inventory databases. The `validate_and_run`
method has been updated to call
`app.current_installation(workspace_client)`, allowing for a more
flexible handling of installations. The `Installation` class import has
been updated to include `SerdeError`, and the test suite has been
updated to inject installation objects and check for existing
installations before setting the global installation as default. A new
argument `inventory_schema_suffix` has been added to the `factory`
method for customization of the inventory schema name. We have also
added a new method `check_inventory_database_exists` to the
`WorkspaceInstaller` class, which checks if an inventory database
already exists for a given installation type and raises an
`AlreadyExists` error if it does. The behavior of the `download` method
in the `WorkspaceClient` class has been mocked, and the `get_status`
method has been updated to return `NotFound` in certain tests. These
changes aim to improve the robustness, flexibility, and safety of the
installation process in the Workspace.
* Added a check for external metastore in SQL warehouse configuration
([#1046](#1046)). In this
release, we have added new functionality to the Unity Catalog (UCX)
installation process to enable checking for and connecting to an
external Hive metastore configuration. A new method,
`_get_warehouse_config_with_external_hive_metastore`, has been
introduced to retrieve the workspace warehouse config and identify if it
is set up for an external Hive metastore. If so, and the user confirms
the prompt, UCX will be configured to connect to the external metastore.
Additionally, new methods `_extract_external_hive_metastore_sql_conf`
and `test_cluster_policy_definition_<cloud_provider>_hms_warehouse()`
have been added to handle the external metastore configuration for
Azure, AWS, and GCP, and to handle the case when the data_access_config
is empty. These changes provide more flexibility and ease of use when
installing UCX with external Hive metastore configurations. The new
imports `EndpointConfPair`, `GetWorkspaceWarehouseConfigResponse` from
the `databricks.sdk.service.sql` package are used to handle the endpoint
configuration of the SQL warehouse.
* Added integration tests for AWS - create locations
([#1026](#1026)). In this
release, we have added comprehensive integration tests for AWS resources
and their management in the `tests/unit/assessment/test_aws.py` file.
The `AWSResources` class has been updated with new methods (AwsIamRole,
add_uc_role, add_uc_role_policy, and validate_connection) and the
regular expression for matching S3 resource ARN has been modified. The
`create_external_locations` method now allows for creating external
locations without validating them, and the
`_identify_missing_external_locations` function has been enhanced to
match roles with a wildcard pattern. The new tests include validating
the integration of AWS services with the system, testing the CLI's
behavior when it is missing, and introducing new configuration scenarios
with the addition of a Key Management Service (KMS) key during the
creation of IAM roles and policies. These changes improve the robustness
and reliability of AWS resource integration and handling in our system.
* Bump Databricks SDK to v0.22.0
([#1059](#1059)). In this
release, we are bumping the Databricks SDK version to 0.22.0 and
upgrading the `databricks-labs-lsql` package to ~0.2.2. The new
dependencies for this release include `databricks-sdk==0.22.0`,
`databricks-labs-lsql~=0.2.2`, `databricks-labs-blueprint~=0.4.3`, and
`PyYAML>=6.0.0,<7.0.0`. In the `fixtures.py` file, we have added
`PermissionLevel.CAN_QUERY` to the `CAN_VIEW` and `CAN_MANAGE`
permissions in the `_path` function, allowing users to query the
endpoint. Additionally, we have updated the `test_endpoints` function in
the `test_generic.py` file as part of the integration tests for
workspace access. This change updates the permission level for creating
a serving endpoint from `CAN_MANAGE` to `CAN_QUERY`, meaning that the
assigned group can now only query the endpoint. We have also included
the `test_feature_tables` function in the commit, which tests the
behavior of feature tables in the Databricks workspace. This change only
affects the `test_endpoints` function and its assert statements, and
does not impact the functionality of the `test_feature_tables` function.
* Changed default UCX installation folder to `/Applications/ucx` from
`/Users/<me>/.ucx` to allow multiple users users utilising the same
installation ([#854](#854)).
In this release, we've added a new advanced feature that allows users to
force the installation of UCX over an existing installation using the
`UCX_FORCE_INSTALL` environment variable. This variable can take two
values `global` and 'user', providing more control and flexibility in
installing UCX. The default UCX installation folder has been changed to
/Applications/ucx from /Users/<me>/.ucx to enable multiple users to
utilize the same installation. A table detailing the expected install
location, `install_folder`, and mode for each combination of global and
user values has been added to the README file. We've also added user
prompts to confirm the installation if UCX is already installed and the
`UCX_FORCE_INSTALL` variable is set to 'user'. This feature is useful
when users want to install UCX in a specific location or force the
installation over an existing installation. However, it is recommended
to use this feature with caution, as it can potentially break existing
installations if not used correctly. Additionally, several changes to
the implementation of the UCX installation process have been made, as
well as new tests to ensure that the installation process works
correctly in various scenarios.
* Fix: Recover lost fix for `webbrowser.open` mock
([#1052](#1052)). A fix has
been implemented to address an issue related to the mock for
`webbrowser.open` in the tests `test_repair_run` and
`test_get_existing_installation_global`. This change prevents the
`webbrowser.open` function from being called during these tests, which
helps improve test stability and consistency. No new methods have been
added, and the existing functionality of these tests has only been
modified to include the `webbrowser.open` mock. This modification aims
to enhance the reliability and predictability of these specific tests,
ensuring accurate and consistent results.
* Improved table migrations logic
([#1050](#1050)). This
change introduces improvements to table migrations logic by refactoring
unit tests to load table mappings from JSON instead of inline structs,
adding an `escape_sql_identifier` function where missing, and preparing
for ACLs migration. The `uc_grant_sql` method in `grants.py` has been
updated to accept optional `object_type` and `object_key` parameters,
and the hive-to-UC mapping has been expanded to include mappings for
views. Additionally, new JSON files for external source table
configuration have been added, and new functions have been introduced
for loading fixture data from JSON files and creating mocked
`WorkspaceClient` and `TableMapping` objects for testing. The changes
improve the maintainability and security of the codebase, prepare it for
future migration tasks, and ensure that the code is more adaptable and
robust. The changes have been manually tested and verified on the
staging environment.
* Moved `SqlBackend` implementation to `databricks-labs-lsql` dependency
([#1042](#1042)). In this
change, the `SqlBackend` implementation, including classes such as
`StatementExecutionBackend` and `RuntimeBackend`, has been moved to a
separate library, `databricks-labs-lsql`, which is managed at
<https://github.com/databrickslabs/lsql>. This refactoring simplifies
the current repository, promotes code reuse, and improves modularity by
leveraging an external dependency. The modification includes adding a
new line in the .gitignore file to exclude `*.out` files from version
control.
* Prepare for a PyPI release
([#1038](#1038)). In
preparation for a PyPI release, this change introduces a new GitHub
Actions workflow that automates the package release process and ensures
the integrity of the released packages by signing them with Sigstore.
When a new git tag starting with `v` is pushed, this workflow is
triggered, building wheels using hatch, drafting a new GitHub release,
publishing the package distributions to PyPI, and signing the artifacts
with Sigstore. The `pyproject.toml` file is now used for metadata,
replacing `setup.cfg` and `setup.py`, and is cached to improve build
performance. In addition, the `pyproject.toml` file has been updated
with recent metadata in preparation for the release, including updates
to the package's authors, development status, classifiers, and
dependencies.
* Prevent fragile `mock.patch('databricks...')` in the test code
([#1037](#1037)). This
change introduces a custom `pylint` checker to improve code flexibility
and maintainability by preventing fragile `mock.patch` designs in test
code. The new checker discourages the use of `MagicMock` and encourages
the use of `create_autospec` to ensure that mocks have the same
attributes and methods as the original class. This change has been
implemented in multiple test files, including `test_cli.py`,
`test_locations.py`, `test_mapping.py`, `test_table_migrate.py`,
`test_table_move.py`, `test_workspace_access.py`, `test_redash.py`,
`test_scim.py`, and `test_verification.py`, to improve the robustness
and maintainability of the test code. Additionally, the commit removes
the `verification.py` file, which contained a `VerificationManager`
class for verifying applied permissions, scope ACLs, roles, and
entitlements for various objects in a Databricks workspace.
* Removed `mocker.patch("databricks...)` from `test_cli`
([#1047](#1047)). In this
release, we have made significant updates to the library's handling of
Azure and AWS workspaces. We have added new parameters
`azure_resource_permissions` and `aws_permissions` to the
`_execute_for_cloud` function in `cli.py`, which are passed to the
`func_azure` and `func_aws` functions respectively. The
`create_uber_principal` and `principal_prefix_access` commands have also
been updated to include these new parameters. Additionally, the
`_azure_setup_uber_principal` and `_aws_setup_uber_principal` functions
have been updated to accept the new `azure_resource_permissions` and
`aws_resource_permissions` parameters. The
`_azure_principal_prefix_access` and `_aws_principal_prefix_access`
functions have also been updated similarly. We have also introduced a
new `aws_resources` parameter in the `migrate_credentials` command,
which is used to migrate Azure Service Principals in ADLS Gen2 locations
to UC storage credentials. In terms of testing, we have replaced the
`mocker.patch` calls with the creation of `AzureResourcePermissions` and
`AWSResourcePermissions` objects, improving the code's readability and
maintainability. Overall, these changes significantly enhance the
library's functionality and maintainability in handling Azure and AWS
workspaces.
* Require Hatch v1.9.4 on build machines
([#1049](#1049)). In this
release, we have updated the Hatch package version to 1.9.4 on build
machines, addressing issue
[#1049](#1049). The changes
include updating the toolchain dependencies and setup in the
`.codegen.json` file, which simplifies the setup process and now relies
on a pre-existing Hatch environment and Python 3. The acceptance
workflow has also been updated to use the latest version of Hatch and
the `databrickslabs/sandbox/acceptance` GitHub action version `v0.1.4`.
Hatch is a Python package manager that simplifies package development
and management, and this update provides new features and bug fixes that
can help improve the reliability and performance of the acceptance
workflow. This change requires version 1.9.4 of the Hatch package on
build machines, and it will affect the build process for the project but
will not have any impact on the functionality of the project itself. As
a software engineer adopting this project, it's important to note this
change to ensure that the build process runs smoothly and takes
advantage of any new features or improvements in Hatch 1.9.4.
* Set acceptance tests to timeout after 45 minutes
([#1036](#1036)). As part of
issue [#1036](#1036), the
acceptance tests in this open-source library now have a 45-minute
timeout configured, improving the reliability and stability of the
testing environment. This change has been implemented in the
`.github/workflows/acceptance.yml` file by adding the `timeout`
parameter to the step where the `databrickslabs/sandbox/acceptance`
action is called. This ensures that the acceptance tests will not run
indefinitely and prevents any potential issues caused by long-running
tests. By adopting this project, software engineers can now benefit from
a more stable and reliable testing environment, with acceptance tests
that are guaranteed to complete within a maximum of 45 minutes.
* Updated databricks-labs-blueprint requirement from ~0.4.1 to ~0.4.3
([#1058](#1058)). In this
release, the version requirement for the `databricks-labs-blueprint`
library has been updated from ~0.4.1 to ~0.4.3 in the pyproject.toml
file. This change is necessary to support issues
[#1056](#1056) and
[#1057](#1057). The code has
been manually tested and is ready for further testing to ensure the
compatibility and smooth functioning of the software. It is essential to
thoroughly test the latest version of the `databricks-labs-blueprint`
library with the existing codebase before deploying it to production.
This includes running a comprehensive suite of tests such as unit tests,
integration tests, and verification on the staging environment. This
modification allows the software to use the latest version of the
library, improving its functionality and overall performance.
* Use `MockPrompts.extend()` functionality in test_install to supply
multiple prompts
([#1057](#1057)). This diff
introduces the `MockPrompts.extend()` functionality in the
`test_install` module to enable the supplying of multiple prompts for
testing purposes. A new `base_prompts` dictionary with default prompts
has been added and is extended with additional prompts for specific test
cases. This allows for the testing of various scenarios, such as when
UCX is already installed on the workspace and the user is prompted to
choose between global or user installation. Additionally, new
`force_user_environ` and `force_global_env` dictionaries have been added
to simulate different installation environments. The functionality of
the `WorkspaceInstaller` class and mocking of `webbrowser.open` are also
utilized in the test cases. These changes aim to ensure the proper
functioning of the configuration process for different installation
scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Create uber-IAM profile for performing external table migration on AWS

3 participants