Skip to content

Update 01_backup_catalog.py#1

Closed
aemrob wants to merge 1 commit intodatabrickslabs:mainfrom
aemrob:patch-1
Closed

Update 01_backup_catalog.py#1
aemrob wants to merge 1 commit intodatabrickslabs:mainfrom
aemrob:patch-1

Conversation

@aemrob
Copy link
Copy Markdown

@aemrob aemrob commented Apr 3, 2023

get_external_location is string value and the conditional statement is always executed irrespective the value of get_external_location

 get_external_location is string value and the conditional statement is always executed irrespective the value of get_external_location
@HariGS-DB HariGS-DB closed this Aug 9, 2023
pritishpai added a commit that referenced this pull request Dec 29, 2023
pritishpai added a commit that referenced this pull request Jan 3, 2024
pritishpai added a commit that referenced this pull request Jan 4, 2024
@nfx nfx mentioned this pull request May 8, 2024
nfx added a commit that referenced this pull request May 8, 2024
* Added DBSQL queries & dashboard migration ([#1532](#1532)). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands: `migrate-dbsql-dashboards` and `revert-dbsql-dashboards`. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. The `migrate-dbsql-dashboards` command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries with `migrated by UCX` and backing up original queries. The `revert-dbsql-dashboards` command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a `--dashboard-id` flag for migrating or reverting a specific dashboard. Additionally, two new functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`, have been added to the `cli.py` file, and new classes have been added to interact with Redash for data visualization and querying. The `make_dashboard` fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards.
* Added UDFs assessment ([#1610](#1610)). A User Defined Function (UDF) assessment feature has been introduced, addressing issue [#1610](#1610). A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns, `success` and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count".
* Added `databricks labs ucx create-missing-principals` command to create the missing UC roles in AWS ([#1495](#1495)). The `databricks labs ucx` tool now includes a new command, `create-missing-principals`, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented using `IamRoleCreation` from `databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new command only supports AWS and does not affect Azure. The existing `migrate_credentials` function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, including `AWSUCRoleCandidate` in `aws.py`, and `create_missing_principals` and `list_uc_roles` methods in `access.py`. The `create_uc_roles_cli` method in `access.py` has been refactored and renamed to `list_uc_roles`. New unit tests have been implemented to test the functionality of `create_missing_principals` for AWS and Azure, as well as testing the behavior when the command is not approved.
* Added baseline for workflow linter ([#1613](#1613)). This change introduces the `WorkflowLinter` class in the `application.py` file of the `databricks.labs.ucx.source_code.jobs` package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such as `workspace_client`, `dependency_resolver`, `path_lookup`, and `migration_index`. Several properties have been moved from `dependency_resolver` to the `CliContext` class, and the `NotebookLoader` class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. The `generic` and `redash` modules from `databricks.labs.ucx.workspace_access` and the `GroupManager` class from `databricks.labs.ucx.workspace_access.groups` are used. The `VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from `databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class from `databricks.labs.ucx.installer.workflows` are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests.
* Added linter to check for RDD use and JVM access ([#1606](#1606)). A new `AstHelper` class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing the `AstHelper` class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments.
* Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow ([#1621](#1621)). The `migrate_tables` workflow in `workflows.py` has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. The `migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and `migrate_views` tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities.
* Added "seen tables" feature ([#1465](#1465)). The `seen tables` feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of a `snapshot` method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The `_crawl` function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and `_get_table_location` have been included to facilitate these improvements. In the testing realm, a new test `test_mount_listing_seen_tables` has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore.
* Added support for `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` CLI command ([#1660](#1660)). This commit adds support for the `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` command, which checks for external tables that cannot be synced and prompts the user to run the `migrate-tables-ctas` workflow. Two new methods, `test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts, ctx=ctx)`, have been added. The first method checks if the `migrate-external-tables-ctas` workflow is called correctly, while the second method runs the workflow after prompting the user. The method `test_migrate_external_hiveserde_tables_in_place(ws)` has been modified to test if the `migrate-external-hiveserde-tables-in-place-experimental` workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project.
* Added support for migrating external location permissions from interactive cluster mounts ([#1487](#1487)). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing `databricks labs ucx` command is modified, with the addition of the new method `create_external_locations` and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues [#1192](#1192) and [#1193](#1193), ensuring a more robust and controlled user experience with interactive clusters.
* Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN ([#1631](#1631)). In this release, we've implemented new features to enhance the security and control over data access during the migration process for the SQL warehouse data access configuration. The `databricks labs ucx create-uber-principal` command now creates a service principal with read-only access to all the storage used by tables in the workspace. The UCX Cluster Policy and SQL Warehouse data access configuration will be updated to use this service principal for migration workflows. A new method, `_update_sql_dac_with_instance_profile`, has been introduced in the `access.py` file to update the SQL data access configuration with the provided AWS instance profile, ensuring a more streamlined management of instance profiles within the SQL data access configuration during the creation of an uber service principal (SPN). Additionally, new methods and tests have been added to the sql module of the databricks.sdk.service package to improve Azure resource permissions, handling different scenarios related to creating a global SPN in the presence or absence of various conditions, such as storage, cluster policies, or secrets.
* Addressed issue with disabled features in certain regions ([#1618](#1618)). In this release, we have implemented improvements to address an issue where certain features were disabled in specific regions. We have added error handling when listing serving endpoints to raise a NotFound error if a feature is disabled, preventing the code from failing silently and providing better error messages. A new method, test_serving_endpoints_not_enabled, has been added, which creates a mock WorkspaceClient and raises a NotFound error if serving endpoints are not enabled for a shard. The GenericPermissionsSupport class uses this method to get crawler tasks, and if serving endpoints are not enabled, an error message is logged. These changes increase the reliability and robustness of the codebase by providing better error handling and messaging for this particular issue. Additionally, the change includes unit tests and manual testing to ensure the proper functioning of the new features.
* Aggregate UCX output across workspaces with CLI command ([#1596](#1596)). A new `report-account-compatibility` command has been added to the `databricks labs ucx` tool, enabling users to evaluate the compatibility of an entire Azure Databricks account with UCX (Unified Client Context). This command generates a readiness report for an Azure Databricks account, specifically for evaluating compatibility with UCX, by querying various aspects of the account such as clusters, configurations, and data formats. It uses Azure CLI authentication with AAD tokens for authentication and accepts a profile as an argument. The output includes warnings for workspaces that do not have UCX installed, and provides information about unsupported cluster types, unsupported configurations, data format compatibility, and more. Additionally, a new feature has been added to aggregate UCX output across workspaces in an account through a new CLI command, "report-account-compatibility", which can be run at the account level. The existing `manual-workspace-info` command remains unchanged. These changes will help assess the readiness and compatibility of an Azure Databricks account for UCX integration and simplify the process of checking compatibility across an entire account.
* Assert if group name is in cluster policy ([#1665](#1665)). In this release, we have implemented a change to ensure the presence of the display name of a specific workspace group (ws_group_a) in the cluster policy. This is to prevent a key error previously encountered. The cluster policy is now loaded as a dictionary, and the group name is checked to confirm its presence. If the group is not found, a message is raised alerting users. Additionally, the permission level for the group is verified to ensure it is set to CAN_USE. No new methods have been added, and existing functionality remains unchanged. The test file test_ext_hms.py has been updated to include the new assertion and has undergone both unit tests and manual testing to ensure proper implementation. This change is intended for software engineers who adopt the project.
* Automatically retrying with `auth_type=azure-cli` when constructing `workspace_clients` on Azure ([#1650](#1650)). This commit introduces automatic retrying with 'auth_type=azure-cli' when constructing `workspace_clients` on Azure, resolving TODO items for `AccountWorkspaces` and adding relevant suggestions in 'troubleshooting.md'. It closes issues [#1574](#1574) and [#1430](#1430), and includes new methods for generating readiness reports in `AccountAggregate` and testing the `get_accessible_workspaces` method in 'test_workspaces.py'. User documentation has been updated and the changes have been manually verified in a staging environment. For macOS and Windows users, explicit auth type settings are required for command line utilities.
* Changes to identify service principal with custom roles on Azure storage account for principal-prefix-access ([#1576](#1576)). This release introduces several enhancements to the identification of service principals with custom roles on Azure storage accounts for principal-prefix-access. New methods such as `_get_permission_level`, `_get_custom_role_privilege`, and `_get_role_privilege` have been added to improve the functionality of the module. Additionally, two new classes, AzureRoleAssignment and AzureRoleDetails, have been added to enable more detailed management and access control for custom roles on Azure storage accounts. The 'test_access.py' file has been updated to include tests for saving custom roles in Azure storage accounts and ensuring the correct identification of service principals with custom roles. A new unit test function, test_role_assignments_custom_storage(), has also been added to verify the behavior of custom roles in Azure storage accounts. Overall, these changes provide a more efficient and fine-grained way to manage and control custom roles on Azure storage accounts.
* Clarified unsupported config in compute crawler ([#1656](#1656)). In this release, we have made significant changes to clarify and improve the handling of unsupported configurations in our compute crawler related to the Hive metastore. We have expanded error messages for unsupported configurations and provided detailed recommendations for remediation. Additionally, we have added relevant user documentation and manually tested the changes. The changes include updates to the configuration for external Hive metastore and passthrough security model for Unity Catalog, which are incompatible with the current configurations. We recommend removing or altering the configs while migrating existing tables and views using UCX or other compatible clusters, and mapping the passthrough security model to a security model compatible with Unity Catalog. The code modifications include the addition of new methods for checking cluster init script and Spark configurations, as well as refining the error messages for unsupported configurations. We also added a new assertion in the `test_cluster_with_multiple_failures` unit test to check for the presence of a specific message regarding the use of the `spark.databricks.passthrough.enabled` configuration. This release is not yet verified on the staging environment.
* Created a unique default schema when External Hive Metastore is detected ([#1579](#1579)). A new default database `ucx` is introduced for storing inventory in the hive metastore, with a suffix consisting of the workspace's client ID to ensure uniqueness when an external hive metastore is detected. The `has_ext_hms()` method is added to the `InstallationPolicy` class to detect external HMS and thereby create a unique default schema. The `_prompt_for_new_installation` method's default value for the `Inventory Database stored in hive_metastore` prompt is updated to use the new default database name, modified to include the workspace's client ID if external HMS is detected. Additionally, a test function `test_save_config_ext_hms` is implemented to demonstrate the `WorkspaceInstaller` class's behavior with external HMS, creating a unique default schema for improved system functionality and customization. This change is part of issue [#1579](#1579).
* Extend service principal migration to create storage credentials for access connectors created for each storage account ([#1426](#1426)). This commit extends the service principal migration to create storage credentials for access connectors associated with each storage account, resolving issues [#1384](#1384) and [#875](#875). The update includes modifications to the existing `databricks labs ucx` command for creating access connectors, adds a new CLI command for creating storage credentials, and updates the documentation. A new workflow has been added for creating credentials for access connectors and service principals, and updates have been made to existing workflows. The commit includes manual, unit, and integration tests, and no new or modified methods are specified in the diff. The focus is on the feature description and its impact on the project's functionality. The commit has been co-authored by Serge Smertin and vuong-nguyen.
* Suggest users to create Access Connector(s) with Managed Identity to access Azure Storage Accounts behind firewall ([#1589](#1589)). In this release, we have introduced a new feature to improve access to Azure Storage Accounts that are protected by firewalls. Due to limitations with service principals in such scenarios, we have developed Access Connectors with Managed Identities for more reliable connectivity. This change includes updates to the 'credentials.py' file, which introduces new methods for managing the migration of service principals to Access Connectors using Managed Identities. Users are warned that migrating to this new feature may cause issues when transitioning to UC, and are advised to validate external locations after running the migration command. This update enhances the security and functionality of the system, providing a more dependable method for accessing Azure Storage Accounts protected by firewalls.
* Fixed catalog/schema grants when tables with same source schema have different target schemas ([#1581](#1581)). In this release, we have implemented a fix to address an issue where catalog/schema grants were not being handled correctly when tables with the same source schema had different target schemas. This was causing problems with granting appropriate permissions to users. We have modified the prepare_test function to include an additional test case with a different target schema for the same source table. Furthermore, we have updated the test_catalog_schema_acl function to ensure that grants are being created correctly for all catalogs, schemas, and tables. We have also added an extra query to grant use schema permissions for catalog2.schema3 to user1. Additionally, we have introduced a new `SchemaInfo` class to store information about catalogs and schemas, and refactored the `_get_database_source_target_mapping` method to return a dictionary that maps source databases to a list of `SchemaInfo` objects instead of a single dictionary. These changes ensure that grants are being handled correctly for catalogs, schemas, and tables, even when tables with the same source schema have different target schemas. This will improve the overall functionality and reliability of the system, making it easier for users to manage their catalogs and schemas.
* Fixed Spark configuration parameter referencing secret ([#1635](#1635)). In this release, the code related to the Spark configuration parameter reference for a secret has been updated in the `access.py` file, specifically within the `_update_cluster_policy_definition` method. The change modifies the method to retrieve the OAuth client secret for a given storage account using an f-string to reference the secret, replacing the previous concatenation operator. This enhancement is aimed at improving the readability and maintainability of the code while preserving its functionality. Furthermore, the commit includes additional changes, such as new methods `test_create_global_spn` and "cluster_policies.edit", which may be related to this fix. These changes address the secret reference issue, ensuring secure access control and improved integration, particularly with the Spark configuration, benefiting engineers utilizing this project for handling sensitive information and managing clusters securely and effectively.
* Fixed `migration-locations` and `assign-metastore` definitions in `labs.yml` ([#1627](#1627)). In this release, the `migration-locations` command in the `labs.yml` file has been updated to include new flags `subscription-id` and `aws-profile`. The `subscription-id` flag allows users to specify the subscription to scan the storage account in, and the `aws-profile` flag allows for authentication using a specified AWS Profile. The `assign-metastore` command has also been updated with a new description: "Enable Unity Catalog features on a workspace by assigning a metastore to it." The `is_account_level` parameter remains unchanged, and the new optional flag `workspace-id` has been added, allowing users to specify the Workspace ID to assign a metastore to. This change enhances the functionality of the `migration-locations` and `assign-metastore` commands, providing more options for users to customize their storage scanning and metastore assignment processes. The `migration-locations` and `assign-metastore` definitions in the `labs.yml` file have been fixed in this release.
* Fixed prompt for using external metastore ([#1668](#1668)). A fix has been implemented in the `create` function of the `policy.py` file to correctly prompt users for using an external metastore. Previously, a missing period and space in the prompt caused potential confusion. The updated prompt now includes a clarifying sentence and the `_prompts.confirm` method has been modified to check if the user wants to set UCX to connect to an external metastore in two scenarios: when one or more cluster policies are set up for an external metastore, and when the workspace warehouse is configured for an external metastore. If the user chooses to set up an external metastore, an informational message will be recorded in the logger. This change ensures clear and precise communication with users during the external metastore setup process.
* Fixed storage account network ACLs retrieved from properties ([#1620](#1620)). This release includes a fix to the storage account network ACLs retrieval in the open-source library, addressing issue [#1](#1). Previously, the network ACLs were being retrieved from an incorrect location, but this commit corrects that by obtaining the network ACLs from the storage account's properties.networkAcls field. The `StorageAccount` class has been updated to modify the way default network action is retrieved, with a new value `Unknown` added to the previous values `Deny` and "Allow". The `from_raw_resource` class method has also been updated to retrieve the default network action from the `properties.networkAcls` field instead of the `networkAcls` field. This change may affect any functionality that relies on network ACL information and impacts the existing command `databricks labs ucx ...`. Relevant tests, including a new test `test_azure_resource_storage_accounts_list_non_zero`, have been added and manually and unit tested to ensure the fix is functioning correctly.
* Fully refresh table migration status in table migration workflow ([#1630](#1630)). This release introduces a new method, `index_full_refresh()`, to the table migration workflow for fully refreshing the migration status, addressing an oversight from a previous commit ([#1623](#1623)) and resolving issue [#1628](#1628). The new method resets the `_migration_status_refresher` before computing the index, ensuring the latest migration status is used for determining whether view dependencies have been migrated. The `index()` method was previously used to refresh the migration status, but it only provided a partial refresh. With this update, `index_full_refresh()` is utilized for a comprehensive refresh, affecting the `refresh_migration_status` task in multiple workflows such as `migrate_views`, `scan_tables_in_mounts_experimental`, and others. This change ensures a more accurate migration report, presenting the updated migration status.
* Ignore existing corrupted installations when refreshing ([#1605](#1605)). A recent update has enhanced the error handling during the loading of installations in the `install.py` file. Specifically, the `installation.load` function now handles certain errors, including `PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by logging a warning message and skipping the corrupted installation instead of raising an error. This behavior has been incorporated into both the `configure` and `_check_inventory_database_exists` functions, allowing the installation process to continue even in the presence of issues with existing installations, while providing improved error messages. This change resolves issue [#1601](#1601) and introduces a new test case for a corrupted installation configuration, as well as an updated existing test case for `test_save_config` that includes a mock installation.
* Improved exception handling ([#1584](#1584)). In this release, the exception handling during the upload of a wheel file to DBFS has been significantly improved. Previously, only PermissionDenied errors were caught and handled. Now, both BadRequest and PermissionDenied exceptions will be caught and logged as a warning. This change enhances the robustness of the code by handling a wider range of exceptions during the upload process. In addition, cluster overrides have been configured and DBFS write permissions have been set up. The specific changes made to the code include updating the import statement for NotFound to include BadRequest and modifying the except block in the _get_init_script_data method to catch both NotFound and BadRequest exceptions. These improvements ensure that the code can handle more types of errors, providing more helpful error messages and preventing crash scenarios, thereby enhancing the reliability and robustness of the code.
* Improved exception handling for `migrate_acl` ([#1590](#1590)). In this release, the `migrate_acl` functionality has been enhanced to improve exception handling, addressing a flakiness issue in the `test_migrate_managed_tables_with_acl` test. Previously, unhandled `not found` exceptions during parallel test execution caused the flakiness. This release resolves this issue ([#1549](#1549)) by introducing error handling in the `test_migrate_acls_should_produce_proper_queries` test. A controlled error is now introduced to simulate a failed grant migration due to a `TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise testing of error handling and logging mechanisms when migration fails for specific objects, ensuring a more reliable testing environment for the `migrate_acl` functionality.
* Improved reliability of table migration status refresher ([#1623](#1623)). This release introduces improvements to the table migration status refresher in the open-source library, enhancing its reliability and robustness. The `table_migrate` function has been updated to ensure that the table migration status is always reset when requesting the latest snapshot, addressing issues [#1623](#1623), [#1622](#1622), and [#1615](#1615). Additionally, the function now handles `NotFound` errors when refreshing migration status. The `get_seen_tables` function has been modified to convert the returned iterator to a list and raise a `NotFound` exception if the schema does not exist, which is then caught and logged as a warning. Furthermore, the migration status reset behavior has been improved, and the `migration_status_refresher` parameter type in the `TableMigrate` class constructor has been modified. New private methods `_index_with_reset()` and updated `_migrate_views()` and `_view_can_be_migrated()` methods have been added to ensure a more accurate and consistent table migration process. The changes have been thoroughly tested and are ready for review.
* Refresh migration status at the end of the `migrate_tables` workflows ([#1599](#1599)). In this release, updates have been made to the migration status at the end of the `migrate_tables` workflows, with no new or modified tables or methods introduced. The `_migration_status_refresher.reset()` method has been added in two locations to ensure accurate migration status updates. A new `refresh_migration_status` method has been included in the `RuntimeContext` class in the `databricks.labs.ucx.hive_metastore.workflows` module, which refreshes the migration status for presentation in the dashboard. The changes also include the addition of the `refresh_migration_status` task in `migrate_views`, `migrate_views_with_acl`, and `scan_tables_in_mounts_experimental` workflows, and the `migration_report` method is now dependent on the `refresh_migration_status` task. Thorough testing has been conducted, including the creation of a new integration test in the file `tests/integration/hive_metastore/test_workflows.py` to verify that the migration status is refreshed after the migration job is run. These changes aim to ensure that the migration status is up-to-date and accurately presented in the dashboard.
* Removed DBFS library installations ([#1554](#1554)). In this release, the "configure.py" file has been removed, which previously contained the `ConfigureClusterOverrides` class with methods for validating cluster IDs, distinguishing between classic and Table Access Control (TACL) clusters, and building a prompt for users to select a valid active cluster ID. The removal of this file signifies that these functionalities are no longer available. This change is part of a larger commit that also removes DBFS library installations and updates the Estimates Dashboard to remove metastore assignment, addressing issue [#1098](#1098). The commit has been tested via integration tests and manual installation and running of UCX on a no-uc environment. Please note that the `create_jobs` method in the `install.py` file has been updated to reflect these changes, ensuring a more straightforward installation experience and usage of the Estimates Dashboard.
* Removed the `Is Terraform used` prompt ([#1664](#1664)). In this release, we have removed the `is_terraform_used` prompt from the configuration file and the installation process in the ucx package. This prompt was not being utilized and had been a source of confusion for some users. Although the variable that stored its outcome will be retained for backwards compatibility, no new methods or modifications to existing functionality have been introduced. No tests have been added or modified as part of this change. The removal of this prompt simplifies the configuration process and aligns with the project's future plans to eliminate the use of Terraform state for ucx migration. Manual testing has been conducted to ensure that the removal of the prompt does not affect the functionality of other properties in the configuration file or the installation process.
* Resolve relative paths when building dependency graph ([#1608](#1608)). This commit introduces support for resolving relative paths when building a dependency graph in the UCX project, addressing issues 1202, 1499, and 1287. The SysPathProvider now includes a `cwd` attribute, and a new class, LocalNotebookLoader, has been implemented to handle local files and folders. The PathLookup class is used to resolve paths, and new methods have been added to support these changes. Unit tests have been provided to ensure the correct functioning of the new functionality. This commit replaces issue 1593 and enhances the project's ability to handle local files and folders, resulting in a more robust and reliable dependency graph.
* Show tables migration status in migration dashboard ([#1507](#1507)). A migration dashboard has been added to display the status of data object migrations, addressing issue [#323](#323). This new feature includes a query to show the migration status of tables, a new CLI command, and a modification to an existing command. The `migrataion-*` workflow has been updated to include a refresh migration dashboard option. The `mock_installation` function has been modified with an updated state.json file. The changes consist of manual testing and can be found in the `migrations/main` directory as a new SQL query file. This migration dashboard provides users with an easier way to monitor the progress and status of their data migration tasks.
* Simulate loading of local files or notebooks after manipulation of `sys.path` ([#1633](#1633)). This commit updates the PathLookup process during the construction of the dependency graph, addressing issues [#1202](#1202) and [#1468](#1468). It simplifies the DependencyGraphBuilder by directly using the DependencyResolver with resolvers and lookup passed as arguments, and removes the DependencyGraphBuilder. The changes include new methods for handling compatibility checks, but no new user-facing features or changes to command-line interfaces or existing workflows are introduced. Unit tests are included to ensure correct behavior. The modifications aim to improve the internal handling of dependency resolution and compatibility checks.
* Test if `create-catalogs-schemas` works with tables defined as mount paths ([#1578](#1578)). This release includes a new unit test for the `create-catalogs-schemas` logic that verifies the correct creation and management of catalogs and schemas defined as mount paths. The test checks the storage location of catalogs, ensures non-existing schemas are properly created, and prevents the creation of catalogs without a storage location. It also verifies the catalog schema ACL is set correctly. Using the `CatalogSchema` class and various test functions, the test creates and grants permissions to catalogs and schemas. This change resolves issue [#1039](#1039) without modifying any existing commands or workflows. The release contains no new CLI commands or user documentation, but includes unit tests and assertion calls to validate the behavior of the `create_all_catalogs_schemas` method.
* Upgraded `databricks-sdk` to 0.27 ([#1626](#1626)). In this release, the `databricks-sdk` package has been upgraded to version 0.27, bringing updated methods for Redash objects. The `_install_query` method in the `dashboards.py` file has been updated to include a `tags` parameter, set to `None`, when calling `self._ws.queries.update` and `self._ws.queries.create`. This ensures that the updated SDK version is used and that tags are not applied during query updates and creation. Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint` packages have been updated to versions 0.4.0 and 0.4.3 respectively, and the dependency for PyYAML has been updated to a version between 6.0.0 and 7.0.0. These updates may impact the functionality of the project. The changes have been manually tested, but there is no verification on a staging environment.
* Use stack of dependency resolvers ([#1560](#1560)). This pull request introduces a stack-based implementation of resolvers, resolving issues [#1202](#1202), [#1499](#1499), and [#1421](#1421), and implements an initial version of SysPathProvider, while eliminating previous hacks. The new functionality includes modified existing commands, a new workflow, and the addition of unit tests. No new documentation or CLI commands have been added. The `problem_collector` parameter is not addressed in this PR and has been moved to a separate issue. The changes include renaming and moving a Python file, as well as modifications to the `Notebook` class and its related methods for handling notebook dependencies and dependency checking. The code has been tested, but manual testing and integration tests are still pending.
nfx added a commit that referenced this pull request May 8, 2024
* Added DBSQL queries & dashboard migration
([#1532](#1532)). The
Databricks Labs Unified Command Extensions (UCX) project has been
updated with two new experimental commands: `migrate-dbsql-dashboards`
and `revert-dbsql-dashboards`. These commands are designed for migrating
and reverting the migration of Databricks SQL dashboards in the
workspace. The `migrate-dbsql-dashboards` command transforms all
Databricks SQL dashboards in the workspace after table migration,
tagging migrated dashboards and queries with `migrated by UCX` and
backing up original queries. The `revert-dbsql-dashboards` command
returns migrated Databricks SQL dashboards to their original state
before migration. Both commands accept a `--dashboard-id` flag for
migrating or reverting a specific dashboard. Additionally, two new
functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`,
have been added to the `cli.py` file, and new classes have been added to
interact with Redash for data visualization and querying. The
`make_dashboard` fixture has been updated to enhance testing
capabilities, and new unit tests have been added for migrating and
reverting DBSQL dashboards.
* Added UDFs assessment
([#1610](#1610)). A User
Defined Function (UDF) assessment feature has been introduced,
addressing issue
[#1610](#1610). A new
method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed
information about UDFs, including function description, input
parameters, and return types. This method has been integrated into
existing test cases, enhancing the validation of UDF metadata and
associated privileges, and ensuring system reliability. The UDF
constructor has been updated with a new parameter 'comment', initially
left blank in the test function. Additionally, two new columns,
`success` and 'failures', have been added to the udf table in the
inventory database to store assessment data for UDFs. The UdfsCrawler
class has been updated to return a list of UDF objects, and the
assertions in the test have been updated accordingly. Furthermore, a new
SQL file has been added to calculate the total count of UDFs in the
$inventory.udfs table, with a widget displaying this information as a
counter visualization named "Total UDF Count".
* Added `databricks labs ucx create-missing-principals` command to
create the missing UC roles in AWS
([#1495](#1495)). The
`databricks labs ucx` tool now includes a new command,
`create-missing-principals`, which creates missing Universal Catalog
(UC) roles in AWS for S3 locations that lack a UC compatible role. This
command is implemented using `IamRoleCreation` from
`databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with
the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new
command only supports AWS and does not affect Azure. The existing
`migrate_credentials` function has been updated to handle Azure Service
Principals migration. Additionally, new classes and methods have been
added, including `AWSUCRoleCandidate` in `aws.py`, and
`create_missing_principals` and `list_uc_roles` methods in `access.py`.
The `create_uc_roles_cli` method in `access.py` has been refactored and
renamed to `list_uc_roles`. New unit tests have been implemented to test
the functionality of `create_missing_principals` for AWS and Azure, as
well as testing the behavior when the command is not approved.
* Added baseline for workflow linter
([#1613](#1613)). This
change introduces the `WorkflowLinter` class in the `application.py`
file of the `databricks.labs.ucx.source_code.jobs` package. The class is
used to lint workflows by checking their dependencies and ensuring they
meet certain criteria, taking in arguments such as `workspace_client`,
`dependency_resolver`, `path_lookup`, and `migration_index`. Several
properties have been moved from `dependency_resolver` to the
`CliContext` class, and the `NotebookLoader` class has been moved to a
new location. Additionally, several classes and methods have been
introduced to build a dependency graph, resolve dependencies, and manage
allowed dependencies, site packages, and supported programming
languages. The `generic` and `redash` modules from
`databricks.labs.ucx.workspace_access` and the `GroupManager` class from
`databricks.labs.ucx.workspace_access.groups` are used. The
`VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from
`databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class
from `databricks.labs.ucx.installer.workflows` are also used. This
commit is part of a larger effort to improve workflow linting and
addresses several related issues and pull requests.
* Added linter to check for RDD use and JVM access
([#1606](#1606)). A new
`AstHelper` class has been added to provide utility functions for
working with abstract syntax trees (ASTs) in Python code, including
methods for extracting attribute and function call node names.
Additionally, a linter has been integrated to check for RDD use and JVM
access, utilizing the `AstHelper` class, which has been moved to a
separate module. A new file, 'spark_connect.py', introduces a linter
with three matchers to ensure conformance to best practices and catch
potential issues early in the development process related to RDD usage
and JVM access. The linter is environment-aware, accommodating shared
cluster and serverless configurations, and includes new test methods to
validate its functionality. These improvements enhance codebase quality,
promote reusability, and ensure performance and stability in Spark
cluster environments.
* Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in
migrate_table workflow
([#1621](#1621)). The
`migrate_tables` workflow in `workflows.py` has been enhanced to support
a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables
stored in DBFS root from the Hive Metastore to the Unity Catalog using
CTAS. Additionally, the ACL migration strategy has been updated to
include the AclMigrationWhat.PRINCIPAL strategy. The
`migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and
`migrate_views` tasks now incorporate the new ACL migration strategy.
These changes have been thoroughly tested through unit tests and
integration tests, ensuring the continued functionality of the existing
workflow while expanding its capabilities.
* Added "seen tables" feature
([#1465](#1465)). The `seen
tables` feature has been introduced, allowing for better handling of
existing tables in the hive metastore and supporting their migration to
UC. This enhancement includes the addition of a `snapshot` method that
fetches and crawls table inventory, appending or overwriting records
based on assessment results. The `_crawl` function has been updated to
check for and skip existing tables in the current workspace. New methods
such as '_get_tables_paths_from_assessment', '_overwrite_records', and
`_get_table_location` have been included to facilitate these
improvements. In the testing realm, a new test
`test_mount_listing_seen_tables` has been implemented, replacing
'test_partitioned_csv_jsons'. This test checks the behavior of the
TablesInMounts class when enumerating tables in mounts for a specific
context, accounting for different table formats and managing external
and managed tables. The diff modifies the 'locations.py' file in the
databricks/labs/ucx directory, related to the hive metastore.
* Added support for `migrate-tables-ctas` workflow in the `databricks
labs ucx migrate-tables` CLI command
([#1660](#1660)). This
commit adds support for the `migrate-tables-ctas` workflow in the
`databricks labs ucx migrate-tables` command, which checks for external
tables that cannot be synced and prompts the user to run the
`migrate-tables-ctas` workflow. Two new methods,
`test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts,
ctx=ctx)`, have been added. The first method checks if the
`migrate-external-tables-ctas` workflow is called correctly, while the
second method runs the workflow after prompting the user. The method
`test_migrate_external_hiveserde_tables_in_place(ws)` has been modified
to test if the `migrate-external-hiveserde-tables-in-place-experimental`
workflow is called correctly. No new methods or significant
modifications to existing functionality have been made in this commit.
The changes include updated unit tests and user documentation. The
target audience for this feature are software engineers who adopt the
project.
* Added support for migrating external location permissions from
interactive cluster mounts
([#1487](#1487)). This
commit adds support for migrating external location permissions from
interactive cluster mounts in Databricks Labs' UCX project, enhancing
security and access control. It retrieves interactive cluster locations
and user mappings from the AzureACL class, granting necessary
permissions to each cluster principal for each location. The existing
`databricks labs ucx` command is modified, with the addition of the new
method `create_external_locations` and thorough testing through manual,
unit, and integration tests. This feature is developed by vuong-nguyen
and Vuong and addresses issues
[#1192](#1192) and
[#1193](#1193), ensuring a
more robust and controlled user experience with interactive clusters.
* Added uber principal spn details in SQL warehouse data access
configuration when creating uber-SPN
([#1631](#1631)). In this
release, we've implemented new features to enhance the security and
control over data access during the migration process for the SQL
warehouse data access configuration. The `databricks labs ucx
create-uber-principal` command now creates a service principal with
read-only access to all the storage used by tables in the workspace. The
UCX Cluster Policy and SQL Warehouse data access configuration will be
updated to use this service principal for migration workflows. A new
method, `_update_sql_dac_with_instance_profile`, has been introduced in
the `access.py` file to update the SQL data access configuration with
the provided AWS instance profile, ensuring a more streamlined
management of instance profiles within the SQL data access configuration
during the creation of an uber service principal (SPN). Additionally,
new methods and tests have been added to the sql module of the
databricks.sdk.service package to improve Azure resource permissions,
handling different scenarios related to creating a global SPN in the
presence or absence of various conditions, such as storage, cluster
policies, or secrets.
* Addressed issue with disabled features in certain regions
([#1618](#1618)). In this
release, we have implemented improvements to address an issue where
certain features were disabled in specific regions. We have added error
handling when listing serving endpoints to raise a NotFound error if a
feature is disabled, preventing the code from failing silently and
providing better error messages. A new method,
test_serving_endpoints_not_enabled, has been added, which creates a mock
WorkspaceClient and raises a NotFound error if serving endpoints are not
enabled for a shard. The GenericPermissionsSupport class uses this
method to get crawler tasks, and if serving endpoints are not enabled,
an error message is logged. These changes increase the reliability and
robustness of the codebase by providing better error handling and
messaging for this particular issue. Additionally, the change includes
unit tests and manual testing to ensure the proper functioning of the
new features.
* Aggregate UCX output across workspaces with CLI command
([#1596](#1596)). A new
`report-account-compatibility` command has been added to the `databricks
labs ucx` tool, enabling users to evaluate the compatibility of an
entire Azure Databricks account with UCX (Unified Client Context). This
command generates a readiness report for an Azure Databricks account,
specifically for evaluating compatibility with UCX, by querying various
aspects of the account such as clusters, configurations, and data
formats. It uses Azure CLI authentication with AAD tokens for
authentication and accepts a profile as an argument. The output includes
warnings for workspaces that do not have UCX installed, and provides
information about unsupported cluster types, unsupported configurations,
data format compatibility, and more. Additionally, a new feature has
been added to aggregate UCX output across workspaces in an account
through a new CLI command, "report-account-compatibility", which can be
run at the account level. The existing `manual-workspace-info` command
remains unchanged. These changes will help assess the readiness and
compatibility of an Azure Databricks account for UCX integration and
simplify the process of checking compatibility across an entire account.
* Assert if group name is in cluster policy
([#1665](#1665)). In this
release, we have implemented a change to ensure the presence of the
display name of a specific workspace group (ws_group_a) in the cluster
policy. This is to prevent a key error previously encountered. The
cluster policy is now loaded as a dictionary, and the group name is
checked to confirm its presence. If the group is not found, a message is
raised alerting users. Additionally, the permission level for the group
is verified to ensure it is set to CAN_USE. No new methods have been
added, and existing functionality remains unchanged. The test file
test_ext_hms.py has been updated to include the new assertion and has
undergone both unit tests and manual testing to ensure proper
implementation. This change is intended for software engineers who adopt
the project.
* Automatically retrying with `auth_type=azure-cli` when constructing
`workspace_clients` on Azure
([#1650](#1650)). This
commit introduces automatic retrying with 'auth_type=azure-cli' when
constructing `workspace_clients` on Azure, resolving TODO items for
`AccountWorkspaces` and adding relevant suggestions in
'troubleshooting.md'. It closes issues
[#1574](#1574) and
[#1430](#1430), and includes
new methods for generating readiness reports in `AccountAggregate` and
testing the `get_accessible_workspaces` method in 'test_workspaces.py'.
User documentation has been updated and the changes have been manually
verified in a staging environment. For macOS and Windows users, explicit
auth type settings are required for command line utilities.
* Changes to identify service principal with custom roles on Azure
storage account for principal-prefix-access
([#1576](#1576)). This
release introduces several enhancements to the identification of service
principals with custom roles on Azure storage accounts for
principal-prefix-access. New methods such as `_get_permission_level`,
`_get_custom_role_privilege`, and `_get_role_privilege` have been added
to improve the functionality of the module. Additionally, two new
classes, AzureRoleAssignment and AzureRoleDetails, have been added to
enable more detailed management and access control for custom roles on
Azure storage accounts. The 'test_access.py' file has been updated to
include tests for saving custom roles in Azure storage accounts and
ensuring the correct identification of service principals with custom
roles. A new unit test function, test_role_assignments_custom_storage(),
has also been added to verify the behavior of custom roles in Azure
storage accounts. Overall, these changes provide a more efficient and
fine-grained way to manage and control custom roles on Azure storage
accounts.
* Clarified unsupported config in compute crawler
([#1656](#1656)). In this
release, we have made significant changes to clarify and improve the
handling of unsupported configurations in our compute crawler related to
the Hive metastore. We have expanded error messages for unsupported
configurations and provided detailed recommendations for remediation.
Additionally, we have added relevant user documentation and manually
tested the changes. The changes include updates to the configuration for
external Hive metastore and passthrough security model for Unity
Catalog, which are incompatible with the current configurations. We
recommend removing or altering the configs while migrating existing
tables and views using UCX or other compatible clusters, and mapping the
passthrough security model to a security model compatible with Unity
Catalog. The code modifications include the addition of new methods for
checking cluster init script and Spark configurations, as well as
refining the error messages for unsupported configurations. We also
added a new assertion in the `test_cluster_with_multiple_failures` unit
test to check for the presence of a specific message regarding the use
of the `spark.databricks.passthrough.enabled` configuration. This
release is not yet verified on the staging environment.
* Created a unique default schema when External Hive Metastore is
detected ([#1579](#1579)). A
new default database `ucx` is introduced for storing inventory in the
hive metastore, with a suffix consisting of the workspace's client ID to
ensure uniqueness when an external hive metastore is detected. The
`has_ext_hms()` method is added to the `InstallationPolicy` class to
detect external HMS and thereby create a unique default schema. The
`_prompt_for_new_installation` method's default value for the `Inventory
Database stored in hive_metastore` prompt is updated to use the new
default database name, modified to include the workspace's client ID if
external HMS is detected. Additionally, a test function
`test_save_config_ext_hms` is implemented to demonstrate the
`WorkspaceInstaller` class's behavior with external HMS, creating a
unique default schema for improved system functionality and
customization. This change is part of issue
[#1579](#1579).
* Extend service principal migration to create storage credentials for
access connectors created for each storage account
([#1426](#1426)). This
commit extends the service principal migration to create storage
credentials for access connectors associated with each storage account,
resolving issues
[#1384](#1384) and
[#875](#875). The update
includes modifications to the existing `databricks labs ucx` command for
creating access connectors, adds a new CLI command for creating storage
credentials, and updates the documentation. A new workflow has been
added for creating credentials for access connectors and service
principals, and updates have been made to existing workflows. The commit
includes manual, unit, and integration tests, and no new or modified
methods are specified in the diff. The focus is on the feature
description and its impact on the project's functionality. The commit
has been co-authored by Serge Smertin and vuong-nguyen.
* Suggest users to create Access Connector(s) with Managed Identity to
access Azure Storage Accounts behind firewall
([#1589](#1589)). In this
release, we have introduced a new feature to improve access to Azure
Storage Accounts that are protected by firewalls. Due to limitations
with service principals in such scenarios, we have developed Access
Connectors with Managed Identities for more reliable connectivity. This
change includes updates to the 'credentials.py' file, which introduces
new methods for managing the migration of service principals to Access
Connectors using Managed Identities. Users are warned that migrating to
this new feature may cause issues when transitioning to UC, and are
advised to validate external locations after running the migration
command. This update enhances the security and functionality of the
system, providing a more dependable method for accessing Azure Storage
Accounts protected by firewalls.
* Fixed catalog/schema grants when tables with same source schema have
different target schemas
([#1581](#1581)). In this
release, we have implemented a fix to address an issue where
catalog/schema grants were not being handled correctly when tables with
the same source schema had different target schemas. This was causing
problems with granting appropriate permissions to users. We have
modified the prepare_test function to include an additional test case
with a different target schema for the same source table. Furthermore,
we have updated the test_catalog_schema_acl function to ensure that
grants are being created correctly for all catalogs, schemas, and
tables. We have also added an extra query to grant use schema
permissions for catalog2.schema3 to user1. Additionally, we have
introduced a new `SchemaInfo` class to store information about catalogs
and schemas, and refactored the `_get_database_source_target_mapping`
method to return a dictionary that maps source databases to a list of
`SchemaInfo` objects instead of a single dictionary. These changes
ensure that grants are being handled correctly for catalogs, schemas,
and tables, even when tables with the same source schema have different
target schemas. This will improve the overall functionality and
reliability of the system, making it easier for users to manage their
catalogs and schemas.
* Fixed Spark configuration parameter referencing secret
([#1635](#1635)). In this
release, the code related to the Spark configuration parameter reference
for a secret has been updated in the `access.py` file, specifically
within the `_update_cluster_policy_definition` method. The change
modifies the method to retrieve the OAuth client secret for a given
storage account using an f-string to reference the secret, replacing the
previous concatenation operator. This enhancement is aimed at improving
the readability and maintainability of the code while preserving its
functionality. Furthermore, the commit includes additional changes, such
as new methods `test_create_global_spn` and "cluster_policies.edit",
which may be related to this fix. These changes address the secret
reference issue, ensuring secure access control and improved
integration, particularly with the Spark configuration, benefiting
engineers utilizing this project for handling sensitive information and
managing clusters securely and effectively.
* Fixed `migration-locations` and `assign-metastore` definitions in
`labs.yml` ([#1627](#1627)).
In this release, the `migration-locations` command in the `labs.yml`
file has been updated to include new flags `subscription-id` and
`aws-profile`. The `subscription-id` flag allows users to specify the
subscription to scan the storage account in, and the `aws-profile` flag
allows for authentication using a specified AWS Profile. The
`assign-metastore` command has also been updated with a new description:
"Enable Unity Catalog features on a workspace by assigning a metastore
to it." The `is_account_level` parameter remains unchanged, and the new
optional flag `workspace-id` has been added, allowing users to specify
the Workspace ID to assign a metastore to. This change enhances the
functionality of the `migration-locations` and `assign-metastore`
commands, providing more options for users to customize their storage
scanning and metastore assignment processes. The `migration-locations`
and `assign-metastore` definitions in the `labs.yml` file have been
fixed in this release.
* Fixed prompt for using external metastore
([#1668](#1668)). A fix has
been implemented in the `create` function of the `policy.py` file to
correctly prompt users for using an external metastore. Previously, a
missing period and space in the prompt caused potential confusion. The
updated prompt now includes a clarifying sentence and the
`_prompts.confirm` method has been modified to check if the user wants
to set UCX to connect to an external metastore in two scenarios: when
one or more cluster policies are set up for an external metastore, and
when the workspace warehouse is configured for an external metastore. If
the user chooses to set up an external metastore, an informational
message will be recorded in the logger. This change ensures clear and
precise communication with users during the external metastore setup
process.
* Fixed storage account network ACLs retrieved from properties
([#1620](#1620)). This
release includes a fix to the storage account network ACLs retrieval in
the open-source library, addressing issue
[#1](#1). Previously, the
network ACLs were being retrieved from an incorrect location, but this
commit corrects that by obtaining the network ACLs from the storage
account's properties.networkAcls field. The `StorageAccount` class has
been updated to modify the way default network action is retrieved, with
a new value `Unknown` added to the previous values `Deny` and "Allow".
The `from_raw_resource` class method has also been updated to retrieve
the default network action from the `properties.networkAcls` field
instead of the `networkAcls` field. This change may affect any
functionality that relies on network ACL information and impacts the
existing command `databricks labs ucx ...`. Relevant tests, including a
new test `test_azure_resource_storage_accounts_list_non_zero`, have been
added and manually and unit tested to ensure the fix is functioning
correctly.
* Fully refresh table migration status in table migration workflow
([#1630](#1630)). This
release introduces a new method, `index_full_refresh()`, to the table
migration workflow for fully refreshing the migration status, addressing
an oversight from a previous commit
([#1623](#1623)) and
resolving issue
[#1628](#1628). The new
method resets the `_migration_status_refresher` before computing the
index, ensuring the latest migration status is used for determining
whether view dependencies have been migrated. The `index()` method was
previously used to refresh the migration status, but it only provided a
partial refresh. With this update, `index_full_refresh()` is utilized
for a comprehensive refresh, affecting the `refresh_migration_status`
task in multiple workflows such as `migrate_views`,
`scan_tables_in_mounts_experimental`, and others. This change ensures a
more accurate migration report, presenting the updated migration status.
* Ignore existing corrupted installations when refreshing
([#1605](#1605)). A recent
update has enhanced the error handling during the loading of
installations in the `install.py` file. Specifically, the
`installation.load` function now handles certain errors, including
`PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by
logging a warning message and skipping the corrupted installation
instead of raising an error. This behavior has been incorporated into
both the `configure` and `_check_inventory_database_exists` functions,
allowing the installation process to continue even in the presence of
issues with existing installations, while providing improved error
messages. This change resolves issue
[#1601](#1601) and
introduces a new test case for a corrupted installation configuration,
as well as an updated existing test case for `test_save_config` that
includes a mock installation.
* Improved exception handling
([#1584](#1584)). In this
release, the exception handling during the upload of a wheel file to
DBFS has been significantly improved. Previously, only PermissionDenied
errors were caught and handled. Now, both BadRequest and
PermissionDenied exceptions will be caught and logged as a warning. This
change enhances the robustness of the code by handling a wider range of
exceptions during the upload process. In addition, cluster overrides
have been configured and DBFS write permissions have been set up. The
specific changes made to the code include updating the import statement
for NotFound to include BadRequest and modifying the except block in the
_get_init_script_data method to catch both NotFound and BadRequest
exceptions. These improvements ensure that the code can handle more
types of errors, providing more helpful error messages and preventing
crash scenarios, thereby enhancing the reliability and robustness of the
code.
* Improved exception handling for `migrate_acl`
([#1590](#1590)). In this
release, the `migrate_acl` functionality has been enhanced to improve
exception handling, addressing a flakiness issue in the
`test_migrate_managed_tables_with_acl` test. Previously, unhandled `not
found` exceptions during parallel test execution caused the flakiness.
This release resolves this issue
([#1549](#1549)) by
introducing error handling in the
`test_migrate_acls_should_produce_proper_queries` test. A controlled
error is now introduced to simulate a failed grant migration due to a
`TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise
testing of error handling and logging mechanisms when migration fails
for specific objects, ensuring a more reliable testing environment for
the `migrate_acl` functionality.
* Improved reliability of table migration status refresher
([#1623](#1623)). This
release introduces improvements to the table migration status refresher
in the open-source library, enhancing its reliability and robustness.
The `table_migrate` function has been updated to ensure that the table
migration status is always reset when requesting the latest snapshot,
addressing issues
[#1623](#1623),
[#1622](#1622), and
[#1615](#1615).
Additionally, the function now handles `NotFound` errors when refreshing
migration status. The `get_seen_tables` function has been modified to
convert the returned iterator to a list and raise a `NotFound` exception
if the schema does not exist, which is then caught and logged as a
warning. Furthermore, the migration status reset behavior has been
improved, and the `migration_status_refresher` parameter type in the
`TableMigrate` class constructor has been modified. New private methods
`_index_with_reset()` and updated `_migrate_views()` and
`_view_can_be_migrated()` methods have been added to ensure a more
accurate and consistent table migration process. The changes have been
thoroughly tested and are ready for review.
* Refresh migration status at the end of the `migrate_tables` workflows
([#1599](#1599)). In this
release, updates have been made to the migration status at the end of
the `migrate_tables` workflows, with no new or modified tables or
methods introduced. The `_migration_status_refresher.reset()` method has
been added in two locations to ensure accurate migration status updates.
A new `refresh_migration_status` method has been included in the
`RuntimeContext` class in the
`databricks.labs.ucx.hive_metastore.workflows` module, which refreshes
the migration status for presentation in the dashboard. The changes also
include the addition of the `refresh_migration_status` task in
`migrate_views`, `migrate_views_with_acl`, and
`scan_tables_in_mounts_experimental` workflows, and the
`migration_report` method is now dependent on the
`refresh_migration_status` task. Thorough testing has been conducted,
including the creation of a new integration test in the file
`tests/integration/hive_metastore/test_workflows.py` to verify that the
migration status is refreshed after the migration job is run. These
changes aim to ensure that the migration status is up-to-date and
accurately presented in the dashboard.
* Removed DBFS library installations
([#1554](#1554)). In this
release, the "configure.py" file has been removed, which previously
contained the `ConfigureClusterOverrides` class with methods for
validating cluster IDs, distinguishing between classic and Table Access
Control (TACL) clusters, and building a prompt for users to select a
valid active cluster ID. The removal of this file signifies that these
functionalities are no longer available. This change is part of a larger
commit that also removes DBFS library installations and updates the
Estimates Dashboard to remove metastore assignment, addressing issue
[#1098](#1098). The commit
has been tested via integration tests and manual installation and
running of UCX on a no-uc environment. Please note that the
`create_jobs` method in the `install.py` file has been updated to
reflect these changes, ensuring a more straightforward installation
experience and usage of the Estimates Dashboard.
* Removed the `Is Terraform used` prompt
([#1664](#1664)). In this
release, we have removed the `is_terraform_used` prompt from the
configuration file and the installation process in the ucx package. This
prompt was not being utilized and had been a source of confusion for
some users. Although the variable that stored its outcome will be
retained for backwards compatibility, no new methods or modifications to
existing functionality have been introduced. No tests have been added or
modified as part of this change. The removal of this prompt simplifies
the configuration process and aligns with the project's future plans to
eliminate the use of Terraform state for ucx migration. Manual testing
has been conducted to ensure that the removal of the prompt does not
affect the functionality of other properties in the configuration file
or the installation process.
* Resolve relative paths when building dependency graph
([#1608](#1608)). This
commit introduces support for resolving relative paths when building a
dependency graph in the UCX project, addressing issues 1202, 1499, and
1287. The SysPathProvider now includes a `cwd` attribute, and a new
class, LocalNotebookLoader, has been implemented to handle local files
and folders. The PathLookup class is used to resolve paths, and new
methods have been added to support these changes. Unit tests have been
provided to ensure the correct functioning of the new functionality.
This commit replaces issue 1593 and enhances the project's ability to
handle local files and folders, resulting in a more robust and reliable
dependency graph.
* Show tables migration status in migration dashboard
([#1507](#1507)). A
migration dashboard has been added to display the status of data object
migrations, addressing issue
[#323](#323). This new
feature includes a query to show the migration status of tables, a new
CLI command, and a modification to an existing command. The
`migrataion-*` workflow has been updated to include a refresh migration
dashboard option. The `mock_installation` function has been modified
with an updated state.json file. The changes consist of manual testing
and can be found in the `migrations/main` directory as a new SQL query
file. This migration dashboard provides users with an easier way to
monitor the progress and status of their data migration tasks.
* Simulate loading of local files or notebooks after manipulation of
`sys.path` ([#1633](#1633)).
This commit updates the PathLookup process during the construction of
the dependency graph, addressing issues
[#1202](#1202) and
[#1468](#1468). It
simplifies the DependencyGraphBuilder by directly using the
DependencyResolver with resolvers and lookup passed as arguments, and
removes the DependencyGraphBuilder. The changes include new methods for
handling compatibility checks, but no new user-facing features or
changes to command-line interfaces or existing workflows are introduced.
Unit tests are included to ensure correct behavior. The modifications
aim to improve the internal handling of dependency resolution and
compatibility checks.
* Test if `create-catalogs-schemas` works with tables defined as mount
paths ([#1578](#1578)). This
release includes a new unit test for the `create-catalogs-schemas` logic
that verifies the correct creation and management of catalogs and
schemas defined as mount paths. The test checks the storage location of
catalogs, ensures non-existing schemas are properly created, and
prevents the creation of catalogs without a storage location. It also
verifies the catalog schema ACL is set correctly. Using the
`CatalogSchema` class and various test functions, the test creates and
grants permissions to catalogs and schemas. This change resolves issue
[#1039](#1039) without
modifying any existing commands or workflows. The release contains no
new CLI commands or user documentation, but includes unit tests and
assertion calls to validate the behavior of the
`create_all_catalogs_schemas` method.
* Upgraded `databricks-sdk` to 0.27
([#1626](#1626)). In this
release, the `databricks-sdk` package has been upgraded to version 0.27,
bringing updated methods for Redash objects. The `_install_query` method
in the `dashboards.py` file has been updated to include a `tags`
parameter, set to `None`, when calling `self._ws.queries.update` and
`self._ws.queries.create`. This ensures that the updated SDK version is
used and that tags are not applied during query updates and creation.
Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint`
packages have been updated to versions 0.4.0 and 0.4.3 respectively, and
the dependency for PyYAML has been updated to a version between 6.0.0
and 7.0.0. These updates may impact the functionality of the project.
The changes have been manually tested, but there is no verification on a
staging environment.
* Use stack of dependency resolvers
([#1560](#1560)). This pull
request introduces a stack-based implementation of resolvers, resolving
issues [#1202](#1202),
[#1499](#1499), and
[#1421](#1421), and
implements an initial version of SysPathProvider, while eliminating
previous hacks. The new functionality includes modified existing
commands, a new workflow, and the addition of unit tests. No new
documentation or CLI commands have been added. The `problem_collector`
parameter is not addressed in this PR and has been moved to a separate
issue. The changes include renaming and moving a Python file, as well as
modifications to the `Notebook` class and its related methods for
handling notebook dependencies and dependency checking. The code has
been tested, but manual testing and integration tests are still pending.
@nfx nfx mentioned this pull request Jun 4, 2024
nfx added a commit that referenced this pull request Jun 4, 2024
* Added handling for legacy ACL `DENY` permission in group migration ([#1815](#1815)). In this release, the handling of `DENY` permissions during group migrations in our legacy ACL table has been improved. Previously, `DENY` operations were denoted with a `DENIED` prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence of `DENIED` in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue [#1803](#1803). A new test function, test_hive_deny_sql(), has also been added to test the behavior of the `DENY` permission.
* Added handling for parsing corrupted log files ([#1817](#1817)). The `logs.py` file in the `src/databricks/labs/ucx/installer` directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new method `test_parse_logs_warns_for_corrupted_log_file` that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files.
* Added known problems with `pyspark` package ([#1813](#1813)). In this release, updates have been made to the `src/databricks/labs/ucx/source_code/known.json` file to document known issues with the `pyspark` package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A new `KnownProblem` dataclass has been added to the `known.py` file, which includes methods for converting the object to a dictionary for better encoding of problems. The `_analyze_file` method has also been updated to use a `known_problems` set of `KnownProblem` objects, improving readability and management of known problems within the application. These changes address issue [#1813](#1813) and improve the documentation of known issues with `pyspark`.
* Added library linting for jobs launched on shared clusters ([#1689](#1689)). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue [#1637](#1637). A new function, `_register_existing_cluster_id(graph: DependencyGraph)`, has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to the `test_jobs.py` file in the `tests/integration/source_code` directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of the `jobs` and `compute` modules from the `databricks.sdk.service` package. Additionally, a new `WorkflowTaskContainer` method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters.
* Added linters to check for spark logging and configuration access ([#1808](#1808)). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via `sc.conf`, and `rdd.mapPartitions`. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to the `SparkConnectLinter` class and are executed as part of the `databricks labs ucx` command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected.
* Added list of known dependency compatibilities and regeneration infrastructure for it ([#1747](#1747)). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the `known.json` file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library.
* Added more known libraries from Databricks Runtime ([#1812](#1812)). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios.
* Added more known packages from Databricks Runtime ([#1814](#1814)). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility.
* Added support for `.egg` Python libraries in jobs ([#1789](#1789)). This commit adds support for `.egg` Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue [#1643](#1643). It includes the addition of a new method, `PythonLibraryResolver`, which replaces the old `PipResolver`, and is used to register egg library dependencies in the `DependencyGraph`. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section where `PipResolver` is replaced with `PythonLibraryResolver` from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from `.egg` files.
* Added table migration workflow guide ([#1607](#1607)). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience.
* Added workflow linter for spark python tasks ([#1810](#1810)). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the `_register_spark_python_task` method in the `jobs.py` file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. The `test_job_spark_python_task_linter_happy_path` test checks the linter on a valid job configuration where all required libraries are specified, while the `test_job_spark_python_task_linter_unhappy_path` test checks the linter on an invalid job configuration where required libraries are not specified. These tests ensure that the workflow linter for Spark Python tasks is functioning correctly and can help identify any potential issues in job configurations.
* Connect all linters to `LinterContext` and add functional testing framework ([#1811](#1811)). This commit connects all linters, including those related to JVM, to the critical path for improved code linting, and introduces a functional testing framework to simplify the writing of code linting verification tests. The `pyproject.toml` file has been updated to include a new configuration for the `ignore-paths` option, utilizing a regular expression to exclude certain files or directories from linting. The testing framework is particularly useful for verifying the correct functioning of linters, reducing the risk of errors and improving the overall development experience. These changes will help to improve the reliability and efficiency of the linting process, making it easier to write and maintain high-quality code.
* Deduplicate errors emitted by Spark Connect linter ([#1824](#1824)). This pull request introduces error deduplication for the Spark Connect linter and adds new functional tests using an updated framework. The modifications include the addition of user documentation and unit tests, as well as alterations to existing commands and workflows. Specifically, a new CLI command has been added, and the command `databricks labs ucx ...` has been modified. Additionally, a new workflow has been implemented, and an existing workflow has been updated. No new tables or modifications to existing tables are present. Testing has been conducted through manual testing and new unit tests, with no integration tests or staging environment tests specified. The `verify` method in the `test_functional.py` file has been updated to sort the actual problems list before comparing it to the expected problems list, ensuring consistent ordering of results. The changes aim to improve the functionality and usability of the Spark Connect linter for our software engineer audience.
* Download wheel dependency locally to register it to the dependency graph ([#1704](#1704)). A new feature has been implemented in the open-source library to enhance dependency management for wheel files. Previously, when the library type was wheel, a `not-yet-implemented` DependencyProblem would be yielded. Now, the system downloads the wheel file from a remote location, saves it to a temporary directory, and registers the local file to the dependency graph. This allows for more comprehensive handling of wheel dependencies, as they are now downloaded and registered instead of simply being flagged as "not-yet-implemented". Additionally, new functions for creating jobs, making notebooks, and generating random values have been added to enable more comprehensive testing of the workflow linter. New tests have been implemented to check the linter's behavior when there is a missing library dependency and to verify that the linter correctly handles wheel dependencies. These changes improve the testing capabilities of the workflow linter and ensure that all dependencies are properly accounted for and managed within the system. A new test method, 'test_workflow_task_container_builds_dependency_graph_for_python_wheel', has been added to ensure that the dependency graph is built correctly for Python wheels and to improve test coverage.
* Drop pyspark `register` lint matcher ([#1818](#1818)). In the latest release, the `register` lint matcher has been removed from pyspark, indicating that the specific usage pattern for the `register` method in UDTFRegistration is no longer required. This change affects the linting process during code reviews, but does not impact the functionality of the code directly. Other matchers for DataFrame, DataFrameReader, DataFrameWriter, and direct filesystem access remain unchanged. The `register` method, which was likely used to register a temporary table or view in pyspark, is no longer considered a best practice or necessary feature. If you previously relied on the `register` method in your pyspark code, you will need to find an alternative solution. This update aims to improve the quality and consistency of pyspark code by removing outdated or unnecessary functionality.
* Enabled joining an existing installation to a collection ([#1799](#1799)). This change introduces several new features and modifications to the open-source library, aimed at enhancing the management and organization of workspaces within a collection. A new command `join-collection` has been added to allow a workspace to join a collection using its workspace ID. The `report-account-compatibility` command has been updated with a new flag `--workspace-ids`, and the `alias` command has been updated with a new description. Two new commands `principal-prefix-access` and `create-missing-principals` have been introduced for AWS, and a new command `create-uber-principal` has been introduced for Azure to handle the creation of service principals with STORAGE BLOB READER access for storage accounts used by tables in the workspace. The code's readability and maintainability have been improved by modifying the method `_can_administer` to `can_administer` and `_load_workspace_info` to `load_workspace_info` in the `workspaces.py` file. A new `join_collection` command has been added to the `ucx` application instance to enable joining an existing installation to a collection. Additionally, modifications to the `install.py` file and `test_installation.py` file have been made to facilitate the integration of existing installations into a collection. The tests have been updated to ensure that the joining process works correctly in various scenarios. Overall, these changes provide more flexibility and ease of use for users and improve the interoperability and security of the system.
* Fixed `migrate-credential` cli command on AWS ([#1732](#1732)). In this release, the `migrate-credential` CLI command for AWS has been improved and fixed. The command now includes changes to the `access.py` file in the `databricks/labs/ucx/aws` directory. Notable updates are the refactoring of the `role_name` method into a dataclass called `AWSCredentialCandidate`, the addition of the method `_aws_role_trust_doc`, and the removal of the `_databricks_trust_statement` method. The `_aws_s3_policy` method has been updated to include `s3:PutObjectAcl` in the allowed actions, and methods `_create_role` and `_get_role_access_task` have been updated to use `arn` instead of `role_name`. Additionally, the `create_uc_role` and `update_uc_trust_role` methods have been combined into a single `update_uc_role` method. The `migrate-credentials` command in the `cli.py` file has also been updated to support migration of AWS Instance Profiles to UC storage credentials. These improvements resolve issue [#1726](#1726) and enhance the functionality and reliability of the `migrate-credential` command for AWS.
* Fixed crasher when running migrate-local-code ([#1794](#1794)). In this release, we have addressed a crasher issue that occurred when running the `migrate-local-code` command. The change involves modifying the `local_file_migrator` property in the `LocalCheckoutContext` class to use a lambda function instead of directly passing `self.languages`. This ensures that the languages are loaded only when the `local_file_migrator` property is accessed, preventing unnecessary load and potential crashes. The change does not introduce any new functionalities, but instead modifies existing commands related to local file migration. Comprehensive manual testing and unit tests have been conducted to ensure the fix works as expected without negatively impacting other parts of the system.
* Fixed inconsistent behavior in `%pip` cell handling ([#1785](#1785)). This PR addresses inconsistent behavior in `%pip` cell handling by modifying Python library installation to occur in a designated path lookup, rather than deep within the library tree. These changes impact various components, such as the `PipResolver` class, which no longer requires a `FileLoader` instance as an argument and now takes a `Whitelist` instance directly. Additionally, tests like `test_detect_s3fs_import` and `test_detect_s3fs_import_in_dependencies` are affected by these modifications. Overall, these changes streamline the `%pip` feature, improving library installation efficiency and consistency.
* Fixed issue when creating view using `WITH` clause ([#1809](#1809)). In this release, we have addressed an issue that occurred when creating a view using a `WITH` clause, which was causing potential errors or incorrect results due to improper handling of aliases. A new method, `_read_aliases`, has been introduced to read and store aliases from the `WITH` clause as a set, and during view dependency analysis, if an old table's name matches an alias, it is now skipped to prevent double-counting. This ensures improved accuracy and reliability of view creation with `WITH` clauses. Moreover, the commit includes adjustments to import statements, addition of unit tests, and the introduction of a new class `TableView` in the `databricks.labs.ucx.hive_metastore.view_migrate` module to test whether a view with a local dataset should be skipped. This release also includes a test for migrating a view with columns, ensuring that views with local datasets are now handled correctly. The fix resolves issue [#1798](#1798).
* Fixed linting for non-UTF8 encoded files ([#1804](#1804)). This commit addresses linting issues for files that are not encoded in UTF-8, improving compatibility with non-UTF-8 encoded files in the databricks labs ucx project. Previously, the linter and fixer tools were unable to process non-UTF-8 encoded files, causing them to fail. This issue has been resolved by adding a check for file encoding during linting and handling the case where the file is not encoded in UTF-8 by returning a failure message. A new method, `getpreferredencoding(False)`, has been introduced to determine the file's encoding, ensuring UTF-8 compatibility. Additionally, a new test method, `test_file_linter_lints_non_ascii_encoded_file`, has been added to check the linter's behavior with non-ASCII encoded files. This enhancement simplifies the linting process, allowing for better file handling of non-UTF-8 encoded files, and is supported by manual testing and unit tests.
* Further fix for DENY permissions ([#1834](#1834)). This commit addresses issue [#1834](#1834) by implementing a fix for handling DENY permissions in the legacy TACL migration logic. Previously, all permissions were grouped in a single GRANT statement, but they have now been updated to be split into separate GRANT and DENY statements. This change improves the clarity and maintainability of the code and also increases test coverage with the addition of unit tests and integration tests. A new test function `test_tacl_applier_deny_and_grant()` has been added to demonstrate the use of the updated logic for handling DENY permissions. The resulting SQL queries now include both GRANT and DENY statements, reflecting the updated logic. These changes ensure that the DENY permissions are correctly applied, increasing the overall test coverage and confidence in the code.
* Removed false warning on DataFrame.insertInto() about the default format changing from parquet to delta ([#1823](#1823)). This pull request removes a false warning related to the use of DataFrameWriter.insertInto(), which had been incorrectly flagging a potential issue due to the default format change from Parquet to Delta. The warning is now suppressed as it is no longer relevant, since the operation ignores any specified format and uses the existing format of the underlying table. Additionally, an unnecessary linting suppression has been removed. These changes improve the accuracy of the warning system and eliminate confusion for users, with no impact on functionality, usability, or performance. The changes have been manually tested and do not require any new unit or integration tests, CLI commands, workflows, or tables.
* Support linting python wheel tasks ([#1821](#1821)). This release introduces support for linting python wheel tasks, addressing issue [#1](#1)
* Updated linting checks for Spark table methods ([#1816](#1816)). This commit updates linting checks for PySpark's Spark table methods, focusing on improving handling of migrated tables and deprecating direct filesystem references in favor of the Unity Catalog. New tests and examples include literal and variable references to known and unknown tables, as well as cases with extra or out-of-position arguments. The commit also highlights false positives and trivial references in unrelated contexts. These changes aim to ensure proper usage of Spark table methods, improve codebase consistency, and minimize potential issues related to migrations and format changes.

Dependency updates:

 * Updated sqlglot requirement from <24.1,>=23.9 to >=23.9,<24.2 ([#1819](#1819)).
nfx added a commit that referenced this pull request Jun 4, 2024
* Added handling for legacy ACL `DENY` permission in group migration
([#1815](#1815)). In this
release, the handling of `DENY` permissions during group migrations in
our legacy ACL table has been improved. Previously, `DENY` operations
were denoted with a `DENIED` prefix and were not being applied correctly
during migrations. This issue has been resolved by adding a condition in
the _apply_grant_sql method to check for the presence of `DENIED` in the
action_type, removing the prefix, and enclosing the action type in
backticks to prevent syntax errors. These changes have been thoroughly
tested through manual testing, unit tests, integration tests, and
verification on the staging environment, and resolve issue
[#1803](#1803). A new test
function, test_hive_deny_sql(), has also been added to test the behavior
of the `DENY` permission.
* Added handling for parsing corrupted log files
([#1817](#1817)). The
`logs.py` file in the `src/databricks/labs/ucx/installer` directory has
been updated to improve the handling of corrupted log files. A new block
of code has been added to check if the logs match the expected format,
and if they don't, a warning message is logged and the function returns,
preventing further processing and potential production of incorrect
results. The changes include a new method
`test_parse_logs_warns_for_corrupted_log_file` that verifies the
expected warning message and corrupt log line are present in the last
log message when a corrupted log file is detected. These enhancements
increase the robustness of the log parsing functionality by introducing
error handling for corrupted log files.
* Added known problems with `pyspark` package
([#1813](#1813)). In this
release, updates have been made to the
`src/databricks/labs/ucx/source_code/known.json` file to document known
issues with the `pyspark` package when running on UC Shared Clusters.
These issues include not being able to access the Spark Driver JVM,
using legacy contexts, or using RDD APIs. A new `KnownProblem` dataclass
has been added to the `known.py` file, which includes methods for
converting the object to a dictionary for better encoding of problems.
The `_analyze_file` method has also been updated to use a
`known_problems` set of `KnownProblem` objects, improving readability
and management of known problems within the application. These changes
address issue [#1813](#1813)
and improve the documentation of known issues with `pyspark`.
* Added library linting for jobs launched on shared clusters
([#1689](#1689)). This
release includes an update to add library linting for jobs launched on
shared clusters, addressing issue
[#1637](#1637). A new
function, `_register_existing_cluster_id(graph: DependencyGraph)`, has
been introduced to retrieve libraries installed on a specified existing
cluster and register them in the dependency graph. If the existing
cluster ID is not present in the task, the function returns early. This
feature also includes changes to the `test_jobs.py` file in the
`tests/integration/source_code` directory, such as the addition of new
methods for linting jobs and handling libraries, and the inclusion of
the `jobs` and `compute` modules from the `databricks.sdk.service`
package. Additionally, a new `WorkflowTaskContainer` method has been
added to build a dependency graph for job tasks. These changes improve
the reliability and efficiency of the service by ensuring that jobs run
smoothly on shared clusters by checking for and handling missing
libraries. Software engineers will benefit from these improvements as it
will reduce the occurrence of errors due to missing libraries on shared
clusters.
* Added linters to check for spark logging and configuration access
([#1808](#1808)). This
commit introduces new linters to check for the use of Spark logging,
Spark configuration access via `sc.conf`, and `rdd.mapPartitions`. The
changes address one issue and enhance three others related to RDDs in
shared clusters and the use of deprecated code. Additionally, new tests
have been added for the linters and updates have been made to existing
ones. The new linters have been added to the `SparkConnectLinter` class
and are executed as part of the `databricks labs ucx` command. This
commit also includes documentation for the new functionality. The
modifications are thoroughly tested through manual tests and unit tests
to ensure no existing functionality is affected.
* Added list of known dependency compatibilities and regeneration
infrastructure for it
([#1747](#1747)). This
change introduces an automated system for regenerating known Python
dependencies to ensure compatibility with Unity Catalog (UC), resolving
import issues during graph generation. The changes include a script
entry point for adding new libraries, manual trimming of unnecessary
information in the `known.json` file, and integration of package data
with the Whitelist. This development practice prioritizes using standard
libraries and provides guidelines for contributing to the project,
including debugging, fixtures, and IDE setup. The target audience for
this feature is software engineers contributing to the open-source
library.
* Added more known libraries from Databricks Runtime
([#1812](#1812)). In this
release, we've expanded the Databricks Runtime's capabilities by
incorporating a variety of new libraries. These libraries include
absl-py, aiohttp, and grpcio, which enhance networking functionalities.
For improved data processing, we've added aiosignal, anyio, appdirs, and
others. The suite of cloud computing libraries has been bolstered with
the addition of google-auth, google-cloud-bigquery,
google-cloud-storage, and many more. These libraries are now integrated
in the known libraries file in the JSON format, enhancing the platform's
overall functionality and performance in networking, data processing,
and cloud computing scenarios.
* Added more known packages from Databricks Runtime
([#1814](#1814)). In this
release, we have added a significant number of new packages to the known
packages file in the Databricks Runtime, including astor, audioread,
azure-core, and many others. These additions include several new modules
and sub-packages for some of the existing packages, significantly
expanding the library's capabilities. The new packages are expected to
provide new functionality and improve compatibility with the existing
packages. However, it is crucial to thoroughly test the new packages to
ensure they work as expected and do not introduce any issues. We
encourage all software engineers to familiarize themselves with the new
packages and integrate them into their workflows to take full advantage
of the improved functionality and compatibility.
* Added support for `.egg` Python libraries in jobs
([#1789](#1789)). This
commit adds support for `.egg` Python libraries in jobs by registering
egg library dependencies to DependencyGraph for linting, addressing
issue [#1643](#1643). It
includes the addition of a new method, `PythonLibraryResolver`, which
replaces the old `PipResolver`, and is used to register egg library
dependencies in the `DependencyGraph`. The changes also involve adding
user documentation, a new CLI command, and a new workflow, as well as
modifying an existing workflow and table. The tests include manual
testing, unit tests, and integration tests. The diff includes changes to
the 'test_dependencies.py' file, specifically in the import section
where `PipResolver` is replaced with `PythonLibraryResolver` from the
'databricks.labs.ucx.source_code.python_libraries' package. These
changes aim to improve test coverage and ensure the correct resolution
of dependencies, including those from `.egg` files.
* Added table migration workflow guide
([#1607](#1607)). UCX is a
new open-source library that simplifies the process of upgrading to
Unity Catalog in Databricks workspaces. After installation, users can
trigger the assessment workflow, which identifies any incompatible
entities and provides information necessary for planning migration. Once
the assessment is complete, users can initiate the group migration
workflow to upgrade various Databricks workspace assets, including
Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters,
Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live
Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries,
SQL Alerts, and Token and Password usage permissions set on the
workspace level, Secret scopes, Notebooks, Directories, Repos, and
Files. Additionally, the group migration workflow creates a debug
notebook and logs for debugging purposes, providing added convenience
and improved user experience.
* Added workflow linter for spark python tasks
([#1810](#1810)). A linter
for workflows related to Spark Python tasks has been implemented,
ensuring proper implementation of workflows for Spark Python tasks and
avoiding errors for tasks that are not yet implemented. The changes are
limited to the `_register_spark_python_task` method in the `jobs.py`
file. If the task is not a Spark Python task, an empty list is returned,
and if it is, the entrypoint is logged and the notebook is registered.
Additionally, two new tests have been implemented to demonstrate the
functionality of this linter. The
`test_job_spark_python_task_linter_happy_path` test checks the linter on
a valid job configuration where all required libraries are specified,
while the `test_job_spark_python_task_linter_unhappy_path` test checks
the linter on an invalid job configuration where required libraries are
not specified. These tests ensure that the workflow linter for Spark
Python tasks is functioning correctly and can help identify any
potential issues in job configurations.
* Connect all linters to `LinterContext` and add functional testing
framework ([#1811](#1811)).
This commit connects all linters, including those related to JVM, to the
critical path for improved code linting, and introduces a functional
testing framework to simplify the writing of code linting verification
tests. The `pyproject.toml` file has been updated to include a new
configuration for the `ignore-paths` option, utilizing a regular
expression to exclude certain files or directories from linting. The
testing framework is particularly useful for verifying the correct
functioning of linters, reducing the risk of errors and improving the
overall development experience. These changes will help to improve the
reliability and efficiency of the linting process, making it easier to
write and maintain high-quality code.
* Deduplicate errors emitted by Spark Connect linter
([#1824](#1824)). This pull
request introduces error deduplication for the Spark Connect linter and
adds new functional tests using an updated framework. The modifications
include the addition of user documentation and unit tests, as well as
alterations to existing commands and workflows. Specifically, a new CLI
command has been added, and the command `databricks labs ucx ...` has
been modified. Additionally, a new workflow has been implemented, and an
existing workflow has been updated. No new tables or modifications to
existing tables are present. Testing has been conducted through manual
testing and new unit tests, with no integration tests or staging
environment tests specified. The `verify` method in the
`test_functional.py` file has been updated to sort the actual problems
list before comparing it to the expected problems list, ensuring
consistent ordering of results. The changes aim to improve the
functionality and usability of the Spark Connect linter for our software
engineer audience.
* Download wheel dependency locally to register it to the dependency
graph ([#1704](#1704)). A
new feature has been implemented in the open-source library to enhance
dependency management for wheel files. Previously, when the library type
was wheel, a `not-yet-implemented` DependencyProblem would be yielded.
Now, the system downloads the wheel file from a remote location, saves
it to a temporary directory, and registers the local file to the
dependency graph. This allows for more comprehensive handling of wheel
dependencies, as they are now downloaded and registered instead of
simply being flagged as "not-yet-implemented". Additionally, new
functions for creating jobs, making notebooks, and generating random
values have been added to enable more comprehensive testing of the
workflow linter. New tests have been implemented to check the linter's
behavior when there is a missing library dependency and to verify that
the linter correctly handles wheel dependencies. These changes improve
the testing capabilities of the workflow linter and ensure that all
dependencies are properly accounted for and managed within the system. A
new test method,
'test_workflow_task_container_builds_dependency_graph_for_python_wheel',
has been added to ensure that the dependency graph is built correctly
for Python wheels and to improve test coverage.
* Drop pyspark `register` lint matcher
([#1818](#1818)). In the
latest release, the `register` lint matcher has been removed from
pyspark, indicating that the specific usage pattern for the `register`
method in UDTFRegistration is no longer required. This change affects
the linting process during code reviews, but does not impact the
functionality of the code directly. Other matchers for DataFrame,
DataFrameReader, DataFrameWriter, and direct filesystem access remain
unchanged. The `register` method, which was likely used to register a
temporary table or view in pyspark, is no longer considered a best
practice or necessary feature. If you previously relied on the
`register` method in your pyspark code, you will need to find an
alternative solution. This update aims to improve the quality and
consistency of pyspark code by removing outdated or unnecessary
functionality.
* Enabled joining an existing installation to a collection
([#1799](#1799)). This
change introduces several new features and modifications to the
open-source library, aimed at enhancing the management and organization
of workspaces within a collection. A new command `join-collection` has
been added to allow a workspace to join a collection using its workspace
ID. The `report-account-compatibility` command has been updated with a
new flag `--workspace-ids`, and the `alias` command has been updated
with a new description. Two new commands `principal-prefix-access` and
`create-missing-principals` have been introduced for AWS, and a new
command `create-uber-principal` has been introduced for Azure to handle
the creation of service principals with STORAGE BLOB READER access for
storage accounts used by tables in the workspace. The code's readability
and maintainability have been improved by modifying the method
`_can_administer` to `can_administer` and `_load_workspace_info` to
`load_workspace_info` in the `workspaces.py` file. A new
`join_collection` command has been added to the `ucx` application
instance to enable joining an existing installation to a collection.
Additionally, modifications to the `install.py` file and
`test_installation.py` file have been made to facilitate the integration
of existing installations into a collection. The tests have been updated
to ensure that the joining process works correctly in various scenarios.
Overall, these changes provide more flexibility and ease of use for
users and improve the interoperability and security of the system.
* Fixed `migrate-credential` cli command on AWS
([#1732](#1732)). In this
release, the `migrate-credential` CLI command for AWS has been improved
and fixed. The command now includes changes to the `access.py` file in
the `databricks/labs/ucx/aws` directory. Notable updates are the
refactoring of the `role_name` method into a dataclass called
`AWSCredentialCandidate`, the addition of the method
`_aws_role_trust_doc`, and the removal of the
`_databricks_trust_statement` method. The `_aws_s3_policy` method has
been updated to include `s3:PutObjectAcl` in the allowed actions, and
methods `_create_role` and `_get_role_access_task` have been updated to
use `arn` instead of `role_name`. Additionally, the `create_uc_role` and
`update_uc_trust_role` methods have been combined into a single
`update_uc_role` method. The `migrate-credentials` command in the
`cli.py` file has also been updated to support migration of AWS Instance
Profiles to UC storage credentials. These improvements resolve issue
[#1726](#1726) and enhance
the functionality and reliability of the `migrate-credential` command
for AWS.
* Fixed crasher when running migrate-local-code
([#1794](#1794)). In this
release, we have addressed a crasher issue that occurred when running
the `migrate-local-code` command. The change involves modifying the
`local_file_migrator` property in the `LocalCheckoutContext` class to
use a lambda function instead of directly passing `self.languages`. This
ensures that the languages are loaded only when the
`local_file_migrator` property is accessed, preventing unnecessary load
and potential crashes. The change does not introduce any new
functionalities, but instead modifies existing commands related to local
file migration. Comprehensive manual testing and unit tests have been
conducted to ensure the fix works as expected without negatively
impacting other parts of the system.
* Fixed inconsistent behavior in `%pip` cell handling
([#1785](#1785)). This PR
addresses inconsistent behavior in `%pip` cell handling by modifying
Python library installation to occur in a designated path lookup, rather
than deep within the library tree. These changes impact various
components, such as the `PipResolver` class, which no longer requires a
`FileLoader` instance as an argument and now takes a `Whitelist`
instance directly. Additionally, tests like `test_detect_s3fs_import`
and `test_detect_s3fs_import_in_dependencies` are affected by these
modifications. Overall, these changes streamline the `%pip` feature,
improving library installation efficiency and consistency.
* Fixed issue when creating view using `WITH` clause
([#1809](#1809)). In this
release, we have addressed an issue that occurred when creating a view
using a `WITH` clause, which was causing potential errors or incorrect
results due to improper handling of aliases. A new method,
`_read_aliases`, has been introduced to read and store aliases from the
`WITH` clause as a set, and during view dependency analysis, if an old
table's name matches an alias, it is now skipped to prevent
double-counting. This ensures improved accuracy and reliability of view
creation with `WITH` clauses. Moreover, the commit includes adjustments
to import statements, addition of unit tests, and the introduction of a
new class `TableView` in the
`databricks.labs.ucx.hive_metastore.view_migrate` module to test whether
a view with a local dataset should be skipped. This release also
includes a test for migrating a view with columns, ensuring that views
with local datasets are now handled correctly. The fix resolves issue
[#1798](#1798).
* Fixed linting for non-UTF8 encoded files
([#1804](#1804)). This
commit addresses linting issues for files that are not encoded in UTF-8,
improving compatibility with non-UTF-8 encoded files in the databricks
labs ucx project. Previously, the linter and fixer tools were unable to
process non-UTF-8 encoded files, causing them to fail. This issue has
been resolved by adding a check for file encoding during linting and
handling the case where the file is not encoded in UTF-8 by returning a
failure message. A new method, `getpreferredencoding(False)`, has been
introduced to determine the file's encoding, ensuring UTF-8
compatibility. Additionally, a new test method,
`test_file_linter_lints_non_ascii_encoded_file`, has been added to check
the linter's behavior with non-ASCII encoded files. This enhancement
simplifies the linting process, allowing for better file handling of
non-UTF-8 encoded files, and is supported by manual testing and unit
tests.
* Further fix for DENY permissions
([#1834](#1834)). This
commit addresses issue
[#1834](#1834) by
implementing a fix for handling DENY permissions in the legacy TACL
migration logic. Previously, all permissions were grouped in a single
GRANT statement, but they have now been updated to be split into
separate GRANT and DENY statements. This change improves the clarity and
maintainability of the code and also increases test coverage with the
addition of unit tests and integration tests. A new test function
`test_tacl_applier_deny_and_grant()` has been added to demonstrate the
use of the updated logic for handling DENY permissions. The resulting
SQL queries now include both GRANT and DENY statements, reflecting the
updated logic. These changes ensure that the DENY permissions are
correctly applied, increasing the overall test coverage and confidence
in the code.
* Removed false warning on DataFrame.insertInto() about the default
format changing from parquet to delta
([#1823](#1823)). This pull
request removes a false warning related to the use of
DataFrameWriter.insertInto(), which had been incorrectly flagging a
potential issue due to the default format change from Parquet to Delta.
The warning is now suppressed as it is no longer relevant, since the
operation ignores any specified format and uses the existing format of
the underlying table. Additionally, an unnecessary linting suppression
has been removed. These changes improve the accuracy of the warning
system and eliminate confusion for users, with no impact on
functionality, usability, or performance. The changes have been manually
tested and do not require any new unit or integration tests, CLI
commands, workflows, or tables.
* Support linting python wheel tasks
([#1821](#1821)). This
release introduces support for linting python wheel tasks, addressing
issue [#1](#1)
* Updated linting checks for Spark table methods
([#1816](#1816)). This
commit updates linting checks for PySpark's Spark table methods,
focusing on improving handling of migrated tables and deprecating direct
filesystem references in favor of the Unity Catalog. New tests and
examples include literal and variable references to known and unknown
tables, as well as cases with extra or out-of-position arguments. The
commit also highlights false positives and trivial references in
unrelated contexts. These changes aim to ensure proper usage of Spark
table methods, improve codebase consistency, and minimize potential
issues related to migrations and format changes.

Dependency updates:

* Updated sqlglot requirement from <24.1,>=23.9 to >=23.9,<24.2
([#1819](#1819)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants