Closed
Conversation
get_external_location is string value and the conditional statement is always executed irrespective the value of get_external_location
pritishpai
added a commit
that referenced
this pull request
Dec 29, 2023
pritishpai
added a commit
that referenced
this pull request
Jan 3, 2024
pritishpai
added a commit
that referenced
this pull request
Jan 4, 2024
1 task
Merged
nfx
added a commit
that referenced
this pull request
May 8, 2024
* Added DBSQL queries & dashboard migration ([#1532](#1532)). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands: `migrate-dbsql-dashboards` and `revert-dbsql-dashboards`. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. The `migrate-dbsql-dashboards` command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries with `migrated by UCX` and backing up original queries. The `revert-dbsql-dashboards` command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a `--dashboard-id` flag for migrating or reverting a specific dashboard. Additionally, two new functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`, have been added to the `cli.py` file, and new classes have been added to interact with Redash for data visualization and querying. The `make_dashboard` fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards. * Added UDFs assessment ([#1610](#1610)). A User Defined Function (UDF) assessment feature has been introduced, addressing issue [#1610](#1610). A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns, `success` and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count". * Added `databricks labs ucx create-missing-principals` command to create the missing UC roles in AWS ([#1495](#1495)). The `databricks labs ucx` tool now includes a new command, `create-missing-principals`, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented using `IamRoleCreation` from `databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new command only supports AWS and does not affect Azure. The existing `migrate_credentials` function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, including `AWSUCRoleCandidate` in `aws.py`, and `create_missing_principals` and `list_uc_roles` methods in `access.py`. The `create_uc_roles_cli` method in `access.py` has been refactored and renamed to `list_uc_roles`. New unit tests have been implemented to test the functionality of `create_missing_principals` for AWS and Azure, as well as testing the behavior when the command is not approved. * Added baseline for workflow linter ([#1613](#1613)). This change introduces the `WorkflowLinter` class in the `application.py` file of the `databricks.labs.ucx.source_code.jobs` package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such as `workspace_client`, `dependency_resolver`, `path_lookup`, and `migration_index`. Several properties have been moved from `dependency_resolver` to the `CliContext` class, and the `NotebookLoader` class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. The `generic` and `redash` modules from `databricks.labs.ucx.workspace_access` and the `GroupManager` class from `databricks.labs.ucx.workspace_access.groups` are used. The `VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from `databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class from `databricks.labs.ucx.installer.workflows` are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests. * Added linter to check for RDD use and JVM access ([#1606](#1606)). A new `AstHelper` class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing the `AstHelper` class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments. * Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow ([#1621](#1621)). The `migrate_tables` workflow in `workflows.py` has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. The `migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and `migrate_views` tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities. * Added "seen tables" feature ([#1465](#1465)). The `seen tables` feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of a `snapshot` method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The `_crawl` function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and `_get_table_location` have been included to facilitate these improvements. In the testing realm, a new test `test_mount_listing_seen_tables` has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore. * Added support for `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` CLI command ([#1660](#1660)). This commit adds support for the `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` command, which checks for external tables that cannot be synced and prompts the user to run the `migrate-tables-ctas` workflow. Two new methods, `test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts, ctx=ctx)`, have been added. The first method checks if the `migrate-external-tables-ctas` workflow is called correctly, while the second method runs the workflow after prompting the user. The method `test_migrate_external_hiveserde_tables_in_place(ws)` has been modified to test if the `migrate-external-hiveserde-tables-in-place-experimental` workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project. * Added support for migrating external location permissions from interactive cluster mounts ([#1487](#1487)). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing `databricks labs ucx` command is modified, with the addition of the new method `create_external_locations` and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues [#1192](#1192) and [#1193](#1193), ensuring a more robust and controlled user experience with interactive clusters. * Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN ([#1631](#1631)). In this release, we've implemented new features to enhance the security and control over data access during the migration process for the SQL warehouse data access configuration. The `databricks labs ucx create-uber-principal` command now creates a service principal with read-only access to all the storage used by tables in the workspace. The UCX Cluster Policy and SQL Warehouse data access configuration will be updated to use this service principal for migration workflows. A new method, `_update_sql_dac_with_instance_profile`, has been introduced in the `access.py` file to update the SQL data access configuration with the provided AWS instance profile, ensuring a more streamlined management of instance profiles within the SQL data access configuration during the creation of an uber service principal (SPN). Additionally, new methods and tests have been added to the sql module of the databricks.sdk.service package to improve Azure resource permissions, handling different scenarios related to creating a global SPN in the presence or absence of various conditions, such as storage, cluster policies, or secrets. * Addressed issue with disabled features in certain regions ([#1618](#1618)). In this release, we have implemented improvements to address an issue where certain features were disabled in specific regions. We have added error handling when listing serving endpoints to raise a NotFound error if a feature is disabled, preventing the code from failing silently and providing better error messages. A new method, test_serving_endpoints_not_enabled, has been added, which creates a mock WorkspaceClient and raises a NotFound error if serving endpoints are not enabled for a shard. The GenericPermissionsSupport class uses this method to get crawler tasks, and if serving endpoints are not enabled, an error message is logged. These changes increase the reliability and robustness of the codebase by providing better error handling and messaging for this particular issue. Additionally, the change includes unit tests and manual testing to ensure the proper functioning of the new features. * Aggregate UCX output across workspaces with CLI command ([#1596](#1596)). A new `report-account-compatibility` command has been added to the `databricks labs ucx` tool, enabling users to evaluate the compatibility of an entire Azure Databricks account with UCX (Unified Client Context). This command generates a readiness report for an Azure Databricks account, specifically for evaluating compatibility with UCX, by querying various aspects of the account such as clusters, configurations, and data formats. It uses Azure CLI authentication with AAD tokens for authentication and accepts a profile as an argument. The output includes warnings for workspaces that do not have UCX installed, and provides information about unsupported cluster types, unsupported configurations, data format compatibility, and more. Additionally, a new feature has been added to aggregate UCX output across workspaces in an account through a new CLI command, "report-account-compatibility", which can be run at the account level. The existing `manual-workspace-info` command remains unchanged. These changes will help assess the readiness and compatibility of an Azure Databricks account for UCX integration and simplify the process of checking compatibility across an entire account. * Assert if group name is in cluster policy ([#1665](#1665)). In this release, we have implemented a change to ensure the presence of the display name of a specific workspace group (ws_group_a) in the cluster policy. This is to prevent a key error previously encountered. The cluster policy is now loaded as a dictionary, and the group name is checked to confirm its presence. If the group is not found, a message is raised alerting users. Additionally, the permission level for the group is verified to ensure it is set to CAN_USE. No new methods have been added, and existing functionality remains unchanged. The test file test_ext_hms.py has been updated to include the new assertion and has undergone both unit tests and manual testing to ensure proper implementation. This change is intended for software engineers who adopt the project. * Automatically retrying with `auth_type=azure-cli` when constructing `workspace_clients` on Azure ([#1650](#1650)). This commit introduces automatic retrying with 'auth_type=azure-cli' when constructing `workspace_clients` on Azure, resolving TODO items for `AccountWorkspaces` and adding relevant suggestions in 'troubleshooting.md'. It closes issues [#1574](#1574) and [#1430](#1430), and includes new methods for generating readiness reports in `AccountAggregate` and testing the `get_accessible_workspaces` method in 'test_workspaces.py'. User documentation has been updated and the changes have been manually verified in a staging environment. For macOS and Windows users, explicit auth type settings are required for command line utilities. * Changes to identify service principal with custom roles on Azure storage account for principal-prefix-access ([#1576](#1576)). This release introduces several enhancements to the identification of service principals with custom roles on Azure storage accounts for principal-prefix-access. New methods such as `_get_permission_level`, `_get_custom_role_privilege`, and `_get_role_privilege` have been added to improve the functionality of the module. Additionally, two new classes, AzureRoleAssignment and AzureRoleDetails, have been added to enable more detailed management and access control for custom roles on Azure storage accounts. The 'test_access.py' file has been updated to include tests for saving custom roles in Azure storage accounts and ensuring the correct identification of service principals with custom roles. A new unit test function, test_role_assignments_custom_storage(), has also been added to verify the behavior of custom roles in Azure storage accounts. Overall, these changes provide a more efficient and fine-grained way to manage and control custom roles on Azure storage accounts. * Clarified unsupported config in compute crawler ([#1656](#1656)). In this release, we have made significant changes to clarify and improve the handling of unsupported configurations in our compute crawler related to the Hive metastore. We have expanded error messages for unsupported configurations and provided detailed recommendations for remediation. Additionally, we have added relevant user documentation and manually tested the changes. The changes include updates to the configuration for external Hive metastore and passthrough security model for Unity Catalog, which are incompatible with the current configurations. We recommend removing or altering the configs while migrating existing tables and views using UCX or other compatible clusters, and mapping the passthrough security model to a security model compatible with Unity Catalog. The code modifications include the addition of new methods for checking cluster init script and Spark configurations, as well as refining the error messages for unsupported configurations. We also added a new assertion in the `test_cluster_with_multiple_failures` unit test to check for the presence of a specific message regarding the use of the `spark.databricks.passthrough.enabled` configuration. This release is not yet verified on the staging environment. * Created a unique default schema when External Hive Metastore is detected ([#1579](#1579)). A new default database `ucx` is introduced for storing inventory in the hive metastore, with a suffix consisting of the workspace's client ID to ensure uniqueness when an external hive metastore is detected. The `has_ext_hms()` method is added to the `InstallationPolicy` class to detect external HMS and thereby create a unique default schema. The `_prompt_for_new_installation` method's default value for the `Inventory Database stored in hive_metastore` prompt is updated to use the new default database name, modified to include the workspace's client ID if external HMS is detected. Additionally, a test function `test_save_config_ext_hms` is implemented to demonstrate the `WorkspaceInstaller` class's behavior with external HMS, creating a unique default schema for improved system functionality and customization. This change is part of issue [#1579](#1579). * Extend service principal migration to create storage credentials for access connectors created for each storage account ([#1426](#1426)). This commit extends the service principal migration to create storage credentials for access connectors associated with each storage account, resolving issues [#1384](#1384) and [#875](#875). The update includes modifications to the existing `databricks labs ucx` command for creating access connectors, adds a new CLI command for creating storage credentials, and updates the documentation. A new workflow has been added for creating credentials for access connectors and service principals, and updates have been made to existing workflows. The commit includes manual, unit, and integration tests, and no new or modified methods are specified in the diff. The focus is on the feature description and its impact on the project's functionality. The commit has been co-authored by Serge Smertin and vuong-nguyen. * Suggest users to create Access Connector(s) with Managed Identity to access Azure Storage Accounts behind firewall ([#1589](#1589)). In this release, we have introduced a new feature to improve access to Azure Storage Accounts that are protected by firewalls. Due to limitations with service principals in such scenarios, we have developed Access Connectors with Managed Identities for more reliable connectivity. This change includes updates to the 'credentials.py' file, which introduces new methods for managing the migration of service principals to Access Connectors using Managed Identities. Users are warned that migrating to this new feature may cause issues when transitioning to UC, and are advised to validate external locations after running the migration command. This update enhances the security and functionality of the system, providing a more dependable method for accessing Azure Storage Accounts protected by firewalls. * Fixed catalog/schema grants when tables with same source schema have different target schemas ([#1581](#1581)). In this release, we have implemented a fix to address an issue where catalog/schema grants were not being handled correctly when tables with the same source schema had different target schemas. This was causing problems with granting appropriate permissions to users. We have modified the prepare_test function to include an additional test case with a different target schema for the same source table. Furthermore, we have updated the test_catalog_schema_acl function to ensure that grants are being created correctly for all catalogs, schemas, and tables. We have also added an extra query to grant use schema permissions for catalog2.schema3 to user1. Additionally, we have introduced a new `SchemaInfo` class to store information about catalogs and schemas, and refactored the `_get_database_source_target_mapping` method to return a dictionary that maps source databases to a list of `SchemaInfo` objects instead of a single dictionary. These changes ensure that grants are being handled correctly for catalogs, schemas, and tables, even when tables with the same source schema have different target schemas. This will improve the overall functionality and reliability of the system, making it easier for users to manage their catalogs and schemas. * Fixed Spark configuration parameter referencing secret ([#1635](#1635)). In this release, the code related to the Spark configuration parameter reference for a secret has been updated in the `access.py` file, specifically within the `_update_cluster_policy_definition` method. The change modifies the method to retrieve the OAuth client secret for a given storage account using an f-string to reference the secret, replacing the previous concatenation operator. This enhancement is aimed at improving the readability and maintainability of the code while preserving its functionality. Furthermore, the commit includes additional changes, such as new methods `test_create_global_spn` and "cluster_policies.edit", which may be related to this fix. These changes address the secret reference issue, ensuring secure access control and improved integration, particularly with the Spark configuration, benefiting engineers utilizing this project for handling sensitive information and managing clusters securely and effectively. * Fixed `migration-locations` and `assign-metastore` definitions in `labs.yml` ([#1627](#1627)). In this release, the `migration-locations` command in the `labs.yml` file has been updated to include new flags `subscription-id` and `aws-profile`. The `subscription-id` flag allows users to specify the subscription to scan the storage account in, and the `aws-profile` flag allows for authentication using a specified AWS Profile. The `assign-metastore` command has also been updated with a new description: "Enable Unity Catalog features on a workspace by assigning a metastore to it." The `is_account_level` parameter remains unchanged, and the new optional flag `workspace-id` has been added, allowing users to specify the Workspace ID to assign a metastore to. This change enhances the functionality of the `migration-locations` and `assign-metastore` commands, providing more options for users to customize their storage scanning and metastore assignment processes. The `migration-locations` and `assign-metastore` definitions in the `labs.yml` file have been fixed in this release. * Fixed prompt for using external metastore ([#1668](#1668)). A fix has been implemented in the `create` function of the `policy.py` file to correctly prompt users for using an external metastore. Previously, a missing period and space in the prompt caused potential confusion. The updated prompt now includes a clarifying sentence and the `_prompts.confirm` method has been modified to check if the user wants to set UCX to connect to an external metastore in two scenarios: when one or more cluster policies are set up for an external metastore, and when the workspace warehouse is configured for an external metastore. If the user chooses to set up an external metastore, an informational message will be recorded in the logger. This change ensures clear and precise communication with users during the external metastore setup process. * Fixed storage account network ACLs retrieved from properties ([#1620](#1620)). This release includes a fix to the storage account network ACLs retrieval in the open-source library, addressing issue [#1](#1). Previously, the network ACLs were being retrieved from an incorrect location, but this commit corrects that by obtaining the network ACLs from the storage account's properties.networkAcls field. The `StorageAccount` class has been updated to modify the way default network action is retrieved, with a new value `Unknown` added to the previous values `Deny` and "Allow". The `from_raw_resource` class method has also been updated to retrieve the default network action from the `properties.networkAcls` field instead of the `networkAcls` field. This change may affect any functionality that relies on network ACL information and impacts the existing command `databricks labs ucx ...`. Relevant tests, including a new test `test_azure_resource_storage_accounts_list_non_zero`, have been added and manually and unit tested to ensure the fix is functioning correctly. * Fully refresh table migration status in table migration workflow ([#1630](#1630)). This release introduces a new method, `index_full_refresh()`, to the table migration workflow for fully refreshing the migration status, addressing an oversight from a previous commit ([#1623](#1623)) and resolving issue [#1628](#1628). The new method resets the `_migration_status_refresher` before computing the index, ensuring the latest migration status is used for determining whether view dependencies have been migrated. The `index()` method was previously used to refresh the migration status, but it only provided a partial refresh. With this update, `index_full_refresh()` is utilized for a comprehensive refresh, affecting the `refresh_migration_status` task in multiple workflows such as `migrate_views`, `scan_tables_in_mounts_experimental`, and others. This change ensures a more accurate migration report, presenting the updated migration status. * Ignore existing corrupted installations when refreshing ([#1605](#1605)). A recent update has enhanced the error handling during the loading of installations in the `install.py` file. Specifically, the `installation.load` function now handles certain errors, including `PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by logging a warning message and skipping the corrupted installation instead of raising an error. This behavior has been incorporated into both the `configure` and `_check_inventory_database_exists` functions, allowing the installation process to continue even in the presence of issues with existing installations, while providing improved error messages. This change resolves issue [#1601](#1601) and introduces a new test case for a corrupted installation configuration, as well as an updated existing test case for `test_save_config` that includes a mock installation. * Improved exception handling ([#1584](#1584)). In this release, the exception handling during the upload of a wheel file to DBFS has been significantly improved. Previously, only PermissionDenied errors were caught and handled. Now, both BadRequest and PermissionDenied exceptions will be caught and logged as a warning. This change enhances the robustness of the code by handling a wider range of exceptions during the upload process. In addition, cluster overrides have been configured and DBFS write permissions have been set up. The specific changes made to the code include updating the import statement for NotFound to include BadRequest and modifying the except block in the _get_init_script_data method to catch both NotFound and BadRequest exceptions. These improvements ensure that the code can handle more types of errors, providing more helpful error messages and preventing crash scenarios, thereby enhancing the reliability and robustness of the code. * Improved exception handling for `migrate_acl` ([#1590](#1590)). In this release, the `migrate_acl` functionality has been enhanced to improve exception handling, addressing a flakiness issue in the `test_migrate_managed_tables_with_acl` test. Previously, unhandled `not found` exceptions during parallel test execution caused the flakiness. This release resolves this issue ([#1549](#1549)) by introducing error handling in the `test_migrate_acls_should_produce_proper_queries` test. A controlled error is now introduced to simulate a failed grant migration due to a `TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise testing of error handling and logging mechanisms when migration fails for specific objects, ensuring a more reliable testing environment for the `migrate_acl` functionality. * Improved reliability of table migration status refresher ([#1623](#1623)). This release introduces improvements to the table migration status refresher in the open-source library, enhancing its reliability and robustness. The `table_migrate` function has been updated to ensure that the table migration status is always reset when requesting the latest snapshot, addressing issues [#1623](#1623), [#1622](#1622), and [#1615](#1615). Additionally, the function now handles `NotFound` errors when refreshing migration status. The `get_seen_tables` function has been modified to convert the returned iterator to a list and raise a `NotFound` exception if the schema does not exist, which is then caught and logged as a warning. Furthermore, the migration status reset behavior has been improved, and the `migration_status_refresher` parameter type in the `TableMigrate` class constructor has been modified. New private methods `_index_with_reset()` and updated `_migrate_views()` and `_view_can_be_migrated()` methods have been added to ensure a more accurate and consistent table migration process. The changes have been thoroughly tested and are ready for review. * Refresh migration status at the end of the `migrate_tables` workflows ([#1599](#1599)). In this release, updates have been made to the migration status at the end of the `migrate_tables` workflows, with no new or modified tables or methods introduced. The `_migration_status_refresher.reset()` method has been added in two locations to ensure accurate migration status updates. A new `refresh_migration_status` method has been included in the `RuntimeContext` class in the `databricks.labs.ucx.hive_metastore.workflows` module, which refreshes the migration status for presentation in the dashboard. The changes also include the addition of the `refresh_migration_status` task in `migrate_views`, `migrate_views_with_acl`, and `scan_tables_in_mounts_experimental` workflows, and the `migration_report` method is now dependent on the `refresh_migration_status` task. Thorough testing has been conducted, including the creation of a new integration test in the file `tests/integration/hive_metastore/test_workflows.py` to verify that the migration status is refreshed after the migration job is run. These changes aim to ensure that the migration status is up-to-date and accurately presented in the dashboard. * Removed DBFS library installations ([#1554](#1554)). In this release, the "configure.py" file has been removed, which previously contained the `ConfigureClusterOverrides` class with methods for validating cluster IDs, distinguishing between classic and Table Access Control (TACL) clusters, and building a prompt for users to select a valid active cluster ID. The removal of this file signifies that these functionalities are no longer available. This change is part of a larger commit that also removes DBFS library installations and updates the Estimates Dashboard to remove metastore assignment, addressing issue [#1098](#1098). The commit has been tested via integration tests and manual installation and running of UCX on a no-uc environment. Please note that the `create_jobs` method in the `install.py` file has been updated to reflect these changes, ensuring a more straightforward installation experience and usage of the Estimates Dashboard. * Removed the `Is Terraform used` prompt ([#1664](#1664)). In this release, we have removed the `is_terraform_used` prompt from the configuration file and the installation process in the ucx package. This prompt was not being utilized and had been a source of confusion for some users. Although the variable that stored its outcome will be retained for backwards compatibility, no new methods or modifications to existing functionality have been introduced. No tests have been added or modified as part of this change. The removal of this prompt simplifies the configuration process and aligns with the project's future plans to eliminate the use of Terraform state for ucx migration. Manual testing has been conducted to ensure that the removal of the prompt does not affect the functionality of other properties in the configuration file or the installation process. * Resolve relative paths when building dependency graph ([#1608](#1608)). This commit introduces support for resolving relative paths when building a dependency graph in the UCX project, addressing issues 1202, 1499, and 1287. The SysPathProvider now includes a `cwd` attribute, and a new class, LocalNotebookLoader, has been implemented to handle local files and folders. The PathLookup class is used to resolve paths, and new methods have been added to support these changes. Unit tests have been provided to ensure the correct functioning of the new functionality. This commit replaces issue 1593 and enhances the project's ability to handle local files and folders, resulting in a more robust and reliable dependency graph. * Show tables migration status in migration dashboard ([#1507](#1507)). A migration dashboard has been added to display the status of data object migrations, addressing issue [#323](#323). This new feature includes a query to show the migration status of tables, a new CLI command, and a modification to an existing command. The `migrataion-*` workflow has been updated to include a refresh migration dashboard option. The `mock_installation` function has been modified with an updated state.json file. The changes consist of manual testing and can be found in the `migrations/main` directory as a new SQL query file. This migration dashboard provides users with an easier way to monitor the progress and status of their data migration tasks. * Simulate loading of local files or notebooks after manipulation of `sys.path` ([#1633](#1633)). This commit updates the PathLookup process during the construction of the dependency graph, addressing issues [#1202](#1202) and [#1468](#1468). It simplifies the DependencyGraphBuilder by directly using the DependencyResolver with resolvers and lookup passed as arguments, and removes the DependencyGraphBuilder. The changes include new methods for handling compatibility checks, but no new user-facing features or changes to command-line interfaces or existing workflows are introduced. Unit tests are included to ensure correct behavior. The modifications aim to improve the internal handling of dependency resolution and compatibility checks. * Test if `create-catalogs-schemas` works with tables defined as mount paths ([#1578](#1578)). This release includes a new unit test for the `create-catalogs-schemas` logic that verifies the correct creation and management of catalogs and schemas defined as mount paths. The test checks the storage location of catalogs, ensures non-existing schemas are properly created, and prevents the creation of catalogs without a storage location. It also verifies the catalog schema ACL is set correctly. Using the `CatalogSchema` class and various test functions, the test creates and grants permissions to catalogs and schemas. This change resolves issue [#1039](#1039) without modifying any existing commands or workflows. The release contains no new CLI commands or user documentation, but includes unit tests and assertion calls to validate the behavior of the `create_all_catalogs_schemas` method. * Upgraded `databricks-sdk` to 0.27 ([#1626](#1626)). In this release, the `databricks-sdk` package has been upgraded to version 0.27, bringing updated methods for Redash objects. The `_install_query` method in the `dashboards.py` file has been updated to include a `tags` parameter, set to `None`, when calling `self._ws.queries.update` and `self._ws.queries.create`. This ensures that the updated SDK version is used and that tags are not applied during query updates and creation. Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint` packages have been updated to versions 0.4.0 and 0.4.3 respectively, and the dependency for PyYAML has been updated to a version between 6.0.0 and 7.0.0. These updates may impact the functionality of the project. The changes have been manually tested, but there is no verification on a staging environment. * Use stack of dependency resolvers ([#1560](#1560)). This pull request introduces a stack-based implementation of resolvers, resolving issues [#1202](#1202), [#1499](#1499), and [#1421](#1421), and implements an initial version of SysPathProvider, while eliminating previous hacks. The new functionality includes modified existing commands, a new workflow, and the addition of unit tests. No new documentation or CLI commands have been added. The `problem_collector` parameter is not addressed in this PR and has been moved to a separate issue. The changes include renaming and moving a Python file, as well as modifications to the `Notebook` class and its related methods for handling notebook dependencies and dependency checking. The code has been tested, but manual testing and integration tests are still pending.
nfx
added a commit
that referenced
this pull request
May 8, 2024
* Added DBSQL queries & dashboard migration ([#1532](#1532)). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands: `migrate-dbsql-dashboards` and `revert-dbsql-dashboards`. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. The `migrate-dbsql-dashboards` command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries with `migrated by UCX` and backing up original queries. The `revert-dbsql-dashboards` command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a `--dashboard-id` flag for migrating or reverting a specific dashboard. Additionally, two new functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`, have been added to the `cli.py` file, and new classes have been added to interact with Redash for data visualization and querying. The `make_dashboard` fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards. * Added UDFs assessment ([#1610](#1610)). A User Defined Function (UDF) assessment feature has been introduced, addressing issue [#1610](#1610). A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns, `success` and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count". * Added `databricks labs ucx create-missing-principals` command to create the missing UC roles in AWS ([#1495](#1495)). The `databricks labs ucx` tool now includes a new command, `create-missing-principals`, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented using `IamRoleCreation` from `databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new command only supports AWS and does not affect Azure. The existing `migrate_credentials` function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, including `AWSUCRoleCandidate` in `aws.py`, and `create_missing_principals` and `list_uc_roles` methods in `access.py`. The `create_uc_roles_cli` method in `access.py` has been refactored and renamed to `list_uc_roles`. New unit tests have been implemented to test the functionality of `create_missing_principals` for AWS and Azure, as well as testing the behavior when the command is not approved. * Added baseline for workflow linter ([#1613](#1613)). This change introduces the `WorkflowLinter` class in the `application.py` file of the `databricks.labs.ucx.source_code.jobs` package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such as `workspace_client`, `dependency_resolver`, `path_lookup`, and `migration_index`. Several properties have been moved from `dependency_resolver` to the `CliContext` class, and the `NotebookLoader` class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. The `generic` and `redash` modules from `databricks.labs.ucx.workspace_access` and the `GroupManager` class from `databricks.labs.ucx.workspace_access.groups` are used. The `VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from `databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class from `databricks.labs.ucx.installer.workflows` are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests. * Added linter to check for RDD use and JVM access ([#1606](#1606)). A new `AstHelper` class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing the `AstHelper` class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments. * Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow ([#1621](#1621)). The `migrate_tables` workflow in `workflows.py` has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. The `migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and `migrate_views` tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities. * Added "seen tables" feature ([#1465](#1465)). The `seen tables` feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of a `snapshot` method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The `_crawl` function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and `_get_table_location` have been included to facilitate these improvements. In the testing realm, a new test `test_mount_listing_seen_tables` has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore. * Added support for `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` CLI command ([#1660](#1660)). This commit adds support for the `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` command, which checks for external tables that cannot be synced and prompts the user to run the `migrate-tables-ctas` workflow. Two new methods, `test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts, ctx=ctx)`, have been added. The first method checks if the `migrate-external-tables-ctas` workflow is called correctly, while the second method runs the workflow after prompting the user. The method `test_migrate_external_hiveserde_tables_in_place(ws)` has been modified to test if the `migrate-external-hiveserde-tables-in-place-experimental` workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project. * Added support for migrating external location permissions from interactive cluster mounts ([#1487](#1487)). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing `databricks labs ucx` command is modified, with the addition of the new method `create_external_locations` and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues [#1192](#1192) and [#1193](#1193), ensuring a more robust and controlled user experience with interactive clusters. * Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN ([#1631](#1631)). In this release, we've implemented new features to enhance the security and control over data access during the migration process for the SQL warehouse data access configuration. The `databricks labs ucx create-uber-principal` command now creates a service principal with read-only access to all the storage used by tables in the workspace. The UCX Cluster Policy and SQL Warehouse data access configuration will be updated to use this service principal for migration workflows. A new method, `_update_sql_dac_with_instance_profile`, has been introduced in the `access.py` file to update the SQL data access configuration with the provided AWS instance profile, ensuring a more streamlined management of instance profiles within the SQL data access configuration during the creation of an uber service principal (SPN). Additionally, new methods and tests have been added to the sql module of the databricks.sdk.service package to improve Azure resource permissions, handling different scenarios related to creating a global SPN in the presence or absence of various conditions, such as storage, cluster policies, or secrets. * Addressed issue with disabled features in certain regions ([#1618](#1618)). In this release, we have implemented improvements to address an issue where certain features were disabled in specific regions. We have added error handling when listing serving endpoints to raise a NotFound error if a feature is disabled, preventing the code from failing silently and providing better error messages. A new method, test_serving_endpoints_not_enabled, has been added, which creates a mock WorkspaceClient and raises a NotFound error if serving endpoints are not enabled for a shard. The GenericPermissionsSupport class uses this method to get crawler tasks, and if serving endpoints are not enabled, an error message is logged. These changes increase the reliability and robustness of the codebase by providing better error handling and messaging for this particular issue. Additionally, the change includes unit tests and manual testing to ensure the proper functioning of the new features. * Aggregate UCX output across workspaces with CLI command ([#1596](#1596)). A new `report-account-compatibility` command has been added to the `databricks labs ucx` tool, enabling users to evaluate the compatibility of an entire Azure Databricks account with UCX (Unified Client Context). This command generates a readiness report for an Azure Databricks account, specifically for evaluating compatibility with UCX, by querying various aspects of the account such as clusters, configurations, and data formats. It uses Azure CLI authentication with AAD tokens for authentication and accepts a profile as an argument. The output includes warnings for workspaces that do not have UCX installed, and provides information about unsupported cluster types, unsupported configurations, data format compatibility, and more. Additionally, a new feature has been added to aggregate UCX output across workspaces in an account through a new CLI command, "report-account-compatibility", which can be run at the account level. The existing `manual-workspace-info` command remains unchanged. These changes will help assess the readiness and compatibility of an Azure Databricks account for UCX integration and simplify the process of checking compatibility across an entire account. * Assert if group name is in cluster policy ([#1665](#1665)). In this release, we have implemented a change to ensure the presence of the display name of a specific workspace group (ws_group_a) in the cluster policy. This is to prevent a key error previously encountered. The cluster policy is now loaded as a dictionary, and the group name is checked to confirm its presence. If the group is not found, a message is raised alerting users. Additionally, the permission level for the group is verified to ensure it is set to CAN_USE. No new methods have been added, and existing functionality remains unchanged. The test file test_ext_hms.py has been updated to include the new assertion and has undergone both unit tests and manual testing to ensure proper implementation. This change is intended for software engineers who adopt the project. * Automatically retrying with `auth_type=azure-cli` when constructing `workspace_clients` on Azure ([#1650](#1650)). This commit introduces automatic retrying with 'auth_type=azure-cli' when constructing `workspace_clients` on Azure, resolving TODO items for `AccountWorkspaces` and adding relevant suggestions in 'troubleshooting.md'. It closes issues [#1574](#1574) and [#1430](#1430), and includes new methods for generating readiness reports in `AccountAggregate` and testing the `get_accessible_workspaces` method in 'test_workspaces.py'. User documentation has been updated and the changes have been manually verified in a staging environment. For macOS and Windows users, explicit auth type settings are required for command line utilities. * Changes to identify service principal with custom roles on Azure storage account for principal-prefix-access ([#1576](#1576)). This release introduces several enhancements to the identification of service principals with custom roles on Azure storage accounts for principal-prefix-access. New methods such as `_get_permission_level`, `_get_custom_role_privilege`, and `_get_role_privilege` have been added to improve the functionality of the module. Additionally, two new classes, AzureRoleAssignment and AzureRoleDetails, have been added to enable more detailed management and access control for custom roles on Azure storage accounts. The 'test_access.py' file has been updated to include tests for saving custom roles in Azure storage accounts and ensuring the correct identification of service principals with custom roles. A new unit test function, test_role_assignments_custom_storage(), has also been added to verify the behavior of custom roles in Azure storage accounts. Overall, these changes provide a more efficient and fine-grained way to manage and control custom roles on Azure storage accounts. * Clarified unsupported config in compute crawler ([#1656](#1656)). In this release, we have made significant changes to clarify and improve the handling of unsupported configurations in our compute crawler related to the Hive metastore. We have expanded error messages for unsupported configurations and provided detailed recommendations for remediation. Additionally, we have added relevant user documentation and manually tested the changes. The changes include updates to the configuration for external Hive metastore and passthrough security model for Unity Catalog, which are incompatible with the current configurations. We recommend removing or altering the configs while migrating existing tables and views using UCX or other compatible clusters, and mapping the passthrough security model to a security model compatible with Unity Catalog. The code modifications include the addition of new methods for checking cluster init script and Spark configurations, as well as refining the error messages for unsupported configurations. We also added a new assertion in the `test_cluster_with_multiple_failures` unit test to check for the presence of a specific message regarding the use of the `spark.databricks.passthrough.enabled` configuration. This release is not yet verified on the staging environment. * Created a unique default schema when External Hive Metastore is detected ([#1579](#1579)). A new default database `ucx` is introduced for storing inventory in the hive metastore, with a suffix consisting of the workspace's client ID to ensure uniqueness when an external hive metastore is detected. The `has_ext_hms()` method is added to the `InstallationPolicy` class to detect external HMS and thereby create a unique default schema. The `_prompt_for_new_installation` method's default value for the `Inventory Database stored in hive_metastore` prompt is updated to use the new default database name, modified to include the workspace's client ID if external HMS is detected. Additionally, a test function `test_save_config_ext_hms` is implemented to demonstrate the `WorkspaceInstaller` class's behavior with external HMS, creating a unique default schema for improved system functionality and customization. This change is part of issue [#1579](#1579). * Extend service principal migration to create storage credentials for access connectors created for each storage account ([#1426](#1426)). This commit extends the service principal migration to create storage credentials for access connectors associated with each storage account, resolving issues [#1384](#1384) and [#875](#875). The update includes modifications to the existing `databricks labs ucx` command for creating access connectors, adds a new CLI command for creating storage credentials, and updates the documentation. A new workflow has been added for creating credentials for access connectors and service principals, and updates have been made to existing workflows. The commit includes manual, unit, and integration tests, and no new or modified methods are specified in the diff. The focus is on the feature description and its impact on the project's functionality. The commit has been co-authored by Serge Smertin and vuong-nguyen. * Suggest users to create Access Connector(s) with Managed Identity to access Azure Storage Accounts behind firewall ([#1589](#1589)). In this release, we have introduced a new feature to improve access to Azure Storage Accounts that are protected by firewalls. Due to limitations with service principals in such scenarios, we have developed Access Connectors with Managed Identities for more reliable connectivity. This change includes updates to the 'credentials.py' file, which introduces new methods for managing the migration of service principals to Access Connectors using Managed Identities. Users are warned that migrating to this new feature may cause issues when transitioning to UC, and are advised to validate external locations after running the migration command. This update enhances the security and functionality of the system, providing a more dependable method for accessing Azure Storage Accounts protected by firewalls. * Fixed catalog/schema grants when tables with same source schema have different target schemas ([#1581](#1581)). In this release, we have implemented a fix to address an issue where catalog/schema grants were not being handled correctly when tables with the same source schema had different target schemas. This was causing problems with granting appropriate permissions to users. We have modified the prepare_test function to include an additional test case with a different target schema for the same source table. Furthermore, we have updated the test_catalog_schema_acl function to ensure that grants are being created correctly for all catalogs, schemas, and tables. We have also added an extra query to grant use schema permissions for catalog2.schema3 to user1. Additionally, we have introduced a new `SchemaInfo` class to store information about catalogs and schemas, and refactored the `_get_database_source_target_mapping` method to return a dictionary that maps source databases to a list of `SchemaInfo` objects instead of a single dictionary. These changes ensure that grants are being handled correctly for catalogs, schemas, and tables, even when tables with the same source schema have different target schemas. This will improve the overall functionality and reliability of the system, making it easier for users to manage their catalogs and schemas. * Fixed Spark configuration parameter referencing secret ([#1635](#1635)). In this release, the code related to the Spark configuration parameter reference for a secret has been updated in the `access.py` file, specifically within the `_update_cluster_policy_definition` method. The change modifies the method to retrieve the OAuth client secret for a given storage account using an f-string to reference the secret, replacing the previous concatenation operator. This enhancement is aimed at improving the readability and maintainability of the code while preserving its functionality. Furthermore, the commit includes additional changes, such as new methods `test_create_global_spn` and "cluster_policies.edit", which may be related to this fix. These changes address the secret reference issue, ensuring secure access control and improved integration, particularly with the Spark configuration, benefiting engineers utilizing this project for handling sensitive information and managing clusters securely and effectively. * Fixed `migration-locations` and `assign-metastore` definitions in `labs.yml` ([#1627](#1627)). In this release, the `migration-locations` command in the `labs.yml` file has been updated to include new flags `subscription-id` and `aws-profile`. The `subscription-id` flag allows users to specify the subscription to scan the storage account in, and the `aws-profile` flag allows for authentication using a specified AWS Profile. The `assign-metastore` command has also been updated with a new description: "Enable Unity Catalog features on a workspace by assigning a metastore to it." The `is_account_level` parameter remains unchanged, and the new optional flag `workspace-id` has been added, allowing users to specify the Workspace ID to assign a metastore to. This change enhances the functionality of the `migration-locations` and `assign-metastore` commands, providing more options for users to customize their storage scanning and metastore assignment processes. The `migration-locations` and `assign-metastore` definitions in the `labs.yml` file have been fixed in this release. * Fixed prompt for using external metastore ([#1668](#1668)). A fix has been implemented in the `create` function of the `policy.py` file to correctly prompt users for using an external metastore. Previously, a missing period and space in the prompt caused potential confusion. The updated prompt now includes a clarifying sentence and the `_prompts.confirm` method has been modified to check if the user wants to set UCX to connect to an external metastore in two scenarios: when one or more cluster policies are set up for an external metastore, and when the workspace warehouse is configured for an external metastore. If the user chooses to set up an external metastore, an informational message will be recorded in the logger. This change ensures clear and precise communication with users during the external metastore setup process. * Fixed storage account network ACLs retrieved from properties ([#1620](#1620)). This release includes a fix to the storage account network ACLs retrieval in the open-source library, addressing issue [#1](#1). Previously, the network ACLs were being retrieved from an incorrect location, but this commit corrects that by obtaining the network ACLs from the storage account's properties.networkAcls field. The `StorageAccount` class has been updated to modify the way default network action is retrieved, with a new value `Unknown` added to the previous values `Deny` and "Allow". The `from_raw_resource` class method has also been updated to retrieve the default network action from the `properties.networkAcls` field instead of the `networkAcls` field. This change may affect any functionality that relies on network ACL information and impacts the existing command `databricks labs ucx ...`. Relevant tests, including a new test `test_azure_resource_storage_accounts_list_non_zero`, have been added and manually and unit tested to ensure the fix is functioning correctly. * Fully refresh table migration status in table migration workflow ([#1630](#1630)). This release introduces a new method, `index_full_refresh()`, to the table migration workflow for fully refreshing the migration status, addressing an oversight from a previous commit ([#1623](#1623)) and resolving issue [#1628](#1628). The new method resets the `_migration_status_refresher` before computing the index, ensuring the latest migration status is used for determining whether view dependencies have been migrated. The `index()` method was previously used to refresh the migration status, but it only provided a partial refresh. With this update, `index_full_refresh()` is utilized for a comprehensive refresh, affecting the `refresh_migration_status` task in multiple workflows such as `migrate_views`, `scan_tables_in_mounts_experimental`, and others. This change ensures a more accurate migration report, presenting the updated migration status. * Ignore existing corrupted installations when refreshing ([#1605](#1605)). A recent update has enhanced the error handling during the loading of installations in the `install.py` file. Specifically, the `installation.load` function now handles certain errors, including `PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by logging a warning message and skipping the corrupted installation instead of raising an error. This behavior has been incorporated into both the `configure` and `_check_inventory_database_exists` functions, allowing the installation process to continue even in the presence of issues with existing installations, while providing improved error messages. This change resolves issue [#1601](#1601) and introduces a new test case for a corrupted installation configuration, as well as an updated existing test case for `test_save_config` that includes a mock installation. * Improved exception handling ([#1584](#1584)). In this release, the exception handling during the upload of a wheel file to DBFS has been significantly improved. Previously, only PermissionDenied errors were caught and handled. Now, both BadRequest and PermissionDenied exceptions will be caught and logged as a warning. This change enhances the robustness of the code by handling a wider range of exceptions during the upload process. In addition, cluster overrides have been configured and DBFS write permissions have been set up. The specific changes made to the code include updating the import statement for NotFound to include BadRequest and modifying the except block in the _get_init_script_data method to catch both NotFound and BadRequest exceptions. These improvements ensure that the code can handle more types of errors, providing more helpful error messages and preventing crash scenarios, thereby enhancing the reliability and robustness of the code. * Improved exception handling for `migrate_acl` ([#1590](#1590)). In this release, the `migrate_acl` functionality has been enhanced to improve exception handling, addressing a flakiness issue in the `test_migrate_managed_tables_with_acl` test. Previously, unhandled `not found` exceptions during parallel test execution caused the flakiness. This release resolves this issue ([#1549](#1549)) by introducing error handling in the `test_migrate_acls_should_produce_proper_queries` test. A controlled error is now introduced to simulate a failed grant migration due to a `TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise testing of error handling and logging mechanisms when migration fails for specific objects, ensuring a more reliable testing environment for the `migrate_acl` functionality. * Improved reliability of table migration status refresher ([#1623](#1623)). This release introduces improvements to the table migration status refresher in the open-source library, enhancing its reliability and robustness. The `table_migrate` function has been updated to ensure that the table migration status is always reset when requesting the latest snapshot, addressing issues [#1623](#1623), [#1622](#1622), and [#1615](#1615). Additionally, the function now handles `NotFound` errors when refreshing migration status. The `get_seen_tables` function has been modified to convert the returned iterator to a list and raise a `NotFound` exception if the schema does not exist, which is then caught and logged as a warning. Furthermore, the migration status reset behavior has been improved, and the `migration_status_refresher` parameter type in the `TableMigrate` class constructor has been modified. New private methods `_index_with_reset()` and updated `_migrate_views()` and `_view_can_be_migrated()` methods have been added to ensure a more accurate and consistent table migration process. The changes have been thoroughly tested and are ready for review. * Refresh migration status at the end of the `migrate_tables` workflows ([#1599](#1599)). In this release, updates have been made to the migration status at the end of the `migrate_tables` workflows, with no new or modified tables or methods introduced. The `_migration_status_refresher.reset()` method has been added in two locations to ensure accurate migration status updates. A new `refresh_migration_status` method has been included in the `RuntimeContext` class in the `databricks.labs.ucx.hive_metastore.workflows` module, which refreshes the migration status for presentation in the dashboard. The changes also include the addition of the `refresh_migration_status` task in `migrate_views`, `migrate_views_with_acl`, and `scan_tables_in_mounts_experimental` workflows, and the `migration_report` method is now dependent on the `refresh_migration_status` task. Thorough testing has been conducted, including the creation of a new integration test in the file `tests/integration/hive_metastore/test_workflows.py` to verify that the migration status is refreshed after the migration job is run. These changes aim to ensure that the migration status is up-to-date and accurately presented in the dashboard. * Removed DBFS library installations ([#1554](#1554)). In this release, the "configure.py" file has been removed, which previously contained the `ConfigureClusterOverrides` class with methods for validating cluster IDs, distinguishing between classic and Table Access Control (TACL) clusters, and building a prompt for users to select a valid active cluster ID. The removal of this file signifies that these functionalities are no longer available. This change is part of a larger commit that also removes DBFS library installations and updates the Estimates Dashboard to remove metastore assignment, addressing issue [#1098](#1098). The commit has been tested via integration tests and manual installation and running of UCX on a no-uc environment. Please note that the `create_jobs` method in the `install.py` file has been updated to reflect these changes, ensuring a more straightforward installation experience and usage of the Estimates Dashboard. * Removed the `Is Terraform used` prompt ([#1664](#1664)). In this release, we have removed the `is_terraform_used` prompt from the configuration file and the installation process in the ucx package. This prompt was not being utilized and had been a source of confusion for some users. Although the variable that stored its outcome will be retained for backwards compatibility, no new methods or modifications to existing functionality have been introduced. No tests have been added or modified as part of this change. The removal of this prompt simplifies the configuration process and aligns with the project's future plans to eliminate the use of Terraform state for ucx migration. Manual testing has been conducted to ensure that the removal of the prompt does not affect the functionality of other properties in the configuration file or the installation process. * Resolve relative paths when building dependency graph ([#1608](#1608)). This commit introduces support for resolving relative paths when building a dependency graph in the UCX project, addressing issues 1202, 1499, and 1287. The SysPathProvider now includes a `cwd` attribute, and a new class, LocalNotebookLoader, has been implemented to handle local files and folders. The PathLookup class is used to resolve paths, and new methods have been added to support these changes. Unit tests have been provided to ensure the correct functioning of the new functionality. This commit replaces issue 1593 and enhances the project's ability to handle local files and folders, resulting in a more robust and reliable dependency graph. * Show tables migration status in migration dashboard ([#1507](#1507)). A migration dashboard has been added to display the status of data object migrations, addressing issue [#323](#323). This new feature includes a query to show the migration status of tables, a new CLI command, and a modification to an existing command. The `migrataion-*` workflow has been updated to include a refresh migration dashboard option. The `mock_installation` function has been modified with an updated state.json file. The changes consist of manual testing and can be found in the `migrations/main` directory as a new SQL query file. This migration dashboard provides users with an easier way to monitor the progress and status of their data migration tasks. * Simulate loading of local files or notebooks after manipulation of `sys.path` ([#1633](#1633)). This commit updates the PathLookup process during the construction of the dependency graph, addressing issues [#1202](#1202) and [#1468](#1468). It simplifies the DependencyGraphBuilder by directly using the DependencyResolver with resolvers and lookup passed as arguments, and removes the DependencyGraphBuilder. The changes include new methods for handling compatibility checks, but no new user-facing features or changes to command-line interfaces or existing workflows are introduced. Unit tests are included to ensure correct behavior. The modifications aim to improve the internal handling of dependency resolution and compatibility checks. * Test if `create-catalogs-schemas` works with tables defined as mount paths ([#1578](#1578)). This release includes a new unit test for the `create-catalogs-schemas` logic that verifies the correct creation and management of catalogs and schemas defined as mount paths. The test checks the storage location of catalogs, ensures non-existing schemas are properly created, and prevents the creation of catalogs without a storage location. It also verifies the catalog schema ACL is set correctly. Using the `CatalogSchema` class and various test functions, the test creates and grants permissions to catalogs and schemas. This change resolves issue [#1039](#1039) without modifying any existing commands or workflows. The release contains no new CLI commands or user documentation, but includes unit tests and assertion calls to validate the behavior of the `create_all_catalogs_schemas` method. * Upgraded `databricks-sdk` to 0.27 ([#1626](#1626)). In this release, the `databricks-sdk` package has been upgraded to version 0.27, bringing updated methods for Redash objects. The `_install_query` method in the `dashboards.py` file has been updated to include a `tags` parameter, set to `None`, when calling `self._ws.queries.update` and `self._ws.queries.create`. This ensures that the updated SDK version is used and that tags are not applied during query updates and creation. Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint` packages have been updated to versions 0.4.0 and 0.4.3 respectively, and the dependency for PyYAML has been updated to a version between 6.0.0 and 7.0.0. These updates may impact the functionality of the project. The changes have been manually tested, but there is no verification on a staging environment. * Use stack of dependency resolvers ([#1560](#1560)). This pull request introduces a stack-based implementation of resolvers, resolving issues [#1202](#1202), [#1499](#1499), and [#1421](#1421), and implements an initial version of SysPathProvider, while eliminating previous hacks. The new functionality includes modified existing commands, a new workflow, and the addition of unit tests. No new documentation or CLI commands have been added. The `problem_collector` parameter is not addressed in this PR and has been moved to a separate issue. The changes include renaming and moving a Python file, as well as modifications to the `Notebook` class and its related methods for handling notebook dependencies and dependency checking. The code has been tested, but manual testing and integration tests are still pending.
Merged
nfx
added a commit
that referenced
this pull request
Jun 4, 2024
* Added handling for legacy ACL `DENY` permission in group migration ([#1815](#1815)). In this release, the handling of `DENY` permissions during group migrations in our legacy ACL table has been improved. Previously, `DENY` operations were denoted with a `DENIED` prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence of `DENIED` in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue [#1803](#1803). A new test function, test_hive_deny_sql(), has also been added to test the behavior of the `DENY` permission. * Added handling for parsing corrupted log files ([#1817](#1817)). The `logs.py` file in the `src/databricks/labs/ucx/installer` directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new method `test_parse_logs_warns_for_corrupted_log_file` that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files. * Added known problems with `pyspark` package ([#1813](#1813)). In this release, updates have been made to the `src/databricks/labs/ucx/source_code/known.json` file to document known issues with the `pyspark` package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A new `KnownProblem` dataclass has been added to the `known.py` file, which includes methods for converting the object to a dictionary for better encoding of problems. The `_analyze_file` method has also been updated to use a `known_problems` set of `KnownProblem` objects, improving readability and management of known problems within the application. These changes address issue [#1813](#1813) and improve the documentation of known issues with `pyspark`. * Added library linting for jobs launched on shared clusters ([#1689](#1689)). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue [#1637](#1637). A new function, `_register_existing_cluster_id(graph: DependencyGraph)`, has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to the `test_jobs.py` file in the `tests/integration/source_code` directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of the `jobs` and `compute` modules from the `databricks.sdk.service` package. Additionally, a new `WorkflowTaskContainer` method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters. * Added linters to check for spark logging and configuration access ([#1808](#1808)). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via `sc.conf`, and `rdd.mapPartitions`. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to the `SparkConnectLinter` class and are executed as part of the `databricks labs ucx` command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected. * Added list of known dependency compatibilities and regeneration infrastructure for it ([#1747](#1747)). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the `known.json` file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library. * Added more known libraries from Databricks Runtime ([#1812](#1812)). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios. * Added more known packages from Databricks Runtime ([#1814](#1814)). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility. * Added support for `.egg` Python libraries in jobs ([#1789](#1789)). This commit adds support for `.egg` Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue [#1643](#1643). It includes the addition of a new method, `PythonLibraryResolver`, which replaces the old `PipResolver`, and is used to register egg library dependencies in the `DependencyGraph`. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section where `PipResolver` is replaced with `PythonLibraryResolver` from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from `.egg` files. * Added table migration workflow guide ([#1607](#1607)). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience. * Added workflow linter for spark python tasks ([#1810](#1810)). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the `_register_spark_python_task` method in the `jobs.py` file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. The `test_job_spark_python_task_linter_happy_path` test checks the linter on a valid job configuration where all required libraries are specified, while the `test_job_spark_python_task_linter_unhappy_path` test checks the linter on an invalid job configuration where required libraries are not specified. These tests ensure that the workflow linter for Spark Python tasks is functioning correctly and can help identify any potential issues in job configurations. * Connect all linters to `LinterContext` and add functional testing framework ([#1811](#1811)). This commit connects all linters, including those related to JVM, to the critical path for improved code linting, and introduces a functional testing framework to simplify the writing of code linting verification tests. The `pyproject.toml` file has been updated to include a new configuration for the `ignore-paths` option, utilizing a regular expression to exclude certain files or directories from linting. The testing framework is particularly useful for verifying the correct functioning of linters, reducing the risk of errors and improving the overall development experience. These changes will help to improve the reliability and efficiency of the linting process, making it easier to write and maintain high-quality code. * Deduplicate errors emitted by Spark Connect linter ([#1824](#1824)). This pull request introduces error deduplication for the Spark Connect linter and adds new functional tests using an updated framework. The modifications include the addition of user documentation and unit tests, as well as alterations to existing commands and workflows. Specifically, a new CLI command has been added, and the command `databricks labs ucx ...` has been modified. Additionally, a new workflow has been implemented, and an existing workflow has been updated. No new tables or modifications to existing tables are present. Testing has been conducted through manual testing and new unit tests, with no integration tests or staging environment tests specified. The `verify` method in the `test_functional.py` file has been updated to sort the actual problems list before comparing it to the expected problems list, ensuring consistent ordering of results. The changes aim to improve the functionality and usability of the Spark Connect linter for our software engineer audience. * Download wheel dependency locally to register it to the dependency graph ([#1704](#1704)). A new feature has been implemented in the open-source library to enhance dependency management for wheel files. Previously, when the library type was wheel, a `not-yet-implemented` DependencyProblem would be yielded. Now, the system downloads the wheel file from a remote location, saves it to a temporary directory, and registers the local file to the dependency graph. This allows for more comprehensive handling of wheel dependencies, as they are now downloaded and registered instead of simply being flagged as "not-yet-implemented". Additionally, new functions for creating jobs, making notebooks, and generating random values have been added to enable more comprehensive testing of the workflow linter. New tests have been implemented to check the linter's behavior when there is a missing library dependency and to verify that the linter correctly handles wheel dependencies. These changes improve the testing capabilities of the workflow linter and ensure that all dependencies are properly accounted for and managed within the system. A new test method, 'test_workflow_task_container_builds_dependency_graph_for_python_wheel', has been added to ensure that the dependency graph is built correctly for Python wheels and to improve test coverage. * Drop pyspark `register` lint matcher ([#1818](#1818)). In the latest release, the `register` lint matcher has been removed from pyspark, indicating that the specific usage pattern for the `register` method in UDTFRegistration is no longer required. This change affects the linting process during code reviews, but does not impact the functionality of the code directly. Other matchers for DataFrame, DataFrameReader, DataFrameWriter, and direct filesystem access remain unchanged. The `register` method, which was likely used to register a temporary table or view in pyspark, is no longer considered a best practice or necessary feature. If you previously relied on the `register` method in your pyspark code, you will need to find an alternative solution. This update aims to improve the quality and consistency of pyspark code by removing outdated or unnecessary functionality. * Enabled joining an existing installation to a collection ([#1799](#1799)). This change introduces several new features and modifications to the open-source library, aimed at enhancing the management and organization of workspaces within a collection. A new command `join-collection` has been added to allow a workspace to join a collection using its workspace ID. The `report-account-compatibility` command has been updated with a new flag `--workspace-ids`, and the `alias` command has been updated with a new description. Two new commands `principal-prefix-access` and `create-missing-principals` have been introduced for AWS, and a new command `create-uber-principal` has been introduced for Azure to handle the creation of service principals with STORAGE BLOB READER access for storage accounts used by tables in the workspace. The code's readability and maintainability have been improved by modifying the method `_can_administer` to `can_administer` and `_load_workspace_info` to `load_workspace_info` in the `workspaces.py` file. A new `join_collection` command has been added to the `ucx` application instance to enable joining an existing installation to a collection. Additionally, modifications to the `install.py` file and `test_installation.py` file have been made to facilitate the integration of existing installations into a collection. The tests have been updated to ensure that the joining process works correctly in various scenarios. Overall, these changes provide more flexibility and ease of use for users and improve the interoperability and security of the system. * Fixed `migrate-credential` cli command on AWS ([#1732](#1732)). In this release, the `migrate-credential` CLI command for AWS has been improved and fixed. The command now includes changes to the `access.py` file in the `databricks/labs/ucx/aws` directory. Notable updates are the refactoring of the `role_name` method into a dataclass called `AWSCredentialCandidate`, the addition of the method `_aws_role_trust_doc`, and the removal of the `_databricks_trust_statement` method. The `_aws_s3_policy` method has been updated to include `s3:PutObjectAcl` in the allowed actions, and methods `_create_role` and `_get_role_access_task` have been updated to use `arn` instead of `role_name`. Additionally, the `create_uc_role` and `update_uc_trust_role` methods have been combined into a single `update_uc_role` method. The `migrate-credentials` command in the `cli.py` file has also been updated to support migration of AWS Instance Profiles to UC storage credentials. These improvements resolve issue [#1726](#1726) and enhance the functionality and reliability of the `migrate-credential` command for AWS. * Fixed crasher when running migrate-local-code ([#1794](#1794)). In this release, we have addressed a crasher issue that occurred when running the `migrate-local-code` command. The change involves modifying the `local_file_migrator` property in the `LocalCheckoutContext` class to use a lambda function instead of directly passing `self.languages`. This ensures that the languages are loaded only when the `local_file_migrator` property is accessed, preventing unnecessary load and potential crashes. The change does not introduce any new functionalities, but instead modifies existing commands related to local file migration. Comprehensive manual testing and unit tests have been conducted to ensure the fix works as expected without negatively impacting other parts of the system. * Fixed inconsistent behavior in `%pip` cell handling ([#1785](#1785)). This PR addresses inconsistent behavior in `%pip` cell handling by modifying Python library installation to occur in a designated path lookup, rather than deep within the library tree. These changes impact various components, such as the `PipResolver` class, which no longer requires a `FileLoader` instance as an argument and now takes a `Whitelist` instance directly. Additionally, tests like `test_detect_s3fs_import` and `test_detect_s3fs_import_in_dependencies` are affected by these modifications. Overall, these changes streamline the `%pip` feature, improving library installation efficiency and consistency. * Fixed issue when creating view using `WITH` clause ([#1809](#1809)). In this release, we have addressed an issue that occurred when creating a view using a `WITH` clause, which was causing potential errors or incorrect results due to improper handling of aliases. A new method, `_read_aliases`, has been introduced to read and store aliases from the `WITH` clause as a set, and during view dependency analysis, if an old table's name matches an alias, it is now skipped to prevent double-counting. This ensures improved accuracy and reliability of view creation with `WITH` clauses. Moreover, the commit includes adjustments to import statements, addition of unit tests, and the introduction of a new class `TableView` in the `databricks.labs.ucx.hive_metastore.view_migrate` module to test whether a view with a local dataset should be skipped. This release also includes a test for migrating a view with columns, ensuring that views with local datasets are now handled correctly. The fix resolves issue [#1798](#1798). * Fixed linting for non-UTF8 encoded files ([#1804](#1804)). This commit addresses linting issues for files that are not encoded in UTF-8, improving compatibility with non-UTF-8 encoded files in the databricks labs ucx project. Previously, the linter and fixer tools were unable to process non-UTF-8 encoded files, causing them to fail. This issue has been resolved by adding a check for file encoding during linting and handling the case where the file is not encoded in UTF-8 by returning a failure message. A new method, `getpreferredencoding(False)`, has been introduced to determine the file's encoding, ensuring UTF-8 compatibility. Additionally, a new test method, `test_file_linter_lints_non_ascii_encoded_file`, has been added to check the linter's behavior with non-ASCII encoded files. This enhancement simplifies the linting process, allowing for better file handling of non-UTF-8 encoded files, and is supported by manual testing and unit tests. * Further fix for DENY permissions ([#1834](#1834)). This commit addresses issue [#1834](#1834) by implementing a fix for handling DENY permissions in the legacy TACL migration logic. Previously, all permissions were grouped in a single GRANT statement, but they have now been updated to be split into separate GRANT and DENY statements. This change improves the clarity and maintainability of the code and also increases test coverage with the addition of unit tests and integration tests. A new test function `test_tacl_applier_deny_and_grant()` has been added to demonstrate the use of the updated logic for handling DENY permissions. The resulting SQL queries now include both GRANT and DENY statements, reflecting the updated logic. These changes ensure that the DENY permissions are correctly applied, increasing the overall test coverage and confidence in the code. * Removed false warning on DataFrame.insertInto() about the default format changing from parquet to delta ([#1823](#1823)). This pull request removes a false warning related to the use of DataFrameWriter.insertInto(), which had been incorrectly flagging a potential issue due to the default format change from Parquet to Delta. The warning is now suppressed as it is no longer relevant, since the operation ignores any specified format and uses the existing format of the underlying table. Additionally, an unnecessary linting suppression has been removed. These changes improve the accuracy of the warning system and eliminate confusion for users, with no impact on functionality, usability, or performance. The changes have been manually tested and do not require any new unit or integration tests, CLI commands, workflows, or tables. * Support linting python wheel tasks ([#1821](#1821)). This release introduces support for linting python wheel tasks, addressing issue [#1](#1) * Updated linting checks for Spark table methods ([#1816](#1816)). This commit updates linting checks for PySpark's Spark table methods, focusing on improving handling of migrated tables and deprecating direct filesystem references in favor of the Unity Catalog. New tests and examples include literal and variable references to known and unknown tables, as well as cases with extra or out-of-position arguments. The commit also highlights false positives and trivial references in unrelated contexts. These changes aim to ensure proper usage of Spark table methods, improve codebase consistency, and minimize potential issues related to migrations and format changes. Dependency updates: * Updated sqlglot requirement from <24.1,>=23.9 to >=23.9,<24.2 ([#1819](#1819)).
nfx
added a commit
that referenced
this pull request
Jun 4, 2024
* Added handling for legacy ACL `DENY` permission in group migration ([#1815](#1815)). In this release, the handling of `DENY` permissions during group migrations in our legacy ACL table has been improved. Previously, `DENY` operations were denoted with a `DENIED` prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence of `DENIED` in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue [#1803](#1803). A new test function, test_hive_deny_sql(), has also been added to test the behavior of the `DENY` permission. * Added handling for parsing corrupted log files ([#1817](#1817)). The `logs.py` file in the `src/databricks/labs/ucx/installer` directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new method `test_parse_logs_warns_for_corrupted_log_file` that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files. * Added known problems with `pyspark` package ([#1813](#1813)). In this release, updates have been made to the `src/databricks/labs/ucx/source_code/known.json` file to document known issues with the `pyspark` package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A new `KnownProblem` dataclass has been added to the `known.py` file, which includes methods for converting the object to a dictionary for better encoding of problems. The `_analyze_file` method has also been updated to use a `known_problems` set of `KnownProblem` objects, improving readability and management of known problems within the application. These changes address issue [#1813](#1813) and improve the documentation of known issues with `pyspark`. * Added library linting for jobs launched on shared clusters ([#1689](#1689)). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue [#1637](#1637). A new function, `_register_existing_cluster_id(graph: DependencyGraph)`, has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to the `test_jobs.py` file in the `tests/integration/source_code` directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of the `jobs` and `compute` modules from the `databricks.sdk.service` package. Additionally, a new `WorkflowTaskContainer` method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters. * Added linters to check for spark logging and configuration access ([#1808](#1808)). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via `sc.conf`, and `rdd.mapPartitions`. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to the `SparkConnectLinter` class and are executed as part of the `databricks labs ucx` command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected. * Added list of known dependency compatibilities and regeneration infrastructure for it ([#1747](#1747)). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the `known.json` file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library. * Added more known libraries from Databricks Runtime ([#1812](#1812)). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios. * Added more known packages from Databricks Runtime ([#1814](#1814)). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility. * Added support for `.egg` Python libraries in jobs ([#1789](#1789)). This commit adds support for `.egg` Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue [#1643](#1643). It includes the addition of a new method, `PythonLibraryResolver`, which replaces the old `PipResolver`, and is used to register egg library dependencies in the `DependencyGraph`. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section where `PipResolver` is replaced with `PythonLibraryResolver` from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from `.egg` files. * Added table migration workflow guide ([#1607](#1607)). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience. * Added workflow linter for spark python tasks ([#1810](#1810)). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the `_register_spark_python_task` method in the `jobs.py` file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. The `test_job_spark_python_task_linter_happy_path` test checks the linter on a valid job configuration where all required libraries are specified, while the `test_job_spark_python_task_linter_unhappy_path` test checks the linter on an invalid job configuration where required libraries are not specified. These tests ensure that the workflow linter for Spark Python tasks is functioning correctly and can help identify any potential issues in job configurations. * Connect all linters to `LinterContext` and add functional testing framework ([#1811](#1811)). This commit connects all linters, including those related to JVM, to the critical path for improved code linting, and introduces a functional testing framework to simplify the writing of code linting verification tests. The `pyproject.toml` file has been updated to include a new configuration for the `ignore-paths` option, utilizing a regular expression to exclude certain files or directories from linting. The testing framework is particularly useful for verifying the correct functioning of linters, reducing the risk of errors and improving the overall development experience. These changes will help to improve the reliability and efficiency of the linting process, making it easier to write and maintain high-quality code. * Deduplicate errors emitted by Spark Connect linter ([#1824](#1824)). This pull request introduces error deduplication for the Spark Connect linter and adds new functional tests using an updated framework. The modifications include the addition of user documentation and unit tests, as well as alterations to existing commands and workflows. Specifically, a new CLI command has been added, and the command `databricks labs ucx ...` has been modified. Additionally, a new workflow has been implemented, and an existing workflow has been updated. No new tables or modifications to existing tables are present. Testing has been conducted through manual testing and new unit tests, with no integration tests or staging environment tests specified. The `verify` method in the `test_functional.py` file has been updated to sort the actual problems list before comparing it to the expected problems list, ensuring consistent ordering of results. The changes aim to improve the functionality and usability of the Spark Connect linter for our software engineer audience. * Download wheel dependency locally to register it to the dependency graph ([#1704](#1704)). A new feature has been implemented in the open-source library to enhance dependency management for wheel files. Previously, when the library type was wheel, a `not-yet-implemented` DependencyProblem would be yielded. Now, the system downloads the wheel file from a remote location, saves it to a temporary directory, and registers the local file to the dependency graph. This allows for more comprehensive handling of wheel dependencies, as they are now downloaded and registered instead of simply being flagged as "not-yet-implemented". Additionally, new functions for creating jobs, making notebooks, and generating random values have been added to enable more comprehensive testing of the workflow linter. New tests have been implemented to check the linter's behavior when there is a missing library dependency and to verify that the linter correctly handles wheel dependencies. These changes improve the testing capabilities of the workflow linter and ensure that all dependencies are properly accounted for and managed within the system. A new test method, 'test_workflow_task_container_builds_dependency_graph_for_python_wheel', has been added to ensure that the dependency graph is built correctly for Python wheels and to improve test coverage. * Drop pyspark `register` lint matcher ([#1818](#1818)). In the latest release, the `register` lint matcher has been removed from pyspark, indicating that the specific usage pattern for the `register` method in UDTFRegistration is no longer required. This change affects the linting process during code reviews, but does not impact the functionality of the code directly. Other matchers for DataFrame, DataFrameReader, DataFrameWriter, and direct filesystem access remain unchanged. The `register` method, which was likely used to register a temporary table or view in pyspark, is no longer considered a best practice or necessary feature. If you previously relied on the `register` method in your pyspark code, you will need to find an alternative solution. This update aims to improve the quality and consistency of pyspark code by removing outdated or unnecessary functionality. * Enabled joining an existing installation to a collection ([#1799](#1799)). This change introduces several new features and modifications to the open-source library, aimed at enhancing the management and organization of workspaces within a collection. A new command `join-collection` has been added to allow a workspace to join a collection using its workspace ID. The `report-account-compatibility` command has been updated with a new flag `--workspace-ids`, and the `alias` command has been updated with a new description. Two new commands `principal-prefix-access` and `create-missing-principals` have been introduced for AWS, and a new command `create-uber-principal` has been introduced for Azure to handle the creation of service principals with STORAGE BLOB READER access for storage accounts used by tables in the workspace. The code's readability and maintainability have been improved by modifying the method `_can_administer` to `can_administer` and `_load_workspace_info` to `load_workspace_info` in the `workspaces.py` file. A new `join_collection` command has been added to the `ucx` application instance to enable joining an existing installation to a collection. Additionally, modifications to the `install.py` file and `test_installation.py` file have been made to facilitate the integration of existing installations into a collection. The tests have been updated to ensure that the joining process works correctly in various scenarios. Overall, these changes provide more flexibility and ease of use for users and improve the interoperability and security of the system. * Fixed `migrate-credential` cli command on AWS ([#1732](#1732)). In this release, the `migrate-credential` CLI command for AWS has been improved and fixed. The command now includes changes to the `access.py` file in the `databricks/labs/ucx/aws` directory. Notable updates are the refactoring of the `role_name` method into a dataclass called `AWSCredentialCandidate`, the addition of the method `_aws_role_trust_doc`, and the removal of the `_databricks_trust_statement` method. The `_aws_s3_policy` method has been updated to include `s3:PutObjectAcl` in the allowed actions, and methods `_create_role` and `_get_role_access_task` have been updated to use `arn` instead of `role_name`. Additionally, the `create_uc_role` and `update_uc_trust_role` methods have been combined into a single `update_uc_role` method. The `migrate-credentials` command in the `cli.py` file has also been updated to support migration of AWS Instance Profiles to UC storage credentials. These improvements resolve issue [#1726](#1726) and enhance the functionality and reliability of the `migrate-credential` command for AWS. * Fixed crasher when running migrate-local-code ([#1794](#1794)). In this release, we have addressed a crasher issue that occurred when running the `migrate-local-code` command. The change involves modifying the `local_file_migrator` property in the `LocalCheckoutContext` class to use a lambda function instead of directly passing `self.languages`. This ensures that the languages are loaded only when the `local_file_migrator` property is accessed, preventing unnecessary load and potential crashes. The change does not introduce any new functionalities, but instead modifies existing commands related to local file migration. Comprehensive manual testing and unit tests have been conducted to ensure the fix works as expected without negatively impacting other parts of the system. * Fixed inconsistent behavior in `%pip` cell handling ([#1785](#1785)). This PR addresses inconsistent behavior in `%pip` cell handling by modifying Python library installation to occur in a designated path lookup, rather than deep within the library tree. These changes impact various components, such as the `PipResolver` class, which no longer requires a `FileLoader` instance as an argument and now takes a `Whitelist` instance directly. Additionally, tests like `test_detect_s3fs_import` and `test_detect_s3fs_import_in_dependencies` are affected by these modifications. Overall, these changes streamline the `%pip` feature, improving library installation efficiency and consistency. * Fixed issue when creating view using `WITH` clause ([#1809](#1809)). In this release, we have addressed an issue that occurred when creating a view using a `WITH` clause, which was causing potential errors or incorrect results due to improper handling of aliases. A new method, `_read_aliases`, has been introduced to read and store aliases from the `WITH` clause as a set, and during view dependency analysis, if an old table's name matches an alias, it is now skipped to prevent double-counting. This ensures improved accuracy and reliability of view creation with `WITH` clauses. Moreover, the commit includes adjustments to import statements, addition of unit tests, and the introduction of a new class `TableView` in the `databricks.labs.ucx.hive_metastore.view_migrate` module to test whether a view with a local dataset should be skipped. This release also includes a test for migrating a view with columns, ensuring that views with local datasets are now handled correctly. The fix resolves issue [#1798](#1798). * Fixed linting for non-UTF8 encoded files ([#1804](#1804)). This commit addresses linting issues for files that are not encoded in UTF-8, improving compatibility with non-UTF-8 encoded files in the databricks labs ucx project. Previously, the linter and fixer tools were unable to process non-UTF-8 encoded files, causing them to fail. This issue has been resolved by adding a check for file encoding during linting and handling the case where the file is not encoded in UTF-8 by returning a failure message. A new method, `getpreferredencoding(False)`, has been introduced to determine the file's encoding, ensuring UTF-8 compatibility. Additionally, a new test method, `test_file_linter_lints_non_ascii_encoded_file`, has been added to check the linter's behavior with non-ASCII encoded files. This enhancement simplifies the linting process, allowing for better file handling of non-UTF-8 encoded files, and is supported by manual testing and unit tests. * Further fix for DENY permissions ([#1834](#1834)). This commit addresses issue [#1834](#1834) by implementing a fix for handling DENY permissions in the legacy TACL migration logic. Previously, all permissions were grouped in a single GRANT statement, but they have now been updated to be split into separate GRANT and DENY statements. This change improves the clarity and maintainability of the code and also increases test coverage with the addition of unit tests and integration tests. A new test function `test_tacl_applier_deny_and_grant()` has been added to demonstrate the use of the updated logic for handling DENY permissions. The resulting SQL queries now include both GRANT and DENY statements, reflecting the updated logic. These changes ensure that the DENY permissions are correctly applied, increasing the overall test coverage and confidence in the code. * Removed false warning on DataFrame.insertInto() about the default format changing from parquet to delta ([#1823](#1823)). This pull request removes a false warning related to the use of DataFrameWriter.insertInto(), which had been incorrectly flagging a potential issue due to the default format change from Parquet to Delta. The warning is now suppressed as it is no longer relevant, since the operation ignores any specified format and uses the existing format of the underlying table. Additionally, an unnecessary linting suppression has been removed. These changes improve the accuracy of the warning system and eliminate confusion for users, with no impact on functionality, usability, or performance. The changes have been manually tested and do not require any new unit or integration tests, CLI commands, workflows, or tables. * Support linting python wheel tasks ([#1821](#1821)). This release introduces support for linting python wheel tasks, addressing issue [#1](#1) * Updated linting checks for Spark table methods ([#1816](#1816)). This commit updates linting checks for PySpark's Spark table methods, focusing on improving handling of migrated tables and deprecating direct filesystem references in favor of the Unity Catalog. New tests and examples include literal and variable references to known and unknown tables, as well as cases with extra or out-of-position arguments. The commit also highlights false positives and trivial references in unrelated contexts. These changes aim to ensure proper usage of Spark table methods, improve codebase consistency, and minimize potential issues related to migrations and format changes. Dependency updates: * Updated sqlglot requirement from <24.1,>=23.9 to >=23.9,<24.2 ([#1819](#1819)).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
get_external_location is string value and the conditional statement is always executed irrespective the value of get_external_location