Skip to content

Add databricks labs ucx validate-table-locations command #673

@nfx

Description

@nfx

Description

databricks labs ucx validate-table-locations --workspace-profiles profile-one,profile-two,profile-three to detect conflicts for table migration across the specified set of workspaces.

Implementation

Create an exception list at the account level containing tables that show up on more than on workspace:

  • Pointing to the same cloud storage location
  • With different metadata
  • With different ACLs

Addressing table conflicts/duplication require special processing we have the following options

  • Define a "master" and create derivative objects as views
  • Flag and skip the dupes
  • Duplicate the data and create dupes
  • Consider upgrading a workspace at a time. Highlight the conflict with prior upgrades.

Now for tables, there also needs to be a report on table/db inconsistency - like:

Workspace Table(s)
A db1.tbl1, db1.tbl3
B db1.tbl2

And the team(s) that are driving UC Migration within account would make a decision after some time in review (of excel spreadsheet). By the way, we can split UCX installation across different Azure Subscriptions. And every installation would just focus on defining target catalog mapping per database.

Questions

Situation

  • Two workspaces, same dbs, all different tables and columns (all managed tables, effectively)
  • Two workspaces, same dbs, 90% same tables, 10% are different tables
  • Two workspaces, two different dbs

We can technically support both db_to_catalog and workspace_to_catalog, and even at the same time, but db_to_catalog will override workspace_to_catalog. We also need default_catalog_for_workspace, if workspace_to_catalog is set (default catalog for all workspaces is set per metastore)..

We can also do another override for tables, but we have unanswered questions:

  • What if same db, same workspace, same table, but different columns/order/types? Do we ignore and keep in hive metastore? And then rerun the scan for tables and grants?
  • What if during migration catalog/database/table were deleted either from hms and/or uc?

Speaking of metastores, in the beginning, there needs to be workspace_to_metastore mapping with default_metastore_for_workspace:

  • Can we come up with a good default mapping here?
    • Coarse or fine grained?
    • Select between the two?
    • Ask for inline input?
  • How many conflicts we expect to justify the need to create/support custom mapping?
    • In what what future-proof configuration format might we need for this mapping.

Thinking of an incremental approach here:

  • Add workspace-id to the schema-name
    • eg: ucx-##########
    • This can be retrieved via the workspace URL; prob best added to the SDK
    • This ensures that each workspace inventory is isolated; regardless of if there is an external HMS or not
  • No need to alter "destroy" scripts
  • Add workspace-id as a column to every inventory table to treat tables as multi-tenant
    • eg: workspace_id
    • This ensures that data is always segmented; even if a user overrides the config for schema name
  • Add metastore object to inventory
    • Not sure if there is a natural identifier or not
    • Capture if this metastore is internal, external, glue, config, etc.
  • Add metastore-id column to any related inventory object
  • Create Workflow to switch inventory from HMS to UC
  • These tasks would ideally be included at the end of the workflow for Step 1 "Assign Metastore to Workspace"
  • Copy inventory data from HMS ucx-########## schema to ucx CATALOG in the same set of tables
  • Since each table has workspace-id column and same schema, this is a simple INSERT INTO SELECT * FROM
  • Run Destroy workflow to purge data from HMS
  • Update config file to redirect ucx location from HMS to UC
  • From this point forward, UCX in this workspace will reference the ucx CATALOG
  • When running the "mapping" portion for any object, UCX logic will examine the data in the UC tables and highlight if there are any mappings which must be reviewed and approved manually
    Each object will have a mapping table from old to new
    Exporting / Importing to a google sheet would be a good interface for this
    The same process can be run for each new workspace that gets added
    Migration of objects is performed based on the mapping tables

Related issues:

Upstream dependencies:

Downstream dependencies:

There are 4 possible actions in case of failure:

Metadata

Metadata

Assignees

Labels

feat/account-levelcross-workspace installationsfeat/cliCLI commandsfeat/migration-indexmapping of databases to catalog or potentially other databasesmigrate/managedgo/uc/upgrade Upgrade Managed Tables and Jobs

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions