Create index on `run_uuid` columns to improve SQL operations by harupy · Pull Request #5443 · mlflow/mlflow

harupy · 2022-03-03T01:07:37Z

Signed-off-by: harupy [email protected]

What changes are proposed in this pull request?

Fixes [BUG] No index on foreign keys, in postgres store #3785
Create index on run_uuid columns for PostgreSQL to improve SQL operations.

How is this patch tested?

Thank you @mberr for investigating the impact of creating an index: #3785 (comment)

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: harupy <[email protected]>

harupy · 2022-03-03T02:23:03Z

mlflow/store/db_migrations/versions/bd07f7e963c5_create_index_on_run_uuid_for_postgres.py

+        for table in ["params", "metrics", "latest_metrics", "tags"]:
+            op.create_index(f"index_{table}_run_uuid", table, ["run_uuid"])


These are unidexed foreign keys (source: #3785 (comment)):

table columns size constraint referenced_table

runs experiment_id 8192 bytes runs_experiment_id_fkey experiments

✅ tags run_uuid 8192 bytes tags_run_uuid_fkey runs

✅ metrics run_uuid 0 bytes metrics_run_uuid_fkey runs

✅ params run_uuid 0 bytes params_run_uuid_fkey runs

experiment_tags experiment_id 0 bytes experiment_tags_experiment_id_fkey experiments

✅ latest_metrics run_uuid 0 bytes latest_metrics_run_uuid_fkey runs

registered_model_tags name 0 bytes registered_model_tags_name_fkey registered_models

model_version_tags name,version 0 bytes model_version_tags_name_version_fkey model_versions

This migration script updates rows marked with ✅.

PostgreSQL log looks like this:

... db-postgres-1 | 2022-03-03 03:15:54.405 UTC [69] LOG: statement: db-postgres-1 | CREATE TABLE model_version_tags ( db-postgres-1 | key VARCHAR(250) NOT NULL, db-postgres-1 | value VARCHAR(5000), db-postgres-1 | name VARCHAR(256) NOT NULL, db-postgres-1 | version INTEGER NOT NULL, db-postgres-1 | CONSTRAINT model_version_tag_pk PRIMARY KEY (key, name, version), db-postgres-1 | FOREIGN KEY(name, version) REFERENCES model_versions (name, version) ON UPDATE cascade db-postgres-1 | ) db-postgres-1 | db-postgres-1 | db-postgres-1 | 2022-03-03 03:15:54.410 UTC [69] LOG: statement: ALTER TABLE model_versions ADD COLUMN run_link VARCHAR(500) db-postgres-1 | 2022-03-03 03:15:54.412 UTC [69] LOG: statement: ALTER TABLE model_versions ALTER COLUMN run_id DROP NOT NULL db-postgres-1 | 2022-03-03 03:15:54.415 UTC [69] LOG: statement: ALTER TABLE latest_metrics ALTER COLUMN is_nan TYPE BOOLEAN db-postgres-1 | 2022-03-03 03:15:54.415 UTC [69] LOG: statement: ALTER TABLE latest_metrics ALTER COLUMN is_nan SET NOT NULL db-postgres-1 | 2022-03-03 03:15:54.416 UTC [69] LOG: statement: ALTER TABLE metrics ALTER COLUMN is_nan TYPE BOOLEAN db-postgres-1 | 2022-03-03 03:15:54.417 UTC [69] LOG: statement: ALTER TABLE metrics ALTER COLUMN is_nan SET NOT NULL 👇👇👇👇👇👇👇👇👇👇👇👇👇👇👇👇 db-postgres-1 | 2022-03-03 03:15:54.420 UTC [69] LOG: statement: CREATE INDEX index_params_run_uuid ON params (run_uuid) db-postgres-1 | 2022-03-03 03:15:54.423 UTC [69] LOG: statement: CREATE INDEX index_metrics_run_uuid ON metrics (run_uuid) db-postgres-1 | 2022-03-03 03:15:54.425 UTC [69] LOG: statement: CREATE INDEX index_latest_metrics_run_uuid ON latest_metrics (run_uuid) db-postgres-1 | 2022-03-03 03:15:54.428 UTC [69] LOG: statement: CREATE INDEX index_tags_run_uuid ON tags (run_uuid)

Do we want to add an index on other columns?

It certainly seems reasonable to create an index on experiment_id since we do have a number of joins employing that and a restrictive filter query on updates to the name. @dbczumar thoughts?

BenWilson2

LGTM!

harupy · 2022-03-03T06:34:24Z

mlflow/store/db_migrations/versions/bd07f7e963c5_create_index_on_run_uuid_for_postgres.py

+    # As a fix for https://github.com/mlflow/mlflow/issues/3785, create indexes on run_uuid columns
+    # to speed up SQL operations.
+    bind = op.get_bind()
+    if bind.engine.name == "postgresql":


We might need to add sqlite. https://www.sqlite.org/foreignkeys.html#fk_indexes says:

an index should be created on the child key columns of each foreign key constraint.

Could we add this for all DBs, including mysql? Is this specific to Postgres?

Signed-off-by: harupy <[email protected]>

dbczumar

LGTM! Thanks @harupy !

harupy · 2022-03-04T02:57:32Z

@dbczumar @BenWilson2 I checked how the migration changes indices for MySQL:

Before the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY  |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY  |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

After the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY               |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY               |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | index_params_run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

It looks like the migration renames the index run_uuid to index_params_run_uuid so there is no duplicate index.

dbczumar · 2022-03-04T03:30:15Z

@dbczumar @BenWilson2 I checked how the migration changes indices for MySQL:

Before the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY  |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY  |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

After the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY               |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY               |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | index_params_run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

It looks like the migration renames the index run_uuid to index_params_run_uuid so there is no duplicate index.

Great!

Signed-off-by: harupy <[email protected]>

harupy · 2022-03-04T07:09:13Z

tests/store/tracking/test_sqlalchemy_store_schema.py

+    # `diff` contains several `remove_index` operations because `Base.metadata` does not contain
+    # index metadata but `mc` does. Note this doesn't mean the MLflow database is missing indexes
+    # as tested in `test_create_index_on_run_uuid`.
+    diff = [d for d in diff if d[0] != "remove_index"]
    assert len(diff) == 0


@dbczumar @BenWilson2 Added a workaround to make this test pass.

Add migration script

52d86de

Signed-off-by: harupy <[email protected]>

github-actions bot added the rn/bug-fix Mention under Bug Fixes in Changelogs. label Mar 3, 2022

fix

9f2d217

Signed-off-by: harupy <[email protected]>

harupy commented Mar 3, 2022

View reviewed changes

BenWilson2 approved these changes Mar 3, 2022

View reviewed changes

harupy requested a review from dbczumar March 3, 2022 06:10

harupy commented Mar 3, 2022

View reviewed changes

harupy added 3 commits March 3, 2022 17:10

add sqlite

7da506d

Signed-off-by: harupy <[email protected]>

improve comment

19baacc

Signed-off-by: harupy <[email protected]>

apply migration to all databases

cbf7498

Signed-off-by: harupy <[email protected]>

dbczumar approved these changes Mar 4, 2022

View reviewed changes

harupy added 2 commits March 4, 2022 12:39

remove postgres

6ae95dd

Signed-off-by: harupy <[email protected]>

fix tests

81d1682

Signed-off-by: harupy <[email protected]>

harupy commented Mar 4, 2022

View reviewed changes

harupy merged commit ab8ee45 into mlflow:master Mar 4, 2022

harupy deleted the add-index-postgres branch March 4, 2022 08:27

harupy changed the title ~~Create index on run_uuid columns for PostgreSQL to improve SQL operations~~ Create index on run_uuid columns to improve SQL operations Mar 7, 2022

harupy mentioned this pull request Mar 7, 2022

[BUG] http.client.RemoteDisconnected for get_metric_history #3334

Closed

23 tasks

harupy mentioned this pull request Jul 27, 2022

Add index for run UUID to sqlalchemy models #6347

Merged

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create index on `run_uuid` columns to improve SQL operations#5443

Create index on `run_uuid` columns to improve SQL operations#5443
harupy merged 7 commits intomlflow:masterfrom
harupy:add-index-postgres

harupy commented Mar 3, 2022 •

edited

Loading

Uh oh!

harupy Mar 3, 2022 •

edited

Loading

Uh oh!

harupy Mar 3, 2022 •

edited

Loading

Uh oh!

harupy Mar 3, 2022

Uh oh!

BenWilson2 Mar 3, 2022

Uh oh!

BenWilson2 left a comment

Uh oh!

harupy Mar 3, 2022

Uh oh!

dbczumar Mar 3, 2022

Uh oh!

dbczumar left a comment

Uh oh!

harupy commented Mar 4, 2022 •

edited

Loading

Uh oh!

dbczumar commented Mar 4, 2022

Before the migration:

After the migration:

Uh oh!

harupy Mar 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		for table in ["params", "metrics", "latest_metrics", "tags"]:
		op.create_index(f"index_{table}_run_uuid", table, ["run_uuid"])

table	columns	size	constraint	referenced_table
runs	experiment_id	8192 bytes	runs_experiment_id_fkey	experiments
✅ tags	run_uuid	8192 bytes	tags_run_uuid_fkey	runs
✅ metrics	run_uuid	0 bytes	metrics_run_uuid_fkey	runs
✅ params	run_uuid	0 bytes	params_run_uuid_fkey	runs
experiment_tags	experiment_id	0 bytes	experiment_tags_experiment_id_fkey	experiments
✅ latest_metrics	run_uuid	0 bytes	latest_metrics_run_uuid_fkey	runs
registered_model_tags	name	0 bytes	registered_model_tags_name_fkey	registered_models
model_version_tags	name,version	0 bytes	model_version_tags_name_version_fkey	model_versions

Conversation

harupy commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Uh oh!

harupy Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harupy Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harupy Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

BenWilson2 Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

BenWilson2 left a comment

Choose a reason for hiding this comment

Uh oh!

harupy Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

dbczumar Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

dbczumar left a comment

Choose a reason for hiding this comment

Uh oh!

harupy commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before the migration:

After the migration:

Uh oh!

dbczumar commented Mar 4, 2022

Before the migration:

After the migration:

Uh oh!

harupy Mar 4, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

harupy commented Mar 3, 2022 •

edited

Loading

harupy Mar 3, 2022 •

edited

Loading

harupy Mar 3, 2022 •

edited

Loading

harupy commented Mar 4, 2022 •

edited

Loading