Skip to content

Create index on run_uuid columns to improve SQL operations#5443

Merged
harupy merged 7 commits intomlflow:masterfrom
harupy:add-index-postgres
Mar 4, 2022
Merged

Create index on run_uuid columns to improve SQL operations#5443
harupy merged 7 commits intomlflow:masterfrom
harupy:add-index-postgres

Conversation

@harupy
Copy link
Copy Markdown
Member

@harupy harupy commented Mar 3, 2022

Signed-off-by: harupy [email protected]

What changes are proposed in this pull request?

How is this patch tested?

Thank you @mberr for investigating the impact of creating an index: #3785 (comment)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: harupy <[email protected]>
@github-actions github-actions bot added the rn/bug-fix Mention under Bug Fixes in Changelogs. label Mar 3, 2022
Signed-off-by: harupy <[email protected]>
Comment on lines +25 to +26
for table in ["params", "metrics", "latest_metrics", "tags"]:
op.create_index(f"index_{table}_run_uuid", table, ["run_uuid"])
Copy link
Copy Markdown
Member Author

@harupy harupy Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unidexed foreign keys (source: #3785 (comment)):

table columns size constraint referenced_table
runs experiment_id 8192 bytes runs_experiment_id_fkey experiments
✅ tags run_uuid 8192 bytes tags_run_uuid_fkey runs
✅ metrics run_uuid 0 bytes metrics_run_uuid_fkey runs
✅ params run_uuid 0 bytes params_run_uuid_fkey runs
experiment_tags experiment_id 0 bytes experiment_tags_experiment_id_fkey experiments
✅ latest_metrics run_uuid 0 bytes latest_metrics_run_uuid_fkey runs
registered_model_tags name 0 bytes registered_model_tags_name_fkey registered_models
model_version_tags name,version 0 bytes model_version_tags_name_version_fkey model_versions

This migration script updates rows marked with ✅.

Copy link
Copy Markdown
Member Author

@harupy harupy Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL log looks like this:

...
db-postgres-1  | 2022-03-03 03:15:54.405 UTC [69] LOG:  statement: 
db-postgres-1  | 	CREATE TABLE model_version_tags (
db-postgres-1  | 		key VARCHAR(250) NOT NULL, 
db-postgres-1  | 		value VARCHAR(5000), 
db-postgres-1  | 		name VARCHAR(256) NOT NULL, 
db-postgres-1  | 		version INTEGER NOT NULL, 
db-postgres-1  | 		CONSTRAINT model_version_tag_pk PRIMARY KEY (key, name, version), 
db-postgres-1  | 		FOREIGN KEY(name, version) REFERENCES model_versions (name, version) ON UPDATE cascade
db-postgres-1  | 	)
db-postgres-1  | 	
db-postgres-1  | 	
db-postgres-1  | 2022-03-03 03:15:54.410 UTC [69] LOG:  statement: ALTER TABLE model_versions ADD COLUMN run_link VARCHAR(500)
db-postgres-1  | 2022-03-03 03:15:54.412 UTC [69] LOG:  statement: ALTER TABLE model_versions ALTER COLUMN run_id DROP NOT NULL
db-postgres-1  | 2022-03-03 03:15:54.415 UTC [69] LOG:  statement: ALTER TABLE latest_metrics ALTER COLUMN is_nan TYPE BOOLEAN 
db-postgres-1  | 2022-03-03 03:15:54.415 UTC [69] LOG:  statement: ALTER TABLE latest_metrics ALTER COLUMN is_nan SET NOT NULL
db-postgres-1  | 2022-03-03 03:15:54.416 UTC [69] LOG:  statement: ALTER TABLE metrics ALTER COLUMN is_nan TYPE BOOLEAN 
db-postgres-1  | 2022-03-03 03:15:54.417 UTC [69] LOG:  statement: ALTER TABLE metrics ALTER COLUMN is_nan SET NOT NULL
👇👇👇👇👇👇👇👇👇👇👇👇👇👇👇👇
db-postgres-1  | 2022-03-03 03:15:54.420 UTC [69] LOG:  statement: CREATE INDEX index_params_run_uuid ON params (run_uuid)
db-postgres-1  | 2022-03-03 03:15:54.423 UTC [69] LOG:  statement: CREATE INDEX index_metrics_run_uuid ON metrics (run_uuid)
db-postgres-1  | 2022-03-03 03:15:54.425 UTC [69] LOG:  statement: CREATE INDEX index_latest_metrics_run_uuid ON latest_metrics (run_uuid)
db-postgres-1  | 2022-03-03 03:15:54.428 UTC [69] LOG:  statement: CREATE INDEX index_tags_run_uuid ON tags (run_uuid)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add an index on other columns?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly seems reasonable to create an index on experiment_id since we do have a number of joins employing that and a restrictive filter query on updates to the name. @dbczumar thoughts?

Copy link
Copy Markdown
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@harupy harupy requested a review from dbczumar March 3, 2022 06:10
# As a fix for https://github.com/mlflow/mlflow/issues/3785, create indexes on run_uuid columns
# to speed up SQL operations.
bind = op.get_bind()
if bind.engine.name == "postgresql":
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to add sqlite. https://www.sqlite.org/foreignkeys.html#fk_indexes says:

an index should be created on the child key columns of each foreign key constraint.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add this for all DBs, including mysql? Is this specific to Postgres?

harupy added 3 commits March 3, 2022 17:10
Signed-off-by: harupy <[email protected]>
Signed-off-by: harupy <[email protected]>
Copy link
Copy Markdown
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @harupy !

@harupy
Copy link
Copy Markdown
Member Author

harupy commented Mar 4, 2022

@dbczumar @BenWilson2 I checked how the migration changes indices for MySQL:

Before the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY  |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY  |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

After the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY               |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY               |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | index_params_run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

It looks like the migration renames the index run_uuid to index_params_run_uuid so there is no duplicate index.

@dbczumar
Copy link
Copy Markdown
Collaborator

dbczumar commented Mar 4, 2022

@dbczumar @BenWilson2 I checked how the migration changes indices for MySQL:

Before the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY  |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY  |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

After the migration:

mysql> SHOW INDEX FROM mlflowdb.params;
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| params |          0 | PRIMARY               |            1 | key         | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          0 | PRIMARY               |            2 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| params |          1 | index_params_run_uuid |            1 | run_uuid    | A         |           1 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

It looks like the migration renames the index run_uuid to index_params_run_uuid so there is no duplicate index.

Great!

harupy added 2 commits March 4, 2022 12:39
Signed-off-by: harupy <[email protected]>
Signed-off-by: harupy <[email protected]>
Comment on lines +130 to 134
# `diff` contains several `remove_index` operations because `Base.metadata` does not contain
# index metadata but `mc` does. Note this doesn't mean the MLflow database is missing indexes
# as tested in `test_create_index_on_run_uuid`.
diff = [d for d in diff if d[0] != "remove_index"]
assert len(diff) == 0
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbczumar @BenWilson2 Added a workaround to make this test pass.

@harupy harupy merged commit ab8ee45 into mlflow:master Mar 4, 2022
@harupy harupy deleted the add-index-postgres branch March 4, 2022 08:27
@harupy harupy changed the title Create index on run_uuid columns for PostgreSQL to improve SQL operations Create index on run_uuid columns to improve SQL operations Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rn/bug-fix Mention under Bug Fixes in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] No index on foreign keys, in postgres store

3 participants