Create index on run_uuid columns to improve SQL operations#5443
Create index on run_uuid columns to improve SQL operations#5443harupy merged 7 commits intomlflow:masterfrom
run_uuid columns to improve SQL operations#5443Conversation
Signed-off-by: harupy <[email protected]>
Signed-off-by: harupy <[email protected]>
| for table in ["params", "metrics", "latest_metrics", "tags"]: | ||
| op.create_index(f"index_{table}_run_uuid", table, ["run_uuid"]) |
There was a problem hiding this comment.
These are unidexed foreign keys (source: #3785 (comment)):
| table | columns | size | constraint | referenced_table |
|---|---|---|---|---|
| runs | experiment_id | 8192 bytes | runs_experiment_id_fkey | experiments |
| ✅ tags | run_uuid | 8192 bytes | tags_run_uuid_fkey | runs |
| ✅ metrics | run_uuid | 0 bytes | metrics_run_uuid_fkey | runs |
| ✅ params | run_uuid | 0 bytes | params_run_uuid_fkey | runs |
| experiment_tags | experiment_id | 0 bytes | experiment_tags_experiment_id_fkey | experiments |
| ✅ latest_metrics | run_uuid | 0 bytes | latest_metrics_run_uuid_fkey | runs |
| registered_model_tags | name | 0 bytes | registered_model_tags_name_fkey | registered_models |
| model_version_tags | name,version | 0 bytes | model_version_tags_name_version_fkey | model_versions |
This migration script updates rows marked with ✅.
There was a problem hiding this comment.
PostgreSQL log looks like this:
...
db-postgres-1 | 2022-03-03 03:15:54.405 UTC [69] LOG: statement:
db-postgres-1 | CREATE TABLE model_version_tags (
db-postgres-1 | key VARCHAR(250) NOT NULL,
db-postgres-1 | value VARCHAR(5000),
db-postgres-1 | name VARCHAR(256) NOT NULL,
db-postgres-1 | version INTEGER NOT NULL,
db-postgres-1 | CONSTRAINT model_version_tag_pk PRIMARY KEY (key, name, version),
db-postgres-1 | FOREIGN KEY(name, version) REFERENCES model_versions (name, version) ON UPDATE cascade
db-postgres-1 | )
db-postgres-1 |
db-postgres-1 |
db-postgres-1 | 2022-03-03 03:15:54.410 UTC [69] LOG: statement: ALTER TABLE model_versions ADD COLUMN run_link VARCHAR(500)
db-postgres-1 | 2022-03-03 03:15:54.412 UTC [69] LOG: statement: ALTER TABLE model_versions ALTER COLUMN run_id DROP NOT NULL
db-postgres-1 | 2022-03-03 03:15:54.415 UTC [69] LOG: statement: ALTER TABLE latest_metrics ALTER COLUMN is_nan TYPE BOOLEAN
db-postgres-1 | 2022-03-03 03:15:54.415 UTC [69] LOG: statement: ALTER TABLE latest_metrics ALTER COLUMN is_nan SET NOT NULL
db-postgres-1 | 2022-03-03 03:15:54.416 UTC [69] LOG: statement: ALTER TABLE metrics ALTER COLUMN is_nan TYPE BOOLEAN
db-postgres-1 | 2022-03-03 03:15:54.417 UTC [69] LOG: statement: ALTER TABLE metrics ALTER COLUMN is_nan SET NOT NULL
👇👇👇👇👇👇👇👇👇👇👇👇👇👇👇👇
db-postgres-1 | 2022-03-03 03:15:54.420 UTC [69] LOG: statement: CREATE INDEX index_params_run_uuid ON params (run_uuid)
db-postgres-1 | 2022-03-03 03:15:54.423 UTC [69] LOG: statement: CREATE INDEX index_metrics_run_uuid ON metrics (run_uuid)
db-postgres-1 | 2022-03-03 03:15:54.425 UTC [69] LOG: statement: CREATE INDEX index_latest_metrics_run_uuid ON latest_metrics (run_uuid)
db-postgres-1 | 2022-03-03 03:15:54.428 UTC [69] LOG: statement: CREATE INDEX index_tags_run_uuid ON tags (run_uuid)
There was a problem hiding this comment.
Do we want to add an index on other columns?
There was a problem hiding this comment.
It certainly seems reasonable to create an index on experiment_id since we do have a number of joins employing that and a restrictive filter query on updates to the name. @dbczumar thoughts?
| # As a fix for https://github.com/mlflow/mlflow/issues/3785, create indexes on run_uuid columns | ||
| # to speed up SQL operations. | ||
| bind = op.get_bind() | ||
| if bind.engine.name == "postgresql": |
There was a problem hiding this comment.
We might need to add sqlite. https://www.sqlite.org/foreignkeys.html#fk_indexes says:
an index should be created on the child key columns of each foreign key constraint.
There was a problem hiding this comment.
Could we add this for all DBs, including mysql? Is this specific to Postgres?
Signed-off-by: harupy <[email protected]>
Signed-off-by: harupy <[email protected]>
Signed-off-by: harupy <[email protected]>
|
@dbczumar @BenWilson2 I checked how the migration changes indices for MySQL: Before the migration:After the migration:It looks like the migration renames the index |
Great! |
Signed-off-by: harupy <[email protected]>
Signed-off-by: harupy <[email protected]>
| # `diff` contains several `remove_index` operations because `Base.metadata` does not contain | ||
| # index metadata but `mc` does. Note this doesn't mean the MLflow database is missing indexes | ||
| # as tested in `test_create_index_on_run_uuid`. | ||
| diff = [d for d in diff if d[0] != "remove_index"] | ||
| assert len(diff) == 0 |
There was a problem hiding this comment.
@dbczumar @BenWilson2 Added a workaround to make this test pass.
run_uuid columns for PostgreSQL to improve SQL operationsrun_uuid columns to improve SQL operations
Signed-off-by: harupy [email protected]
What changes are proposed in this pull request?
run_uuidcolumns for PostgreSQL to improve SQL operations.How is this patch tested?
Thank you @mberr for investigating the impact of creating an index: #3785 (comment)
Does this PR change the documentation?
ci/circleci: build_doccheck. If it's successful, proceed to thenext step, otherwise fix it.
Detailson the right to open the job page of CircleCI.Artifactstab.docs/build/html/index.html.Release Notes
Is this a user-facing change?
(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts: Artifact stores and artifact loggingarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesarea/examples: Example codearea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models: MLmodel format, model serialization/deserialization, flavorsarea/projects: MLproject format, project running backendsarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra: MLflow Tracking server backendarea/tracking: Tracking Service, tracking client APIs, autologgingInterface
area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows: Windows supportLanguage
language/r: R APIs and clientslanguage/java: Java APIs and clientslanguage/new: Proposals for new client languagesIntegrations
integrations/azure: Azure and Azure ML integrationsintegrations/sagemaker: SageMaker integrationsintegrations/databricks: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notes