Skip to content

Fix MySQL loadTables performance#6820

Closed
cx-ronit-steinberg wants to merge 51 commits intotypeorm:masterfrom
checkmarx-ltd:mysql-multi-tenancy-performance
Closed

Fix MySQL loadTables performance#6820
cx-ronit-steinberg wants to merge 51 commits intotypeorm:masterfrom
checkmarx-ltd:mysql-multi-tenancy-performance

Conversation

@cx-ronit-steinberg
Copy link
Contributor

@cx-ronit-steinberg cx-ronit-steinberg commented Oct 1, 2020

This change will fix a performance issue in MySQL loadTables, which is especially bad in multi-tenanted environments.

loadTables performs queries to INFORMATION_SCHEMA, which is not optimized.
MySQL recommends adding a where clause with the schema and table for better performance - https://dev.mysql.com/doc/refman/5.7/en/information-schema-optimization.html.

There are 2 queries with a JOIN that have a where clause on the schema and table, but the where clause is only for one of the tables in the join.

For large databases with many schemas, this can be a crucial improvement.
In our database, for example, it improved each query from about 7.5 seconds to around 200 milliseconds.

@imnotjames
Copy link
Contributor

Is it possible to split these two changes apart? The performance vs the FK query check?

It's nicer to have that from a changelog perspective & so if we have to roll things back we can isolate what we're rolling back.

@cx-ronit-steinberg
Copy link
Contributor Author

cx-ronit-steinberg commented Oct 2, 2020

@imnotjames Sure, I split only the foreign key fix into another PR

@imnotjames
Copy link
Contributor

imnotjames commented Oct 4, 2020

Is the subquery needed if we're already doing this in the where clause in the main query?

Does the EXPLAIN show this problem? If so, can you post them as a reply - just for documentation purposes if we ever come back to the change via a git blame, get to this PR & wonder "huh - what was this trying to fix exactly?"

Copy link
Contributor

@imnotjames imnotjames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the test for this a duplicate of the one for 6168?

@cx-ronit-steinberg
Copy link
Contributor Author

cx-ronit-steinberg commented Oct 5, 2020

The subquery is needed for the indices because it's a left join. If it were an inner join, like the foreign keys, then we could use the main where clause, but since it's a left join we also need in the results the rows where REFERENTIAL_CONSTRAINTS columns are null. I also tried adding it in the main where clause with an OR IS NULL, but that doesn't get the performance improvement.

The explain shows the problem. For example for the foreign key query it has 2 rows, one for kcu and one for rc.
For kcu we get: Using where; Open_full_table; Scanned 0 databases
For rc we get: Using where; Open_full_table; **Scanned all databases**; Using join buffer (Block Nested Loop)
After adding the where clause for the table, we get: Using where; Open_full_table; **Scanned 1 database**; Using join buffer (Block Nested Loop)
I'll add screenshots of both explains

@cx-ronit-steinberg
Copy link
Contributor Author

cx-ronit-steinberg commented Oct 5, 2020

The test isn't exactly a duplicate, because the 6168 PR only touched the foreign keys, so I only tested the foreign keys there. Here I also added index checks to see they're unaffected by the change.
For example the index checks will fail if we add to the main where clause instead of a subquery, because no indices will be found.

@cx-ronit-steinberg
Copy link
Contributor Author

Before - scanned all databases:
Before - scanned all databases

After - scanned 1 database:
After - scanned 1 database

@cx-ronit-steinberg
Copy link
Contributor Author

@imnotjames @pleerock Hi, I noticed you merged this refactor yesterday.
I think adding the condition in the ON will make it impossible to have a performance improvement, even with the extra conditions in the WHERE clause (not to mention they're duplicate of one another).

See explain only with the WHERE clause conditions (rc scanned 0 databases):
Screen Shot 2020-10-06 at 9 36 42

Explain with both WHERE clause and ON conditions (rc scanned all databases):
Screen Shot 2020-10-06 at 9 36 19

However I did just notice that I used KEY_COLUMN_USAGE's columns TABLE_SCHEMA and CONSTRAINT_SCHEMA interchangeably, which is a possible bug. If TABLE_SCHEMA !== CONSTRAINT_SCHEMA then it's wrong to add the rc constraints inside the foreignKeysCondition like I did here. So I'm not sure how it's possible to add this performance improvement...

@imnotjames imnotjames self-requested a review October 9, 2020 09:36
@imnotjames
Copy link
Contributor

imnotjames commented Oct 11, 2020

I'm finding that having an OR in the WHERE clause at all forces a data directory scan. This is even with a simpler query like:

        SELECT
            *
        FROM `INFORMATION_SCHEMA`.`KEY_COLUMN_USAGE` `kcu`
        WHERE
            (`kcu`.`TABLE_SCHEMA` = 'test' AND `kcu`.`TABLE_NAME` = 'question')
           OR
            (`kcu`.`TABLE_SCHEMA` = 'test' AND `kcu`.`TABLE_NAME` = 'category')

Causes:

Using where; Open_full_table; Scanned all databases

However,

EXPLAIN SELECT
  *
FROM `INFORMATION_SCHEMA`.`KEY_COLUMN_USAGE` `kcu`
WHERE
    `kcu`.`TABLE_SCHEMA` = 'test' AND `kcu`.`TABLE_NAME` = 'question'

Leaves me with

Using where; Open_full_table; Scanned 0 databases

Hypothesis: It might be faster to either run multiple queries - one per database & schema - or to construct a nightmarish frankenstein-query that unions a bunch of these together with single lookups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants