Skip to content

Improve SQL query performance in Indexable_Post_Indexation_Action#20215

Closed
av3nger wants to merge 1 commit intoYoast:trunkfrom
av3nger:feature/improve-sql-query-performance
Closed

Improve SQL query performance in Indexable_Post_Indexation_Action#20215
av3nger wants to merge 1 commit intoYoast:trunkfrom
av3nger:feature/improve-sql-query-performance

Conversation

@av3nger
Copy link
Copy Markdown

@av3nger av3nger commented Apr 26, 2023

Context

Improve performance of Yoast on wp-admin dashboard, when working with indexables, on sites with large number of posts.

Summary

The following query in get_select_query() method in the Indexable_Post_Indexation_Action class can be improved:

return $this->wpdb->prepare(
	"
	SELECT P.ID
	FROM {$this->wpdb->posts} AS P
	WHERE P.post_type IN (" . \implode( ', ', \array_fill( 0, \count( $post_types ), '%s' ) ) . ')
	AND P.post_status NOT IN (' . \implode( ', ', \array_fill( 0, \count( $excluded_post_statuses ), '%s' ) ) . ")
	AND P.ID not in (
		SELECT I.object_id from $indexable_table as I
		WHERE I.object_type = 'post'
		AND I.version = %d )
	$limit_query",
	$replacements
);

When running EXPLAIN on the query, the subquery goes over all the rows, which makes this significantly slower on larger data sets:

EXPLAIN results before

If the query is changed to:

return $this->wpdb->prepare(
	"
	SELECT P.ID
	FROM {$this->wpdb->posts} AS P
	LEFT JOIN $indexable_table AS I
		ON P.ID = I.object_id
		AND I.object_type = 'post'
		AND I.version = %d
	WHERE P.post_type IN (" . \implode( ', ', \array_fill( 0, \count( $post_types ), '%s' ) ) . ')
		AND P.post_status NOT IN (' . \implode( ', ', \array_fill( 0, \count( $excluded_post_statuses ), '%s' ) ) . ")
		AND I.object_id IS NULL
	$limit_query",
	$replacements
);

we can simplify the query and get rid of the scan across the whole table:

EXPLAIN results after

This PR can be summarized in the following changelog entry:

  • Improve SQL performance on wp-admin

Test instructions for QA when the code is in the RC

Make sure that the output results from old and new queries match.

Impact check

This PR affects how unindexed posts are selected.

Quality assurance

  • I have tested this code to the best of my abilities.
  • During testing, I had activated all plugins Yoast SEO provides integrations for.
  • I have added unit tests to verify the code works as intended.
  • If any part of the code is behind a feature flag, my test instructions also cover cases where the feature flag is switched off.
  • I have written this PR in accordance with my team's definition of done.

Innovation

  • No innovation project is applicable for this PR.
  • This PR falls under an innovation project. I have attached the innovation label and noted the work hours.

@leonidasmi
Copy link
Copy Markdown
Contributor

@av3nger We have created internally a task to assess this and added it on our board.

As per our processes, we will open a separate PR (that will link to this one) in order to tweak your solution and potentially merge that instead, but if we do we will give you props in our changelog.

@leonidasmi
Copy link
Copy Markdown
Contributor

I've also added a fix regarding using the new order of placeholders because otherwise the resulting queries would get malformed :)

@leonidasmi
Copy link
Copy Markdown
Contributor

@av3nger I have been investigating your PR and I'm actually not seeing any noticeable improvements when benchmarking both solutions 🤔 In most setups I actually see a tiny regression even, in terms of average time for the new query to complete, compared to the production one.

This is my methodology for benchmarking the queries btw:

  • Add the following snippet in a mu-plugin:
add_filter( 'shutdown', 'benchmark_yoast_code', 10 );
function benchmark_yoast_code() {
    if ( ! get_option( 'benchmark_ran' ) ) {
        update_option( 'benchmark_ran', true );
        $indexation_action = YoastSEO()->classes->get( Yoast\WP\SEO\Actions\Indexing\Indexable_Post_Indexation_Action::class );
        // Define the number of times the function should be executed
        $numRuns = 100;
        
        // Start the benchmark
        $start = microtime(true);
        
        // Execute the function multiple times
        for ($i = 0; $i < $numRuns; $i++) {
            delete_transient( 'wpseo_total_unindexed_posts' );
            delete_transient( 'wpseo_total_unindexed_posts_limited' );
            $indexation_action->get_limited_unindexed_count(25);
        }
        
        // End the benchmark
        $end = microtime(true);
        
        // Calculate the elapsed time
        $elapsed = $end - $start;
        
        error_log( "Elapsed time: " . $elapsed . " seconds\n" );
        error_log( "Average time per run: " . ($elapsed / $numRuns) . " seconds\n" );
    }
}
  • Every time you want to benchmark again, you should delete the benchmark_ran option from your db (I added that to avoid multiple runs)

If you try the same methodology and get opposite results, I'd be interested to look further on any potential differences our setups might have. (you do have to apply this fix on your PR as well first though 🙂 )

@av3nger
Copy link
Copy Markdown
Author

av3nger commented May 2, 2023

@leonidasmi, what is the select_type for the second query set to on your setup? Is it SUBQUERY or something else? If the subquery is being cached, then, you are right, the difference is not noticeable. However, if the second query is executed as a SUBQUERY then this could become a potential bottleneck, when we have a large number of posts, especially when the db cache hits limits. Also, won't the same query be cached on the SQL server, so running this 100 times will not really be helpful without flushing the caches on SQL?

@leonidasmi
Copy link
Copy Markdown
Contributor

@av3nger I'm getting this for the original query:
image

and that is on a website with ~48k posts. And when I compare it with this PR's query:
image

I'm noticing that the original query is slightly more performant in that site too. And that can be explained by the ref column, as the first query has a const there compared to the new query that is not completely constant.

@av3nger av3nger closed this by deleting the head repository Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants