Each purge batch and associated records are removed atomically #23381

jrafanie · 2025-03-17T20:46:50Z

We wrap each batch deletion in a transaction, ensuring the whole batch is deleted together or not at all. Note, any early breaks out of the loop needed to be moved outside of the transaction to ensure the transaction completes successfully deleting the whole batch or nothing at all.

In a demo environment database, we found some miq_report_result_details and binary_blobs were left orphaned as they referenced miq_report_result_id values that no longer existed. The provided spec demonstrates the problem but likely occurred in production if a timeout occurred anywhere in the purge_associated_records.

jrafanie · 2025-03-17T20:48:38Z

app/models/mixins/purging_mixin.rb

+
+        # Purge each batch with their associated records
+        # in a transaction to avoid leaving orphaned references.
+        transaction do


Hide whitespace to minimize the changes...
I needed to move the early breaks out of the transaction to avoid that without completing the transaction properly.

jrafanie · 2025-03-17T20:49:14Z

app/models/mixins/purging_mixin.rb

+            batch_ids = query
+          end
+
+          unless batch_ids.empty?


convert a early break to a simple conditional

This might be a problem..in the case of the else condition, we don't want it to execute the query, but instead stay as a scope.

Ah, good eye. I wanted to test that else condition.

Fryguy · 2025-03-17T21:05:35Z

app/models/mixins/purging_mixin.rb

        break if count == 0

-        total += count
-
-        purge_associated_records(batch_ids) if respond_to?(:purge_associated_records)


I wonder if we can move this line up under count = unscoped.where(:id => batch_ids).delete_all, and run it regardless of the count that's returned. Then you can end the transaction at that point?

yeah, I can try it. I was debating removing the guard/conditions entirely for count == 0 for purge_associated_records. Also batch_ids.empty?. I honestly am not sure how batch_ids.empty? is true. Is it only to guard against running 1 transaction of this, which does nothing:

count = unscoped.where(:id => batch_ids).delete_all purge_associated_records(batch_ids) if respond_to?(:purge_associated_records)

It feels like it would be safer to do one run of those two lines even if nothing is done to avoid these extra conditionals.

tldr; I think it's ok to remove those conditionals.

tldr; + 1: Ok, so it's just a wasted set of queries in the situation in which nothing is purged. I think that's fine to eliminate some complexity for an edge case that should never be a performance issue.

To recreate it, I asked it to purge_remaining with 10 when I had 6 report results, meaning, nothing would be deleted.

ok, with the guard clause to avoid batch_ids.empty?, it does something like this:

==> log/evm.log <== [----] I, [2025-03-17T18:00:08.774832#17137:35d4] INFO -- evm: MIQ(MiqReportResult.purge_by_remaining) Purging Miq report results older than last 10 results... ==> log/test.log <== [----] D, [2025-03-17T18:00:08.775384#17137:35d4] DEBUG -- test: TRANSACTION (0.1ms) SAVEPOINT active_record_1 [----] D, [2025-03-17T18:00:08.776041#17137:35d4] DEBUG -- test: MiqReportResult Pluck (0.7ms) SELECT "id" FROM (SELECT "id", RANK() OVER( PARTITION BY "miq_report_id" ORDER BY "id" DESC ) AS "rank" FROM "miq_report_results") miq_report_results WHERE (rank > 10) LIMIT $1 [["LIMIT", 100]] [----] D, [2025-03-17T18:00:08.776164#17137:35d4] DEBUG -- test: TRANSACTION (0.1ms) RELEASE SAVEPOINT active_record_1 ==> log/evm.log <== [----] I, [2025-03-17T18:00:08.776199#17137:35d4] INFO -- evm: MIQ(MiqReportResult.purge_by_remaining) Purging Miq report results older than last 10 results...Complete - Deleted 0 records

Now, in the situation in which you will not purging anything anyway, the removal of the conditional will do this:

[----] I, [2025-03-17T17:57:54.677557#16882:35d4] INFO -- evm: MIQ(MiqReportResult.purge_by_remaining) Purging Miq report results older than last 10 results... ==> log/test.log <== [----] D, [2025-03-17T17:57:54.678205#16882:35d4] DEBUG -- test: TRANSACTION (0.0ms) SAVEPOINT active_record_1 [----] D, [2025-03-17T17:57:54.678472#16882:35d4] DEBUG -- test: MiqReportResult Pluck (0.3ms) SELECT "id" FROM (SELECT "id", RANK() OVER( PARTITION BY "miq_report_id" ORDER BY "id" DESC ) AS "rank" FROM "miq_report_results") miq_report_results WHERE (rank > 10) LIMIT $1 [["LIMIT", 100]] ==> log/evm.log <== [----] I, [2025-03-17T17:57:54.678526#16882:35d4] INFO -- evm: MIQ(MiqReportResult.purge_in_batches) Purging 0 Miq report results. ==> log/test.log <== [----] D, [2025-03-17T17:57:54.678733#16882:35d4] DEBUG -- test: MiqReportResult Delete All (0.1ms) DELETE FROM "miq_report_results" WHERE 1=0 [----] D, [2025-03-17T17:57:54.678892#16882:35d4] DEBUG -- test: MiqReportResultDetail Delete All (0.1ms) DELETE FROM "miq_report_result_details" WHERE 1=0 [----] D, [2025-03-17T17:57:54.679013#16882:35d4] DEBUG -- test: TRANSACTION (0.0ms) RELEASE SAVEPOINT active_record_1 ==> log/evm.log <== [----] I, [2025-03-17T17:57:54.679045#16882:35d4] INFO -- evm: MIQ(MiqReportResult.purge_by_remaining) Purging Miq report results older than last 10 results...Complete - Deleted 0 records

I used this to test that... (ignoring whitespace)

diff --git a/app/models/mixins/purging_mixin.rb b/app/models/mixins/purging_mixin.rb index cb56d52f4d..96a41d6274 100644 --- a/app/models/mixins/purging_mixin.rb +++ b/app/models/mixins/purging_mixin.rb @@ -209,16 +209,19 @@ module PurgingMixin # pull back ids - will slow performance batch_ids = query.pluck(:id) current_window = batch_ids.size + exit_early = true if batch_ids.empty? else batch_ids = query end + unless exit_early _log.info("Purging #{current_window} #{table_name.humanize}.") count = unscoped.where(:id => batch_ids).delete_all purge_associated_records(batch_ids) if respond_to?(:purge_associated_records) total += count end + end

Ok, this is ready to go I think.

We wrap each batch deletion in a transaction, ensuring the whole batch is deleted together or not at all. Note, any early breaks out of the loop needed to be removed or moved outside of the transaction to ensure the transaction completes successfully deleting the whole batch or nothing at all. In a demo environment database, we found some miq_report_result_details and binary_blobs were left orphaned as they referenced miq_report_result_id values that no longer existed. The provided spec demonstrates the problem but likely occurred in production if a timeout occurred anywhere in the purge_associated_records.

jrafanie · 2025-03-19T14:08:26Z

Closing for now. After discussing this with @kbrock, he had the clever idea that really addresses our concerns but without using a transaction. He suggested purging the associated tables before the parent table. In this way, we can continue to purge tables each run through the purge timer and will only remove from the parent table once all the associated records are removed.

Alternative to ManageIQ#23381, where each purge batch was done in a transaction. Actally, we only care about eventual consistency, so we can purge the associated records first, and if that completes, we purge the parent table's rows for that batch. This allows us to deal with timeouts or other errors by continuing a prior run by cleaning up associated rows and only removing the parent when there are no more associated rows in other tables that remain.

Alternative to ManageIQ#23381, where each purge batch was done in a transaction. Actually, we only care about eventual consistency, so we can purge the associated records first, and if that completes, we purge the parent table's rows for that batch. This allows us to deal with timeouts or other errors by continuing a prior run by cleaning up associated rows and only removing the parent when there are no more associated rows in other tables that remain.

Alternative to ManageIQ#23381, where each purge batch was done in a transaction. Actually, we only care about eventual consistency, so we can purge the associated records first, and if that completes, we purge the parent table's rows for that batch. This allows us to deal with timeouts or other errors by continuing a prior run by cleaning up associated rows and only removing the parent when there are no more associated rows in other tables that remain. Note, there is no recourse for delete_all that has a failure as it's a vanilla DELETE FROM where ... so it's very unlikely to fail for anything we care about. Conversely, destroy_all instantiates objects, each of which could have callbacks that fail. Additionally, each instantiation takes time and increases the likelihood that if a timeout occurs, it would be in this area. Therefore, we verify the rows are actually destroyed? and bail if not to avoid deleting the parent and leaving orphans in other tables.

Alternative to ManageIQ#23381, where each purge batch was done in a transaction. Actually, we only care about eventual consistency, so we can purge the associated records first, and if that completes, we purge the parent table's rows for that batch. This allows us to deal with timeouts or other errors by continuing a prior run by cleaning up associated rows and only removing the parent when there are no more associated rows in other tables that remain. Note, there is no recourse for delete_all that has a failure as it's a vanilla DELETE FROM where ...

Alternative to ManageIQ#23381, where each purge batch was done in a transaction. Actually, we only care about eventual consistency, so we can purge the associated records first, and if that completes, we purge the parent table's rows for that batch. This allows us to deal with timeouts or other errors by continuing a prior run by cleaning up associated rows and only removing the parent when there are no more associated rows in other tables that remain. Note, there is no recourse for delete_all that has a failure as it's a vanilla DELETE FROM where ... so it's very unlikely to fail for anything we care to handle.

jrafanie added bug spassky/yes? labels Mar 17, 2025

jrafanie assigned Fryguy Mar 17, 2025

jrafanie requested review from agrare, Fryguy and kbrock as code owners March 17, 2025 20:46

jrafanie commented Mar 17, 2025

View reviewed changes

Fryguy reviewed Mar 17, 2025

View reviewed changes

agrare changed the title ~~Each purge batch and associated records are removed atomicly~~ Each purge batch and associated records are removed atomically Mar 17, 2025

jrafanie force-pushed the purge-batch-in-transaction branch from ee0846f to a783168 Compare March 17, 2025 21:20

jrafanie force-pushed the purge-batch-in-transaction branch from a783168 to 1afea62 Compare March 17, 2025 21:43

jrafanie added the core label Mar 18, 2025

jrafanie closed this Mar 19, 2025

jrafanie mentioned this pull request Mar 19, 2025

Prevent orphans by purging associated records before the parent table #23384

Merged

kbrock mentioned this pull request Mar 19, 2025

Add purge_by_date_and_orphaned #23385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Each purge batch and associated records are removed atomically #23381

Each purge batch and associated records are removed atomically #23381

jrafanie commented Mar 17, 2025

jrafanie Mar 17, 2025

jrafanie Mar 17, 2025

Fryguy Mar 17, 2025

jrafanie Mar 17, 2025

Fryguy Mar 17, 2025

jrafanie Mar 17, 2025

jrafanie Mar 17, 2025

jrafanie Mar 17, 2025

jrafanie Mar 18, 2025

jrafanie commented Mar 19, 2025

Each purge batch and associated records are removed atomically #23381

Each purge batch and associated records are removed atomically #23381

Conversation

jrafanie commented Mar 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrafanie commented Mar 19, 2025