Skip to content

Conversation

@JoaoJandre
Copy link
Contributor

Description

ACS has a job that periodically removes records from the vm_stats table that are older than the retention time (setting vm.stats.max.retention.time). In environments that have a large amount of statistics being collected and have a long retention time, if the retention time is shortened, table cleaning may fail to finish due to the huge amount of records being deleted leading to a query timeout.

The vm.stats.remove.batch.size configuration has been added to define the batch size of statistics removal queries. This way, instead of trying to remove all records in a single query, ACS will remove records in batches, with the batch size being defined with the vm.stats.remove.batch.size setting. The default value is -1, where no limit will be applied; thus, the current behavior is maintained.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

In a local lab, I artificially generated 60 million rows on vm_stats and set the retention time to one minute.

Before the change, the delete job would timeout and no rows would be removed.
After applying the change and setting vm.stats.max.retention.time to 100000, the table was successfully cleaned.

@codecov
Copy link

codecov bot commented Mar 4, 2024

Codecov Report

Attention: Patch coverage is 9.52381% with 19 lines in your changes missing coverage. Please review.

Project coverage is 14.95%. Comparing base (00fe25a) to head (d401488).

Files Patch % Lines
...src/main/java/com/cloud/vm/dao/VmStatsDaoImpl.java 0.00% 12 Missing ⚠️
...c/main/java/com/cloud/utils/db/GenericDaoBase.java 0.00% 6 Missing ⚠️
...src/main/java/com/cloud/server/StatsCollector.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #8740      +/-   ##
============================================
- Coverage     14.95%   14.95%   -0.01%     
+ Complexity    11017    11014       -3     
============================================
  Files          5378     5378              
  Lines        469935   469951      +16     
  Branches      59493    58858     -635     
============================================
- Hits          70301    70294       -7     
- Misses       391847   391871      +24     
+ Partials       7787     7786       -1     
Flag Coverage Δ
uitests 4.29% <ø> (ø)
unittests 15.66% <9.52%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@FelipeM525 FelipeM525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM!

Copy link
Contributor

@BryanMLima BryanMLima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9623

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9730

@JoaoJandre
Copy link
Contributor Author

@sureshanaparti @DaanHoogland could we run the CI here?

@weizhouapache
Copy link
Member

@blueorangutan test rocky8 kvm-rocky8

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-10312)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 44434 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8740-t10312-kvm-rocky8.zip
Smoke tests completed. 131 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@JoaoJandre
Copy link
Contributor Author

@sureshanaparti I think we are ready to merge this one

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoaoJandre
It seems Vishesh had some concerns, not sure if they are addressed

@JoaoJandre
Copy link
Contributor Author

@JoaoJandre It seems Vishesh had some concerns, not sure if they are addressed

@weizhouapache I've answered @vishesh92's concerns a few months ago, as he did not respond anymore, I thought that he was ok with my reasoning.

@vishesh92
Copy link
Member

@JoaoJandre It seems Vishesh had some concerns, not sure if they are addressed

My only concern is increasing number of global settings. I would personally prefer a generic global setting which can be reused for multiple tables instead of a table specific global setting. I don't see any benefit of having multiple global settings per table for deletion.

@DaanHoogland
Copy link
Contributor

@vishesh92 , is this something that would block this PR, or can it be added as a future improvement. cc @JoaoJandre .
In any case do you agree on a way forward?

@JoaoJandre
Copy link
Contributor Author

@vishesh92 @DaanHoogland I agree that if we are to have a configuration for each table it would be too much. However, the configuration that @vishesh92 proposed to use (detail.batch.query.size) is meant for listing, not removing; thus, I would not use it for the purpose of this PR.

As this is a broader discussion, I think that we should move this discussion somewhere else to agree on which configurations should exist, then when/if something is decided, we can create a PR to normalize all the required tables. I feel like this is out of scope for this PR, as we would have to change a lot of code that is not related to the vm_stats table and metrics deletion, which is the focus for this PR.

@vishesh92
Copy link
Member

vishesh92 commented Jun 11, 2024

@vishesh92 @DaanHoogland I agree that if we are to have a configuration for each table it would be too much. However, the configuration that @vishesh92 proposed to use (detail.batch.query.size) is meant for listing, not removing; thus, I would not use it for the purpose of this PR.

As this is a broader discussion, I think that we should move this discussion somewhere else to agree on which configurations should exist, then when/if something is decided, we can create a PR to normalize all the required tables. I feel like this is out of scope for this PR, as we would have to change a lot of code that is not related to the vm_stats table and metrics deletion, which is the focus for this PR.

@JoaoJandre The PR looks good. Can we rename the global setting to something more generic like delete.batch.query.size for deletions across table? If not, what is our way forward when we have a lot of these?

Also the Build Github Action is failing due to test failures. Please check.

@JoaoJandre
Copy link
Contributor Author

@vishesh92 @DaanHoogland I agree that if we are to have a configuration for each table it would be too much. However, the configuration that @vishesh92 proposed to use (detail.batch.query.size) is meant for listing, not removing; thus, I would not use it for the purpose of this PR.
As this is a broader discussion, I think that we should move this discussion somewhere else to agree on which configurations should exist, then when/if something is decided, we can create a PR to normalize all the required tables. I feel like this is out of scope for this PR, as we would have to change a lot of code that is not related to the vm_stats table and metrics deletion, which is the focus for this PR.

@JoaoJandre The PR looks good. Can we rename the global setting to something more generic like delete.batch.query.size for deletions across table? If not, what is our way forward when we have a lot of these?

Also the Build Github Action is failing due to test failures. Please check.

As I said on my previous message, this PR is focusing only on the vm_stats table, as it is the one that caused issues. To create a generic configuration is out of scope for this PR. If we face a situation where we need to create lots of these configurations, then we can discuss a generic configuration, but for now, this is the only one that exists.

Also, the tests should be fixed now :)

@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 10000

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10009

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

@vishesh92 @JoaoJandre can you guys agree? Personally I have a slight preference to start out with as generic settings as we can and speciallise as we go. But I cannot be bothered much.

@blueorangutan
Copy link

[SF] Trillian test result (tid-10506)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43116 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8740-t10506-kvm-centos7.zip
Smoke tests completed. 130 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_trigger_shutdown Failure 341.62 test_safe_shutdown.py

@JoaoJandre
Copy link
Contributor Author

@vishesh92 @JoaoJandre can you guys agree? Personally I have a slight preference to start out with as generic settings as we can and speciallise as we go. But I cannot be bothered much.

Creating a generic config would significantly broaden the scope of this PR.

I could just rename the config that I created; however, as I won't touch other tables in this PR, it would be misleading to have a configuration with a generic name that only affects a single table. Thus, I would rather we keep this name, as it is more intuitive and descriptive.

A future PR that wants to do something similar for other tables could change this configuration's name/description to be more generic and use it as well if the author wills it.

@vishesh92
Copy link
Member

Let's merge this PR for now. And create a PR later to make the configuration setting generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants