Skip to content

Conversation

@yadvr
Copy link
Member

@yadvr yadvr commented Jun 10, 2024

The user_vm_view can end up not picking the right index to join against the user_ip_address table causing full table scan on the user_ip_address table. This could be related to a MySQL bug
https://bugs.mysql.com/bug.php?id=41220

In a test environment with 20k shared networks and over 20M IPs, the listVirtualMachines API was found to take over 17s to return list of just 10 VMs. However, with this fix it would now take under 200ms to return the list. (MySQL Server: 8.0.36)

MySQL slow query logging showed ~nearly 20M table scans of the IP address table:

# User@Host: cloud[cloud] @ localhost [127.0.0.1]  Id:    39
# Query_time: 8.227541  Lock_time: 0.000014 Rows_sent: 12  Rows_examined: 19,667,235
SET timestamp=1715410270;
SELECT user_vm_view.id, user_vm_view.name /*snipped*/ FROM user_vm_view
WHERE user_vm_view.id IN (4,6,7,8,9,10,11,12,13,14,15,16);

Visual explain:
328905494-72e44291-a657-49da-adcd-5803a2fa91f9

This may not be worse than what it already is, I'm open to more ideas and suggestions.

Addresses #7910

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

…sting

The user_vm_view can end up not picking the right index to join against
the user_ip_address table causing full table scan on the user_ip_address
table. This could be related to a MySQL bug
https://bugs.mysql.com/bug.php?id=41220

In a test environment with 20k shared networks and over 20M IPs, the
listVirtualMachines API was found to take over 17s to return list of
just 10 VMs. However, with this fix it would now take under 200ms to
return the list.

MySQL slow query logging showed ~nearly 20M table scans of the IP
address table:
```
SET timestamp=1715410270;
SELECT user_vm_view.id, user_vm_view.name /*snipped*/ FROM user_vm_view
WHERE user_vm_view.id IN (4,6,7,8,9,10,11,12,13,14,15,16);
```

Signed-off-by: Rohit Yadav <[email protected]>
Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm

@yadvr
Copy link
Member Author

yadvr commented Jun 10, 2024

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Jun 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.30%. Comparing base (3f2761e) to head (5ff6e6f).

Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #9198       +/-   ##
============================================
- Coverage     14.96%   4.30%   -10.66%     
============================================
  Files          5373     363     -5010     
  Lines        469248   29302   -439946     
  Branches      58782    5116    -53666     
============================================
- Hits          70210    1261    -68949     
+ Misses       391266   27898   -363368     
+ Partials       7772     143     -7629     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9841

@yadvr
Copy link
Member Author

yadvr commented Jun 10, 2024

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-10407)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42731 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9198-t10407-kvm-centos7.zip
Smoke tests completed. 131 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@yadvr
Copy link
Member Author

yadvr commented Jun 12, 2024

Tested locally on a 4.19/KVM env homelab with 3 hosts and less than 20 user VMs (across accounts & projects), with this changes my own listing of 10 VMs reduces from about 1s to under 50ms.

@yadvr
Copy link
Member Author

yadvr commented Jun 12, 2024

cc @NuxRo - if you have any insights from a production env (user/customer)? Thanks.

@yadvr
Copy link
Member Author

yadvr commented Jun 13, 2024

Tested this on a 4.18 and 4.19 env; with large number of simulator hosts and networks in one env; and in case of my KVM-based homelab with MySQL 8.0.36 on Ubuntu 22.04.

After the view/change was applied, this is the explain analyze:

explain analyze select * from user_vm_view where id in (1,2,3,4,5);
...
| -> Nested loop left join  (cost=75824 rows=305) (actual time=0.161..0.213 rows=2 loops=1)
    -> Nested loop left join  (cost=50766 rows=297) (actual time=0.158..0.208 rows=2 loops=1)
        -> Nested loop left join  (cost=26363 rows=289) (actual time=0.155..0.204 rows=2 loops=1)
            -> Nested loop left join  (cost=2597 rows=281) (actual time=0.153..0.2 rows=2 loops=1)
                -> Left hash join (autoscale_vmgroup_vm_map.instance_id = vm_instance.id)  (cost=2569 rows=281) (actual time=0.151..0.198 rows=2 loops=1)
                    -> Nested loop left join  (cost=2243 rows=281) (actual time=0.14..0.186 rows=2 loops=1)
                        -> Left hash join (affinity_group_vm_map.instance_id = vm_instance.id)  (cost=2215 rows=281) (actual time=0.139..0.185 rows=2 loops=1)
                            -> Nested loop left join  (cost=8670 rows=281) (actual time=0.135..0.18 rows=2 loops=1)
                                -> Left hash join (resource_tags.resource_id = vm_instance.id)  (cost=1607 rows=214) (actual time=0.124..0.165 rows=2 loops=1)
                                    -> Nested loop left join  (cost=17858 rows=214) (actual time=0.118..0.158 rows=2 loops=1)
                                        -> Nested loop left join  (cost=285 rows=208) (actual time=0.112..0.148 rows=2 loops=1)
                                            -> Nested loop left join  (cost=8.34 rows=0.00322) (actual time=0.108..0.141 rows=2 loops=1)
                                                -> Nested loop left join  (cost=8.32 rows=0.00322) (actual time=0.107..0.14 rows=2 loops=1)
                                                    -> Nested loop left join  (cost=8.31 rows=0.00322) (actual time=0.102..0.135 rows=2 loops=1)
                                                        -> Nested loop left join  (cost=8.1 rows=0.0022) (actual time=0.0945..0.118 rows=2 loops=1)
                                                            -> Nested loop left join  (cost=8.08 rows=0.0022) (actual time=0.0934..0.117 rows=2 loops=1)
                                                                -> Nested loop left join  (cost=8.07 rows=0.0022) (actual time=0.0925..0.116 rows=2 loops=1)
                                                                    -> Nested loop left join  (cost=7.85 rows=0.0022) (actual time=0.0907..0.113 rows=2 loops=1)
                                                                        -> Nested loop left join  (cost=7.83 rows=0.0022) (actual time=0.0869..0.109 rows=2 loops=1)
                                                                            -> Nested loop left join  (cost=7.78 rows=0.0022) (actual time=0.086..0.108 rows=2 loops=1)
                                                                                -> Nested loop left join  (cost=7.76 rows=0.0022) (actual time=0.0837..0.105 rows=2 loops=1)
                                                                                    -> Nested loop left join  (cost=7.75 rows=0.0022) (actual time=0.0804..0.102 rows=2 loops=1)
                                                                                        -> Nested loop left join  (cost=7.73 rows=0.0022) (actual time=0.0771..0.0969 rows=2 loops=1)
                                                                                            -> Nested loop left join  (cost=7.59 rows=0.0015) (actual time=0.0686..0.0846 rows=2 loops=1)
                                                                                                -> Nested loop left join  (cost=7.58 rows=0.0015) (actual time=0.0674..0.0832 rows=2 loops=1)
                                                                                                    -> Nested loop left join  (cost=7.57 rows=0.0015) (actual time=0.063..0.0784 rows=2 loops=1)
                                                                                                        -> Nested loop left join  (cost=7.56 rows=0.0015) (actual time=0.0588..0.0739 rows=2 loops=1)
                                                                                                            -> Nested loop left join  (cost=7.54 rows=0.0015) (actual time=0.0539..0.0687 rows=2 loops=1)
                                                                                                                -> Nested loop left join  (cost=7.49 rows=0.0015) (actual time=0.0529..0.0675 rows=2 loops=1)
                                                                                                                    -> Nested loop inner join  (cost=7.34 rows=0.0015) (actual time=0.0508..0.0645 rows=2 loops=1)
                                                                                                                        -> Nested loop left join  (cost=7.29 rows=0.0015) (actual time=0.0473..0.0596 rows=2 loops=1)
                                                                                                                            -> Nested loop left join  (cost=7.26 rows=0.0015) (actual time=0.044..0.0536 rows=2 loops=1)
                                                                                                                                -> Nested loop left join  (cost=7.24 rows=0.0015) (actual time=0.0408..0.05 rows=2 loops=1)
                                                                                                                                    -> Nested loop inner join  (cost=7.23 rows=0.0015) (actual time=0.0379..0.0467 rows=2 loops=1)
                                                                                                                                        -> Inner hash join (vm_instance.domain_id = domain.id)  (cost=7.22 rows=0.0015) (actual time=0.0333..0.04 rows=2 loops=1)
                                                                                                                                            -> Filter: ((vm_instance.id in (1,2,3,4,5)) and (vm_instance.removed is null))  (cost=2.22 rows=0.05) (actual time=0.0165..0.0225 rows=2 loops=1)
                                                                                                                                                -> Index range scan on vm_instance using PRIMARY over (id = 1) OR (id = 2) OR (3 more)  (cost=2.22 rows=5) (actual time=0.0107..0.0207 rows=5 loops=1)
                                                                                                                                            -> Hash
                                                                                                                                                -> Table scan on domain  (cost=0.55 rows=3) (actual time=0.00857..0.0105 rows=3 loops=1)
                                                                                                                                        -> Single-row index lookup on account using PRIMARY (id=vm_instance.account_id)  (cost=0.75 rows=1) (actual time=0.00288..0.00289 rows=1 loops=2)
                                                                                                                                    -> Single-row index lookup on guest_os using PRIMARY (id=vm_instance.guest_os_id)  (cost=0.75 rows=1) (actual time=0.00147..0.00149 rows=1 loops=2)
                                                                                                                                -> Single-row index lookup on host_pod_ref using PRIMARY (id=vm_instance.pod_id)  (cost=0.75 rows=1) (actual time=0.00161..0.00163 rows=1 loops=2)
                                                                                                                            -> Index lookup on projects using fk_projects__project_account_id (project_account_id=vm_instance.account_id)  (cost=0.917 rows=1) (actual time=0.00257..0.00277 rows=0.5 loops=2)
                                                                                                                        -> Single-row index lookup on user_vm using PRIMARY (id=vm_instance.id)  (cost=0.994 rows=1) (actual time=0.00227..0.00228 rows=1 loops=2)
                                                                                                                    -> Index lookup on instance_group_vm_map using fk_instance_group_vm_map___instance_id (instance_id=vm_instance.id)  (cost=1.67 rows=1) (actual time=0.00128..0.00128 rows=0 loops=2)
                                                                                                                -> Single-row index lookup on instance_group using PRIMARY (id=instance_group_vm_map.group_id)  (cost=1 rows=1) (actual time=480e-6..480e-6 rows=0 loops=2)
                                                                                                            -> Single-row index lookup on data_center using PRIMARY (id=vm_instance.data_center_id)  (cost=0.75 rows=1) (actual time=0.00241..0.00243 rows=1 loops=2)
                                                                                                        -> Single-row index lookup on host using PRIMARY (id=vm_instance.host_id)  (cost=0.75 rows=1) (actual time=0.0021..0.00211 rows=1 loops=2)
                                                                                                    -> Single-row index lookup on vm_template using PRIMARY (id=vm_instance.vm_template_id)  (cost=0.75 rows=1) (actual time=0.00223..0.00224 rows=1 loops=2)
                                                                                                -> Single-row index lookup on iso using PRIMARY (id=user_vm.iso_id)  (cost=0.75 rows=1) (actual time=516e-6..516e-6 rows=0 loops=2)
                                                                                            -> Index lookup on volumes using i_volumes__instance_id (instance_id=vm_instance.id)  (cost=1.89 rows=1.47) (actual time=0.00561..0.00597 rows=1 loops=2)
                                                                                        -> Single-row index lookup on service_offering using PRIMARY (id=vm_instance.service_offering_id)  (cost=0.511 rows=1) (actual time=0.00235..0.00236 rows=1 loops=2)
                                                                                    -> Single-row covering index lookup on svc_disk_offering using PRIMARY (id=volumes.disk_offering_id)  (cost=0.538 rows=1) (actual time=0.00163..0.00164 rows=1 loops=2)
                                                                                -> Single-row index lookup on disk_offering using PRIMARY (id=volumes.disk_offering_id)  (cost=0.538 rows=1) (actual time=0.00116..0.00117 rows=1 loops=2)
                                                                            -> Single-row index lookup on backup_offering using PRIMARY (id=vm_instance.backup_offering_id)  (cost=0.682 rows=1) (actual time=448e-6..448e-6 rows=0 loops=2)
                                                                        -> Single-row index lookup on storage_pool using PRIMARY (id=volumes.pool_id)  (cost=0.538 rows=1) (actual time=0.00186..0.00188 rows=1 loops=2)
                                                                    -> Index lookup on security_group_vm_map using fk_security_group_vm_map___instance_id (instance_id=vm_instance.id)  (cost=1.45 rows=1) (actual time=0.00113..0.00113 rows=0 loops=2)
                                                                -> Single-row index lookup on security_group using PRIMARY (id=security_group_vm_map.security_group_id)  (cost=0.511 rows=1) (actual time=422e-6..422e-6 rows=0 loops=2)
                                                            -> Single-row index lookup on user_data using PRIMARY (id=user_vm.user_data_id)  (cost=0.511 rows=1) (actual time=550e-6..550e-6 rows=0 loops=2)
                                                        -> Filter: (nics.removed is null)  (cost=1.61 rows=1.46) (actual time=0.00757..0.00803 rows=1 loops=2)
                                                            -> Index lookup on nics using fk_nics__instance_id (instance_id=vm_instance.id)  (cost=1.61 rows=1.46) (actual time=0.0071..0.00775 rows=1.5 loops=2)
                                                    -> Single-row index lookup on networks using PRIMARY (id=nics.network_id)  (cost=0.368 rows=1) (actual time=0.00215..0.00219 rows=1 loops=2)
                                                -> Filter: (vpc.removed is null)  (cost=0.368 rows=1) (actual time=572e-6..572e-6 rows=0 loops=2)
                                                    -> Single-row index lookup on vpc using PRIMARY (id=networks.vpc_id)  (cost=0.368 rows=1) (actual time=473e-6..473e-6 rows=0 loops=2)
                                            -> Index lookup on user_ip_address using fk_user_ip_address__vm_id (vm_id=vm_instance.id)  (cost=20896 rows=64676) (actual time=0.00279..0.00279 rows=0 loops=2)
                                        -> Index lookup on ssh_details using i_name_vm_id (vm_id=vm_instance.id, name='SSH.KeyPairNames')  (cost=0.843 rows=1.03) (actual time=0.0045..0.0051 rows=1 loops=2)
                                    -> Hash
                                        -> Filter: (resource_tags.resource_type = 'UserVm')  (cost=0.0742 rows=1) (actual time=0.00301..0.00301 rows=0 loops=1)
                                            -> Table scan on resource_tags  (cost=0.0742 rows=1) (actual time=0.00283..0.00283 rows=0 loops=1)
                                -> Filter: (async_job.job_status = 0)  (cost=0.329 rows=1.32) (actual time=0.00744..0.00744 rows=0 loops=2)
                                    -> Index lookup on async_job using i_async__instance_type_id (instance_type='VirtualMachine', instance_id=vm_instance.id)  (cost=0.329 rows=1.32) (actual time=0.00641..0.007 rows=1 loops=2)
                            -> Hash
                                -> Table scan on affinity_group_vm_map  (cost=0.0777 rows=1) (actual time=0.00142..0.00142 rows=0 loops=1)
                        -> Single-row index lookup on affinity_group using PRIMARY (id=affinity_group_vm_map.affinity_group_id)  (cost=5.33e-6 rows=1) (actual time=496e-6..496e-6 rows=0 loops=2)
                    -> Hash
                        -> Table scan on autoscale_vmgroup_vm_map  (cost=0.0903 rows=1) (actual time=0.00693..0.00693 rows=0 loops=1)
                -> Single-row index lookup on autoscale_vmgroups using PRIMARY (id=autoscale_vmgroup_vm_map.vmgroup_id)  (cost=4e-6 rows=1) (actual time=484e-6..484e-6 rows=0 loops=2)
            -> Index lookup on custom_cpu using i_name_vm_id (vm_id=vm_instance.id, name='CpuNumber')  (cost=0.843 rows=1.03) (actual time=0.00205..0.00205 rows=0 loops=2)
        -> Index lookup on custom_speed using i_name_vm_id (vm_id=vm_instance.id, name='CpuSpeed')  (cost=0.843 rows=1.03) (actual time=0.00172..0.00172 rows=0 loops=2)
    -> Index lookup on custom_ram_size using i_name_vm_id (vm_id=vm_instance.id, name='memory')  (cost=0.843 rows=1.03) (actual time=0.00216..0.00216 rows=0 loops=2)

For comparison, when the index is used the number of actual rows that could have been scanned reduces dramatically:

 -> Index lookup on user_ip_address using fk_user_ip_address__vm_id (vm_id=vm_instance.id)  (cost=20896 rows=64676) (actual time=0.00279..0.00279 rows=0 loops=2)

@yadvr yadvr merged commit 0f8a839 into apache:4.19 Jun 13, 2024
@yadvr yadvr deleted the uservm-force-index branch June 13, 2024 06:59
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 17, 2024
…sting (apache#9198)

The user_vm_view can end up not picking the right index to join against
the user_ip_address table causing full table scan on the user_ip_address
table. This could be related to a MySQL bug
https://bugs.mysql.com/bug.php?id=41220

In a test environment with 20k shared networks and over 20M IPs, the
listVirtualMachines API was found to take over 17s to return list of
just 10 VMs. However, with this fix it would now take under 200ms to
return the list.

MySQL slow query logging showed ~nearly 20M table scans of the IP
address table:
```
# User@Host: cloud[cloud] @ localhost [127.0.0.1]  Id:    39
# Query_time: 8.227541  Lock_time: 0.000014 Rows_sent: 12  Rows_examined: 19,667,235
SET timestamp=1715410270;
SELECT user_vm_view.id, user_vm_view.name /*snipped*/ FROM user_vm_view
WHERE user_vm_view.id IN (4,6,7,8,9,10,11,12,13,14,15,16);
```

Signed-off-by: Rohit Yadav <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

GUI - search instance always give the first VM name and GUI is very slow

5 participants