Optimize volume integral kernels by huiyuxie · Pull Request #102 · trixi-gpu/TrixiCUDA.jl

huiyuxie · 2024-12-26T01:40:33Z

This PR addresses kernel optimization for the volume integral process in the semidiscretization.

Update: The original flux_kernel! and weak_form_kernel! are fused into single flux_weak_from_kernel!. Dynamic shared memory is allocated to apply the tiling algorithm (one tile only here due to the constraints of the high dimensional matrix) for further speedup. 1D, 2D, and 3D cases all gain speedup.

Update2: Maybe can try CUDA library later for some matrix mutiplication.

huiyuxie · 2024-12-29T19:45:07Z

Benchmark on 1D advection basic equation before optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.200 μs …  1.489 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.297 μs ± 39.720 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▆▄▂▂▂▂▂▂▂▁▁▁▁▂▂▃▃▃▂▂▁▁             ▁▁  ▂▂                ▂
  ███████████████████████████▇▇█▇██▆▇▇▇████████▇▇▆▅▆▅▅▅▅▅▄▅▅▅ █
  24.2 μs      Histogram: log(frequency) by time       110 μs <

 Memory estimate: 3.27 KiB, allocs estimate: 60.

Benchmark on 1D advection basic equation after optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  19.200 μs … 267.400 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     23.200 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   27.067 μs ±  11.579 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

 ▄▆▇▇█▆▄▃▃▁▁        ▂▄▃▃▃▁▁                                   ▂
 ███████████████████████████▇▇█▇▇▇▇▇▇▇▇▇▆▆▇▅▄▆▅▂▄▆▆▅▇▆▆▅▅▄▄▄▄ █
 19.2 μs       Histogram: log(frequency) by time      73.4 μs <

Memory estimate: 1.88 KiB, allocs estimate: 32.

huiyuxie · 2024-12-29T19:45:41Z

Benchmark on 2D advection basic equation before optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.100 μs … 755.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.577 μs ±  31.178 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █
  ▂███▄▃▄▄▄▆▅▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  28.1 μs         Histogram: frequency by time          191 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 71.

Benchmark on 2D advection basic equation after optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.400 μs … 755.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.092 μs ±  16.107 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▆▆▃▁▂▄▅▃▁▁ ▁ ▁▁▁      ▁▁                      ▁▁          ▂
  ████████████████████▇▇▅▆████▇▇▇▇▆▆▆▅▇▆▆▆▆▇▆▇▇▆▅▆████▇▆▆▆▇▅▄▅ █
  18.4 μs       Histogram: log(frequency) by time      73.4 μs <

 Memory estimate: 1.91 KiB, allocs estimate: 32.

huiyuxie · 2024-12-29T20:36:15Z

Benchmark on 3D advection basic equation before optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  127.200 μs …  48.062 ms  ┊ GC (min … max): 0.00% … 21.84%
 Time  (median):     133.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   149.011 μs ± 480.095 μs  ┊ GC (mean ± σ):  0.70% ±  0.22%

  ▃▆██▆▄▄▄▄▃▂▁ ▁          ▁▁▁▁    ▁▁          ▁▁                ▂
  ██████████████████▇██▇▇█████████████▇▇▇▅▆██▇███▆▇▆▄▆▆▆▆▆▅▅▄▄▆ █
  127 μs        Histogram: log(frequency) by time        243 μs <

 Memory estimate: 5.31 KiB, allocs estimate: 82.

Benchmark on 3D advection basic equation after optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  23.500 μs … 543.100 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     28.900 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   38.314 μs ±  19.512 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ██
 ▃▆██▇▄▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
 23.5 μs         Histogram: frequency by time          100 μs <

Memory estimate: 4.89 KiB, allocs estimate: 80.

huiyuxie · 2024-12-29T20:43:32Z

A conditional flow has be added in volume integral to select different kernels optimization based on the sizes of du and u.

Basically, it depends on the max threads we could launch per block.

Update: The device property queries before the kernel launch introduce some overhead. Consider optimizing by constructing a struct to store hardware properties, avoiding repetitive queries.

huiyuxie added 2 commits December 23, 2024 21:58

Start

bf0992e

Start

9eba3a6

huiyuxie self-assigned this Dec 27, 2024

huiyuxie added the performance Improve performance label Dec 27, 2024

huiyuxie added 2 commits December 26, 2024 18:33

Complete 1D

ce1e70d

dynamic 1D and 2D

ff4fd3b

huiyuxie added 2 commits December 29, 2024 10:37

dynamic 3D

c17b38c

Fix typo

10e372b

huiyuxie added 3 commits December 29, 2024 20:50

Add checks

7eeaf02

Update README.md

e678e4c

Change flux kernel

8f18698

huiyuxie merged commit 630187c into main Dec 30, 2024

huiyuxie deleted the optimize branch January 1, 2025 06:24

This was referenced Jan 1, 2025

Optimize volume integral kernel for flux differencing (frequent use) #105

Merged

Switch to less parallelism to avoid redundant computation #107

Merged

huiyuxie mentioned this pull request Jan 15, 2025

Optimization patch for volume integral kernels #115

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize volume integral kernels#102

Optimize volume integral kernels#102
huiyuxie merged 9 commits intomainfrom
optimize

huiyuxie commented Dec 26, 2024 •

edited

Loading

Uh oh!

huiyuxie commented Dec 29, 2024

Uh oh!

huiyuxie commented Dec 29, 2024

Uh oh!

huiyuxie commented Dec 29, 2024

Uh oh!

huiyuxie commented Dec 29, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huiyuxie commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huiyuxie commented Dec 29, 2024

Uh oh!

huiyuxie commented Dec 29, 2024

Uh oh!

huiyuxie commented Dec 29, 2024

Uh oh!

huiyuxie commented Dec 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huiyuxie commented Dec 26, 2024 •

edited

Loading

huiyuxie commented Dec 29, 2024 •

edited

Loading