Skip to content

Optimize volume integral kernels#102

Merged
huiyuxie merged 9 commits intomainfrom
optimize
Dec 30, 2024
Merged

Optimize volume integral kernels#102
huiyuxie merged 9 commits intomainfrom
optimize

Conversation

@huiyuxie
Copy link
Copy Markdown
Member

@huiyuxie huiyuxie commented Dec 26, 2024

This PR addresses kernel optimization for the volume integral process in the semidiscretization.

Update: The original flux_kernel! and weak_form_kernel! are fused into single flux_weak_from_kernel!. Dynamic shared memory is allocated to apply the tiling algorithm (one tile only here due to the constraints of the high dimensional matrix) for further speedup. 1D, 2D, and 3D cases all gain speedup.

Update2: Maybe can try CUDA library later for some matrix mutiplication.

@huiyuxie huiyuxie self-assigned this Dec 27, 2024
@huiyuxie huiyuxie added the performance Improve performance label Dec 27, 2024
@huiyuxie
Copy link
Copy Markdown
Member Author

Benchmark on 1D advection basic equation before optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.200 μs …  1.489 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.297 μs ± 39.720 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▆▄▂▂▂▂▂▂▂▁▁▁▁▂▂▃▃▃▂▂▁▁             ▁▁  ▂▂                ▂
  ███████████████████████████▇▇█▇██▆▇▇▇████████▇▇▆▅▆▅▅▅▅▅▄▅▅▅ █
  24.2 μs      Histogram: log(frequency) by time       110 μs <

 Memory estimate: 3.27 KiB, allocs estimate: 60.

Benchmark on 1D advection basic equation after optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  19.200 μs … 267.400 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     23.200 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   27.067 μs ±  11.579 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

 ▄▆▇▇█▆▄▃▃▁▁        ▂▄▃▃▃▁▁                                   ▂
 ███████████████████████████▇▇█▇▇▇▇▇▇▇▇▇▆▆▇▅▄▆▅▂▄▆▆▅▇▆▆▅▅▄▄▄▄ █
 19.2 μs       Histogram: log(frequency) by time      73.4 μs <

Memory estimate: 1.88 KiB, allocs estimate: 32.

@huiyuxie
Copy link
Copy Markdown
Member Author

Benchmark on 2D advection basic equation before optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.100 μs … 755.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.577 μs ±  31.178 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █
  ▂███▄▃▄▄▄▆▅▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  28.1 μs         Histogram: frequency by time          191 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 71.

Benchmark on 2D advection basic equation after optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.400 μs … 755.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.092 μs ±  16.107 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▆▆▃▁▂▄▅▃▁▁ ▁ ▁▁▁      ▁▁                      ▁▁          ▂
  ████████████████████▇▇▅▆████▇▇▇▇▆▆▆▅▇▆▆▆▆▇▆▇▇▆▅▆████▇▆▆▆▇▅▄▅ █
  18.4 μs       Histogram: log(frequency) by time      73.4 μs <

 Memory estimate: 1.91 KiB, allocs estimate: 32.

@huiyuxie
Copy link
Copy Markdown
Member Author

Benchmark on 3D advection basic equation before optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  127.200 μs …  48.062 ms  ┊ GC (min … max): 0.00% … 21.84%
 Time  (median):     133.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   149.011 μs ± 480.095 μs  ┊ GC (mean ± σ):  0.70% ±  0.22%

  ▃▆██▆▄▄▄▄▃▂▁ ▁          ▁▁▁▁    ▁▁          ▁▁                ▂
  ██████████████████▇██▇▇█████████████▇▇▇▅▆██▇███▆▇▆▄▆▆▆▆▆▅▅▄▄▆ █
  127 μs        Histogram: log(frequency) by time        243 μs <

 Memory estimate: 5.31 KiB, allocs estimate: 82.

Benchmark on 3D advection basic equation after optimization

[ Info: Time for reset_du! and volume_integral! on GPU
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  23.500 μs … 543.100 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     28.900 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   38.314 μs ±  19.512 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ██
 ▃▆██▇▄▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
 23.5 μs         Histogram: frequency by time          100 μs <

Memory estimate: 4.89 KiB, allocs estimate: 80.

@huiyuxie
Copy link
Copy Markdown
Member Author

huiyuxie commented Dec 29, 2024

A conditional flow has be added in volume integral to select different kernels optimization based on the sizes of du and u.

Basically, it depends on the max threads we could launch per block.

Update: The device property queries before the kernel launch introduce some overhead. Consider optimizing by constructing a struct to store hardware properties, avoiding repetitive queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Improve performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant