Cuda Dynamic Parallelism Performance

vishwashanika · July 13, 2016, 4:34am

Hi Guys,

I am a newbie to Cuda. I am currently doing a performance comparison in Dynamic parallelism.
I have three kernels. I compared the performance with Host kernel launching and Device Kernel Launching (dynamic parallelism).

Dynamic parallelism parent kernel dimensions are
grid - (1, 0, 0)
block - (1, 0, 0)
And Each child kernels dimensions are detailed below. Launching happens after completion of previous kernel (Not recursicely). I have set the “cudaLimitDevRuntimeSyncDepth” to be 2, cudaLimitDevRuntimePendingLaunchCount" = 1024* 128

Host kernel launching dimensions are same to child kernel dimensions.

Followings are my kernel dimensions, Time taken to execute from Host launching, Device Launching.

       |    Calculation Type  |  Grid Dimension  |   Block Dimension |  Host Launch  | Device Launch |

-----------|----------------------|------------------|-------------------|---------------|---------------|
Kernel -1-|Map operation-------|-----1024-------|—1024------------|--------52.7us-|------119.2us–|
Kernel -2-|Reduce operation----|-----1024-------|—1024------------|-------183.7us-|------334.9us–|
Kernel -3-|Sort operation------|--------1-------|----512------------|-------221.7us-|------383.3us–|

I found some more details from [here][/http://users.ece.gatech.edu/~sudha/academic/class/ece8823/Lectures/Module-6-Microarchitecture/cuda-dyn-par.pdf]

The presentation explains dynamic Parallelism have some overhead in synchronization. And it says the kernel execution time should in be same.
But I observed the dynamic parallelism kernel execution time is higher than host kernel launching time.

I am not sure about the is there results. Or am I doing something wrong?

Test Enviroment
GPU - GeForce GTX 980
OS - Red Hat Enterprise Linux Server release 6.6 (Linux k7-1 2.6.32-504.el6.x86_64)
CPU - Intel(R) Core™ i7-4770 CPU @ 3.40GHz
The time stamps are taken after second running iteration.

Thank you in advance.

Vishwa

BulatZiganshin · July 13, 2016, 12:47pm

you can use the code tag (last one in toolbox above edit box) to nicely format your table

Robert_Crovella · July 13, 2016, 2:07pm

cross posted (where the formatting is better):

[url]http://stackoverflow.com/questions/38343526/cuda-dynamic-parallelism-performance[/url]

vishwashanika · July 14, 2016, 5:19am

I will use mentioned feature next time… Thank you… :). Actually I went to stack overflow due to that formatting issue…

I update the question there with more details…

Topic		Replies	Views
How much benefit can i get from dynamic parallelism in my code CUDA Programming and Performance	0	687	December 24, 2013
a question about low performance on dynamic parallelism with tremendous data CUDA Programming and Performance	2	1231	May 27, 2013
dynamic parallelism performance CUDA Programming and Performance	4	1031	January 3, 2013
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	935	June 3, 2024
GTX Titan and dynamic parallelism CUDA Programming and Performance	6	1383	November 14, 2013
Dynamic Parallelism extreme slowdown CUDA Programming and Performance	0	867	April 7, 2013
Cuda Dynamic Parallelism Launch Overhead CUDA Programming and Performance	5	2268	March 17, 2017
dynamic parallelism CUDA Programming and Performance	3	1160	December 30, 2012
Dynamic parallelism vs flat kernels CUDA Programming and Performance	0	398	May 30, 2017
A question on nested parallelism CUDA Programming and Performance	5	1460	April 11, 2019

Cuda Dynamic Parallelism Performance

Related topics