nvidia.com: A CUDA Dynamic Parallelism Case Study, PANDA

nvidia.com: A CUDA Dynamic Parallelism Case Study, PANDA

Dynamic parallelism is better than host streams for 3 reasons.

  • Avoiding extra PCI-e data transfers. The launch configuration of each subsequent kernel depends on results of the previous kernel, which are stored in device memory. For the dynamic parallelism version, reading these results requires just a single memory access, while for host streams a data transfer from device to host is needed.
  • Achieving higher launch throughput with dynamic parallelism, when compared to launching from host using multiple streams.
  • Reducing false dependencies between kernel launches. All host streams are served by a single thread, which means that when waiting on a single stream, work cannot be queued into other streams, even if the data for launch configuration is available. For dynamic parallelism, there are multiple thread blocks queuing kernels executed asynchronously.

The dynamic parallelism version is also more straightforward to implement, and leads to cleaner and more readable code. So in the case of PANDA, using dynamic parallelism improved both performance and productivity.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s