Dynamic parallelism is better than host streams for 3 reasons.
- Avoiding extra PCI-e data transfers. The launch configuration of each subsequent kernel depends on results of the previous kernel, which are stored in device memory. For the dynamic parallelism version, reading these results requires just a single memory access, while for host streams a data transfer from device to host is needed.
- Achieving higher launch throughput with dynamic parallelism, when compared to launching from host using multiple streams.
- Reducing false dependencies between kernel launches. All host streams are served by a single thread, which means that when waiting on a single stream, work cannot be queued into other streams, even if the data for launch configuration is available. For dynamic parallelism, there are multiple thread blocks queuing kernels executed asynchronously.
The dynamic parallelism version is also more straightforward to implement, and leads to cleaner and more readable code. So in the case of PANDA, using dynamic parallelism improved both performance and productivity.