pyfr.org: Python Flux Reconstruction

pyfr.org: Python Flux Reconstruction

PyFR is an open-source Python based framework for solving advection-diffusion type problems on streaming architectures using the Flux Reconstruction approach of Huynh. The framework is designed to solve a range of governing systems on mixed unstructured grids containing various element types. It is also designed to target a range of hardware platforms via use of an in-built domain specific language derived from the Mako templating engine. The current release (PyFR 1.0.0) has the following capabilities:

Governing Equations – Euler, Navier Stokes
Dimensionality – 2D, 3D
Element Types – Triangles, Quadrilaterals, Hexahedra, Prisms, Tetrahedra, Pyramids
Platforms – CPU Clusters, Nvidia GPU Clusters, AMD GPU Clusters
Spatial Discretisation – High-Order Flux Reconstruction
Temporal Discretisation – Explicit Runge-Kutta
Precision – Single, Double
Mesh Files Imported – Gmsh (.msh)
Solution Files Exported – Unstructured VTK (.vtu, .pvtu)

PyFR is being developed in the Vincent Lab, Department of Aeronautics, Imperial College London, UK.

Development of PyFR is supported by the Engineering and Physical Sciences Research Council, Innovate UK, the European Commission, BAE Systems, and Airbus. We are also grateful for hardware donations from Nvidia, Intel, and AMD.

Overview
PyFR 1.0.0 has a hard dependency on Python 3.3+ and the following Python packages:

h5py >= 2.5
mako >= 1.0.0
mpi4py >= 1.3
mpmath >= 0.18
numpy >= 1.8
pytools >= 2014.3
Note that due to a bug in numpy PyFR is not compatible with 32-bit Python distributions.

CUDA Backend
The CUDA backend targets NVIDIA GPUs with a compute capability of 2.0 or greater. The backend requires:

CUDA >= 4.2
pycuda >= 2011.2
OpenCL Backend
The OpenCL backend targets a range of accelerators including GPUs from AMD and NVIDIA. The backend requires:

OpenCL
pyopencl >= 2013.2
clBLAS
OpenMP Backend
The OpenMP backend targets multi-core CPUs. The backend requires:

GCC >= 4.7
A BLAS library compiled as a shared library (e.g. OpenBLAS)
Running in Parallel
To partition meshes for running in parallel it is also necessary to have one of the following partitioners installed:

metis >= 5.0
scotch >= 6.0

nvidia.com: A CUDA Dynamic Parallelism Case Study, PANDA

nvidia.com: A CUDA Dynamic Parallelism Case Study, PANDA

Dynamic parallelism is better than host streams for 3 reasons.

  • Avoiding extra PCI-e data transfers. The launch configuration of each subsequent kernel depends on results of the previous kernel, which are stored in device memory. For the dynamic parallelism version, reading these results requires just a single memory access, while for host streams a data transfer from device to host is needed.
  • Achieving higher launch throughput with dynamic parallelism, when compared to launching from host using multiple streams.
  • Reducing false dependencies between kernel launches. All host streams are served by a single thread, which means that when waiting on a single stream, work cannot be queued into other streams, even if the data for launch configuration is available. For dynamic parallelism, there are multiple thread blocks queuing kernels executed asynchronously.

The dynamic parallelism version is also more straightforward to implement, and leads to cleaner and more readable code. So in the case of PANDA, using dynamic parallelism improved both performance and productivity.