Python Flux Reconstruction Python Flux Reconstruction

PyFR is an open-source Python based framework for solving advection-diffusion type problems on streaming architectures using the Flux Reconstruction approach of Huynh. The framework is designed to solve a range of governing systems on mixed unstructured grids containing various element types. It is also designed to target a range of hardware platforms via use of an in-built domain specific language derived from the Mako templating engine. The current release (PyFR 1.0.0) has the following capabilities:

Governing Equations – Euler, Navier Stokes
Dimensionality – 2D, 3D
Element Types – Triangles, Quadrilaterals, Hexahedra, Prisms, Tetrahedra, Pyramids
Platforms – CPU Clusters, Nvidia GPU Clusters, AMD GPU Clusters
Spatial Discretisation – High-Order Flux Reconstruction
Temporal Discretisation – Explicit Runge-Kutta
Precision – Single, Double
Mesh Files Imported – Gmsh (.msh)
Solution Files Exported – Unstructured VTK (.vtu, .pvtu)

PyFR is being developed in the Vincent Lab, Department of Aeronautics, Imperial College London, UK.

Development of PyFR is supported by the Engineering and Physical Sciences Research Council, Innovate UK, the European Commission, BAE Systems, and Airbus. We are also grateful for hardware donations from Nvidia, Intel, and AMD.

PyFR 1.0.0 has a hard dependency on Python 3.3+ and the following Python packages:

h5py >= 2.5
mako >= 1.0.0
mpi4py >= 1.3
mpmath >= 0.18
numpy >= 1.8
pytools >= 2014.3
Note that due to a bug in numpy PyFR is not compatible with 32-bit Python distributions.

CUDA Backend
The CUDA backend targets NVIDIA GPUs with a compute capability of 2.0 or greater. The backend requires:

CUDA >= 4.2
pycuda >= 2011.2
OpenCL Backend
The OpenCL backend targets a range of accelerators including GPUs from AMD and NVIDIA. The backend requires:

pyopencl >= 2013.2
OpenMP Backend
The OpenMP backend targets multi-core CPUs. The backend requires:

GCC >= 4.7
A BLAS library compiled as a shared library (e.g. OpenBLAS)
Running in Parallel
To partition meshes for running in parallel it is also necessary to have one of the following partitioners installed:

metis >= 5.0
scotch >= 6.0 A CUDA Dynamic Parallelism Case Study, PANDA A CUDA Dynamic Parallelism Case Study, PANDA

Dynamic parallelism is better than host streams for 3 reasons.

  • Avoiding extra PCI-e data transfers. The launch configuration of each subsequent kernel depends on results of the previous kernel, which are stored in device memory. For the dynamic parallelism version, reading these results requires just a single memory access, while for host streams a data transfer from device to host is needed.
  • Achieving higher launch throughput with dynamic parallelism, when compared to launching from host using multiple streams.
  • Reducing false dependencies between kernel launches. All host streams are served by a single thread, which means that when waiting on a single stream, work cannot be queued into other streams, even if the data for launch configuration is available. For dynamic parallelism, there are multiple thread blocks queuing kernels executed asynchronously.

The dynamic parallelism version is also more straightforward to implement, and leads to cleaner and more readable code. So in the case of PANDA, using dynamic parallelism improved both performance and productivity. Mayya Tokman Mayya Tokman


A new class of exponential propagation iterative methods of Runge-Kutta type (EPIRK)
M. Tokman, Journal of Computational Physics, 230 (2011) 8762-8778.

New adaptive exponential propagation iterative Runge-Kutta- type (EPIRK) methods
M. Tokman, P. Tranquilli, and J. Loffeld. Submitted, 2011.

Comparative performance of exponential, implicit, and explicit integrators for stiff systems of ODEs.
J. Loffeld and M. Tokman. Submitted, 2010.

Efficient design of exponential-Krylov integrators for large scale computing
M. Tokman and J. Loffeld. Proceedings of the 10th International Conference on Computational Science, Procedia Computer Science, 1(1), pp. 229-237, (2010).

Computational aspects of mucus propulsion by cilated epithelium
R. Chatelin, P. Poncet and M. Tokman, Proceedings of the 2nd European Conference on Microfluidics, Toulouse 2010.

Automated assessment of short free-text responses in computer science using latent semantic analysis
R. Klein, A. Kyrilov & M. Tokman, ITiCSE’ 11 Proceedings of the 16th annual joint conference on Innovation and technology in computer science education, 158-162, Darmstadt (2011).

Efficient integration of large stiff systems of ODEs with exponential propagation iterative (EPI) methods
M. Tokman, Journal of Computational Physics 213 (2006) 748–776.

Three-dimensional Model of the Structure and Evolution of the Coronal Mass Ejections
M. Tokman, P. Bellan, Astrophysical Journal, 567(2), pp. 1202, 2002.

Investigations into the Relationship between Spheromak, Solar and Astrophysical Plasmas
P.M. Bellan, S.C. Hsu, J.F. Hansen, M. Tokman, S.E. Pracko, C.A. Romero-Talamas, Proceedings on 19th International Atomic Energy Agency Fusion Energy Conference, Lyon, 2002.

Emmanuel Goossaert

Emmanuel Goossaert

I am a software engineer, with strong interests in distributed systems, storage systems, data structures and algorithms.

I work at in Amsterdam, Netherlands.

In my free time I hack on KingDB, a fast on-disk persistent key-value store.

Here is a selection of articles from my blog. For the full list, you can refer to the blog categories: Programming, Business, and Thoughts.
How to implement an algorithm from a scientific paper

Coding for Solid State Drives
Part 1: Introduction and Table of Contents
Part 2: Architecture of an SSD and Benchmarking
Part 3: Pages, Blocks, and the Flash Translation Layer
Part 4: Advanced Functionalities and Internal Parallelism
Part 5: Access Patterns and System Optimizations
Part 6: A Summary – What every programmer should know about solid-state drives

Implementing a Key-Value Store
Table of Content
Part 1: What are key-value stores, and why implement one?
Part 2: Using existing key-value stores as models
Part 3: Comparative Analysis of the Architectures of Kyoto Cabinet and LevelDB
Part 4: API Design
Part 5: Hash table implementations
Part 6: Open-Addressing Hash Tables
Part 7: Optimizing Data Structures for SSDs

Hashing algorithms
Cuckoo hashing
Hopscotch hashing
Robin Hood hashing
Robin Hood hashing: backward-shift deletion

KingDB is a fast persistent key-value store. More information:

FileFacade: Content classification and search
FileFacade is a web-based media center. It scans folders on a computer and automatically identify the movies corresponding to files. The prototype took 15 days to complete, and after testing the idea with users, I decided not to pursue any further.

Computer Vision
Face transformation: iPhone app and web app based on my previously developed AAM library. Learn more on my blog.
Hair Analysis and Detection: implementation of Computer Vision algorithms for the detection and analysis of hair in a picture. The resulting library is available for licencing. Learn more on my blog
Active Appearance Models: implementation of AAMs using OpenCV and C++ to track facial features in a photo.
Content Aggregation
Automatic content aggregation using Natural Language Processing algorithms, and generation of structured websites based on aggregated data.

Window management software that increases the usable screen space on netbooks running Microsoft Windows.

Digital Rights Management (DRM) platform for the payment and activation of licenses of applications on Microsoft Windows.

Extra-light API to enable Hadoop to use Python methods (now deprecated).

Automatic generation of in-code documentation block templates for Python

SciANN (Google Summer of Code 2009)
Gateway generator to automatically parse C-based library and generate gateways for the SciLab computation platform.

Art class project using Atmel Mega8 microcontrollers. The devices exchange information through infrared, and merge color states when coming closer.

Machine Learning: NPY
Python implementation of a back propagation artificial neural network. The goal was not performance but high flexibility.

Machine Learning: Precipitation Classification

Machine Learning class project. Implemented decision tree classifier in Java. Ranked #1 student team at the AMS Artificial Intelligence Competition 2008.

Map Matching
Implemented algorithms to match GPS points to vectorized road maps and determine the path of a vehicle, as part of an internship. The code is still in use at the company as of today.

Created C++ libraries to build abstract syntax trees and perform semantic analysis for a simplified Pascal language.

PiSE, Php Intersnet Site Engine is a CMS coded with PHP. I used PiSE to power the website of a small non-profit organization, which stayed up online for a year.

Audio Streaming
First year project at the ISIMA College. Used multiple buffering to multicast MP3 songs from a server to a host.

Raw socket
Various kernel and raw socket tools: packet forger, stealth file transfer through ICMP, FTP password sniffer, anti-rootkit module. Robin Hood hashing — backward shift deletion Robin Hood hashing — backward shift deletion

3. Experimental protocol

In order to test the effect of backward shift deletion on performance, I am going to use the same test cases that I used in my previous article about Robin Hood hashing [1]. The only difference is that since I observed that there was no difference between the tables of size 10k and 100k, this time I am only plotting the results for tables of size 10k. For more details regarding the test cases, take a look that the “Experiment protocol” section in the my previous article [1].

4. Results

The results are presented in the four figures below, each figure showing the same statistic across the different test cases:
Figure 2 is the mean DIB
Figure 3 is the median of DIB
Figure 4 is the 95th percentile of DIB
Figure 5 is the variance of DIB
Each of these figures is holding sub-figures, each for a different test case:
(a) is the “batch” test case, with LFM=0.8 and LFR=0.1
(b) is the “batch” test case, with LFM=0.8 and LFR=0.8
(c) is the “ripple” test case, with LFM=0.8 and LFR=0.1
(d) is the “loading” test case
The mean DIBs are adjusted for graphs of the the Robin Hood with tombstones, in such a way that the minimum DIB is shifted to zero, and this in order to make a fair comparison with the other algorithms. Indeed, because the implementation of Robin Hood hashing with tombstones is considering only probes between the minimum and maximum DIBs, the probing of an item never starts at DIB 0 but at the minimum DIB. The graphs for Robin Hood hashing with backward shift deletion and the graphs for basic linear probing are not shifted down.

5. Discussion

In the results presented above, it is clear that Robin Hood hashing with backward shift deletion outperforms both basic linear probing and Robin Hood hashing with tombstones. In addition, the mean DIB and variance of DIB are constant, and this even after a large number of insertion and deletion operations, which is consistent with the results presented in the thesis of Celis [4].
The most striking results are is Figure 4, where the 95th percentile for Robin Hood hashing with backward shift deletion remains at a value of around 7, which proves that even in the worst cases, the number of probes to find an entry will be very small.

6. Conclusion

The algorithm I had implemented in my first version of Robin Hood hashing using tombstones [1] was the one described by the pseudo-code of the original thesis [4]. Yet, I was unable to reproduce the results presented in that same thesis. The reason is that the results were describing the theoretical performance based on the mathematical model. And if the math were right, the given pseudo-code was doing it wrong. Thanks to Paul Khuong and his suggestion of using a backward shift on deletion, the practical results now match the theoretical results.
An interesting follow-up would be to experiment to see how costly the backward shift is for the CPU. With these new results, my feeling is now that Robin Hood hashing with backward shift deletion is definitely an interesting algorithm, and given its very linear memory access pattern, it would be worth investigating further for an on-disk key-value store.