Openwall HPC Village

Openwall HPC Village

HPC Village from Openwall is an opportunity for HPC (High Performance Computing) hobbyists alike to program for a heterogeneous (hybrid) HPC platform. Participants are provided with remote access (via the SSH protocol) to a server with multi-core CPUs and HPC accelerator cards of different kinds – Intel MIC (Xeon Phi), AMD GPU, NVIDIA GPU – as well as with pre-installed and configured drivers and development tools (SDKs).

We provide within one machine access to the mentioned four types of computing devices, including OpenCL support for all of them, as well as support for development tools and usage models specific to some of them (OpenMP on CPU, OpenMP offload from CPU to MIC, CUDA on NVIDIA GPU). Although it is uncommon to use more than two types of computing devices within one node in real-world HPC setups, such configuration is convenient for getting acquainted with the different technologies, for trying out and comparing them on specific tasks, and for development of portable software programs (including debugging and optimization).


The current hardware configuration is as follows:

Supermicro GPU SuperWorkstation 7047GR-TPRF workstation/server platform with MCP-290-00059-0B rackmount rail set
4U chassis
Two 1620W PSUs 1)
Dual socket 2011 motherboard with IPMI, 16 memory sockets, four PCIe 3.0 x16 slots for full-length dual-width PCIe cards and a fifth slot for a shorter card
A full set of cooling fans, including those pulling hot air out of passively-cooled accelerator cards
Two 8-core Intel Xeon E5-2670 CPUs
Sandy Bridge-EP, support AVX and AES-NI
A total of 16 CPU cores seen as 32 logical CPUs (two hardware threads per core), at a clock rate of at least 2.6 GHz
Turbo boost to up to 3.0 GHz with all cores in use or 3.3 GHz with few cores in use
128 GB DDR3-1600 ECC RAM
8x 16 GB DDR3-1600 ECC Registered modules on 8 channels (4 channels per CPU)
Theoretical bandwidth 102.4 GB/s, actual measured bandwidth ~85 GB/s (cumulative from 32 threads)
Intel Xeon Phi 5110P coprocessor module
Intel Many Integrated Core (MIC) architecture, Knights Corner
60 cores (x86-ish with 512-bit SIMD units) seen as 240 logical CPUs (four hardware threads per core), 1053 MHz, 8 GB GDDR5 ECC RAM on a 512-bit bus, 320 GB/s
Peak performance of about 2 TFLOPS single-precision, 1 TFLOPS double-precision
AMD Radeon HD 7990 gaming graphics card
AMD GCN architecture
Two “Tahiti” GPUs, which provides 2×2048 SPs, 6 GB GDDR5 RAM on two 384-bit buses, 576 GB/s
Custom core clock rate: 501 MHz for GPU0 (heavily underclocked), 997.5 MHz to 1050 MHz for GPU1 (almost same as HD 7970 GE) 2)
Peak performance of over 6 TFLOPS single-precision, about 1.5 TFLOPS double-precision
This is a budget replacement for the FirePro S10000 GPU card intended for servers (which would cost at least 3 times more, but would offer ECC RAM)
NVIDIA GTX TITAN gaming graphics card (Zotac GeForce GTX TITAN AMP! Edition)
NVIDIA Kepler architecture
One GK110 GPU with 2688 SPs at 902 MHz to 954 MHz in single-precision mode, 6 GB GDDR5 RAM on a 384-bit bus, 317.2 GB/s
Peak performance of over 5 TFLOPS single-precision, from 1.3 to 1.5 TFLOPS double-precision in the corresponding mode
This is a budget replacement for the TESLA K20X GPU card intended for workstations and servers (which would cost at least 3 times more and would run considerably slower at single-precision and integer code, but would offer ECC RAM)
NVIDIA GTX Titan X gaming graphics card (reference design, manufactured by Gigabyte)
NVIDIA Maxwell architecture
One GM200 GPU with 3072 SPs at 1000 MHz to 1076 MHz, 12 GB GDDR5 RAM on a 384-bit bus, 336 GB/s
Peak performance of over 6 TFLOPS single-precision, 0.2 TFLOPS double-precision
AMD Radeon HD 5750/6750 gaming graphics card marketed as “PowerColor Radeon HD 6770 Green Edition (AX6770 1GBD5-HV4)”, one half of a HD 5850
AMD TeraScale 2 (VLIW5) architecture
One Juniper PRO GPU with 720 SPs at 700 MHz, 1 GB GDDR5 RAM on a 128-bit bus, 73.6 GB/s
A short card that fits into this motherboard’s 5th dual-width PCIe slot
Not a high performance card, but usable for testing/benchmarking on the old VLIW5 architecture, such as to avoid performance regressions for users with older cards like this (HD 5000 and 6000 series up to and including 6870)
Peak performance of over 1 TFLOPS single-precision
Total peak performance is over 20 TFLOPS single-precision, about 4 TFLOPS double-precision. AmgX AmgX

AmgX provides a simple path to accelerated core solver technology on NVIDIA GPUs. AmgX provides up to 10x acceleration to the computationally intense linear solver portion of simulations, and is especially well suited for implicit unstructured methods.

It is a high performance, state-of-the-art library and includes a flexible solver composition system that allows a user to easily construct complex nested solvers and preconditioners.

AmgX is available with a commercial and a free license. The free license is limited to CUDA Registered Developers and non-commercial use.

Key Features

  1. Flexible configuration allows for nested solvers, smoothers, and preconditioners
  2. Ruge-Steuben algebraic multigrid
  3. Un-smoothed aggregation algebraic multigrid
  4. Krylov methods: PCG, GMRES, BiCGStab, and flexible variants
  5. Smoothers: Block-Jacobi, Gauss-Seidel, incomplete LU, Polynomial, dense LU
  6. Scalar or coupled block systems
  7. MPI support
  8. OpenMP support
  9. Flexible and simple high level C API

AmgX is free for non-commercial use and is available for download now for CUDA Registered Developers. As a registered developer you can download the latest version of AmgX, access the support forum and file bug reports. If you have not yet registered do so today. Python Flux Reconstruction Python Flux Reconstruction

PyFR is an open-source Python based framework for solving advection-diffusion type problems on streaming architectures using the Flux Reconstruction approach of Huynh. The framework is designed to solve a range of governing systems on mixed unstructured grids containing various element types. It is also designed to target a range of hardware platforms via use of an in-built domain specific language derived from the Mako templating engine. The current release (PyFR 1.0.0) has the following capabilities:

Governing Equations – Euler, Navier Stokes
Dimensionality – 2D, 3D
Element Types – Triangles, Quadrilaterals, Hexahedra, Prisms, Tetrahedra, Pyramids
Platforms – CPU Clusters, Nvidia GPU Clusters, AMD GPU Clusters
Spatial Discretisation – High-Order Flux Reconstruction
Temporal Discretisation – Explicit Runge-Kutta
Precision – Single, Double
Mesh Files Imported – Gmsh (.msh)
Solution Files Exported – Unstructured VTK (.vtu, .pvtu)

PyFR is being developed in the Vincent Lab, Department of Aeronautics, Imperial College London, UK.

Development of PyFR is supported by the Engineering and Physical Sciences Research Council, Innovate UK, the European Commission, BAE Systems, and Airbus. We are also grateful for hardware donations from Nvidia, Intel, and AMD.

PyFR 1.0.0 has a hard dependency on Python 3.3+ and the following Python packages:

h5py >= 2.5
mako >= 1.0.0
mpi4py >= 1.3
mpmath >= 0.18
numpy >= 1.8
pytools >= 2014.3
Note that due to a bug in numpy PyFR is not compatible with 32-bit Python distributions.

CUDA Backend
The CUDA backend targets NVIDIA GPUs with a compute capability of 2.0 or greater. The backend requires:

CUDA >= 4.2
pycuda >= 2011.2
OpenCL Backend
The OpenCL backend targets a range of accelerators including GPUs from AMD and NVIDIA. The backend requires:

pyopencl >= 2013.2
OpenMP Backend
The OpenMP backend targets multi-core CPUs. The backend requires:

GCC >= 4.7
A BLAS library compiled as a shared library (e.g. OpenBLAS)
Running in Parallel
To partition meshes for running in parallel it is also necessary to have one of the following partitioners installed:

metis >= 5.0
scotch >= 6.0 A CUDA Dynamic Parallelism Case Study, PANDA A CUDA Dynamic Parallelism Case Study, PANDA

Dynamic parallelism is better than host streams for 3 reasons.

  • Avoiding extra PCI-e data transfers. The launch configuration of each subsequent kernel depends on results of the previous kernel, which are stored in device memory. For the dynamic parallelism version, reading these results requires just a single memory access, while for host streams a data transfer from device to host is needed.
  • Achieving higher launch throughput with dynamic parallelism, when compared to launching from host using multiple streams.
  • Reducing false dependencies between kernel launches. All host streams are served by a single thread, which means that when waiting on a single stream, work cannot be queued into other streams, even if the data for launch configuration is available. For dynamic parallelism, there are multiple thread blocks queuing kernels executed asynchronously.

The dynamic parallelism version is also more straightforward to implement, and leads to cleaner and more readable code. So in the case of PANDA, using dynamic parallelism improved both performance and productivity. NVIDIA GPUDirect NVIDIA GPUDirect

Using GPUDirect, multiple GPUs, third party network adapters, solid-state drives (SSDs) and other devices can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla™ and Quadro™ products.

GPUDirect peer-to-peer transfers and memory access are supported natively by the CUDA Driver. All you need is CUDA Toolkit v4.0 and R270 drivers (or later) and a system with two or more Fermi- or Kepler-architecture GPUs on the same PCIe bus.

11th International Conference on Parallel Processing and Applied Mathematics in Krakow

11th International Conference on Parallel Processing and Applied Mathematics in Krakow

September 6-9, 2015, Krakow , Poland

WLPP 2015 is a full-day workshop to be held at the PPAM 2015 focusing on high level programming for large-scale parallel systems and multicore processors, with special emphasis on component architectures and models. Its goal is to bring together researchers working in the areas of applications, computational models, language design, compilers, system architecture, and programming tools to discuss new developments in programming Clouds and parallel systems. The workshop focuses on any language-based programming model such as OpenMP, Intel TBB and Ct, Microsoft .NET 4.0 parallel extensions (TPL and PPL), Java parallel extensions, HPCS languages (Chapel, X10 and Fortress), Unified Parallel C (UPC), Co-Array FORTRAN (CAF) and GPGPU language-based programming models such as CUDA. Contributions on other high-level programming models and supportive environments for parallel and distributed systems are equally welcome. Drop-in Acceleration of GNU Octave Drop-in Acceleration of GNU Octave
A well-known trick to skip the time consuming rebuilding step is to dynamically intercept and substitute relevant library symbols with high performing analogs. On Linux systems LD_PRELOAD environment variable allows us to do exactly that. Now we will try OpenBLAS, built with OpenMP and Advanced Vector Extensions (AVX) support. Assuming the library is in LD_LIBRARY_PATH, ‘OMP_NUM_THREADS=20 octave ./sgemm.m‘ yields 765 GFLOPs. We observe 2.5X speedup in SGEMM versus dual socket Ivy Bridge. But overall speedup is 1.56X. The CPU fraction does not scale, since it is executed in single thread, and NVBLAS always uses one CPU thread per GPU. The acceleration is still significant, but we start to be limited by Amdahl’s law.

Karl Rupp: Computational semiconductor scientist TU Wien

Karl Rupp: Computational semiconductor scientist TU Wien
I’m a postdoctoral appointee at the Vienna University of Technology working on numerical methods and efficient implementation for semiconductor device simuations. I’m also working on improved accelerator support (including, but not limited to GPUs) in PETSc. Institute for Microelectronics Gußhausstraße 27-29/E360 A-1040 Wien Austria, Europe Phone: 0043 1 58801 36027 Email: karl.rupp ( at ) Twitter: karlrupp LinkedIn: My profile Google+: My profile