birs.ca: High Performance Computing in Multibody Dynamics

birs.ca: High Performance Computing in Multibody Dynamics

HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS
Dan Negrut
Vilas Associate Professor
Nvidia CUDA Fellow
University of Wisconsin-Madison
People Whose Work/Ideas Shaped Up This Presentation
• Current and past close collaborators:
• Radu Serban, University of Wisconsin-Madison
• Alessandro Tasora, University of Parma, Italy
• Mihai Anitescu, University of Chicago, Argonne National Lab
• Students from University of Wisconsin-Madison, listed alphabetically:
• Omkar Deshmukh, Toby Heyn, Hammad Mazhar, Daniel Melanz, Arman Pazouki, Andrew Seidl
University of Wisconsin 2
Acknowledgements: Funding Sources
• National Science Foundation
• US Army
• NVIDIA
• Caterpillar
• Simertis GMBH
• MSC.Software
• FunctionBay, S. Korea
University of Wisconsin 3
About this presentation
• Summary of my understanding of the advanced computing topic
• Title of talk somewhat misleading
• High Performance Computing (HPC) often times associated with big supercomputers
• This talk: what one can do to speed up a simulation in multibody dynamics
• Supercomputers can sometimes be an option
University of Wisconsin 4
Multibody Dynamics: Commercial Solution
University of Wisconsin 5
Multibody Dynamics: Commercial Solution
University of Wisconsin 6
[Q] Why am in interested in this?
[A] Large Dynamics problems: e.g., terrain simulatio
n
• How is the Rover moving along on a slope
with granular material?
• What wheel geometry is more effective?
• How much power is needed to move it?
• At what grade will it get stuck?
• And so on…
University of Wisconsin 7
On wigs, ties, and t-shirts
University of Wisconsin 8
On Computing Speeds: The Price of 1 Mflop/second
• 1961:
• Combine 17 million IBM-1620 computers
• At $64K apiece, when adjusted for inflation, this would cost $7 trillion
• 2000:
• About $1,000
• 2013:
• Less than 20 cents out of the value of a workstatio
n
University of Wisconsin 9
More Good News: Chip Feature Length & Consequences
• Moore’s law at work
• 2013 – 22 nm
• 2015 – 14 nm
• 2017 – 10 nm
• 2019 – 7 nm
• 2021 – 5 nm
• 2023 – ??? (carbon nanotubes?)
• More transistors = More computational
units
• October 2013:
• Intel Xeon w/ 12 cores – 3 billion transistors
• Projecting ahead, estimates:
• 2015 – 24 cores
• 2017 – 32 cores
• 2019 – 64 cores
• 2021 – 124 cores
University of Wisconsin 10
Frictional Contact Simulation
[Commercial Software Simulation – 2007]
• Model Parameters:
• Spheres: 60 mm diameter and mass 0.882 kg
• Penalty Approach: stiffness of 1E5, force exponent of 2.2, damping coefficient of 10.0
• Simulation length: 3 seconds
University of Wisconsin 11
Frictional Contact Simulation
[Commercial Software Simulation – 2013]
• Same problem tested in 2013
• Simulation time reduced by a factor of six
• Simulation times still prohibitively long
• Conclusion: fast computers mean nothing
to the bottom line
University of Wisconsin 12
Should you decide it is time to look for the exit…
• One of the two important points that should come out of this talk is this
• Performing math operations is basically free
• Procuring the operands is very expensive
• In terms of energy
• In terms of time
• Corollary: a program that leverages spatial or temporal locality in data accesses is a fast program
University of Wisconsin 13
NVIDIA’s Fermi GPU Architecture: Quick Facts
14
• Lots of ALU (green), not much of CU (orange)
• Explains why GPUs are fast for high arithmetic intensity applications
• Arithmetic intensity: high when many operations performed per word of memory
University of Wisconsin
The Fermi GPU Architecture
• Late 2009, early 2010
• 40 nm technology
• Three billion transistors
• 512 Scalar Processors (SP, “shaders”)
• 64 KB L1 cache
• 768 KB L2 uniform cache (shared by all SMs)
• Up to 6 GB of global memory
• Operates at several clock rates
• Memory
• Scheduler
• Execution unit
• High memory bandwidth
• Close to 200 GB/s
University of Wisconsin 15
Fermi: cost of arithmetic vs. cost of memory access
• Fact 1: 32 FMAs (Fused-Multiply-Add) single precision operations in one clock cycle
• Fact 2: One memory request takes 400-600 cycles to service unless it hits L1 or L2 cache
• Conclusions
1. Hundreds of times more expensive to bring data into SM than to compute something with it
2. Arithmetic is free, bringing data over is expensive
University of Wisconsin 16
GPU Computing Requires High Bandwidth
• Required bandwidth:
• 32 (SP) x 3 (operands) x 4 (bytes) x 1125 MHz
x 15 (SMs) = 6,480 GB/s
• Available bandwidth
• 200 GB/s, about 20-30 times smaller
• Two things save the day
• Caching
• High arithmetic intensity algorithms
University of Wisconsin 17
My Lamborghini can drive at 250 mph
[I drive it to get groceries]
University of Wisconsin 18
My Lamborghini can drive at 250 mph
[I drive it to get groceries]
University of Wisconsin 19
Bandwidth in a CPU-GPU System
[Robert Strzodka, Max Plank Institute, Germany]

1-8 GB/s
GPU
NOTE: The width
of the black lines is
proportional to the
bandwidth.
University of Wisconsin 20
Intel Haswell
• Released in June 2013
• 22 nm technology
• Transistor budget: 1.4 billions
• Tri-gate, 3D transistors
• Typically comes in four cores
• Has an integrated GPU
• Deep pipeline – 16 stages
• Strong Instruction Level Parallelism (ILP) support
• Superscalar
• Supports HTT (hyper-threading technology)
University of Wisconsin 21
Haswell Layout: “System on Chip” (Soc) Paradigm
22
Three clocks:
A core’s clock ticks at 2.7 to 3.0 GHz – adjustable up to 3.7-3.9 GHz
Graphics processor ticking at 400 MHz – adjustable up to 1.3 GHz
Ring bus and the shared L3 cache – a frequency close, but not necessarily identical, to that of the cores
[Intel]
→ University of Wisconsin
Caches
• Data:
• L1 – 32 KB per core
• L2 – 512 KB or 1024 KB per core
• L3 – 8 MB per CPU
• Instruction:
• L0 – storage for about 1500 microoperations (uops) per core
• L1 – 32 KB per core
University of Wisconsin 23
Running fast on one workstation
• Several options
• Vectorization using AVX
• Leverage multiple cores, up to 16 (AMD Opteron C6274)
• Have four CPUs on a node (64 cores)
• Intel Xeon Phi (60 cores)
University of Wisconsin 24
Intel Xeon E5-2690 v2 Ivy Bridge-EP 3.0GHz 25MB L3 Cache 10-Core
Tesla K20X (Kepler architecture)
Intel® Xeon Phi™ Coprocessor 5110P (8GB, 1.053 GHz, 60 core)
How can you drive this HW?
• GPU:
• CUDA (NVIDIA proprietary software ecosystem, freely distributed)
• OpenCL (standard supporting parallel computing in hardware-agnostic fashion)
• x86
• pthreads
• OpenMP
• MPI
University of Wisconsin 25
Reflections on picking the winning horse
• The hardware is complex
• The compiler flags are numerous
• The problems can be formulated in so many ways
• The second point of this talk:
• The proof of the pudding is in the eating
University of Wisconsin 26
Next two slides:
Intel MKL – bewildering results at times
“In our factory, we make lipstick. In our Advertising, we sell hope.”
Charles Revson, cosmetics magnate, Revlon
University of Wisconsin 27
DGEMM : C = alpha*A*B + beta*C using Intel MKL 11.1
(alpha = 1, beta = 0)
A(2000×200), B(200×1000)
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
University of Wisconsin 28
DGEMM on Intel Xeon Phi (60 Cores with 512-bit SIMD vector registers)
Intel’s MKL 11.1 – used in native mode
University of Wisconsin 29
Next slides:
Picking the winning horse is not obvious
University of Wisconsin 30
for (int yOut = 0; yOut < nHeight; yOut++) { // Image Y-dimension
const int yInTopLeft = yOut;
for (int xOut = 0; xOut < nWidth; xOut++) { // Image X-dimension
const int xInTopLeft = xOut;
float sum = 0.f;
for (int r = 0; r < nFilterWidth; r++) { // Kernel Y-dimension
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++) { // Image X-dimension
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
}
2D Convolution
University of Wisconsin 31
2D Convolution
[Image size: 8192 x 8192]
• Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz
• Dual core chip, supports 256-bit AVX capable of 8 FMA per clock-cycle
• Middle of the road laptop, single socket configuration
• Used for AVX acceleration, implementation by Okmar
• Vectorization with OpenCL built-in float4 data type
• AMD Opteron™ 6274, 2MB, 2.2 GHz
• Has 16 physical cores
• Used with OpenMP 4/16/64 Threads
University of Wisconsin 32
Vectorization with AVX
• AVX: Advanced Vector Extension – difficult to program, little compiler optimizations
• Examples of AVX intrinsics
• Create and initialize variables that map to AVX registers
__m256 prod __attribute__ ((aligned (32))) = _mm256_set1_ps(0.0f);
• Carry out AVX multiplication using 256 bit wide registers
prod = _mm256_mul_ps(data, kernel);
• Multiplication maps eventually into assembly
c5 f4 59 c0 vmulps ymm0,ymm1,ymm0
University of Wisconsin 33
2D Convolution
[Image size: 8192 x 8192]
University of Wisconsin 34
University of Wisconsin 35
2D Convolution: Scaling w/ image size
[Kernel Size: 8 x 8]
2D Convolution
GPU Results – Tesla K20x, 5GB, 0.73 GHz
University of Wisconsin 36
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Reference: one core of Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Next slides: pthreads is not it
University of Wisconsin 37
Stencil-Type Operation
• Average over the neighbors (grid[ i ± 1 ][ j ± 1 ])
• Hardware Setup: Dual socket, 4-core Intel Xeon CPUs with hyper-threading
• Shows up as 16 virtual cores
• Software approaches
• Pthreads – do-it-yourself, low-level, handle barriers, locking, synchroniztion
• OpenMP – Single line of code
• MPI – Middle ground, uses processes that communicate with each other
University of Wisconsin 38
for (int t=0; t < timesteps/2; t++) {
int flag = 1;
int i, j, k;
MPI_Status status;
// Even indexed elements
for (j=1; j < ydim-1; j++) {
for (i=flag; i < xdim-1; i+=2) {
grid[j][i] = (grid[j-1][i] + grid[j][i-1] + grid[j][i] + grid[j][i+1] + grid[j+1][i]) / 5;
}
flag = (flag == 1 ? 2 : 1);
}
// Odd indexed elements
flag = 2;
for (j=1; j < ydim-1; j++) {
for (i=flag; i < xdim-1; i+=2) {
grid[j][i] = (grid[j-1][i] + grid[j][i-1] + grid[j][i] + grid[j][i+1] + grid[j+1][i]) / 5;
}
flag = (flag == 1 ? 2 : 1);
}
Barrier();
}
University of Wisconsin 39
University of Wisconsin 40
University of Wisconsin 41
Accelerators
Multicore chips
Many-node configurations
University of Wisconsin 42
Our Lab’s Cluster
University of Wisconsin 43
Lab’s Cluster
• More than 1,200 CPU cores
• Mellanox Infiniband Interconnect (QDR), 40Gb/sec
• Memory: about 2.7 TB of RAM
• More than 10 TFlops Double Precision out of x86 hardware
• 60 GPU cards (K40, K30, GTX480) – more than 15 Tflops
• BTW: you can get an account if interested
University of Wisconsin 44
Rover Simulation
University of Wisconsin 45
Distributed Computing Dynamics: Is it worth it?
• The cons
• Communication is major bottleneck – latencies are high, bandwidth is ok for dynamics
• Software design is complex
• The pros
• Access to lots of memory
• Can put more cores to work
• Conclusion
• Justifiable if one node not large enough to store the entire problem
University of Wisconsin 46
Breaking Up Dynamics for Distributed Computing
• Simulation divided into chunks executed on different cores
• Elements leave one chunk (subdomain) to move to a different one
• Key issues:
• Dynamic load balancing
• Establish a dynamic data exchange (DDE) protocol to implement migration at run time
47
v1
v
3
v
2
v
5
v
4
University of Wisconsin
Computation Using Multiple CPUs & MPI
University of Wisconsin 48
0.5 Million Bodies on 64 Cores
[Penalty Approach, MPI-based]
University of Wisconsin 49
Project Chrono: Using HPC in Multibody Dynamics
• Open source, distributed under BSD-3 license
• 100,000 lines of code
https://github.com/projectchrono/chrono
University of Wisconsin 50
Tracked Vehicle Simulation – on GPU
University of Wisconsin 51
Chrono::Fluid
University of Wisconsin 52
Chrono::Flex [GPU+CPU]
University of Wisconsin 53
Additive Manufacturing (3D Printing)
Selective Laser Sintering (SLS) machine
University of Wisconsin 54
Additive Manufacturing (3D Printing)
University of Wisconsin 55
Surface Roughness: Before and After Leveling
University of Wisconsin 56
Put things in perspective: parallel computing options
• Accelerators (GPU/CUDA, Intel Xeon/Phi)
• Up one level: multicore
• Up one more level: multi-node
University of Wisconsin 57
Wrapping Things Up
• In the midst of a democratization of parallel computing
• Many alternatives, landscape very fluid
• Lots of hardware options: GPU, CPU (Intel/AMD), Phi, clusters, FPGAs, etc
• Lots of software ecosystems: CUDA, OpenCL, OpenMP, OpenACC, MPI, etc.
• Parallel computing can be a game changer
• Can pursue new physics or new problems
• Provide impetus into development of new algorithms that expose more parallelism
University of Wisconsin 58
Two pieces of practical advice
• Moving data around is hurting performance badly
• Fixing this aspect calls for rethinking of implementation and maybe even numerical algorithm
• Fluid landscape, technology changing fast – best solution is a bit of a guessing game
• The proof of the pudding is in the eating
• “80% of success is showing up” (Woody Allen, mentioned at breakfast by David Stewart)
University of Wisconsin 59
Looking ahead
• Integration of the CPU and GPU – the “system on a chip” (SoC) paradigm takes over
• Intel
• Haswell, yet not clear strategy for CPU-GPU integration
• NVIDIA
• Maxwell and Denver project – CUDA 6 sees a unified CPU-GPU memory space
• AMD
• Kavery chip, “Hybrid System Architecture” – pushing OpenCL to leverage the architecture
University of Wisconsin 60
Thank You.
negrut@wisc.edu
Simulation Based Engineering Lab
Wisconsin Applied Computing Center
University of Wisconsin-Madison
University of Wisconsin 61
Next two slides – Intel MKL:
What you get, is not quite what’s advertised
“The very first law in advertising is to avoid the concrete promise and
cultivate the delightfully vague.”~ Bill Cosby
University of Wisconsin 62

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s