Deep learning is one of the fastest-growing segments of the machine learning/artificial intelligence field and a key area of innovation in computing. With researchers creating new deep learning algorithms and industries producing and collecting unprecedented amounts of data, computational capability is the key to unlocking insights from data.

GPUs have brought tremendous value to deep learning research over the past couple of years. In the pursuit of continuing innovation and adopting deep learning for our own goals, NVIDIA engineers built the world’s fastest deskside deep learning machine—DIGITS DevBox. We’re making it easy to get started fast with our Dev Program. US customers can apply to purchase directly from NVIDIA using the DevBox Access Program link below. If you want to build your own, and don’t mind troubleshooting and supporting the software image yourself, you can learn more by registering using the Build Your Own DevBox link.

The DIGITS DevBox combines the world’s best hardware, software, and systems engineering:

Four TITAN X GPUs with 7 TFlops of single precision, 336.5 GB/s of memory bandwidth, and 12 GB of memory per board
NVIDIA DIGITS software providing powerful design, training, and visualization of deep neural networks for image classification
Pre-installed standard Ubuntu 14.04 w/ Caffe, Torch, Theano, BIDMach, cuDNN v2, and CUDA 7.0
A single deskside machine that plugs into standard wall plug, with superior PCIe topology
The end result is faster turnaround times for experiments, the freedom to explore multiple network architectures, and accelerated dataset manipulation—all in one powerful, energy-efficient, cool, and quiet solution that fits under your desk.

DIGITS DevBox includes:

Four TITAN X GPUs with 12GB of memory per GPU
Asus X99-E WS workstation class motherboard with 4-way PCI-E Gen3 x16 support
Core i7-5930K 6 Core 3.5GHz desktop processor
Three 3TB SATA 6Gb 3.5” Enterprise Hard Drive in RAID5
512GB PCI-E M.2 SSD cache for RAID
250GB SATA 6Gb Internal SSD
1600W Power Supply Unit from premium suppliers including EVGA
Ubuntu 14.04
NVIDIA-qualified driver
NVIDIA® CUDA® Toolkit 7.0
Caffe, Theano, Torch, BIDMach

devblogs.nvidia.com: Drop-in Acceleration of GNU Octave

devblogs.nvidia.com: Drop-in Acceleration of GNU Octave
A well-known trick to skip the time consuming rebuilding step is to dynamically intercept and substitute relevant library symbols with high performing analogs. On Linux systems LD_PRELOAD environment variable allows us to do exactly that. Now we will try OpenBLAS, built with OpenMP and Advanced Vector Extensions (AVX) support. Assuming the library is in LD_LIBRARY_PATH, ‘OMP_NUM_THREADS=20 LD_PRELOAD=libopenblas.so octave ./sgemm.m‘ yields 765 GFLOPs. We observe 2.5X speedup in SGEMM versus dual socket Ivy Bridge. But overall speedup is 1.56X. The CPU fraction does not scale, since it is executed in single thread, and NVBLAS always uses one CPU thread per GPU. The acceleration is still significant, but we start to be limited by Amdahl’s law.

github.com: CUDA Simulation and GLSL Visualization PART 1: CUDA NBody Simulation

github.com: CUDA Simulation and GLSL Visualization PART 1: CUDA NBody Simulation
Features: 1.Use share memory to accumulate the effects of all bodies tile by tile. 2.A softening factor described in http://www.scholarpedia.org/article/N-body_simulations_(gravitational) is added in the force calculation to prevent force explosion when two bodies are too close. Performance Profiling: As expected, the speedup of shared memory-enabled computation is proportional to the number of bodies involved in force accumulation. GPU throughput is optimized when doing compute-intensive operations.

blogs.nvidia.com: GPUs Further Russia’s Supercomputing Efforts, Accelerate Its Fastest System

blogs.nvidia.com: GPUs Further Russia’s Supercomputing Efforts, Accelerate Its Fastest System
Overall, GPUs power three of the top 10 and nearly one-third of the top 50 systems on the list, which is issued twice yearly. It’s a remarkable stat considering that no GPU-accelerated systems were on the list just three years ago. Moscow State University’s Lomonosov supercomputer glows green with NVIDIA Tesla GPUs inside. Thanks to an upgrade with NVIDIA Tesla GPUs in 2011, Lomonosov claims its spot easily. It delivers 1.7 petaflops of peak performance, making it the fastest accelerator-based supercomputer not just in Russia but in all of Europe.

AnandTech – Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

AnandTech – Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs
Titan is the latest supercomputer to be deployed at Oak Ridge, although it’s technically a significant upgrade rather than a brand new installation. Jaguar, the supercomputer being upgraded, featured 18,688 compute nodes – each with a 12-core AMD Opteron CPU. Titan takes the Jaguar base, maintaining the same number of compute nodes, but moves to 16-core Opteron CPUs paired with an NVIDIA Kepler K20X GPU per node. The result is 18,688 CPUs and 18,688 GPUs, all networked together to make a supercomputer that should be capable of landing at or near the top of the TOP500 list.

David Luebke

David Luebke
Spring 2006: Real-Time Rendering & Game Technology [S04] [F02]Fall 2005: Introduction to Computer Science

Spring 2005: Introduction to Computer Graphics [S03][S00][F99]

Fall 2004: Computer Graphics for Film Production

Spring 2003: Computer Science Seminar [S03] [S02]

Spring 2003: Interactive Ray Tracing

Spring 2002: Introduction to Algorithms [S00]

Fall 2001: 3-D Animation and Special Effects

Spring 2001: Advanced Computer Graphics [S99]

Spring 2001: Modern Research in Computer Graphics [F98]