“We need to find a path forward for life after Moore’s Law,” Nvidia CEO Jen-Hsun Huang said at the beginning of his annual GPU Technology Conference keynote. But Nvidia isn’t hesitant to throw around more iron to make its ferocious graphics processors even more so, as evidenced by the reveal of the first product based on Nvidia’s badass next-gen Volta GPU.
This beastly GPU—both in size and capabilities—boasts a whopping 21 billion transistors and 5,120 CUDA cores humming along at 1,455MHz boost clock speeds, all built using a 12-nanometer manufacturing process more advanced than that of Nvidia’s current GPUs. By comparison, today’s Pascal GPU flagship, the 14nm Tesla P100, offers 3,840 CUDA cores and 15 billion transistors. The GeForce GTX 1060 has a quarter as many CUDA cores as the Tesla V100, at 1,280. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units.
Volta is “at the limits of photolithography,” Huang said with a smirk, created using an R&D budget of over $3 billion.
Nvidia says it’s redesigned Volta’s streaming microprocessor architecture to be 50 percent more efficient than Pascal’s, which is damned impressive if it proves true. That enables “major boosts in FP32 and FP64 performance in the same power envelope,” Nvidia says. The Tesla V100 also includes new “tensor cores” built specifically for deep learning, providing 12 times the teraflops throughput of the Pascal-based Tesla P100, Huang said.
The Tesla V100 utilizes 16GB of ultra-fast, 4096-bit high-bandwidth memory to process data quickly. It’s unknown whether Volta-based consumer graphics cards will feature HBM2, however. Radeon Vega does, but the tech is still relatively new and pricey. The GeForce GTX 10-series graphics cards debuted with new GDDR5X technology based on classic memory designs, and SK Hynix recently said that it’s “planning to mass produce the product for a client to release high-end graphics card by early 2018 equipped with high performance GDDR6 DRAMs.”
That HBM2 memory hits 900GB/s speeds, Nvidia says, and the Tesla V100 features a second-gen version of Nvidia’s NVLink technology. At 300GB/s transfer speeds, Huang claims NVLink is now 10 times faster than standard PCIe connections.
NVIDIA Volta GV100 GPU Key Features:
Key compute features of the NVIDIA Volta GV100 based Tesla V100 include the following:
- New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
- Second-Generation NVLink The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
- HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth efficiency running many workloads.
- Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
- Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
- Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
- Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.
- Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.
NVIDIA has stated that the NVIDIA Volta GV100 GPU based Tesla V100 will start shipping in 2017. We are looking at availability in 2H 2017 so we can expect consumer variants well and ready for launch in early 2018.