Nvidia Tesla V100 with Volta GV100 GPU announced

by Mark Tyson on 10 May 2017, 21:35

Tags: NVIDIA (NASDAQ:NVDA)

Quick Link: HEXUS.net/qadhc2

Add to My Vault: x

A few hours ago at the GTC 2017 Nvidia CEO Jensen Huang took the wraps off the Tesla V100 accelerator. This launch marks several milestones for Nvidia, not least the introduction of its first Volta architecture GPU based product. As those familiar with Nvidia's nomenclature will be aware, the first product based upon Volta is an accelerator targeting complex problem solving. The Nvidia Tesla V100 is referred to in headline terms as an "AI computing and HPC powerhouse".

Following hot on the heels of its Q1 financials, showing stellar performance in its datacentre business, Nvidia looks to be keeping the pressure up on this sector. Furthermore, it hasn't been afraid to invest, spending $3 billion in R&D in developing Volta.

The new Volta-based Nvidia Tesla V100 packs a significantly weightier punch than the Pascal-based Tesla P100 and for the first time Nvidia has started making performance comparisons using Peak Tensor Core TFLOP/s (I'm not sure if this measurement is analogous to TOPS as used by Google in describing its Tensor Processing Unit performance).

Peak computation rates (based on GPU Boost clock rate) are:

  • 7.5 TFLOP/s of double precision floating-point (FP64) performance;
  • 15 TFLOP/s of single precision (FP32) performance;
  • 120 Tensor TFLOP/s of mixed-precision matrix-multiply-and-accumulate.

It's interesting to see that Nvidia's Volta GV100 architecture offers dedicated Tensor Cores to compete with accelerators from the likes of Google. There are 8 Tensor Core per SM unit in the Volta GV100, that's 640 in total. They provide a significant performance uplift in training neural networks. "Tesla V100’s Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications," notes Nvidia. I think that means the GV100 leapfrogs Google's TPU ASIC which is capable of 90 TOPS.

The Volta GV100 GPU powering Nvidia's latest accelerator product has some mighty specs. First of all, from the article subheading, you will already be aware that this GPU packs in 21.1 billion transistors and is fabricated using TSMCs 12nm FFN process. Chip size is considerably higher than the last gen, with the GPU measuring 815mm2 compared to the P100's 610 mm². Among the computing components inside a full GV100 GPU are; 84 SMs, a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. In the Tesla V100 80 SMs are enabled, so there's 5120 CUDA cores at work, probably reduced from the maximum possible for better yields and to provide room for next generation Titan headlines.

Supporting the Volta chip in the Tesla V100, Nvidia has architected an SMX card with second gen NV-Link high speed interconnect technology for up to 300GB/s links, 16GB of HBM2 memory from Samsung providing 1.5x delivered memory bandwidth versus Pascal GP100, Maximum Performance and Maximum Efficiency Modes are present, and that all-important optimised software support with GPU accelerated libraries is available.

Nvidia has published quite a lengthy blog post about the arrival of Volta with the Tesla V100 accelerator. You can head on over there to read more about the technicalities and architectural nuances of the new architecture: Inside Volta.

Those interested in deploying Volta based solutions should know that the first server and deep learning products based upon Tesla V100 will become available starting from Q3 2017. A DGX-1 system powered by the new GPUs will set you back $149,000 for example.

Tesla Product

Tesla K40

Tesla M40

Tesla P100

Tesla V100

GPU

GK110 (Kepler)

GM200 (Maxwell)

GP100 (Pascal)

GV100 (Volta)

SMs

15

24

56

80

TPCs

15

24

28

40

FP32 Cores / SM

192

128

64

64

FP32 Cores / GPU

2880

3072

3584

5120

FP64 Cores / SM

64

4

32

32

FP64 Cores / GPU

960

96

1792

2560

Tensor Cores / SM

NA

NA

NA

8

Tensor Cores / GPU

NA

NA

NA

640

GPU Boost Clock

810/875 MHz

1114 MHz

1480 MHz

1455 MHz

Peak FP32 TFLOP/s*

5.04

6.8

10.6

15

Peak FP64 TFLOP/s*

1.68

2.1

5.3

7.5

Peak Tensor Core TFLOP/s*

NA

NA

NA

120

Texture Units

240

192

224

320

Memory Interface

384-bit GDDR5

384-bit GDDR5

4096-bit HBM2

4096-bit HBM2

Memory Size

Up to 12 GB

Up to 24 GB

16 GB

16 GB

L2 Cache Size

1536 KB

3072 KB

4096 KB

6144 KB

Shared Memory Size / SM

16 KB/32 KB/48 KB

96 KB

64 KB

Configurable up to 96 KB

Register File Size / SM

256 KB

256 KB

256 KB

256KB

Register File Size / GPU

3840 KB

6144 KB

14336 KB

20480 KB

TDP

235 Watts

250 Watts

300 Watts

300 Watts

Transistors

7.1 billion

8 billion

15.3 billion

21.1 billion

GPU Die Size

551 mm²

601 mm²

610 mm²

815 mm²

Manufacturing Process

28 nm

28 nm

16 nm FinFET+

12 nm FFN

Tesla V100 Compared to Prior Generation Tesla Accelerators. (* Peak TFLOP/s rates are based on GPU Boost clock.)



HEXUS Forums :: 14 Comments

Login with Forum Account

Don't have an account? Register today!
$149000?

I wonder what a used Tesla K40 fetches these days?
UseItNow
$149000?

I wonder what a used Tesla K40 fetches these days?
Quite a lot, given the Tesla K40 is a generic PCIe add-in card you can bung into any modern workstation or server, whereas the $149,000 DGX-1 is a ultra-high-end proprietary supercomputer module with eight Tesla V100's in it
Wow that die is simply monstrous. I don't even recall Intel making something that large! I assume they're talking about the logic die rather than the interposer, but I thought TSMC's single-exposure reticle size was somewhere around the 600mm2 mark, hence both AMD and Nvidia hitting that wall on 28nm. IIRC they can get around that limit for things like interposers by using multiple exposures but that wouldn't work for a logic die (I could imagine a die somehow divided down the middle working, but then it would make no sense to leave it as a singe die vs MCM).

I wonder what the real motivation is behind something like this? It seems like it's more marketing and halo effect than anything - they could have gone marginally smaller with a relatively insubstantial difference in performance but massive improvements to costs and yield. And for the sort of application this seems to be targeted at, even Nvidia are showing it must scale well across nodes given they're sticking eight of these in a box. Are they just going for no-expense-spared bragging rights against the Xeon Phi for die size and performance? Although it must be noted Intel are being very secretive around Phi's die size, which some have estimated to be in the area of 700mm2 IIRC. Then again it's not targeting exactly the same market as Phi has access to far more memory than this. This seems like something with a very niche market, but I guess there must be some demand for it to pour this much money into producing it!
Am I right in thinking that Volta GeForce GPU cards can't be far off? 3 months away at a guess.
qasdfdsaq
UseItNow
$149000?

I wonder what a used Tesla K40 fetches these days?
Quite a lot, given the Tesla K40 is a generic PCIe add-in card you can bung into any modern workstation or server, whereas the $149,000 DGX-1 is a ultra-high-end proprietary supercomputer module with eight Tesla V100's in it

Haha, probably should have worded that better…

Did Google the DGX-1, just wondering what they'll price those V100's at?