Grace, Hopper, NVSwitch detailed at Hot Chips

In four lectures spread over two days, experienced NVIDIA engineers will describe innovations in accelerated computing for modern data centers and systems at the network edge.

Speaking at a virtual Hot Chips event, an annual gathering of processor and system architects, they will reveal performance figures and other technical details for NVIDIA’s first server processor, the GPU Hopper, the latest version of the NVSwitch interconnect chip and the NVIDIA Jetson Orin system-on-module (SoM).

The presentations provide new insights into how the NVIDIA platform will achieve new levels of performance, efficiency, scalability, and security.

Specifically, the discussions demonstrate a design philosophy of innovating across the entire chip, system, and software stack where GPUs, CPUs, and DPUs act as peer processors. Together, they create a platform that already runs AI, data analytics, and high-performance computing across cloud service providers, supercomputing centers, enterprise data centers, and autonomous systems.

Inside NVIDIA’s first server processor

Data centers require flexible clusters of CPUs, GPUs, and other accelerators sharing huge pools of memory to deliver the power-efficient performance that today’s workloads demand.

To address this need, Jonathon Evans, Distinguished Engineer and 15-year veteran at NVIDIA, will describe the NVIDIA NVLink-C2C. It connects CPUs and GPUs at 900 gigabytes per second with 5 times the power efficiency of the existing PCIe Gen 5 standard, thanks to data transfers that consume only 1.3 picojoules per bit.

NVLink-C2C connects two CPU chips to create the NVIDIA Grace processor with 144 Arm Neoverse cores. It’s a processor designed to solve the world’s biggest computing problems.

For maximum efficiency, the Grace processor uses LPDDR5X memory. It allows one terabyte per second of memory bandwidth while keeping power consumption for the entire complex at 500 watts.

One Link, Many Uses

NVLink-C2C also bridges Grace CPU and Hopper GPU chips as memory sharing peers in the NVIDIA Grace Hopper Superchip, providing maximum acceleration for performance-intensive tasks such as AI training.

Anyone can create custom chiplets using NVLink-C2C to seamlessly connect to NVIDIA GPUs, CPUs, DPUs, and SoCs, expanding this new class of embedded products. The interconnect will support AMBA CHI and CXL protocols used by Arm and x86 processors respectively.

First memory markers for Grace and Grace Hopper.

To scale at the system level, the new NVIDIA NVSwitch connects multiple servers into a single AI supercomputer. It uses NVLink, interconnects running at 900 gigabytes per second, which is more than 7 times the bandwidth of PCIe Gen 5.

NVSwitch allows users to link 32 NVIDIA DGX H100 systems into an AI supercomputer that delivers an exaflop of maximum AI performance.

Alexander Ishii and Ryan Wells, both veteran NVIDIA engineers, will describe how the Switch allows users to build systems with up to 256 GPUs to tackle demanding workloads like training AI models that have more of 1 trillion parameters.

The switch includes engines that accelerate data transfers using the NVIDIA Scalable Hierarchical Aggregation Reduction Protocol. SHARP is a network computing capability that debuted on NVIDIA Quantum InfiniBand networks. It can double the data rate on communication-intensive AI applications.

NVSwitch systems enable exascale-class AI
NVSwitch systems enable exascale-class AI supercomputers.

Jack Choquette, a distinguished senior engineer with 14 years of experience in the company, will give an in-depth tour of the NVIDIA H100 Tensor Core GPU, aka Hopper.

In addition to using the new interconnects to reach unprecedented heights, it incorporates many advanced features that improve the performance, efficiency and security of the accelerator.

Hopper’s new Transformer Engine and improved Tensor Cores deliver 30x the previous-gen acceleration on AI inference with the world’s largest neural network models. And it uses the world’s first HBM3 memory system to deliver a whopping 3 terabytes of memory bandwidth, the biggest generational boost ever by NVIDIA.

Among other novelties:

Choquette, one of the main chip designers on the Nintendo 64 system early in his career, will also describe the parallel computing techniques underlying some of Hopper’s advances.

Michael Ditty, chief architect for Orin and a 17-year tenure with the company, will deliver new performance specifications for NVIDIA Jetson AGX Orin, an engine for edge AI, robotics, and advanced autonomous machines.

It integrates 12 Arm Cortex-A78 cores and an NVIDIA Ampere architecture GPU to deliver up to 275 trillion operations per second on AI inference jobs. That’s up to 8 times more performance with 2.3 times higher energy efficiency than the previous generation.

The latest production module contains up to 32 gigabytes of memory and is part of a compatible family that narrows down to the handheld Jetson Nano 5W development kits.

Performance benchmarks for NVIDIA Orin
Performance benchmarks for NVIDIA Orin

All new chips support the NVIDIA software stack which accelerates over 700 applications and is used by 2.5 million developers.

Based on the CUDA programming model, it includes dozens of NVIDIA SDKs for vertical markets such as automotive (DRIVE) and healthcare (Clara), as well as technologies such as recommendation systems (Merlin) and Conversational AI (Riva).

The NVIDIA AI platform is available from all major cloud service and system manufacturers.

Comments are closed.