Hopper Architecture Explained : From SMs to DPX Instructions

The Hopper architecture, introduced by NVIDIA in March 2022, is a breakthrough in GPU technology. As the successor to the Ampere architecture, Hopper represents NVIDIA’s next big step in powering advanced AI models, scientific simulations, and data center computing. Named after Grace Hopper, a pioneer in computer science, this AI GPU architecture brings significant innovations such as redesigned Streaming Multiprocessors (SMs), fourth-generation Tensor Cores, a powerful Transformer Engine, advanced memory subsystems, and new DPX instructions. These features make Hopper-based GPUs, like the NVIDIA H100 GPU, ideal for large-scale AI training, high-performance computing (HPC), and complex AI inference workloads.


In this article, we’ll delve deep into the Hopper microarchitecture, examining its key components, the technology driving its performance, and how it is optimized for AI and HPC workloads. We will also explore its benefits, challenges, and real-world applications.

What is Hopper Architecture?

Hopper is NVIDIA’s GPU microarchitecture designed specifically for AI and HPC workloads. It powers the NVIDIA H100 GPU and introduces many firsts in GPU design, including FP8 precision, the Transformer Engine, and DPX instructions. Hopper is built using the TSMC 4N process—a custom 4nm fabrication node that packs 80 billion transistors into a single die, maximizing performance per watt and transistor density. These keywords help the article rank for a wide range of technical queries related to NVIDIA GPU architecture and high-performance GPUs.

NVIDIA Hopper Architecture
 NVIDIA Hopper Architecture

Streaming Multiprocessors (SMs) in Hopper Architecture

Streaming Multiprocessors (SMs) are the heart of any GPU architecture. In Hopper, SMs have been redesigned to handle significantly more threads and provide better throughput. The NVIDIA H100 includes up to 144 SMs, each capable of supporting 2,048 concurrent threads. These SMs are similar to CPU cores in function but optimized for parallel processing. Each SM includes:

  • Integer and floating-point ALUs
  • Load/store units
  • Register files
  • Shared memory/L1 cache
  • Tensor Cores

The SM architecture in Hopper GPUs allows for higher concurrency and parallelism, which is essential for large-scale AI training and scientific computing. The new instruction scheduler and memory hierarchy ensure that SMs remain fully utilized across diverse workloads.

Tensor Cores and Mixed Precision Computing

Hopper introduces fourth-generation Tensor Cores that support multiple precision formats— FP64, FP32, FP16, and the all-new FP8. This flexibility allows AI models to use the optimal data format for each layer or operation, striking a balance between performance and accuracy.

PCBWay

FP8 Precision:

One of the most transformative aspects of Hopper is support for FP8, a new low-precision format ideal for AI workloads. FP8 precision GPUs allow faster computation and reduced memory usage compared to FP16, making them perfect for training large language models (LLMs) without compromising model accuracy.

Tensor Cores in Hopper deliver:

  • Up to 9x faster AI training vs. A100.
  • Up to 30x faster AI inference.
  • FP8 throughput over 1,000 TFLOPS.

The Transformer Engine

The Transformer Engine is a purpose-built unit in Hopper designed to accelerate Transformer-based models, including GPT, BERT, and T5. It manages precision dynamically using FP8 and FP16 to maximize performance without compromising on model quality. This engine is particularly impactful in large-scale NLP models where speed and efficiency are critical. In benchmark tests, the H100 using the Transformer Engine can generate tokens up to twice as fast as the A100 GPU.

NVLink 4.0 and GPU Interconnect

Another cornerstone of Hopper’s performance is NVLink 4.0. This fourth-generation Interconnect enables ultra-fast communication between GPUs in multi-GPU configurations.

Key Features:

  • Up to 900 GB/s bandwidth per GPU.
  • 18 NVLink lanes at 50 GB/s each.
  • Improved latency and reduced bottlenecks.

NVLink 4.0 is essential for building exascale systems using hundreds of H100 GPUs. It ensures each GPU can share data seamlessly, enabling massive parallel processing for AI model training and simulation workloads.

Please refer to this link to know more about the NVIDIA GB200 AI Chip.

MIG (Multi-Instance GPU)

Hopper GPUs feature second-generation MIG (Multi-Instance GPU) technology, allowing a single H100 GPU to be partitioned into up to 7 separate instances. Each instance operates with isolated compute, cache, and memory resources.

Benefits:

  • Improved resource utilization.
  • Enhanced security and isolation for cloud workloads.
  • Tailored performance profiles for different users.

MIG is ideal for data centers and enterprises offering AI-as-a-Service, as it allows secure, efficient GPU sharing.

DPX Instructions

Hopper introduces DPX (Dynamic Programming Extensions) to accelerate specific algorithms used in fields like:

  • Genomics.
  • Graph analytics.
  • Supply chain optimization.
  • Disease modeling.

These new DPX instructions are implemented in hardware and significantly reduce the runtime of dynamic programming algorithms, which are typically memory- and compute-intensive. For example, DPX can accelerate Smith-Waterman and Needleman-Wunsch algorithms used in bioinformatics.

Memory System and HBM3

Hopper GPUs feature a cutting-edge memory subsystem built around HBM3 (High Bandwidth Memory 3). The SXM version of the NVIDIA H100 GPU includes:

  • 80 GB of HBM3.
  • 3.35 TB/s bandwidth.

Other memory features:

  • Larger L2 cache (50 MB).
  • Per-SM L1 cache and shared memory.
  • Higher bandwidth for memory-intensive tasks.

This high-speed memory system supports fast data movement, essential for AI workloads with large datasets.

CUDA Hierarchy and Software Ecosystem

Hopper GPUs continue to support the CUDA programming model, which has evolved to better utilize their architectural advancements. CUDA developers can take advantage of:

  • Warp-level primitives.
  • Cooperative groups.
  • Improved shared memory usage.

NVIDIA also provides an ecosystem of optimized libraries and tools, including:

  • cuBLAS.
  • cuDNN.
  • TensorRT.
  • Nsight Systems and Compute.

These tools help developers debug, profile, and optimize their Hopper-powered applications efficiently.

Use Cases and Real-World Applications

Hopper architecture is ideal for:

  • Training large language models (LLMs).
  • Generative AI (text, image, video).
  • Deep reinforcement learning.
  • Weather and climate simulations.
  • Drug discovery and genomics.
  • Real-time fraud detection in fintech.
  • Autonomous vehicle training.
  • Its ability to accelerate both inference and training makes it a one-stop solution for companies deploying large-scale AI systems.

Pros and Cons of Hopper Architecture

Pros:

  • Exceptional AI and HPC performance.
  • FP8 precision GPU and Transformer Engine support.
  • Powerful interconnect (NVLink 4.0).
  • Excellent scalability with MIG (Multi-Instance GPU).
  • Specialized hardware instructions (DPX).

Cons:

  • High power consumption (up to 700W).
  • Expensive.
  • Requires advanced infrastructure (cooling, PSU, interconnects).

The Hopper architecture is a game-changer in GPU computing. With innovations like FP8 precision, fourth-generation Tensor Cores, NVLink 4.0, and DPX instructions, it pushes the boundaries of what’s possible in AI and HPC. It not only delivers exponential speedups in training and inference, but also improves efficiency through MIG and hardware-accelerated dynamic programming. NVIDIA’s Hopper is more than just a GPU architecture—it’s the foundation for the future of Accelerated computing.