AMD has detailed its Instinct MI300X “CDNA 3” GPUs ahead of the MI325X launch next quarter, detailing the GPU framework designed for AI workloads.
AMD Instinct MI300X “CDNA 3” AI GPU Brings 320 Compute Units On Full Chip, MI325X With Updated HBM3e Coming In October
AMD’s MI300X is the third iteration of its Instinct accelerators that have been designed for the AI computing segment. The chip also comes in the MI300A flavor, which is an exascale APU-optimized part that offers a combination of Zen 5 cores across two chiplets, while the remaining cores are powered by the CDNA 3 GPU.
AMD has broken down the entire Instinct MI300X to give us an accurate representation of what’s under the hood of this massive AI product. For starters, the AMD Instinct MI300X features a total of 153 billion transistors, featuring a mix of 5nm and 6nm TSMC FinFET process nodes. The eight chaplets feature four shared engines, and each shared engine contains 10 compute units.
The entire chip contains 32 shader engines with a total of 40 shader engines on a single XCD and 320 total across the entire package. Each XCD has its own dedicated L2 cache and the package surrounds the Infinity Fabric Link, 8 HBM3 IO sites, and a single PCIe Gen 5.0 link with 128GB/s of total bandwidth that connects the MI300X to an AMD EPYC CPU.
AMD is using the fourth-generation Infinity Fabric in its Instinct MI300X chip, which offers up to 896GB/s of bandwidth. The chip also incorporates an Infinity Fabric Advanced Package Link that connects all the chips using 4.8TB/s of bisection bandwidth, while the XCD/IOD interface is rated at 2.1TB/s bandwidth.
Diving into the CDNA 3 architecture itself, the latest design includes:
- Duplicate low precision matrix /clk/cu operations
- 2:4 structured scatter support for INT8, FP8, FP16, BF16
- 2x additional performance with scattering enabled
- Support for TF32 and FP8 numeric formats
- Co-issuance FP16/FP32/INT32 with FP16/FP32/FP64
The full block diagram of the Mi300X architecture is shared below and you can see that each XCD has two compute units disabled, which totals 304 CUs of the full 320 CU design. The full chip is configured with 20,480 cores, while the MI300X is configured with 19,456 cores. There is also 256 MB of dedicated Infinity Cache onboard the chip.
The complete analysis of the cache and memory hierarchy in the MI300X is visualized below:
Each CDNA compute unit is composed of a scheduler, local data sharing, vector registers, vector units, matrix core, and L1 cache. Coming to the performance numbers, the MI300X delivers:
- 1.7x speedup over MI250X on Vector FP64
- 3.4x speedup compared to MI250X on Vector FP32
- 1.7x speedup over MI250X in Matrix FP64
- 1.7x speedup over MI250X on Matrix FP32
- 3.4x speedup compared to MI250X in Matrix FP16
- 3.4x speedup over MI250X on Matrix BF16
- 6.8x speedup compared to MI250X on Matrix INT8
AMD’s Instinct MI300X is also the first accelerator to feature an 8-stack HBM3 memory design, followed by NVIDIA with its Blackwell GPUs coming later this year. The new 8-site design allowed AMD to achieve 1.5x greater capacity, while the new HBM3 standard provided a 1.6x increase in bandwidth compared to MI250X.
AMD also claims that their larger and faster memory configuration on the Instinct Mi300X allows them to handle larger LLM (FP16) sizes of up to 70B in training and 680B in inference, whereas NVIDIA HGX H100 systems can only sustain model sizes of up to 30B in training and 290B in inference.
One interesting feature of the Instinct Mi300X is AMD’s spatial slicing, which allows users to partition XCDs based on the demands of their workloads. All XCDs operate together as a single processor, but can also be partitioned and grouped to appear as multiple GPUs.
AMD will update its Instinct platform with the MI325X in October which will feature HBM3e memory and increased capacities of up to 288 GB. Some of the features of the MI325X include:
- 2x Memory
- 1.3x memory bandwidth
- 1.3x Theoretical Peak FP16
- 1.3x Theoretical Peak FP8
- 2x Model Size per Server
NVIDIA’s response will arrive next year in the form of Blackwell Ultra with 288GB HBM3e so AMD will once again remain ahead in this crucial AI market, where larger AI models are emerging and require larger memory capacities to support billions or trillions of parameters.