SC22: CXL3.0, the future of HPC and Frontier interconnects against Fugaku - Analysis of high performance computing news |  inside the HPC

SC22: CXL3.0, the future of HPC and Frontier interconnects against Fugaku – Analysis of high performance computing news | inside the HPC

By Adrian Cockcroft, Partner and Analyst, OrionX

HPC luminary Jack Dongarra’s fascinating comments at SC22 on the low efficiency of leading-class supercomputers highlighted by the latest High Performance Conjugate Gradients (HPCG) benchmark results will, I believe, influence the next generation of supercomputer architectures to optimize sparse matrix calculations. The upcoming technology that will help solve this problem is CXL. Next-generation architectures will use CXL3.0 switches to connect compute nodes, pooled memory, and I/O resources in very large cohesive fabrics within a rack, and use Ethernet between racks . I call this a “Petalith” architecture (explained below), and I believe CXL will play an important and growing role in shaping this emerging development in the high-performance interconnect space.

This story begins 20 years ago, when I was an emeritus engineer at Sun Microsystems and Shahin Khan, now a partner at technology consultancy OrionX, asked me to be the chief architect of the team. of high performance technical computing that he directed. When I retired from my role as VP of Amazon earlier this year and was looking for a way to support an advisory, analyst and advisory role, I contacted Shahin again and joined OrionX.net.

At SC22, Shahin and I went to the Top500 Report press conference. Dongarra reviewed the latest results and pointed out the low efficiency of some important workloads. In the discussion, there was no mention of interconnect trends, so I asked if the solution lay in new approaches to interconnect efficiency. Two slides he shared later, at the conference during his Turing Award talk as the event’s keynote and during a panel discussion on “Reinventing HPC,” address the issue. interconnection:

Top500 HPL (High Performance LINPACK) benchmark results are now driven by Oak Ridge National Laboratory’s Frontier system, an HPE/Cray system providing more than one exaflop for 64-bit floating-point dense matrix factorization.

HPCG Top 10 – Jack Dongarra’s Turing Award Lecture

But there are plenty of heavy workloads represented by the HPCG benchmark, where Frontier only gets 14.1 petaflops, or 0.8% of peak capacity. HPCG is driven by the Japanese RIKEN Fugaku system at 16 petaflops, or 3% of its maximum capacity. That’s almost four times better on a relative basis, but even so it’s clear that most of the machine’s compute resources are sitting idle. In contrast, the LINPACK benchmark on Frontier is running at 68% of its peak capacity and can probably do better with further optimization.

Most of the best supercomputers are similar to Frontier, using AMD or Intel processors with AMD, Intel, and NVIDIA GPU accelerators, and Cray Slingshot or Infiniband networks in a Dragonfly+ configuration. Fugaku is very different. It uses an ARM processor with a vector processor connected via a 6D torus interconnect. The vector processor avoids some memory bottlenecks, is more easily and automatically taken care of by a compiler, and the interconnect also helps.

Over time, we see more and more applications with more complex physics and simulation features, which means that HPCG becomes more and more relevant as a benchmark.

During a number of briefings and conversations, we asked why other vendors weren’t copying Fugaku. We also asked about upcoming architecture enhancements. Some people said they tried to showcase architecture like Fugaku, but customers want Frontier-style systems. Other people saw the choice between specialized technology and more off-the-shelf technology, and that leveraging the greater investment in OTS components was the way to go in the long run.

But I think the trend is towards custom designs, with Apple, Amazon, Google and others building processors optimized for their own use. This week, AWS announced an HPC-optimized Graviton3E CPU, which confirms this trend. If I’m right about that, Fugaku is positioning itself as the first of a new mainstream, rather than a special-purpose system.

I’ve always been particularly interested in the interconnects and protocols used to create clusters, as well as the latency and bandwidth of the different offerings available. I presented a keynote for Sun at Supercomputing 2003 in Phoenix, Arizona and have included the slide below.

The four categories still make sense: kernel-managed network sockets, user-mode messaging libraries, consistent memory interfaces, and on-chip communication.

If we look at Ethernet to begin with, over the last 20 years we’ve gone from the era where 1Gb was common and 10Gb being the best available to 100Gb being common, with many options at 200Gb, some at 400 Gbit, 800 Gbit launched a year ago, and this week’s announcement of 1600Gbit for a single AWS instance. Ethernet latency has been optimized — the HPE/Cray Slingshot used in Frontier is a heavily customized 100 Gb Ethernet interconnect.

Twenty years ago there were a variety of commercial interconnects like Myrinet and several Infiniband vendors and chipsets. Over the years, they consolidated into Mellanox, which is now part of NVIDIA, and OmniPath, which was sold to Cornelis by Intel. In addition to being a highly mature and reliable interconnect at scale and for a variety of uses, part of the appeal of Infiniband is that it can be accessed from a user-mode library like MPI, rather just a call to the kernel. Minimum latency hasn’t dropped much in 20 years, but Infiniband now runs at 400 Gbps.

The most exciting new development this year is that the industry has consolidated several different next-generation interconnect standards around Compute Express Link – CXL, and the CXL3.0 specification was released a few months ago.

CXL3.0 doubles the speed and adds many features to the existing CXL2.0 specification, which is beginning to become working silicon, such as XConn’s 16-port CXL2.0 switch shown at Expo.

CXL is a memory protocol, as shown in the third block of my diagram from 2003, and provides consistent cache latency around 200ns, and up to 2 meters maximum distance. This is enough to wire together the systems of a rack in a single CXL3.0 structure. CXL3.0 has broad industry support: it will be integrated into future processor designs from ARM, and earlier versions of CXL are integrated into future processors from Intel and others. The physical layer and connector specification for CXL3.0 is the same as PCIe 6.0, and in many cases the same silicon will speak both protocols, so the same interface can be used to connect I/O devices. S conventional PCIe, or can use CXL to communicate with I/O devices and pooled or shared memory banks.

XConn 16-Port CXL2.0 Switch

The capacity of each CXL3.0 port is 16 bits wide at 64 GT, which is equivalent to 128 GB/s in each direction, for a total of 256 GB/s. We can expect processors to have two or four ports. Each transfer is a 256-byte flow control unit (FLIT) that contains error control and header information, and over 200 bytes of data.

The way CXL2.0 can be used is to have pools of shared memory and I/O devices behind a switch and then allocate the necessary capacity from the pool. With CXL3.0, memory can also be configured as consistent shared memory and used to communicate between processors.

I’ve updated my latency vs. bandwidth diagram and also included NVIDIA’s proprietary NVLINK interconnect that they use to connect GPUs together (although I couldn’t find a reference to NVLINK latency).

Comparing CXL3.0 to Ethernet, Slingshot, and Infiniband, it offers lower latency and higher bandwidth, but has limited connections inside a rack. The way current Slingshot networks are configured in Dragonfly+ configurations is that a group of CPU/GPU nodes in a rack are fully connected through switchgroups, and there are relatively fewer connections between switchgroups. A possible CXL3.0 architecture could replace local copper wired connections and use Ethernet for fiber optic connections between racks. Unlike Dragonfly, where local clusters have a fixed configuration of memory, CPU, GPU, and connections, the flexibility of a CXL3.0 fabric allows the cluster to be configured with the right balance of each for a specific workload. As mentioned, I call this architecture Petalith. Each rack-scale structure is a much larger “petal” that contains hundreds of CPUs and GPUs, tens to hundreds of terabytes of memory, and a cluster of racks would be connected by many Ethernet links 800 Gbit.

When I asked CXL experts if they saw CXL3.0 competing with local interconnects within a rack, they agreed that was part of the plan. However, when I asked them what the programming model would be, message passing with MPI or shared memory with OpenMP, it appeared that this topic would be covered later as the overall CXL roadmap would be defined, streamlining the various efforts that have been consolidated into CXL. There are several different programming model approaches that I can think of, but I think it’s also worth looking at the work on the Twizzler memory-oriented operating system that’s happening at UC Santa Cruz. The way I think this could happen is that the rack level structure could be reconfigured as the workload is deployed to have the optimal ‘petal’ size to run OpenMP shared memory for that workload, with the petals connected via MPI. Sometimes it would look like a plumeria with a few big petals, and other times like a sunflower with lots of little petals.

This week, AWS announced its first HPC-optimized processor design, and it’s all expected to roll out over the next two to three years. I expect roadmap-level architectural concepts along these lines at SC23, first prototypes with CXL3.0 silicon at SC24, and hopefully some interesting HPCG results at SC25. I’ll be there to see what happens, and to see if Petalith catches on.


Adrian Cockcroft is a partner and analyst at technology consultancy OrionX.


#SC22 #CXL3.0 #future #HPC #Frontier #interconnects #Fugaku #Analysis #high #performance #computing #news #HPC

Leave a Comment

Your email address will not be published. Required fields are marked *