Overview

HPC networks are generally much different from the internet or typical local area networks, both in terms of the hardware and the software involved. Software and Hardware layers in HPC typically need to provide more stringent guarantees for data delivery. Routing between nodes in a supercomputer, for example, is usually statically determined.

High-Level Design Considerations

Topology

Topology (how nodes are interconnected) is one of the major performance impacting concerns in the design of HPC networks. They must take into account various factors, including packaging constraings and wire length. Most importantly, they must provide good performance for a wide array of applications (unless designing an application-specific system). That is, a good physical network topology will accomodate many application topologies.

Most common topologies

Bus: shared medium. Requires aribtration. Scales poorly
Point to Point: every node connected to every other. Scales exponentially in cost (number of connections) PCIe is P2P, so is AMD HyperTransport
Ring network
Mesh (Grid) Also known as k-ary n-cubes. K=#nodes in dim. N=dimensionality. binary 3-cube is a 2x2x2 3D mesh
Torus. A mesh where edges wrap around. Wire length is a factor here. 1D torus == ring
Indirect Networks: These include butterfly, CLOS, Fat Tree. Basically, there are intermediate switch nodes and terminal nodes. Direct networks only have one type of node.

Routing

A topology should be chosen such that it has good path diversity, i.e. there is more than one minimal path between any given source and destination.

High-Performance networks are typically statically routed. Adaptive routing is rare because of performance and the introduction of deadlock (and livelock). It is typically only used for fault tolerance

Flow Control

Virtual-cuthrough (wormhole routing) vs. store-and-forward

Router Design

HPC network hardware tends to be custom designed. The top-tier systems even have proprietary interconnects. This includes custom software and hardware. One example is the new Cray Gemini interconnect. System boards have custom ASIC router chips sitting on them, which are connected to the processor with something like PCIe or HyperTransport.

A big point here is radix of the router, i.e. number of I/O ports. Recently, high-radix routers are becoming more popular with technology advancements. Also, how many input/output buffers per port (i.e. virtual channels). What is the interconnect between ports (crossbar, bus, etc.) Are the router stages pipelined? (i.e. Route computation, arbitration, input buffer allocation). On-Chip routers typically forward a flit (flow control digit) in 2 cycles. They also speculate on the route computation (which can cause stalls in the worst case).

Interconnects

Gemini, SeaStar (Cray Proprietary)
Tofu (Fujitsu Proprietary)
InfiniBand (originally designed for SANs)
10GigE, soon 40GigE

Software

Large scale parallel processing usually involves message-passing, since the big machines tend to be designed in the distributed-memory fashion. Most of this message passing is done in libraries, the most prominent of which is, without a doubt, MPI (Message Passing Interface). MPI is just that, an interface. There are many implementations, e.g. OpenMPI, MPICH/2. Recently there has been a push for distributed-shared memory. I.e. something like a shared memory model with message passing going on behind the curtains (UPC, Chapel). The languages that have this model cooked in are often called PGAS languages (Partitioned Global Address Space).

MPI