On the 50th Anniversary of Cray Research Inc.’s founding by Seymour Cray, I thought I’d summarise my brief but intensely interesting experience with the last crop of Cray Supercomputer systems.
This generation started with the introduction of the Aries interconnect; This impressive network deployed in a Dragonfly topology connected all nodes within a cabinet group to every other node in the same cabinet group using copper, and then connected every cabinet group to every other cabinet group via optical fibre. This resulted in any node in the system being able to communicate with any other node within three networks.
A network this interconnected required some serious networking hardware, which is why Cray created a dedicated ASIC attached to each blade to handle the networking!
Node#1 <-Copper-> Node#5
Node#1 <-Copper-> Cabinet#1 <-Fibre-> Cabinet#64 <-Copper-> Node#3601
These systems can support thousands of nodes spread over hundreds of cabinets, but the administration is suprisingly centralised on a few tiered nodes: The SMW, Boot node and SDB.
The SMW, System Management Workstation, is the first point of contact for an administrator, connecting over SSH you land on a highly resilient node where you will find the logs for the entire supercomputer, as well node configurations and several proprietary administration scripts for checking the systems state.
The boot node handles, well, booting nodes. Sounds simple but it handles booting thousands of nodes simultaneously, sending boot images, controlling the full lifecycle from hardware initialisation through to graceful, or otherwise, shutdown. This task is further complicated by the Aries interconnect which requires sepcial operations in order to add or remove a node from the network.
Lastly the sdb “Service Database” node runs the various workload schedulers that handle the lifecycle of user workloads. The sdb always runs Cray’s ALPS scheduler, which is how the system internally places work, but this scheduler can be fronted by a more well-known scheduler such as PBS or SLURM.
Cray systems aren’t only impressive collections of commodity and specialised hardware, huge effort is put into the software stack. Cray spins their own Linux distro based on Suse and they control the update cycle to ensure that every point-release of their entire software stack can be supported properly and any regressions or bugs are well documented so HPC sites can make informed decisions on when to update – It’s not just
yum update, nowhere near it!
Sites that opt for Cray systems often are running highly specialised code, and running it thousands or millions of times over the lifetime of a Cray system, as such the site will want to squeeze every milisecond of performance out of each run as they can. Cray provides a proprietary set of compilers, performance libraries and profilers, bundled alongside Cray-optimised open-source packages such as NumPy, Fast Fouier Transform, BLAS, etc..
These days those proprietary tools are based on open-source projects like Clang/LLVM and simply layer ontop the Cray optimizations, enabling them to benefit from upstream performance improvements, removing the need to maintain a custom front-end, and making it easier for new users to adapt since support for Clang/LLVM is fairly widespread.
It’s difficult to separate the hardware from the software when it comes to storage appliances and Cray ClusterStor is no exception; Deploying the open-source distributed filesystem Lustre across thousands of disks packed into highly resilient RAID arrays, connected via high-speed Infiniband to dedicated I/O nodes distributed across the Supercomputer sharing the load of mounting the several massive filesystems to every compute node!
Note: ClusterStor was previously named Sonexion, which I preferred for it’s totally confusing origin!