Abstract
This talk will provide an overview of an extreme high end simulation computing site focusing on the storage challenges of yesterday, today, and tomorrow.
With single simulations running on over a million cores, occupying over a petabyte of DRAM, updating 90% of that memory as rapidly as every few milliseconds, and running continuously for over 6 months, the defensive IO challenges have been tremendous, leading to innovations like parallel log structure and burst buffer technology.
Keeping warm data online for multi-year weapons science campaigns and handling in excess of 400 disks failing per-day
has required building data lakes that scale to hundreds of Gigabytes/sec., providing POSIX-like name spaces to keep users from revolting, and producing multi-tier erasure solutions with RDMA
movement to survive. Metadata management appears to be the next extreme HPC Storage problem area. The limited ways to collect metadata and the multi-dimensional problem space in which
weapons science lives gives rise to the need to explore indexing using parallel key-value frameworks, triple stores, and Hexstore mechanisms that scale currently to the tracking of trillions of particles per time step and hundreds of thousands of time steps in a few weeks of calculation time.
Some historical perspective on innovations and methods to solve these extreme HPC storage problems as well as current future facing research efforts going on in extreme HPC Storage will be covered.