Cray and Altair have both been leaders in HPC simulation and computation for decades. Now, with the latest results just published from a joint study, we are happy to announce a new benchmark for RADIOSS crash testing for models at tens of millions of elements on tens of thousands of cores. Read more about our project below, and join us for a live web discussion on October 27.
The technology: Altair RADIOSS and Cray® XC40™ system
Cray and Altair are well-established leaders in HPC for manufacturing. Cray offers several compute and storage solutions, including the Cray® XC40™ supercomputer for maximum scalability and the Cray® CS400™ supercomputer, a highly scalable and modular platform that excels at capacity- and data-intensive workloads. Altair offers a market-leading crash simulation suite with integrated products and tools engineered to optimize design performance, throughput and usability. The element of this suite that most benefits from Cray systems is RADIOSS, a highly nonlinear structural analysis solver established as an automotive crash and impact standard for over 25 years.
The challenge: crash simulation at scale
While crash testing is a very well-understood field in simulation, performance challenges will always persist as systems get more complex. Also, new elements and considerations arise, such as increasingly nonlinear problem sets, multiphysics and the use of new composite and high-strength materials, to name a few. To address this, the hardware architecture must allow a single job to be completed in time to support the design schedule, and also be able to handle enterprise-wide design processes.
These needs drive the requirement for scaling of individual jobs to thousands of compute cores and overall system capacity of tens of thousands of cores. And, of course, this means applications must be able to perform well at very high core counts.
The testing: RADIOSS runs 10 million element model on XC40 system
To test a solution for efficient crash testing at scale, Cray and Altair collaborated to study RADIOSS performance at high core counts. The model we used was derived from the public domain NCAC FORD Taurus model, adapted to RADIOSS with a refined mesh of nearly 10 million elements. For the purpose of this scalability study, we restricted simulation time to the first 2 milliseconds, about 1 percent of a full crash simulation. For crash, this is generally representative enough to extrapolate for the entire run and allows analysts to do many tests, changing the number of cores, optimizing the setup, analyzing and improving the performance.
For this benchmark, to get the best performance, we decided to use RADIOSS 14.0, the new version of which will be released soon with HyperWorks 14.0. RADIOSS 14.0 is built with Intel MPI 5.0, which supports MPICH Application Binary Interface (ABI) compatibility initiative. Such binary is able to run natively under Cray MPT 7.0 with optimal speed. There is no more need to modify RADIOSS 14.0 or to use any wrapper to run on Cray systems – the same executable can run on the XC40 system or the CS400 system as on any other Linux® cluster. Besides, RADIOSS 14.0 offers several enhancements for running huge models over increased core count like optimized I/O and memory management. The domain decomposition and the sorting algorithm have also been revisited to improve parallel efficiency.
For hardware, Cray provided remote access to an XC40 system with Intel® Xeon E5-2698 V3 (16 core, 2.3 GHz).
The results – and some interesting takeaways
This was the first time Altair had ever tested RADIOSS at over 16,000 cores, and we were very pleased with the results. One clear proof point from this project is support for Altair’s claim that RADIOSS is one of the most scalable crash codes available on the market. Thanks to the software’s hybrid parallelization model (with MPI/Open MP), it is ideal for running big jobs efficiently at higher core counts.
The test also revealed some insights about how to optimize RADIOSS hybrid MPI/OpenMP scaling on Cray:
- For systems with lower node counts (say, 6 to 128), pure MPI settings with no OpenMP threading delivers the best speed.
- As the node count increases, combining MPI ranks with OpenMP threading on each node (up to 32 OpenMP threads and one MPI rank when using 512 nodes) is a better approach to maintain performance.
- In situations when scalability may be decreased (such as due to smaller per-processor domain size), hybridization helps – for example, 16 threads was optimal at 8,192 cores, but 32 threads was optimal at 16,382 cores
Based on the results, we find that for optimum scaling it’s best to keep a ratio of around 4,000 elements computed per core. For very large models (say, 100 million elements), scaling will remain very efficient well beyond 512 nodes (using more than 16,386 cores).
Cray has found that working closely with the leading application developers, such as the Altair RADIOSS team, adds significant value for Cray’s and Altair’s mutual customers. As shown, the XC40 system and other Cray systems deliver amazing performance — but in ensuring that a solution reaches its full potential, a partnership between Cray, the application developers and the CAE users is always helpful. Download the application brief here.
We’ll be discussing this project and additional details about how to tune RADIOSS performance on Cray systems in an upcoming live webinar. Learn more about optimizing high fidelity crash & safety simulation performance, join our experts discussion on October 27.
- Powerful HPC Clusters Drive High-Speed Design Innovation - December 13, 2016
- New Results in Crash Testing with Cray and Altair RADIOSS 14 - September 29, 2015
- Featured Event: 2015 HPC & OSL TES in Grenoble, France - March 23, 2015