Adaptive Resource Management for HPC Systems

semanticscholar(2022)

引用 0|浏览9
暂无评分
摘要
This report documents the program and the outcomes of Dagstuhl Seminar 21441 “Adaptive Resource Management for HPC Systems”. The seminar investigated the impact of adaptive resource management of malleable applications on the management of the HPC system, the programming of the applications, the tools for performance analysis and tuning, as well as the efficient usage of the HPC systems. The discussions led to a joint summary presenting the state-of-the-art, required techniques on the various levels of HPC systems, as well as the foreseen advantages of adaptive resource management. coherent I/Os, as well as domain-specific accelerators with staggering (several hundreds of Watts) peak power requirements.The peak power exceeds the TDP, and the package cost constrains the maximum TDP and sustainable peak power. Motherboards’ form-factor, layout, and cost constraint the power distribution design and demand effective and reliable on-chip thermal management. Power, temperature, and energy are critical aspects that must be controlled and optimized online with a low-latency feedback loop with the on-chip power management IPs and sensors, Operating System, Security Subsystem, off-chip Board Management Controller (BMC) and power converters. We propose ControlPULP, a fully-digital and highly capable RISC-V based parallel microcontroller IP optimized for power management of complex HPC processors. Its design supports a single-core manager core and peripherals paired with a cluster of 8 processors to accelerate the Power Control Firmware workload, Direct Memory Access (DMA) engine for accessing on-chip sensors, a uDMA engine for off-chip AVSBUS/PMBUS peripheral support and BMC-based communication through the Management Component Transport Protocol (MCTP). The controller implements basic System Control and Management Interface (SCMI) doorbell-based protocol hosting up to 144 external interrupt lines. On the SW side, it relies on an open-source Real-time Operating System (FreeRTOS) for agile scheduling of the underlying Control Policy. This talk describes work on facilitating cutting-edge current and next-generation scientific workflows through integration of cloud computing with Flux, a novel graph-based Resource and Job Management Software (RJMS) developed at LLNL. The integration is aimed to advance converged computing, an environment that offers the best features of HPC (perform-ance, efficiency, sophisticated scheduling) and the cloud (resiliency, elasticity, portability, and automation) to next-generation high-performance workflows. The talk will also detail work to build industry collaborations to make lasting, sustainable contributions to the broader computing community. This talk introduces the ESPRESO FEM library developed at IT4Innovations. The library was described from computer science point of view, and we highlighted its potential for dynamic resource management. The key component of ESPRESO that enables its elasticity is the I/O module which is capable of checkpoint / restart simulation on various number of MPI ranks. Finally, we have proposed changes needed in this module to fully support iMPI. Sessions API and its implications, and will then discuss options for extensions, which could provide a truly dynamic and malleable MPI. This talk presents the MERIC tool for dynamic tuning of HPC hardware or runtime systems while running parallel application. The goal is to minimize the energy to solution with user-defined impact on application performance. Additionally, we also discuss the potential of tuning the hardware under power-cap which not only opens opportunity for energy savings but also for performance improvements. As the tool is continuously used to evaluate potential of energy savings for different applications under H2020 and EuroHPC projects it is being extended with new features. This includes support for new hardware, such as GPUs or new CPU architectures as presented. As the approach we use can perform dynamic tuning at relatively high rate, it is suitable for dynamic resource management.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要