KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers.

HPCA(2023)

引用 0|浏览12
暂无评分
摘要
Machine learning (ML) inference workloads present significantly different challenges than ML training workloads. Typically, inference workloads are shorter running and under-utilize GPU resources. To overcome this, co-locating multiple instances of a model has been proposed to improve the utilization of GPUs. Co-located models share the GPU through GPU spatial partitioning facilities, such as Nvidia’s MPS, MIG, or AMD’s CU Masking API. Existing spatially partitioned inference servers create model-wise partitions by "right-sizing" based on a model’s latency tolerance to restricting resources. We show that model-wise right-sizing is under-utilized due to varying resource restriction tolerance of individual kernels within an inference pass.We propose Kernel-wise Right-sizing for Spatial Partitioned GPU Inference Servers (KRISP) to enable kernel-wise right-sizing of spatial partitions at the granularity of individual kernels. We demonstrate that KRISP can support a greater level of concurrently running inference models compared to existing spatially partitioned inference servers. KRISP improves overall throughput by 2x when compared with an isolated inference (1.22x vs prior works) and reduce energy per inference by 33%.
更多
查看译文
关键词
GPU Inference Server,Compute Unit Masking,GPU Spatial Partitioning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要