PPoPP 2024
Sat 2 - Wed 6 March 2024 Edinburgh, United Kingdom

Distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intra-operator parallelism to achieve low job completion time (JCT) and intensive communications makes it costly. Conversely, the inter-operator parallelism can achieve high job processing capacity (JPC) with much fewer communications, but it fails to reduce the execution time.

In this paper, we present Liger, a distributed large model inference runtime system that is of capability to achieve low JCT at high JPC on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger enables this parallelism by carefully scheduling computation and communication kernels across requests onto multiple streams of multiple GPUs. It achieves precise control of kernel execution order efficiently by mixing use the CPU-GPU synchronization and the inter-stream synchronization. To prevent scheduling failures caused by resource contention, Liger introduces a contention factor strategy to anticipate the penalty of contention. It enables a higher degree of overlap by decomposing lengthy kernels into smaller, more manageable units at runtime.

Extensive evaluations show that Liger, in most cases, outperforms existing parallelism approaches across models and devices, presenting the best JCT and JPC results. In a 4-device case, Liger reduces the average JCT by 36.0% while maintaining the same JPC compared to the inter-operator approach. Meanwhile, it improves the JPC by 1.34× with improved average JCT compared to the intra-operator approach.

Mon 4 Mar

Displayed time zone: London change

11:30 - 12:50
Compilers and Runtimes for Parallel SystemsMain Conference at Moorfoot
Chair(s): Mohamed Riyadh Baghdadi
11:30
20m
Talk
Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference
Main Conference
Jiangsu Du Sun Yat-sen University, jinhui wei Sun Yat-sen University, Jiazhi Jiang Sun Yat-sen University, Shenggan Cheng National University of Singapore, Zhiguang Chen Sun Yat-sen University, Dan Huang , Yutong Lu Sun Yat-sen University
Link to publication DOI
11:50
20m
Talk
A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs
Main Conference
Jinchen Xu Information Engineering University, Guanghui Song Li Auto Inc., Bei Zhou Information Engineering University, Fei Li Information Engineering University, Jiangwei Hao Information Engineering University, Jie Zhao State Key Laboratory of Mathematical Engineering and Advanced Computing
Link to publication DOI
12:10
20m
Talk
Language-Agnostic Static Deadlock Detection for Futures
Main Conference
Stefan K. Muller Illinois Institute of Technology
Link to publication DOI
12:30
20m
Talk
Recurrence Analysis for Automatic Parallelization of Subscripted Subscripts
Main Conference
Akshay Bhosale University of Delaware, USA, Rudolf Eigenmann University of Delaware
Link to publication DOI