Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning

Oct 27, 2024·

Shenggan Cheng

Shengjie Lin

Lansong Diao

Hao Wu

Siyu Wang

Chang Si

Ziming Liu

Xuanlei Zhao

Jiangsu Du

Wei Lin

Yang You

· 0 min read

Cite

Abstract

With the exponential growth of deep learning (DL), there arises an escalating need for scalability. Despite significant advancements in communication hardware capabilities, the time consumed by communication processes remains a bottleneck during training. The existing various optimizations are coupled within parallel systems to implement specific computation-communication overlap. This approach poses challenges in terms of performance, programmability, and generality. In this paper, we introduce Concerto, a compiler framework designed to address these challenges by automatically optimizing and scheduling communication. We formulate the scheduling problem as a resource constrained project scheduling problem and use off-the-shelf solver to get the near-optimal scheduling. And use auto-decomposition to create overlap opportunity for critical (synchronous) communication. Our evaluation shows Concerto can match or outperform state-of-the-art parallel frameworks, including Megatron-LM, DeepSpeed, and Alpa, all of which include extensive hand-crafted communication optimization. Unlike previous works, Concerto decouples the parallel approach and communication optimization, then can generalize to a wide variety of parallelisms without manual optimization.

Type

Conference paper

Publication

ASPLOS 2025

Last updated on Oct 27, 2024

Deep Learning Compiler Communication Scheduling High Performance Computing

Authors

Ziming Liu

Ph.D. Candidate

← WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training Nov 10, 2024

WallFacer: Harnessing Multi-dimensional Ring Parallelism for Efficient Long Sequence Model Training Jun 30, 2024 →