WallFacer: Guiding Transformer Model Training Out of the Long-Context Dark Forest with N-body Problem

Jun 30, 2024ยท
Ziming Liu
Ziming Liu
,
Shaoyu Wang
,
Shenggan Cheng
,
Zhongkai Zhao
,
Xuanlei Zhao
,
Hames Demmel
,
Yang You
ยท 0 min read
WallFacer Attention Block
Abstract
In recent years, Transformer-based Large Language Models (LLMs) have garnered significant attention due to their exceptional performance across a variety of tasks. However, training these models on long sequences presents a substantial challenge in terms of efficiency and scalability. Current methods are constrained either by the number of attention heads, limiting scalability, or by excessive communication overheads. In this paper, we propose an insight that Attention Computation can be considered as a special case of n-body problem with direct interactions. Based on this concept, this paper introduces WallFacer, an efficient long-sequence training system with a novel multi-dimensional ring sequence parallelism, fostering an efficient communication paradigm and extra tuning space for communication arrangement. Through comprehensive experiments under diverse environments and model settings, we demonstrate that WallFacer significantly surpasses state-of-the-art method that supports near-infinite sequence length, achieving performance improvements of up to 77.12%.
Type
Publication
Arxiv Preprint