HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

May 13, 2024·

Xuanlei Zhao

Bin Jia

Haotian Zhou

Ziming Liu

Shenggan Cheng

Yang You

· 0 min read

PDF Cite

Abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

Type

Conference paper

Publication

In MLSys 2024, Proceedings of Machine Learning and Systems

Last updated on May 13, 2024

Heterogeneous Parallel Inference Large Language Models Resource-Constrained Devices

Authors

Ziming Liu

Ph.D. Candidate

← WallFacer: Harnessing Multi-dimensional Ring Parallelism for Efficient Long Sequence Model Training Jun 30, 2024

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference May 7, 2024 →