Publications

(2025). Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving. Arxiv Preprint.
(2025). Region-Adaptive Sampling for Diffusion Transformers. Arxiv Preprint.
(2024). WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training. PPoPP 2025. *:Equal Contribution.
(2024). Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning. To appear on ASPLOS 2025.
(2024). StarTrail: Concentric Ring Sequence Parallelism for Efficient Near-Infinite-Context Transformer Model Training. NeurIPS 2025.
(2024). HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices. In MLSys 2024, Proceedings of Machine Learning and Systems.
(2024). AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference. In ICLR 2024, International Conference on Learning Representations.
(2024). DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers. Arxiv Preprint.
(2023). Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency. In SC ‘23, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. *:Equal Contribution.
(2023). ATP: Adaptive Tensor Parallelism for Foundation Models. Arxiv Preprint.