CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs Paper • 2409.12490 • Published Sep 19 • 2
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management Paper • 2406.19707 • Published Jun 28 • 2
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding Paper • 2409.08561 • Published Sep 13 • 2
Diver: Large Language Model Decoding with Span-Level Mutual Information Verification Paper • 2406.02120 • Published Jun 4 • 2
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models Paper • 2405.07542 • Published May 13 • 2
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation Paper • 2407.11798 • Published Jul 16 • 1 • 2
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling Paper • 2408.08696 • Published Aug 16 • 2
Learning Harmonized Representations for Speculative Sampling Paper • 2408.15766 • Published Aug 28 • 2
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning Paper • 2408.08146 • Published Aug 15 • 2
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration Paper • 2404.12022 • Published Apr 18 • 2
Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy Paper • 2404.06954 • Published Apr 10 • 2
MoDeGPT: Modular Decomposition for Large Language Model Compression Paper • 2408.09632 • Published Aug 19 • 2
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference Paper • 2409.04992 • Published Sep 8 • 2 • 2
Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads Paper • 2407.17678 • Published Jul 25 • 2