)))?!?((('s picture

)))?!?(((

stereoplegic

·

AI & ML interests

None yet

Organizations

stereoplegic's activity

commented 3 papers about 1 month ago

CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs

Paper • 2409.12490 • Published Sep 19 •

Inference-Friendly Models With MixAttention

Paper • 2409.15012 • Published Sep 23 •

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

Paper • 2406.19707 • Published Jun 28 •

commented 17 papers about 2 months ago

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

Paper • 2409.08561 • Published Sep 13 •

Diver: Large Language Model Decoding with Span-Level Mutual Information Verification

Paper • 2406.02120 • Published Jun 4 •

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Paper • 2405.07542 • Published May 13 •

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Paper • 2407.11798 • Published Jul 16 • 1 •

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Paper • 2408.08696 • Published Aug 16 •

Learning Harmonized Representations for Speculative Sampling

Paper • 2408.15766 • Published Aug 28 •

Parallel Speculative Decoding with Adaptive Draft Length

Paper • 2408.11850 • Published Aug 13 •

Improving Multi-candidate Speculative Decoding

Paper • 2409.10644 • Published Sep 16 •

KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning

Paper • 2408.08146 • Published Aug 15 •

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Paper • 2404.12022 • Published Apr 18 •

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Paper • 2404.06954 • Published Apr 10 •

GrootVL: Tree Topology is All You Need in State Space Model

Paper • 2406.02395 • Published Jun 4 •

MoDeGPT: Modular Decomposition for Large Language Model Compression

Paper • 2408.09632 • Published Aug 19 •

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Paper • 2409.04992 • Published Sep 8 • 2 •

Palu: Compressing KV-Cache with Low-Rank Projection

Paper • 2407.21118 • Published Jul 30 • 1 •

Post-Training Sparse Attention with Double Sparsity

Paper • 2408.07092 • Published Aug 11 •

Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads

Paper • 2407.17678 • Published Jul 25 •