Collections
Discover the best community collections!
Collections including paper arxiv:2408.04093
-
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
Paper • 2307.03712 • Published • 1 -
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Paper • 2408.04093 • Published • 4 -
Arcee's MergeKit: A Toolkit for Merging Large Language Models
Paper • 2403.13257 • Published • 20 -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 51
-
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper • 2403.09919 • Published • 20 -
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
Paper • 2305.09781 • Published • 4 -
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Paper • 2408.04093 • Published • 4
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 78 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Paper • 2309.06180 • Published • 25 -
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper • 2308.16137 • Published • 39 -
Scaling Transformer to 1M tokens and beyond with RMT
Paper • 2304.11062 • Published • 2 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 17