arxiv:2405.04434

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Published on May 7

Upvote

Authors:

Liu Bo ,

Zhao Chenggang ,

Dai Damai ,

Luo Fuli ,

Chen Guanting ,

Li Guowei ,

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

View arXiv page View PDF Add to collection

Community

philschmid

May 11

Architecture:
Decoder with Multi-head Latent Attention (MLA) and a Mixture-of-Experts (MoE). MLA reduces key-value cache demands during inference by utilizing low-rank joint compression of keys and values, improving efficiency.

Training:
📚 Pretrained on 8.1T tokens, mostly English and Chinese, with a sequence length of 4096.
🎓 Supervised fine tuning (SFT) on 1.5M samples, 1.2M for helpfulness, and 0.3M for safety.
🏆 Used Group Relative Policy Optimization (GRPO) to align the model outputs with human preferences, especially focusing on instruction following.

Learning Strategy (warmup-and-step-decay strategy):
⬆️ Linear Learning rate from 0 to the maximum value (2.4e-4) during the first 2K steps (warmup period).
⬇️ After training about 60% of tokens, the learning rate is reduced and multiplied by 0.316.
⬇️ After training about 90% of tokens, the learning rate is again reduced by 0.316.

Other insights:
📈 Used batch size scheduling from 2304 to 9216 for the first 225B tokens.
🌐 Used YaRN to extend the context window from 4K to 128K.
💰 42.5% reduced training cost compared to Deepseek 67B due to sparse activation
🏆 MMLU: 78.5 ; AlpacaEval 2.0: 38.9; MT-Bench: 8.97 🔧 Used Pipeline Parallelism, Expert Parallelism, and Data Parallelism for distributed training.
🎯 GRPO used a multi-reward framework with rewards from helpful, safety, and rule-based rewards.
🔑 MLA significantly reduces the KV cache by compressing them into a latent vector
🚀 Used hybrid engine with vLLM inference backend for RLHF training

Poodle9

May 21

•

edited May 21

Thanks for the insights! Still unclear to me how one can use the latent and compressed KV vector to get the Key/Values without having to compute them each time we are performing the attention (by doing the up-projection from the compressed vector)

This relates to this passage : "In addition, during inference, since 𝑊𝑈𝐾 can be absorbed into 𝑊𝑄, and 𝑊𝑈𝑉 can be absorbed into 𝑊𝑂, we even do not need to compute keys and values out for attention" which I kinda struggle to understand :/