Update README.md
Browse files
README.md
CHANGED
@@ -55,8 +55,8 @@
|
|
55 |
|
56 |
Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:
|
57 |
|
58 |
-
- 16B total params, 2.4B active params, 5.7T
|
59 |
-
- Outperforms 7B dense and 16B MoE on many benchmarks
|
60 |
- Deployable on single 40G GPU, fine-tunable on 8x80G GPUs
|
61 |
|
62 |
DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.
|
|
|
55 |
|
56 |
Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:
|
57 |
|
58 |
+
- 16B total params, 2.4B active params, scratch training with 5.7T tokens
|
59 |
+
- Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks
|
60 |
- Deployable on single 40G GPU, fine-tunable on 8x80G GPUs
|
61 |
|
62 |
DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.
|