fnlp
/

Text Generation
Safetensors
llama
TaoJi commited on
Commit
8d53d5a
·
verified ·
1 Parent(s): c25a9db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -3
README.md CHANGED
@@ -1,10 +1,66 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - HuggingFaceTB/smollm-corpus
5
  base_model:
6
- - HuggingFaceTB/SmolLM-360M
7
  pipeline_tag: text-generation
8
  ---
9
 
10
- **Research Paper** ["Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs"](https://arxiv.org/abs/2502.14837)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - HuggingFaceTB/smollm-corpus
5
  base_model:
6
+ - HuggingFaceTB/SmolLM-360M
7
  pipeline_tag: text-generation
8
  ---
9
 
10
+ **Research Paper** ["Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs"](https://arxiv.org/abs/2502.14837)
11
+
12
+ ## Inference
13
+
14
+ - Step 1: Download the [**monkey patch file**](https://github.com/JT-Ushio/MHA2MLA/blob/main/src/mha2mla/monkey_patch.py).
15
+ ```shell
16
+ wget https://raw.githubusercontent.com/JT-Ushio/MHA2MLA/refs/heads/main/src/mha2mla/monkey_patch.py
17
+ ```
18
+
19
+ - Step 2(Option): For MHA2MLA models using Partial-RoPE 2-nrom method, Download the [**qk_2-norm file**](https://github.com/JT-Ushio/MHA2MLA/tree/main/utils).
20
+ Take `qk_tensor_360M.pth` as an example:
21
+ ```shell
22
+ wget https://github.com/JT-Ushio/MHA2MLA/raw/refs/heads/main/utils/qk_tensor_360M.pth
23
+ ```
24
+
25
+ - Step 3: Download the [MHA2MLA models](https://huggingface.co/fnlp/SmolLM-360M-MLA-d_kv_32) and run inference.
26
+ Take `fnlp/SmolLM-360M-MLA-d_kv_32` as an example:
27
+
28
+ ```python
29
+ import torch
30
+ from transformers import AutoConfig, AutoTokenizer, LlamaForCausalLM
31
+ from monkey_patch import infer_monkey_patch
32
+
33
+ model_name = "fnlp/SmolLM-360M-MLA-d_kv_32"
34
+
35
+ # Monkey Patch: MHA -> MLA
36
+ config = AutoConfig.from_pretrained(model_name)
37
+ if "RoPE" in config:
38
+ config.RoPE["qk_tensor_path"] = "qk_tensor_360M.pth" # Configuration for Specific Models
39
+ infer_monkey_patch(config.RoPE)
40
+
41
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
42
+ model = LlamaForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.bfloat16).cuda()
43
+
44
+ # Generate
45
+ text = "Which American-born Sinclair won the Nobel Prize for Literature in 1930?"
46
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
47
+ generation_kwargs = {"do_sample": False, "use_cache": True, "max_new_tokens": 128}
48
+ output = model.generate(**inputs, **generation_kwargs)
49
+
50
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
51
+ # - Sinclair Lewis
52
+ ```
53
+
54
+ ## Citation
55
+ ```
56
+ @misc{ji2025economicalinferenceenablingdeepseeks,
57
+ title={Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs},
58
+ author={Tao Ji and Bin Guo and Yuanbin Wu and Qipeng Guo and Lixing Shen and Zhan Chen and Xipeng Qiu and Qi Zhang and Tao Gui},
59
+ year={2025},
60
+ eprint={2502.14837},
61
+ archivePrefix={arXiv},
62
+ primaryClass={cs.CL},
63
+ url={https://arxiv.org/abs/2502.14837},
64
+ }
65
+ ```
66
+