fnlp
/

SmolLM-360M-MLA-d_kv_32

Text Generation

Model card Files Files and versions Community

SmolLM-360M-MLA-d_kv_32 / README.md

TaoJi's picture

Update README.md

8d53d5a verified 2 days ago

|

history blame contribute delete

2.48 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceTB/smollm-corpus
	base_model:
	- HuggingFaceTB/SmolLM-360M
	pipeline_tag: text-generation
	---

	Research Paper ["Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs"](https://arxiv.org/abs/2502.14837)

	## Inference

	- Step 1: Download the [monkey patch file](https://github.com/JT-Ushio/MHA2MLA/blob/main/src/mha2mla/monkey_patch.py).
	```shell
	wget https://raw.githubusercontent.com/JT-Ushio/MHA2MLA/refs/heads/main/src/mha2mla/monkey_patch.py
	```

	- Step 2(Option): For MHA2MLA models using Partial-RoPE 2-nrom method, Download the [qk_2-norm file](https://github.com/JT-Ushio/MHA2MLA/tree/main/utils).
	Take `qk_tensor_360M.pth` as an example:
	```shell
	wget https://github.com/JT-Ushio/MHA2MLA/raw/refs/heads/main/utils/qk_tensor_360M.pth
	```

	- Step 3: Download the [MHA2MLA models](https://huggingface.co/fnlp/SmolLM-360M-MLA-d_kv_32) and run inference.
	Take `fnlp/SmolLM-360M-MLA-d_kv_32` as an example:

	```python
	import torch
	from transformers import AutoConfig, AutoTokenizer, LlamaForCausalLM
	from monkey_patch import infer_monkey_patch

	model_name = "fnlp/SmolLM-360M-MLA-d_kv_32"

	# Monkey Patch: MHA -> MLA
	config = AutoConfig.from_pretrained(model_name)
	if "RoPE" in config:
	config.RoPE["qk_tensor_path"] = "qk_tensor_360M.pth" # Configuration for Specific Models
	infer_monkey_patch(config.RoPE)

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = LlamaForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.bfloat16).cuda()

	# Generate
	text = "Which American-born Sinclair won the Nobel Prize for Literature in 1930?"
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	generation_kwargs = {"do_sample": False, "use_cache": True, "max_new_tokens": 128}
	output = model.generate(inputs, generation_kwargs)

	print(tokenizer.decode(output[0], skip_special_tokens=True))
	# - Sinclair Lewis
	```

	## Citation
	```
	@misc{ji2025economicalinferenceenablingdeepseeks,
	title={Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs},
	author={Tao Ji and Bin Guo and Yuanbin Wu and Qipeng Guo and Lixing Shen and Zhan Chen and Xipeng Qiu and Qi Zhang and Tao Gui},
	year={2025},
	eprint={2502.14837},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.14837},
	}
	```