Unofficial LLaMAfied Version in HF format - 非官方的LLaMA化HF格式版本

#13

by JosephusCheung - opened Aug 4, 2023

Discussion

JosephusCheung

Aug 4, 2023

•

edited Aug 4, 2023

Recalibrated to fit the original LLaMA/LLaMA-2-like model structure.
JosephusCheung/Qwen-LLaMAfied-7B-Chat

If anyone knows why QWen's MHA has bias terms for qkv, please kindly teach me. I did not understand why there is a possibility of improving the performance of the model in this way.
如果有谁知道，为什么QWen MHA的qkv要有bias，希望不吝赐教。我并没有理解为什么会有据此提升模型表现的可能。

logicwong

Aug 5, 2023

•

edited Aug 5, 2023

You can refer to the blog of Rope's author, Su Jianlin, at https://spaces.ac.cn/archives/9577. The experimental results indicate that incorporating qkv bias enhances the ability in extrapolating to longer sequences.

可以参阅Rope的作者苏剑林的博客https://spaces.ac.cn/archives/9577 ，实验结果显示qkv bias有助于提升模型的长度外推能力

JosephusCheung changed discussion status to closed Aug 7, 2023

JosephusCheung changed discussion status to open Aug 10, 2023

JosephusCheung

Aug 11, 2023

Updated model weights and MMLU/CEval scores. Now the benchmark scores are almost on par with the original Qwen-7B-chat.

更新了模型权重和 MMLU/CEval 分数。现在，它的基准分数几乎与原来的 Qwen-7B-chat 持平。

songkq

Qwen org Aug 11, 2023

@JosephusCheung Hi, could you please share how to merge the bias item of c_attn = nn.Linear(config.hidden_size, 3 * self.projection_size) into LLama?

JosephusCheung

Aug 11, 2023

@JosephusCheung Hi, could you please share how to merge the bias item of c_attn = nn.Linear(config.hidden_size, 3 * self.projection_size) into LLama?

One possible solution:

$W_c = W + \frac{b}{\|x\|_2^2} xx^T = W + \frac{b_{v} \times avg^T}{n^2}$

songkq

Qwen org Aug 12, 2023

@JosephusCheung Could you please show a detailed explanation of the variables in the formula? Or show your code, please?

JosephusCheung changed discussion status to closed Aug 15, 2023

JosephusCheung changed discussion status to open Aug 31, 2023

JosephusCheung

Nov 30, 2023

You can refer to the blog of Rope's author, Su Jianlin, at https://spaces.ac.cn/archives/9577. The experimental results indicate that incorporating qkv bias enhances the ability in extrapolating to longer sequences.

可以参阅Rope的作者苏剑林的博客https://spaces.ac.cn/archives/9577 ，实验结果显示qkv bias有助于提升模型的长度外推能力

Su have updated his blog, it reads:"【注：后来经过反复测试发现，发现此篇文章的长度外推结果可复现性比较不稳定（可能跟模型结构、超参数等紧密相关），请自行斟酌使用。】"(Note: After repeated testing, it was found that the reproducibility of the length extrapolation results of this article is relatively unstable (may be closely related to the model structure, hyperparameters, etc.), please use it at your own discretion.)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment