Unofficial LLaMAfied Version in HF format - 非官方的LLaMA化HF格式版本

#13
by JosephusCheung - opened

Recalibrated to fit the original LLaMA/LLaMA-2-like model structure.
JosephusCheung/Qwen-LLaMAfied-7B-Chat


If anyone knows why QWen's MHA has bias terms for qkv, please kindly teach me. I did not understand why there is a possibility of improving the performance of the model in this way.
如果有谁知道,为什么QWen MHA的qkv要有bias,希望不吝赐教。 我并没有理解为什么会有据此提升模型表现的可能。

You can refer to the blog of Rope's author, Su Jianlin, at https://spaces.ac.cn/archives/9577. The experimental results indicate that incorporating qkv bias enhances the ability in extrapolating to longer sequences.

可以参阅Rope的作者苏剑林的博客https://spaces.ac.cn/archives/9577 , 实验结果显示qkv bias有助于提升模型的长度外推能力

JosephusCheung changed discussion status to closed
JosephusCheung changed discussion status to open

Updated model weights and MMLU/CEval scores. Now the benchmark scores are almost on par with the original Qwen-7B-chat.

更新了模型权重和 MMLU/CEval 分数。 现在,它的基准分数几乎与原来的 Qwen-7B-chat 持平。

Qwen org

@JosephusCheung Hi, could you please share how to merge the bias item of c_attn = nn.Linear(config.hidden_size, 3 * self.projection_size) into LLama?

@JosephusCheung Hi, could you please share how to merge the bias item of c_attn = nn.Linear(config.hidden_size, 3 * self.projection_size) into LLama?

One possible solution:

Wc=W+bx22xxT=W+bv×avgTn2W_c = W + \frac{b}{\|x\|_2^2} xx^T = W + \frac{b_{v} \times avg^T}{n^2}

Qwen org

@JosephusCheung Could you please show a detailed explanation of the variables in the formula? Or show your code, please?

JosephusCheung changed discussion status to closed
JosephusCheung changed discussion status to open

You can refer to the blog of Rope's author, Su Jianlin, at https://spaces.ac.cn/archives/9577. The experimental results indicate that incorporating qkv bias enhances the ability in extrapolating to longer sequences.

可以参阅Rope的作者苏剑林的博客https://spaces.ac.cn/archives/9577 , 实验结果显示qkv bias有助于提升模型的长度外推能力

Su have updated his blog, it reads:"【注:后来经过反复测试发现,发现此篇文章的长度外推结果可复现性比较不稳定(可能跟模型结构、超参数等紧密相关),请自行斟酌使用。】"(Note: After repeated testing, it was found that the reproducibility of the length extrapolation results of this article is relatively unstable (may be closely related to the model structure, hyperparameters, etc.), please use it at your own discretion.)

Sign up or log in to comment