opencsg
/

csg-wukong-ablation-chinese-fineweb-edu

Safetensors

llama

Model card Files Files and versions Community

TomPei commited on 24 days ago

Commit

98d5b1d

•

1 Parent(s): dc3b039

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -0

README.md CHANGED Viewed

@@ -76,6 +76,23 @@ We utilized OpenCSG's enterprise-grade large language model, csg-wukong-enterpri
 We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
 **We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
 ## License Agreement
@@ -158,6 +175,22 @@ Chinese Fineweb Edu 数据集的原始数据来源广泛，涵盖了多个国内
 我们记录了100k条数据及其得分，形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签，我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`，此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型，未来，OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型，以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据，能够为研究人员和开发者提供高质量的训练数据。
 **我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区，共同推动技术的进步。敬请期待数据集的开源发布！**
 ## 许可协议

 We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
+# Abaltion experiments
+After meticulously designed ablation studies, we aimed to contrast the effects between the Chinese-fineweb-edu dataset and traditional Chinese pre-training corpora.
+For this purpose, we randomly selected samples from five datasets—CCI2-Data, SkyPile-150B, TeleChat-PTD, IndustryCorpus, and MAP-CC—proportional to the Chinese-fineweb-edu dataset, constructing a comparison dataset named chinese-random-select.
+In our experiments, we utilized a model with 2.1 billion parameters, training it for 65k steps on both datasets respectively.
+Throughout the training, we periodically saved checkpoints of the model and conducted validations on Chinese evaluation benchmarks CEval and CMMLU.
+The graph below displays the performance trends of these two datasets in evaluation tasks.
+The results distinctly show that the dataset trained on Chinese-fineweb-edu significantly outperforms the chinese-random-select dataset in both evaluation tasks, especially demonstrating considerable advantages in the later stages of training. This underscores the effectiveness and adaptability of Chinese-fineweb-edu in Chinese language tasks. Furthermore, these experimental outcomes also highlight the critical impact of dataset selection and construction on the ultimate performance of models.
+<p align="center">
+<img width="900px" alt="experiment" src="./chinese-fineweb-benchmark.png">
+</p>
+The experimental results reveal that in the later stages of training, as it enters the second epoch and the learning rate rapidly decreases, the model trained with the chinese-fineweb-edu data shows a significant increase in accuracy,
+whereas the model trained with randomly selected data remains at a lower level. This proves that the high-quality data of chinese-fineweb-edu significantly aids in training effectiveness.
+With the same training duration, it can enhance model capabilities faster and save training resources.
+This outcome also shares a striking similarity with the data ablation experiments conducted by HuggingFace on fineweb edu.
 **We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
 ## License Agreement
 我们记录了100k条数据及其得分，形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签，我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`，此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型，未来，OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型，以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据，能够为研究人员和开发者提供高质量的训练数据。
+## 消融实验
+经过精心设计的消融实验，我们旨在对比 Chinese-fineweb-edu 数据集与传统中文预训练语料的效果差异。为此，我们从 CCI2-Data、SkyPile-150B、TeleChat-PTD、IndustryCorpus 和 MAP-CC 这五个数据集中，随机抽取了与 Chinese-fineweb-edu 数据比例相同的样本，构建了一个对比数据集chinese-random-select。
+实验中，我们使用了一个 2.1B 参数规模的模型，分别使用这两种数据集，训练 65k 步。在训练过程中，我们定期保存模型的 checkpoint，并在中文评测基准 CEval 和 CMMLU 数据集上进行了验证。下图展示了这两个数据集在评测任务中的表现变化趋势。
+从结果可以清晰看出，使用 Chinese-fineweb-edu 训练的数据集在两个评测任务中均显著优于 chinese-random-select 数据集，特别是在训练到后期时表现出极大的优势，证明了 Chinese-fineweb-edu 在中文语言任务中的有效性和适配性。这一实验结果也进一步表明，数据集的选择和构建对模型的最终性能有着关键性的影响。
+<p align="center">
+<img width="900px" alt="experiment" src="./chinese-fineweb-benchmark.png">
+</p>
+通过实验结果可以发现，在训练的靠后阶段，由于进入了第2个epoch，且学习率进入快速下降阶段此时，使用chinese-fineweb-edu训练的模型，准确率有了明显的上升，而使用随机抽取的数据训练，则一直处于较低水平
+这证明了chinese-fineweb-edu高质量数据对于模型训练效果有显著帮助，在同样训练时间下，能够更快的提升模型能力，节省训练资源，这个结果也和HuggingFace fineweb edu 的数据消融实验有异曲同工之妙。
 **我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区，共同推动技术的进步。敬请期待数据集的开源发布！**
 ## 许可协议