BAAI
/

IndustryCorpus2_DataRater

Model card Files Files and versions Community

MonteXiaofeng commited on Sep 20, 2024

Commit

d4f3266

·

verified ·

1 Parent(s): 8f748de

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ base_model:
 - BAAI/bge-m3
 ---
-本模型为数据集[BAAI/IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2)的质量评估模型，对预训练语料进行质量评估。
 ## 为什么要筛选低质量的数据
@@ -25,7 +25,7 @@ base_model:
   数据规模：20k打分数据，中英文比例1:1
-  数据打分prompt
   ```
   quality_prompt = """Below is an extract from a web page. Evaluate whether the page has a high natural language value and could be useful in an naturanl language task to train a good language model using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

 - BAAI/bge-m3
 ---
+本模型为数据集[BAAI/IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2)的质量评估模型，用于从语义一致性，信息密度，教育属性等维度评估预训练数据的质量，，对预训练语料进行质量评估。
 ## 为什么要筛选低质量的数据
   数据规模：20k打分数据，中英文比例1:1
+  **数据prompt**
   ```
   quality_prompt = """Below is an extract from a web page. Evaluate whether the page has a high natural language value and could be useful in an naturanl language task to train a good language model using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion: