MonteXiaofeng
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ base_model:
|
|
4 |
- BAAI/bge-m3
|
5 |
---
|
6 |
|
7 |
-
本模型为数据集[BAAI/IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2)
|
8 |
|
9 |
## 为什么要筛选低质量的数据
|
10 |
|
@@ -25,7 +25,7 @@ base_model:
|
|
25 |
|
26 |
数据规模:20k打分数据,中英文比例1:1
|
27 |
|
28 |
-
|
29 |
|
30 |
```
|
31 |
quality_prompt = """Below is an extract from a web page. Evaluate whether the page has a high natural language value and could be useful in an naturanl language task to train a good language model using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
|
|
|
4 |
- BAAI/bge-m3
|
5 |
---
|
6 |
|
7 |
+
本模型为数据集[BAAI/IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2)的质量评估模型,用于从语义一致性,信息密度,教育属性等维度评估预训练数据的质量,,对预训练语料进行质量评估。
|
8 |
|
9 |
## 为什么要筛选低质量的数据
|
10 |
|
|
|
25 |
|
26 |
数据规模:20k打分数据,中英文比例1:1
|
27 |
|
28 |
+
**数据prompt**
|
29 |
|
30 |
```
|
31 |
quality_prompt = """Below is an extract from a web page. Evaluate whether the page has a high natural language value and could be useful in an naturanl language task to train a good language model using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
|