Pclanglais
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -72,6 +72,24 @@ For best results we recommend the following setting:
|
|
72 |
* Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
|
73 |
* Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
## Ethical Considerations
|
76 |
|
77 |
pleias-pico model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.
|
|
|
72 |
* Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
|
73 |
* Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
|
74 |
|
75 |
+
### RAG Evaluation
|
76 |
+
|
77 |
+
We evaluate Pico and Nano models on a RAG task. As existing benchmarks are largely limited to English, we develop a custom multilingual RAG benchmark. We synthetically generate queries and small sets of documents. To evaluate, we prompted models with the query and documents. We then ran a head-to-head ELO-based tournament with GPT-4o as judge. We [release the prompts and generations for all models we compared](https://huggingface.co/datasets/PleIAs/Pleias-1.0-eval/tree/main/RAGarena). Our nano (1.2B) model outperforms Llama 3.2 1.1B and EuroLLM 1.7B. Our pico (350M) model outperforms other models in its weight class, such as SmolLM 360M and Qwen2.5 500M, in addition to much larger models, such as Llama 3.2 1.1B and EuroLLM 1.7B.
|
78 |
+
|
79 |
+
| **Rank** | **Model** | **ELO** |
|
80 |
+
|----------|--------------------------|------------|
|
81 |
+
| 1 | Qwen2.5-Instruct-7B | 1294.6 |
|
82 |
+
| 2 | Llama-3.2-Instruct-8B | 1269.8 |
|
83 |
+
| 3 | **Pleias-nano-1.2B-RAG** | **1137.5** |
|
84 |
+
| 4 | Llama-3.2-Instruct-3B | 1118.1 |
|
85 |
+
| 5 | Qwen2.5-Instruct-3B | 1078.1 |
|
86 |
+
| 6 | **Pleias-pico-350M-RAG** | **1051.2** |
|
87 |
+
| 7 | Llama-3.2-1B-Instruct | 872.3 |
|
88 |
+
| 8 | EuroLLM-1.7B-Instruct | 860.0 |
|
89 |
+
| 9 | SmolLM-360M-Instruct | 728.6 |
|
90 |
+
| 10 | Qwen2.5-0.5B-Instruct | 722.2 |
|
91 |
+
| 11 | SmolLM-1.7B-Instruct | 706.3 |
|
92 |
+
|
93 |
## Ethical Considerations
|
94 |
|
95 |
pleias-pico model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.
|