SILMA RAGQA V1.0: A Comprehensive Benchmark for Evaluating LLMs on RAG QA Use-Cases
Community Article
Published
December 18, 2024
SILMA RAGQA is a benchmark curated by silma.ai to assess the effectiveness of Arabic/English Language Models in Extractive Question Answering tasks, with a specific emphasis on RAG applications
The benchmark includes 17 bilingual datasets in Arabic and English, spanning various domains
What capabilities does the benchmark test?
- General Arabic and English QA capabilities
- Ability to handle short and long contexts
- Ability to provide short and long answers effectively
- Ability to answer complex numerical questions
- Ability to answer questions based on tabular data
- Multi-hop question answering: ability to answer one question using pieces of data from multiple paragraphs
- Negative Rejection: ability to identify and dismiss inaccurate responses, providing a more precise statement such as "answer can't be found in the provided context."
- Multi-domain: ability to answer questions based on texts from different domains such as financial, medical, etc.
- Noise Robustness: ability to handle noisy and ambiguous contexts
Data Sources
SLM Evaluations
SILMA Kashif is a new model will be released early Jan 2025
Model Name | Benchmark Score |
---|---|
SILMA-9B-Instruct-v1.0 | 0.268 |
Gemma-2-2b-it | 0.281 |
Qwen2.5-3B-Instruct | 0.3 |
Phi-3.5-mini-instruct | 0.301 |
Gemma-2-9b-it | 0.304 |
Phi-3-mini-128k-instruct | 0.306 |
Llama-3.2-3B-Instruct | 0.318 |
Qwen2.5-7B-Instruct | 0.321 |
Llama-3.1-8B-Instruct | 0.328 |
c4ai-command-r7b-12-2024 | 0.330 |
SILMA-Kashif-2B-v0.1 | 0.357 |
How to evaluate your model?
Follow the steps on the benchmark page