Training Data | Params | Context length | GQA | Token count | Knowledge cutoff | |
Llama 3 | A new mix of publicly available online data. | 8B | 8k | Yes | 15T+ | March, 2023 |
70B | 8k | Yes | December, 2023 |
Time (GPU hours) | Power Consumption (W) | Carbon Emitted(tCO2eq) | |
Llama 3 8B | 1.3M | 700 | 390 |
Llama 3 70B | 6.4M | 700 | 1900 |
Total | 7.7M | 2290 |
Category | Benchmark | Llama 3 8B | Llama2 7B | Llama2 13B | Llama 3 70B | Llama2 70B |
General | MMLU (5-shot) | 66.6 | 45.7 | 53.8 | 79.5 | 69.7 |
AGIEval English (3-5 shot) | 45.9 | 28.8 | 38.7 | 63.0 | 54.8 | |
CommonSenseQA (7-shot) | 72.6 | 57.6 | 67.6 | 83.8 | 78.7 | |
Winogrande (5-shot) | 76.1 | 73.3 | 75.4 | 83.1 | 81.8 | |
BIG-Bench Hard (3-shot, CoT) | 61.1 | 38.1 | 47.0 | 81.3 | 65.7 | |
ARC-Challenge (25-shot) | 78.6 | 53.7 | 67.6 | 93.0 | 85.3 | |
Knowledge reasoning | TriviaQA-Wiki (5-shot) | 78.5 | 72.1 | 79.6 | 89.7 | 87.5 |
Reading comprehension | SQuAD (1-shot) | 76.4 | 72.2 | 72.1 | 85.6 | 82.6 |
QuAC (1-shot, F1) | 44.4 | 39.6 | 44.9 | 51.1 | 49.4 | |
BoolQ (0-shot) | 75.7 | 65.5 | 66.9 | 79.0 | 73.1 | |
DROP (3-shot, F1) | 58.4 | 37.9 | 49.8 | 79.7 | 70.2 |
Benchmark | Llama 3 8B | Llama 2 7B | Llama 2 13B | Llama 3 70B | Llama 2 70B |
MMLU (5-shot) | 68.4 | 34.1 | 47.8 | 82.0 | 52.9 |
GPQA (0-shot) | 34.2 | 21.7 | 22.3 | 39.5 | 21.0 |
HumanEval (0-shot) | 62.2 | 7.9 | 14.0 | 81.7 | 25.6 |
GSM-8K (8-shot, CoT) | 79.6 | 25.7 | 77.4 | 93.0 | 57.5 |
MATH (4-shot, CoT) | 30.0 | 3.8 | 6.7 | 50.4 | 11.6 |