juliehunter commited on
Commit
ceb707c
·
verified ·
1 Parent(s): 956f416

Update README

Browse files
Files changed (1) hide show
  1. README.md +338 -3
README.md CHANGED
@@ -1,3 +1,338 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ language:
5
+ - fr
6
+ - en
7
+ - it
8
+ - de
9
+ - es
10
+ tags:
11
+ - pretrained
12
+ - llama-3
13
+ - openllm-france
14
+ datasets:
15
+ - OpenLLM-France/Lucie-Training-Dataset
16
+ widget:
17
+ - text: |-
18
+ Quelle est la capitale de l'Espagne ? Madrid.
19
+ Quelle est la capitale de la France ?
20
+ example_title: Capital cities in French
21
+ group: 1-shot Question Answering
22
+ training_progress:
23
+ num_steps: 756291
24
+ num_tokens: 3131736326144
25
+ context_length: 32000
26
+ ---
27
+
28
+ # Model Card for Lucie-7B
29
+
30
+ <!-- inspired from the following template:
31
+ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1
32
+ -->
33
+
34
+ * [Model Description](#model-description)
35
+ <!-- * [Uses](#uses) -->
36
+ * [Example Code in Python](#example-code-in-python)
37
+ * [Load the model](#load-the-model)
38
+ * [Sentence completion](#sentence-completion)
39
+ * [Load a checkpoint](#load-a-checkpoint)
40
+ * [Training Details](#training-details)
41
+ * [Training Data](#training-data)
42
+ * [Training Procedure](#training-procedure)
43
+ * [Neural Network Architecture](#neural-network-architecture)
44
+ * [Training Hyperparameters](#training-hyperparameters)
45
+ 1. [Main Pre-training](#1-main-pre-training)
46
+ 2. [Context Extension](#2-context-extension)
47
+ 3. [Annealing](#3-annealing)
48
+ * [Training Logs and Learning Curves](#training-logs-and-learning-curves)
49
+ <!-- * [Evaluation](#evaluation) -->
50
+ * [Disclaimer](#disclaimer)
51
+ * [Citation](#citation)
52
+ * [Acknowledgements](#acknowledgements)
53
+ * [Contact](#contact)
54
+
55
+ ## Model Description
56
+
57
+ Lucie-7B is a pretrained 7B parameter causal language model built by [LINAGORA](https://labs.linagora.com/) and [OpenLLM-France](https://github.com/OpenLLM-France),
58
+ available under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
59
+
60
+ Lucie-7B was trained on 3 trillion tokens of multilingual data, including
61
+ English (33.2%),
62
+ French (32.4%),
63
+ German (6.9%),
64
+ Spanish (6.6%),
65
+ Italian (3.8%),
66
+ and parallel data from those languages (2.5%),
67
+ as well as several programming languages (14.7%).
68
+
69
+ ## Example Code in Python
70
+
71
+ ### Load the model
72
+
73
+ Load the model (quantized version on GPU if possible, for efficient inference):
74
+ ```python
75
+ import transformers
76
+
77
+ model_name = "OpenLLM-France/Lucie-7B"
78
+
79
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
80
+ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
81
+ device_map="auto",
82
+ load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
83
+ )
84
+ ```
85
+ ### Sentence completion
86
+
87
+ Wrap the model in a text generation pipeline, and specify some generation parameters:
88
+ ```
89
+ pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
90
+
91
+ generation_kwargs = dict(
92
+ num_return_sequences=1, # Number of variants to generate.
93
+ return_full_text= False, # Do not include the prompt in the generated text.
94
+ do_sample=True,
95
+ temperature=1.0, top_p=1, top_k=None, # Sampling parameters.
96
+ max_new_tokens=200, # Maximum length for the output text (in number of tokens).
97
+ )
98
+ ```
99
+
100
+ Try 1-shot question answering:
101
+ ```python
102
+ prompt = """\
103
+ Quelle est la capitale de l'Espagne ? Madrid\n\
104
+ Quelle est la capitale de la France ?\
105
+ """
106
+ completions = pipeline(prompt, **generation_kwargs)
107
+ for completion in completions:
108
+ print(prompt + " […]" + completion['generated_text'])
109
+ ```
110
+ This will print something like:
111
+ ```
112
+ Quelle est la capitale de l'Espagne ? Madrid
113
+ Quelle est la capitale de la France ? […] Paris
114
+ Quelle est la capitale de l'Italie? Rome
115
+ Quelle est la capitale de la Grande-Bretagne? Londres
116
+ Quelle est la capitale de la Suisse? Berne
117
+ Quelle est la capitale du Portugal? Lisbonne
118
+ Quelle est la capitale de l'Algérie? Alger
119
+ ...
120
+ ```
121
+
122
+ If running on GPU (`cuda` device), you will need at least 6GB of VRAM to run inference using 4bit quantization (16GB of VRAM without 4bit quantization).
123
+
124
+ ### Load a checkpoint
125
+
126
+ Checkpoints at several training steps are available under revision tags,
127
+ every 5000 steps during the first 25000 steps, and then every 25000 steps.
128
+
129
+ Intermediate checkpoints can be loaded using the `revision` parameter:
130
+ ```python
131
+ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
132
+ revision="step0753851",
133
+ ...
134
+ )
135
+ ```
136
+ where `revision` can be one of:
137
+ * "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": every 5000 steps for the first pre-training steps (with a context length of 4096).
138
+ * "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": every 25000 steps from 25k to 750k steps.
139
+ * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
140
+ * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
141
+
142
+ ## Training Details
143
+
144
+ ### Training Data
145
+
146
+ The training dataset used for the pretraining of Lucie-7B is available
147
+ at [OpenLLM-France/Lucie-Training-Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset).
148
+ <!-- and described in ["The Lucie Training Dataset" (2024/12)](https://arxiv.org/abs/xxxx.xxxxx). -->
149
+
150
+ The initial composition of the training data is as follows:
151
+
152
+ ![Initial Data Composition](figures/pie_dataset_composition.png)
153
+
154
+ Some of the data was upsampled to balance the training data distribution yielding the following composition for training:
155
+
156
+ ![Training Data Composition](figures/pie_dataset_composition_training.png)
157
+
158
+ ### Training Procedure
159
+
160
+ Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
161
+
162
+ It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on the [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).
163
+
164
+ The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
165
+ It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).
166
+
167
+ Optimizer checkpoints are available at [OpenLLM-France/Lucie-7B-optimizer-states](https://huggingface.co/OpenLLM-France/Lucie-7B-optimizer-states).
168
+
169
+ #### Neural Network Architecture
170
+
171
+ Lucie-7B has the same neural network architecture as [Llama3.1](https://huggingface.co/meta-llama/Llama-3.1-8B).
172
+ It has exactly 6 706 958 336 free parameters,
173
+ with the following hyperparameters:
174
+ | **Hyperparameter** | **Value** |
175
+ |---------------------------|---------|
176
+ | Vocabulary size (\# tokens)| 65 024 |
177
+ | \# transformer blocks | 32 |
178
+ | \# attention heads | 32 |
179
+ | \# key-value heads | 8 |
180
+ | Hidden size | 4 096 |
181
+ | Feed-Forward hidden size | 12 288 |
182
+ | Activation | `silu` |
183
+ | RMS norm epsilon | 1e-5 |
184
+
185
+ The "theta" parameter of Rotary Positional Embedding (RoPE) was increased during the training process. Its values are indicated in the tables with training hyperparameters below.
186
+
187
+ #### Training Hyperparameters
188
+
189
+ The training consisted of three main phases:
190
+ 1. Main pre-training on 3.1T tokens, with a context length of 4096,
191
+ 2. Context extension on 5B tokens, with a context length of 32000,
192
+ 3. Annealing on 5B tokens of high quality data composed of a mixture of new data and data seen during training.
193
+ <!-- perhaps cite the dataset for annealing -->
194
+
195
+ The details of each phase are given below.
196
+
197
+ ##### 1. Main Pre-training
198
+
199
+ Training hyperparameters in torch/Megatron-DeepSpeed were as follows:
200
+ | **Hyperparameter** | **Value** |
201
+ |------------------------|------------|
202
+ | Total \# samples| 762 144 586 (3.1T tokens) |
203
+ | Total \# steps | 753 851 |
204
+ | RoPE theta | 500 000 |
205
+ | Context length | 4 096 |
206
+ | Initial Batch size | 256 |
207
+ | Final Batch size | 1 024 |
208
+ | Batch size rampup | by steps of 64 over 10M samples |
209
+ | Learning rate schedule | warmup (2M samples) + cosine annealing |
210
+ | Maximum Learning rate | 3e-4 |
211
+ | Final Learning rate | 3e-5 |
212
+ | Weight decay | 0.1 |
213
+ | Dropout | _ |
214
+ | Gradient clipping | 1 |
215
+ | Initializer range | 0.009 |
216
+ | Optimizer | `AdamW` (β₁=0.9, β₂=0.95, ε=1e-5) |
217
+ | Precision | `bfloat16` |
218
+ | Tensor Parallelism (with 512 GPUs) | 4 |
219
+ | Pipeline Parallelism (with 512 GPUs) | 4 |
220
+ | Data Parallelism (with 512 GPUs) | 32 |
221
+
222
+ #### 2. Context Extension
223
+
224
+ Training hyperparameters are the same as above, with the following changes:
225
+ | **Hyperparameter** | **Value** |
226
+ |------------------------|------------|
227
+ | Total \# samples| 156 250 (5B tokens) |
228
+ | Total \# steps | 1 220 |
229
+ | RoPE theta | 20 000 000 |
230
+ | Context length | 32 000 |
231
+ | Batch size | 128 |
232
+ | Learning rate | 2e-5 |
233
+ | Learning rate schedule | constant |
234
+ | Tensor Parallelism (with 128 GPUs) | 4 |
235
+ | Pipeline Parallelism (with 128 GPUs) | 4 |
236
+ | Data Parallelism (with 128 GPUs) | 8 |
237
+
238
+ #### 3. Annealing
239
+
240
+ Training hyperparameters are the same as for context extension, with the following changes:
241
+ | **Hyperparameter** | **Value** |
242
+ |------------------------|------------|
243
+ | Learning rate schedule | linear annealing |
244
+ | Maximum Learning rate | 3e-5 |
245
+ | Final Learning rate | 0 |
246
+
247
+ ### Training Logs and Learning Curves
248
+
249
+ #### Training loss
250
+
251
+ Training logs can be found in Tensorboard format in:
252
+ * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
253
+ <br> ├── [`1_pretraining.zip`](metadata/training_logs/1_pretraining.zip) training logs for the first pre-training phases,
254
+ in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
255
+ <br> ├── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log <br> └── [`3_annealing/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/3_annealing) folder containing the training log for the annealing phase, which also took around 13H of training (parallelized over 128 GPUs).
256
+
257
+ The convergence curves of the three pre-training phases are the following:
258
+
259
+ ![figures/convergence-curve-pretraining.png](figures/convergence-curve-pretraining.png)
260
+
261
+ Data corresponding to these plots were extracted from tensorboard logs and are available in the following CSV files:
262
+ * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
263
+ <br> ├── [`1_pretraining.csv`](metadata/training_logs/1_pretraining.csv)
264
+ <br> ├── [`2_extension.csv`](metadata/training_logs/2_extension.csv)
265
+ <br> └── [`3_annealing.csv`](metadata/training_logs/3_annealing.csv)
266
+
267
+ #### Evaluations
268
+
269
+ Multiple evaluations were conducted during Lucie-7B's training to assess its performance on standard benchmarks,
270
+ primarily in French and English, as well as in Spanish, German, and Italian.
271
+
272
+ Evaluation results on benchmark datasets of checkpoints of Lucie-7B throughout the training process are available at
273
+ [metadata/evaluation_learning_curve_lucie.csv](metadata/evaluation_learning_curve_lucie.csv).
274
+ Evaluation results of baseline models on the same benchmark datasets are available at
275
+ [metadata/evaluation_baselines.csv](metadata/evaluation_baselines.csv).
276
+
277
+ Main results are summarized in the following figures:
278
+
279
+ ### French
280
+ ![figures/learning-curve-evaluation-french-bench.png](figures/learning-curve-evaluation-french-bench.png)
281
+
282
+ ### English
283
+ ![figures/learning-curve-evaluation-benchmarks-in-english.png](figures/learning-curve-evaluation-benchmarks-in-english.png)
284
+
285
+ ### other
286
+ ![figures/learning-curve-evaluation-multilingual-arc-benchmark.png](figures/learning-curve-evaluation-multilingual-arc-benchmark.png)
287
+
288
+ ### Needle in a Haystack
289
+
290
+ #### Pretraining
291
+ ![figures/needle-in-a-haystack/Lucie-7B-main.png](figures/needle-in-a-haystack/Lucie-7B-main.png)
292
+
293
+ #### Context Extension
294
+ ![figures/needle-in-a-haystack/Lucie-7B-extension.png](figures/needle-in-a-haystack/Lucie-7B-extension.png)
295
+
296
+ #### Annealing
297
+ ![figures/needle-in-a-haystack/Lucie-7B-annealing.png](figures/needle-in-a-haystack/Lucie-7B-annealing.png)
298
+
299
+
300
+ ## Disclaimer
301
+
302
+ Lucie-7B is a language model trained solely to predict the most probable next word in a sequence. Despite efforts to filter the [Lucie Training Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset), it is possible that Lucie-7B encountered strings containing toxic or offensive language during its training and as a result, it may generate strings of similar quality. To limit such behavior, it is advised to fine-tune Lucie-7B through instruction and/or preference tuning (DPO, RLHF, etc.).
303
+
304
+ ## Citation
305
+
306
+ TODO
307
+
308
+
309
+ ## Acknowledgements
310
+
311
+ This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
312
+
313
+ Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and OpenLLM-France community, including in alphabetical order:
314
+ Christophe Cerisara (LORIA),
315
+ Evan Dufraisse (CEA),
316
+ Julie Hunter (LINAGORA),
317
+ Jean-Pierre Lorré (LINAGORA),
318
+ Jérôme Louradour (LINAGORA),
319
+ Michel-Marie Maudet (LINAGORA),
320
+ Olivier Gouvert (LINAGORA), and
321
+ Yaya Sy (LORIA).
322
+
323
+ We thank
324
+ Anastasia Stasenko (OpSci/Pleias),
325
+ Clément Bénesse (Opsci),
326
+ Guokan Shang (MBZUAI),
327
+ Ismaïl Harrando (LINAGORA),
328
+ Joël Gombin (Opsci),
329
+ Jordan Ricker (Opsci),
330
+ Olivier Ferret (CEA),
331
+ Pierre-Carl Langlais (OpSci/Pleias),
332
+ and
333
+ Rachel Bawden (INRIA),
334
+ for their helpful input.
335
+
336
+ ## Contact
337
+
338
+ contact@openllm-france.fr