HiTZ
/

Text Generation
Transformers
PyTorch
Basque
English
llama
text-generation-inference
Inference Endpoints
OSainz commited on
Commit
c1dfd49
1 Parent(s): 5059354

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -122
README.md CHANGED
@@ -9,201 +9,169 @@ metrics:
9
  - accuracy
10
  - f1
11
  - perplexity
 
12
  ---
13
 
14
- # Model Card for Model ID
15
 
16
- <!-- Provide a quick summary of what the model is/does. -->
17
 
18
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
19
 
20
- ## Model Details
21
 
22
- ### Model Description
23
 
24
- <!-- Provide a longer summary of what this model is. -->
25
 
 
26
 
 
27
 
28
- - **Developed by:** [More Information Needed]
29
- - **Funded by [optional]:** [More Information Needed]
30
- - **Shared by [optional]:** [More Information Needed]
31
- - **Model type:** [More Information Needed]
32
- - **Language(s) (NLP):** [More Information Needed]
33
- - **License:** [More Information Needed]
34
- - **Finetuned from model [optional]:** [More Information Needed]
35
 
36
- ### Model Sources [optional]
37
 
38
- <!-- Provide the basic links for the model. -->
 
 
 
 
 
 
39
 
40
- - **Repository:** [More Information Needed]
41
- - **Paper [optional]:** [More Information Needed]
42
- - **Demo [optional]:** [More Information Needed]
43
 
44
- ## Uses
45
-
46
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
47
-
48
- ### Direct Use
49
-
50
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
51
-
52
- [More Information Needed]
53
-
54
- ### Downstream Use [optional]
55
-
56
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
57
-
58
- [More Information Needed]
59
-
60
- ### Out-of-Scope Use
61
-
62
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
63
-
64
- [More Information Needed]
65
-
66
- ## Bias, Risks, and Limitations
67
-
68
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
69
-
70
- [More Information Needed]
71
-
72
- ### Recommendations
73
-
74
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
75
-
76
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
77
-
78
- ## How to Get Started with the Model
79
 
80
  Use the code below to get started with the model.
81
 
82
- [More Information Needed]
83
-
84
- ## Training Details
85
-
86
- ### Training Data
87
-
88
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
89
-
90
- [More Information Needed]
91
-
92
- ### Training Procedure
93
 
94
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
95
 
96
- #### Preprocessing [optional]
 
 
 
 
 
 
 
97
 
98
- [More Information Needed]
99
 
 
100
 
101
- #### Training Hyperparameters
102
 
103
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
104
 
105
- #### Speeds, Sizes, Times [optional]
106
 
107
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
108
 
109
- [More Information Needed]
110
 
111
- ## Evaluation
112
 
113
- <!-- This section describes the evaluation protocols and provides the results. -->
114
 
115
- ### Testing Data, Factors & Metrics
116
 
117
- #### Testing Data
118
 
119
- <!-- This should link to a Dataset Card if possible. -->
120
 
121
- [More Information Needed]
122
 
123
- #### Factors
124
 
125
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
126
 
127
- [More Information Needed]
128
 
129
- #### Metrics
130
 
131
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
132
 
133
- [More Information Needed]
134
 
135
- ### Results
136
 
137
- [More Information Needed]
138
 
139
- #### Summary
140
 
 
141
 
142
 
143
- ## Model Examination [optional]
 
 
 
 
144
 
145
- <!-- Relevant interpretability work for the model goes here -->
146
 
147
- [More Information Needed]
148
 
149
- ## Environmental Impact
150
 
151
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
152
 
153
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
154
 
155
- - **Hardware Type:** [More Information Needed]
156
- - **Hours used:** [More Information Needed]
157
- - **Cloud Provider:** [More Information Needed]
158
- - **Compute Region:** [More Information Needed]
159
- - **Carbon Emitted:** [More Information Needed]
160
 
161
- ## Technical Specifications [optional]
162
 
163
- ### Model Architecture and Objective
164
 
165
- [More Information Needed]
166
 
167
- ### Compute Infrastructure
168
 
169
- [More Information Needed]
170
 
171
- #### Hardware
 
 
 
 
 
 
 
 
 
 
172
 
173
- [More Information Needed]
174
 
175
- #### Software
176
 
177
- [More Information Needed]
178
 
179
- ## Citation [optional]
180
 
181
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
182
 
183
- **BibTeX:**
184
 
185
- [More Information Needed]
186
 
187
- **APA:**
188
 
189
- [More Information Needed]
190
 
191
- ## Glossary [optional]
 
 
 
 
 
 
 
 
 
 
192
 
193
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
194
 
195
- [More Information Needed]
196
 
197
- ## More Information [optional]
198
 
199
- [More Information Needed]
200
 
201
- ## Model Card Authors [optional]
202
 
203
- [More Information Needed]
204
 
205
- ## Model Card Contact
 
 
 
 
206
 
207
- [More Information Needed]
208
 
 
209
 
 
 
9
  - accuracy
10
  - f1
11
  - perplexity
12
+ pipeline_tag: text-generation
13
  ---
14
 
15
+ # **Model Card for Basque Llama 7B**
16
 
17
+ Basque LLaMA is a collection of foundation models specifically tuned for Basque. Based on Meta’s LLaMA 2 model family, these models were further trained with highly curated Basque corpora, Euscrawl ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)). Ranging from 7 billion to 70 billion parameters, these models are currently the biggest and best-performing LLMs built for Basque. This is the 7B repository, links to other models can be found in the index at the bottom.
18
 
 
19
 
20
+ # **Model Details**
21
 
 
22
 
23
+ ## **Model Description**
24
 
25
+ Basque LLaMA is a family of Large Language Models (LLM) based on Meta’s [LLaMA models](https://huggingface.co/meta-llama). Current LLMs exhibit incredible performance for high-resource languages such as English, but, in the case of Basque and other low-resource languages, their performance is close to a random guesser. These limitations push the gap between high- and low-resource languages when it comes to digital development. We present Basque LLaMA to overcome these limitations and promote the development of LLM-based technology and research for the Basque language. Basque LLaMA models follow the same architecture as their original counterparts and were further trained in Euscrawl v1 ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)), a high-quality Basque corpora.
26
 
27
+ The models are released in three sizes: 7B, 13B and 70B.
28
 
 
 
 
 
 
 
 
29
 
 
30
 
31
+ * **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
32
+ * **Model type:** Language model
33
+ * **Language(s) (NLP):** en, eu
34
+ * **License:** llama2
35
+ * **Parent Model:** meta-llama/Llama-2-7B
36
+ * **Resources for more information:** [PAPER/BLOG/POST link]
37
+ * **Contact:** hitz@ehu.eus
38
 
 
 
 
39
 
40
+ ## **Getting started**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  Use the code below to get started with the model.
43
 
44
+ ```python
45
+ from transformers import pipeline
 
 
 
 
 
 
 
 
 
46
 
47
+ pipe = pipeline("text-generation", model="HiTZ/basque-llama-2-7b-v1")
48
+ text = "Donosti da Euskal Herriko lekurik"
49
 
50
+ pipe(text, max_new_tokens=40)
51
+ >> [
52
+ {
53
+ 'generated_text': 'Donosti da Euskal Herriko lekurik garestiena alokairuan bizitzeko,'
54
+ ' eta Donostiako alokairuaren prezioa %11,3 igo da azken urtean'
55
+ }
56
+ ]
57
+ ```
58
 
 
59
 
60
+ # **Uses**
61
 
62
+ Basque LLaMA models are intended to be used with Basque data; for any other language the performance is not guaranteed. Same as the original, Basque LLaMA inherits the [LLaMA-2 License](https://ai.meta.com/llama/license/) which allows for commercial and research use.
63
 
 
64
 
65
+ ## **Direct Use**
66
 
67
+ Basque LLaMA family models are pre-trained LLMs without any task-specific or instruction fine-tuning. That is, the model can either be prompted to perform a specific task or further fine-tuned for specific use cases.
68
 
 
69
 
70
+ ## **Out-of-Scope Use**
71
 
72
+ The model was not fine-tuned to follow instructions or to work as a chat assistant, therefore, this kind of usage is not tested nor recommended.
73
 
 
74
 
75
+ # **Bias, Risks, and Limitations**
76
 
77
+ In an effort to alleviate the potentially disturbing or harmful content, Basque LLaMA has been trained on carefully selected and processed data which comes mainly from local media, national/regional newspapers, encyclopedias and blogs (see Euscrawl below). Still, the model is based on LLaMA models and can potentially carry the same bias, risk and limitations.
78
 
79
+ Please see the LLaMA’s _Ethical Considerations and Limitations _for further information.
80
 
 
81
 
82
+ # **Training Details**
83
 
 
84
 
85
+ ## **Training Data**
86
 
87
+ The models were trained on EusCrawl v1, a high-quality corpus for Basque comprising 1.72M documents, 288M words, totalling 2.1GiB of uncompressed text. EusCrawl was built using ad-hoc scrapers to extract text from 33 Basque websites with high-quality content, resulting in cleaner text compared to general-purpose approaches.
88
 
89
+ See more details in the [EusCrawl](https://huggingface.co/datasets/HiTZ/euscrawl) dataset card.
90
 
91
+ Additionally, 100K documents of English data randomly selected from the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset were also included to avoid catastrophic forgetting.
92
 
 
93
 
94
+ ## **Training Procedure**
95
 
96
+ The models were trained using the GPT-Neox library on the HPC CINECA computing cluster. All the models were approximately trained with an effective batch size of 2M tokens for 1000 to 2000 steps.
97
 
98
 
99
+ | Model | Steps | Sequence length | Effective Batch size | Total tokens | GPU hours |
100
+ | ---------------- | ----- | --------------- | -------------------- | ------------ | ---------- |
101
+ | Basque LLaMA 7B | 2000 | 4096 | 2M tokens/step | 4B | 359.2h |
102
+ | Basque LLaMA 13B | 1000 | 4096 | 2M tokens/step | 2B | 468.8h |
103
+ | Basque LLaMA 70B | 1680 | 4096 | 2M tokens/step | 3.4B | \*6475.52h |
104
 
 
105
 
106
+ "*" indicates the time for the entire training process (2000 steps), however the weights of the step 1680 are shared as it is the best checkpoint according to validation loss.
107
 
 
108
 
109
+ # **Evaluation**
110
 
111
+ We evaluated the models on zero-shot and few-shot settings on generative, multiple-choice and classification tasks. We used the basque partitions of each dataset.
112
 
 
 
 
 
 
113
 
114
+ ## **Testing Data, Factors & Metrics**
115
 
 
116
 
117
+ ### **Testing Data**
118
 
 
119
 
 
120
 
121
+ * **Belebele** ([Bandarkar et al.](https://arxiv.org/abs/2308.16884)): Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. We evaluated the model in a 5-shot fashion.
122
+ * Data card: [https://huggingface.co/datasets/facebook/belebele](https://huggingface.co/datasets/facebook/belebele)
123
+ * **X-StoryCloze**: XStoryCloze consists of the professionally translated version of the English StoryCloze dataset to 10 non-English languages. Story Cloze is a new commonsense reasoning dataset which consists of choosing the correct ending to a four-sentence story. We evaluated the model in a 0-shot fashion.
124
+ * data card: [https://huggingface.co/datasets/juletxara/xstory_cloze](https://huggingface.co/datasets/juletxara/xstory_cloze)
125
+ * **BasqueGLUE**: [https://huggingface.co/datasets/orai-nlp/basqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE). BasqueGLUE is a NLU benchmark for Basque. We evaluated the model in a 5-shot fashion on the following tasks:
126
+ * **BEC2016eu**: Sentiment analysis on tweets about the 2016 Basque elections campaign.
127
+ * **VaxxStance**: Stance detection on tweets around the anti-vaccine movement.
128
+ * **BTHCv2**: Topic classification of news extracts with 12 categories.
129
+ * **EpecKorrefBin**: Correference detection task similar to WSC.
130
+ * **QNLIeu**: Q&A NLI built from the Basque Wikipedia.
131
+ * **WiCeu**: Basque Word-in-Context task.
132
 
 
133
 
134
+ ### **Metrics**
135
 
 
136
 
 
137
 
138
+ * Accuracy: Belebele, X-StoryCloze, EpecKorrefBin, QNLI-eu, and, WiC-eu
139
+ * Micro F1: BEC2016-eu and BHTCv2
140
+ * Macro F1: VaxxStance (favor & against)
141
 
 
142
 
143
+ ## **Results**
144
 
145
+ The model was evaluated using the LM Evaluation harness library from Eleuther AI. In order to reproduce our results please refer to our [fork](https://github.com/naiarapm/lm-evaluation-harness/tree/basqueglue) that includes the implementation for the mentioned datasets.
146
 
 
147
 
148
+ | Model | Belebele | X-StoryCloze | BEC | Vaxx | BHTC | coref | QNLI | WiC | Average |
149
+ | ---------------- | -------- | ------------ | ----- | ----- | ----- | ----- | ----- | ----- | ------- |
150
+ | Random | 25.00 | 50.00 | 33.33 | 33.33 | 8.33 | 50.00 | 50.00 | 50.00 | 37.50 |
151
+ | LLaMA 2 7B | 26.22 | 50.43 | 41.63 | 18.60 | 20.06 | 50.94 | 48.32 | 49.64 | 38.23 |
152
+ | LLaMA 2 13B | 32.00 | 50.63 | 41.09 | 18.25 | 27.35 | 49.23 | 48.74 | 49.21 | 39.56 |
153
+ | LLaMA 2 70B | 33.56 | 51.62 | 47.47 | 21.01 | 31.01 | 52.98 | 51.26 | 51.57 | 42.56 |
154
+ | BLOOM 7B | 27.00 | 57.18 | 37.94 | 20.72 | 39.10 | 48.21 | 47.48 | 47.57 | 40.65 |
155
+ | XGLM 7B | 23.88 | 57.71 | 39.94 | 21.58 | 36.73 | 50.94 | 50.42 | 49.21 | 41.30 |
156
+ | Basque LLaMA 7B | 35.67 | 63.13 | 55.61 | 45.93 | 44.44 | 50.43 | 55.04 | 50.14 | 50.05 |
157
+ | Basque LLaMA 13B | 53.56 | 65.85 | 53.23 | 48.66 | 53.61 | 62.52 | 57.14 | 54.21 | 56.10 |
158
+ | Basque LLaMA 70B | 71.78 | 67.57 | 63.52 | 48.95 | 49.51 | 79.90 | 58.82 | 55.50 | 61.94 |
159
 
 
160
 
 
161
 
162
+ # **Environmental Impact**
163
 
164
+ Carbon emissions are estimated using the[ Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in[ Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
165
 
 
166
 
 
167
 
168
+ * **Hardware Type:** HPC Cluster, 4x A100 64Gb nodes
169
+ * **Hours used:** 359.2h + 468.8h + 6475.52h = 7303.52h
170
+ * **Compute cluster:** CINECA HPC
171
+ * **Compute Region:** Italy
172
+ * **Carbon Emitted:** 673.75kg CO<sub>2</sub> eq
173
 
 
174
 
175
+ # **Acknowledgements**
176
 
177
+ This work has been partially supported by the Basque Government (IKER-GAITU project). The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2023E01-013.