vince62s commited on
Commit
7db83b3
1 Parent(s): 96591df

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -72
README.md CHANGED
@@ -1,5 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- language:
 
 
 
3
  - multilingual
4
  - af
5
  - am
@@ -61,7 +110,7 @@ language:
61
  - my
62
  - ne
63
  - nl
64
- - no
65
  - om
66
  - or
67
  - pa
@@ -94,96 +143,78 @@ language:
94
  - xh
95
  - yi
96
  - zh
97
- license: mit
 
98
  ---
99
 
100
- # XLM-RoBERTa-XL (xlarge-sized model)
101
 
102
- XLM-RoBERTa-XL model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
103
 
104
- Disclaimer: The team releasing XLM-RoBERTa-XL did not write a model card for this model so this model card has been written by the Hugging Face team.
105
 
106
- ## Model description
107
 
108
- XLM-RoBERTa-XL is a extra large multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
109
 
110
- RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
111
 
112
- More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
113
 
114
- This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa-XL model as inputs.
 
 
 
115
 
116
- ## Intended uses & limitations
117
 
118
- You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=xlm-roberta-xl) to look for fine-tuned versions on a task that interests you.
 
 
 
 
119
 
120
- Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
121
 
122
- ## Usage
 
 
123
 
124
- You can use this model directly with a pipeline for masked language modeling:
125
 
126
  ```python
127
- >>> from transformers import pipeline
128
- >>> unmasker = pipeline('fill-mask', model='facebook/xlm-roberta-xl')
129
- >>> unmasker("Europe is a <mask> continent.")
130
-
131
- [{'score': 0.08562745153903961,
132
- 'token': 38043,
133
- 'token_str': 'living',
134
- 'sequence': 'Europe is a living continent.'},
135
- {'score': 0.0799778401851654,
136
- 'token': 103494,
137
- 'token_str': 'dead',
138
- 'sequence': 'Europe is a dead continent.'},
139
- {'score': 0.046154674142599106,
140
- 'token': 72856,
141
- 'token_str': 'lost',
142
- 'sequence': 'Europe is a lost continent.'},
143
- {'score': 0.04358183592557907,
144
- 'token': 19336,
145
- 'token_str': 'small',
146
- 'sequence': 'Europe is a small continent.'},
147
- {'score': 0.040570393204689026,
148
- 'token': 34923,
149
- 'token_str': 'beautiful',
150
- 'sequence': 'Europe is a beautiful continent.'}]
151
  ```
152
 
153
- Here is how to use this model to get the features of a given text in PyTorch:
154
 
155
- ```python
156
- from transformers import AutoTokenizer, AutoModelForMaskedLM
157
 
158
- tokenizer = AutoTokenizer.from_pretrained('facebook/xlm-roberta-xl')
159
- model = AutoModelForMaskedLM.from_pretrained("facebook/xlm-roberta-xl")
160
 
161
- # prepare input
162
- text = "Replace me by any text you'd like."
163
- encoded_input = tokenizer(text, return_tensors='pt')
164
 
165
- # forward pass
166
- output = model(**encoded_input)
167
- ```
168
 
169
- ### BibTeX entry and citation info
170
-
171
- ```bibtex
172
- @article{DBLP:journals/corr/abs-2105-00572,
173
- author = {Naman Goyal and
174
- Jingfei Du and
175
- Myle Ott and
176
- Giri Anantharaman and
177
- Alexis Conneau},
178
- title = {Larger-Scale Transformers for Multilingual Masked Language Modeling},
179
- journal = {CoRR},
180
- volume = {abs/2105.00572},
181
- year = {2021},
182
- url = {https://arxiv.org/abs/2105.00572},
183
- eprinttype = {arXiv},
184
- eprint = {2105.00572},
185
- timestamp = {Wed, 12 May 2021 15:54:31 +0200},
186
- biburl = {https://dblp.org/rec/journals/corr/abs-2105-00572.bib},
187
- bibsource = {dblp computer science bibliography, https://dblp.org}
188
- }
189
- ```
 
1
+
2
+ This is the converted model from Unbabel/wmt23-cometkiwi-da
3
+
4
+ 1) Just kept the weights/bias keys()
5
+ 2) Renamed the keys to match the original Facebook/XLM-roberta-XL
6
+ 3) kept the layer_wise_attention / estimator layers
7
+
8
+ Because of a hack in HF's code I had to rename the "layerwise_attention.gamma" key to "layerwise_attention.gam"
9
+
10
+ I changed the config.json key "layer_transformation" from sparsemax to softmax because there is a bug in COMET since the flag is not passed, the actual function used is the default which is softmax.
11
+
12
+ Usage:
13
+
14
+ ```
15
+ from transformers import XLMRobertaTokenizer, XLMRobertaTokenizerFast, AutoModel
16
+ tokenizer = XLMRobertaTokenizerFast.from_pretrained("vince62s/wmt23-cometkiwi-da-roberta-xl", trust_remote_code=True)
17
+ model = AutoModel.from_pretrained("vince62s/wmt23-cometkiwi-da-roberta-xl", trust_remote_code=True)
18
+
19
+ text = "Hello world!</s></s>Bonjour le monde"
20
+ encoded_text = tokenizer(text, return_tensors='pt')
21
+ print(encoded_text)
22
+ output = model(**encoded_text)
23
+ print(output[0])
24
+
25
+ {'input_ids': tensor([[ 0, 35378, 8999, 38, 2, 2, 84602, 95, 11146, 2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
26
+ tensor([[0.8217]], grad_fn=<AddmmBackward0>)
27
+
28
+ ```
29
+
30
+ Let's double check with the original code from Unbabel Comet:
31
+
32
+ ```
33
+ from comet import download_model, load_from_checkpoint
34
+ model = load_from_checkpoint("/home/vincent/Downloads/cometkiwi23/checkpoints/model.ckpt") # this is the Unbabel checkpoint
35
+ data = [{"mt": "Hello world!", "src": "Bonjour le monde"}]
36
+ output = model.predict(data, gpus=0)
37
+ print(output)
38
+
39
+ Prediction([('scores', [0.8216837048530579]), ('system_score', 0.8216837048530579)])
40
+ ```
41
+
42
+
43
+
44
+
45
+
46
+
47
  ---
48
+ extra_gated_heading: Acknowledge license to accept the repository
49
+ extra_gated_button_content: Acknowledge license
50
+ pipeline_tag: translation
51
+ language:
52
  - multilingual
53
  - af
54
  - am
 
110
  - my
111
  - ne
112
  - nl
113
+ - 'no'
114
  - om
115
  - or
116
  - pa
 
143
  - xh
144
  - yi
145
  - zh
146
+ license: cc-by-nc-sa-4.0
147
+ library_name: transformers
148
  ---
149
 
150
+ This is a [COMET](https://github.com/Unbabel/COMET) quality estimation model: It receives a source sentence and the respective translation and returns a score that reflects the quality of the translation.
151
 
152
+ # Paper
153
 
154
+ [CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task](https://aclanthology.org/2022.wmt-1.60) (Rei et al., WMT 2022)
155
 
156
+ # License:
157
 
158
+ cc-by-nc-sa-4.0
159
 
160
+ # Usage (unbabel-comet)
161
 
162
+ Using this model requires unbabel-comet to be installed:
163
 
164
+ ```bash
165
+ pip install --upgrade pip # ensures that pip is current
166
+ pip install "unbabel-comet>=2.0.0"
167
+ ```
168
 
169
+ Make sure you acknowledge its License and Log in into Hugging face hub before using:
170
 
171
+ ```bash
172
+ huggingface-cli login
173
+ # or using an environment variable
174
+ huggingface-cli login --token $HUGGINGFACE_TOKEN
175
+ ```
176
 
177
+ Then you can use it through comet CLI:
178
 
179
+ ```bash
180
+ comet-score -s {source-input}.txt -t {translation-output}.txt --model Unbabel/wmt22-cometkiwi-da
181
+ ```
182
 
183
+ Or using Python:
184
 
185
  ```python
186
+ from comet import download_model, load_from_checkpoint
187
+
188
+ model_path = download_model("Unbabel/wmt22-cometkiwi-da")
189
+ model = load_from_checkpoint(model_path)
190
+ data = [
191
+ {
192
+ "src": "The output signal provides constant sync so the display never glitches.",
193
+ "mt": "Das Ausgangssignal bietet eine konstante Synchronisation, so dass die Anzeige nie stört."
194
+ },
195
+ {
196
+ "src": "Kroužek ilustrace je určen všem milovníkům umění ve věku od 10 do 15 let.",
197
+ "mt": "Кільце ілюстрації призначене для всіх любителів мистецтва у віці від 10 до 15 років."
198
+ },
199
+ {
200
+ "src": "Mandela then became South Africa's first black president after his African National Congress party won the 1994 election.",
201
+ "mt": "その後、1994年の選挙でアフリカ国民会議派が勝利し、南アフリカ初の黒人大統領となった。"
202
+ }
203
+ ]
204
+ model_output = model.predict(data, batch_size=8, gpus=1)
205
+ print (model_output)
 
 
 
 
206
  ```
207
 
208
+ # Intended uses
209
 
210
+ Our model is intented to be used for **reference-free MT evaluation**.
 
211
 
212
+ Given a source text and its translation, outputs a single score between 0 and 1 where 1 represents a perfect translation.
 
213
 
214
+ # Languages Covered:
 
 
215
 
216
+ This model builds on top of InfoXLM which cover the following languages:
217
+
218
+ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
219
 
220
+ Thus, results for language pairs containing uncovered languages are unreliable!