patrickvonplaten
commited on
Commit
•
e000016
1
Parent(s):
eeec07f
Update README.md
Browse files
README.md
CHANGED
@@ -1,11 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
2 |
-
|
|
|
3 |
|
4 |
-
|
5 |
|
|
|
6 |
|
|
|
7 |
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
```python
|
11 |
from transformers import T5ForConditionalGeneration, AutoTokenizer
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
datasets:
|
5 |
+
- c4
|
6 |
|
7 |
+
license: apache-2.0
|
8 |
+
---
|
9 |
|
10 |
+
# Introduction
|
11 |
|
12 |
+
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
|
13 |
|
14 |
+
![model image](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/ul2.png)
|
15 |
|
16 |
+
**Abstract**
|
17 |
+
|
18 |
+
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
|
19 |
+
|
20 |
+
For more information, please take a look at the original paper.
|
21 |
+
|
22 |
+
Paper: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1)
|
23 |
+
|
24 |
+
Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
|
25 |
+
|
26 |
+
# PreTraining
|
27 |
+
|
28 |
+
The model is pretrained on the C4 corpus. A batch size of 1024 is used for pretraining this model.
|
29 |
+
The model is trained on a total of 1 trillion tokens on C4 (2 million steps). The sequence length is set to 512/512 for inputs and targets.
|
30 |
+
Dropout is set to 0 during pretraining. Pre-training took approximately slight more than one month for about 1 trillion
|
31 |
+
tokens. We use the same mixture of denoisers as earlier sections. The model has 32 encoder layers and
|
32 |
+
32 decoder layers, dmodel of 4096 and df f of 16384. The dimension of each head is 256 for a total
|
33 |
+
of 16 heads. Our model uses a model parallelism of 8. We retain the [same sentencepiece tokenizer as T5 of 32k vocab size].
|
34 |
+
Hence, UL20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
|
35 |
+
Similar to earlier experiments, **UL20B** is trained with Jax and T5X infrastructure.
|
36 |
+
|
37 |
+
## Fine-tuning
|
38 |
+
|
39 |
+
The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k.
|
40 |
+
In other words, after each Nk steps of pretraining, we finetune on each downstream task and record its results. This is generally done in a manual fashion.
|
41 |
+
While some tasks were finetuned on earlier pretrained checkpoints as the model was still pretraining, many were finetuned on checkpoints nearer
|
42 |
+
to convergence that we release.
|
43 |
+
As we continiously finetune, we stop finetuning on a task once it has reached sota to save compute.
|
44 |
+
In total, the model was trained for 2.65 million steps where as
|
45 |
+
|
46 |
+
**Important**: For more details, please see sections 5.2.1 and 5.2.2 of the paper.
|
47 |
+
|
48 |
+
## Contribution
|
49 |
+
|
50 |
+
This model was contributed by [Daniel Hesslow](https://huggingface.co/Seledorn)
|
51 |
+
|
52 |
+
## Examples
|
53 |
+
|
54 |
+
Note that the model has been fine-tuned
|
55 |
|
56 |
```python
|
57 |
from transformers import T5ForConditionalGeneration, AutoTokenizer
|