Update README.md (#2)

- Update README.md (789a1c780444af9a54cec6f3e3ac0e1e4cfb982d)

Co-authored-by: Arthur Zucker <ArthurZ@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +42 -13

README.md CHANGED Viewed

@@ -39,6 +39,10 @@ widget:
       It's not certain how many lessons you'll learn by your thirties. Does the
       premise entail the hypothesis?
     example_title: Premise and hypothesis
 tags:
   - text2text-generation
 datasets:
@@ -56,17 +60,21 @@ datasets:
 license: apache-2.0
 ---
-# TL;DR FLan-UL2 improvements over previous version
-The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
-This Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
-The original UL2 model also had mode switch tokens that was rather mandatory to get good performance.
-However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
-# Converting from T5x to huggingface
-You can use the [`convert_`]() and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, we used an identity layer.
-# Performance improvment
 The reported results are the following :
 |  | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
@@ -76,8 +84,26 @@ The reported results are the following :
 | FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
 | FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
-# Introduction
 UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
@@ -95,9 +121,12 @@ Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal
 # Training
-## Flan UL2, a 20B Flan trained UL2 model
 The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
 ## UL2 PreTraining
@@ -113,7 +142,7 @@ UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](http
 The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
-## Mixture of Denoisers
 To quote the paper:
 > We conjecture that a strong universal model has to be exposed to solving diverse set of problems
@@ -164,7 +193,7 @@ In total, the model was trained for 2.65 million steps.
 ## Contribution
-This model was contributed by [Younes Belkada](https://huggingface.co/Seledorn) & [Arthur Zucker]().
 ## Examples

       It's not certain how many lessons you'll learn by your thirties. Does the
       premise entail the hypothesis?
     example_title: Premise and hypothesis
+  - text: >-
+      Answer the following question by reasoning step by step.
+      The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?
+    example_title: Chain of thought
 tags:
   - text2text-generation
 datasets:
 license: apache-2.0
 ---
+# TL;DR FLan-UL2
+Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2)  released earlier last year. It was fine tuned using the "Flan" prompt tuning
+and dataset collection.
+According ot the original [blog]() here are the notable improvements:
+- The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
+- The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
+- The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
+## Converting from T5x to huggingface
+You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
+```bash
+python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path ~/code/ul2/flan-ul220b-v3/ --config_file config.json --pytorch_dump_path ~/code/ul2/flan-ul2
+```
+## Performance improvment
 The reported results are the following :
 |  | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
 | FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
 | FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
+# Using the model
+```python
+from transformers import AutoModelForConditionalGeneration, AutoTokenizer
+import torch
+model = AutoModelForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bits = True)
+tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
+input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"
+inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
+outputs = model.generate(inputs, max_length=200)
+print(tokenizer.decode(outputs[0]))
+# <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
+```
+# Introduction to UL2
 UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
 # Training
+## Flan UL2
 The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
+In “Scaling Instruction-Finetuned language models (Chung et al.)�� (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.
+The Flan datasets have also been open sourced in “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning” (Longpre et al.). See Google AI Blogpost: “The Flan Collection: Advancing Open Source Methods for Instruction Tuning”.
 ## UL2 PreTraining
 The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
+### Mixture of Denoisers
 To quote the paper:
 > We conjecture that a strong universal model has to be exposed to solving diverse set of problems
 ## Contribution
+This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).
 ## Examples