diff --git a/.gitattributes b/.gitattributes deleted file mode 100644 index c7d9f3332a950355d5a77d85000f05e6f45435ea..0000000000000000000000000000000000000000 --- a/.gitattributes +++ /dev/null @@ -1,34 +0,0 @@ -*.7z filter=lfs diff=lfs merge=lfs -text -*.arrow filter=lfs diff=lfs merge=lfs -text -*.bin filter=lfs diff=lfs merge=lfs -text -*.bz2 filter=lfs diff=lfs merge=lfs -text -*.ckpt filter=lfs diff=lfs merge=lfs -text -*.ftz filter=lfs diff=lfs merge=lfs -text -*.gz filter=lfs diff=lfs merge=lfs -text -*.h5 filter=lfs diff=lfs merge=lfs -text -*.joblib filter=lfs diff=lfs merge=lfs -text -*.lfs.* filter=lfs diff=lfs merge=lfs -text -*.mlmodel filter=lfs diff=lfs merge=lfs -text -*.model filter=lfs diff=lfs merge=lfs -text -*.msgpack filter=lfs diff=lfs merge=lfs -text -*.npy filter=lfs diff=lfs merge=lfs -text -*.npz filter=lfs diff=lfs merge=lfs -text -*.onnx filter=lfs diff=lfs merge=lfs -text -*.ot filter=lfs diff=lfs merge=lfs -text -*.parquet filter=lfs diff=lfs merge=lfs -text -*.pb filter=lfs diff=lfs merge=lfs -text -*.pickle filter=lfs diff=lfs merge=lfs -text -*.pkl filter=lfs diff=lfs merge=lfs -text -*.pt filter=lfs diff=lfs merge=lfs -text -*.pth filter=lfs diff=lfs merge=lfs -text -*.rar filter=lfs diff=lfs merge=lfs -text -*.safetensors filter=lfs diff=lfs merge=lfs -text -saved_model/**/* filter=lfs diff=lfs merge=lfs -text -*.tar.* filter=lfs diff=lfs merge=lfs -text -*.tflite filter=lfs diff=lfs merge=lfs -text -*.tgz filter=lfs diff=lfs merge=lfs -text -*.wasm filter=lfs diff=lfs merge=lfs -text -*.xz filter=lfs diff=lfs merge=lfs -text -*.zip filter=lfs diff=lfs merge=lfs -text -*.zst filter=lfs diff=lfs merge=lfs -text -*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..b92e35f0e8b255de235344f58868ee17713ddf60 --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +.idea/* +.DS_Store +.vscode \ No newline at end of file diff --git a/EasyOCRLite b/EasyOCRLite new file mode 160000 index 0000000000000000000000000000000000000000..2a52277dd2e833b08eb95a88b0e9182806070a75 --- /dev/null +++ b/EasyOCRLite @@ -0,0 +1 @@ +Subproject commit 2a52277dd2e833b08eb95a88b0e9182806070a75 diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..a7767b63a8d61b2622642ccc9012f06af5053e17 --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 1999-2022 Alibaba Group Holding Ltd. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md index 768fefb5a2eb585542c7332aa2aba2e2924235ff..9144a9e88f4839414b250e2f7ab7b0bf34b7ab36 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,600 @@ ---- -title: OFA OCR -emoji: 📉 -colorFrom: gray -colorTo: blue -sdk: gradio -sdk_version: 3.9.1 -app_file: app.py -pinned: false -license: apache-2.0 ---- - -Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference + + +
+
+
+
+
+
+
+
+ ModelScope  |  Checkpoints  |  Colab  |  Demo  |  Paper   |  Blog +
+ +
+
+
+
+
+
+[colab]:
+
+
+
+# Online Demos
+We provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:
+* Image Captioning \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_image-caption_coco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Image_Caption)\]
+* Visual Grounding \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-grounding_refcoco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Grounding)\]
+* Visual Question Answering \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-question-answering_pretrain_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Question_Answering)\]
+* Text-to-Image Generation \[[ModelScope](https://modelscope.cn/#/models/damo/ofa_text-to-image-synthesis_coco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Text2Image_Generation)\]
+* Generic Interface \[[Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Generic_Interface)\]
+
+Also we provide Colab notebooks for you to better perceive the procedures. Click [here](colab.md) to check them out!
+
+
+# Use in Huggingface Transformers
+We support the inference of OFA in Huggingface Transformers. Check the [README](transformers.md) and [Colab Notebook](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing) for more information. Codes are released in this branch https://github.com/OFA-Sys/OFA/tree/feature/add_transformers
+
+
+
+# News
+* 2022.8.16: Released the **Chinese** version of OFA. **OFA-CN** needs only switching to `bpe_dir=../../utils/BERT_CN_dict` and `bpe=bert` and using our provided Chinese checkpoints in [checkpoints_cn.md](checkpoints_cn.md). Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on [MUGE Caption](https://tianchi.aliyun.com/muge) and the Chinese version of RefCOCO(-/+/g) (to release soon).
+* 2022.8.5: Released support of **prompt tuning** for OFA. Check our paper [here](https://arxiv.org/abs/2208.02532)! Please see the [prompt_tuning.md](prompt_tuning.md) for further details.
+* 2022.7.7: Updated support of OFA on **huggingface transformers** (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc [transformers.md](transformers.md) and the branch `feature/add_transformers`.
+* 2022.6.17: Released the pretrained checkpoint of **OFA-Huge**. To use it, set `--arch=ofa_huge` in the script.
+* 2022.5.15: OFA was accepted by **ICML 2022**
+* 2022.4.28: Add support of inference on **huggingface transformers**. For how to use it, please refer to the doc [transformers.md](transformers.md) and our [huggingface models](https://huggingface.co/OFA-Sys).
+* 2022.4.16: Released lightweight pretrained models **OFA-Medium** (~93M params) and **OFA-Tiny** (~33M params) in [checkpoints.md](checkpoints.md). To use them, you just need to load the corresponding checkpoint and set `--arch=ofa_medium` or `--arch=ofa_tiny` in the scripts.
+
+More News
+
+
+ ofa_base.pt
and change --arch=ofa_large
to --arch=ofa_base
in the training scripts.
Model | Ckpt | Params | Backbone | Hidden size | Intermediate size | Num. of heads | Enc layers | Dec layers | +
---|---|---|---|---|---|---|---|---|
OFATiny | Download | 33M | ResNet50 | 256 | 1024 | 4 | 4 | 4 | +
OFAMedium | Download | 93M | ResNet101 | 512 | 2048 | 8 | 4 | 4 | +
OFABase | Download | 180M | ResNet101 | 768 | 3072 | 12 | 6 | 6 | +
OFALarge | Download | 470M | ResNet152 | 1024 | 4096 | 16 | 12 | 12 | +
OFAHuge | Download | 930M | ResNet152 | 1280 | 5120 | 16 | 24 | 12 | +
Task | Image Captioning | VQA | Visual Entailment | Referring Expression Comprehension | +||
---|---|---|---|---|---|---|
Dataset | COCO | VQA v2 | SNLI-VE | RefCOCO | RefCOCO+ | RefCOCOg | +
Split | Karpathy test (CE/CIDEr) | test-dev/test-std | val/test | val/test-a/test-b | val/test-a/test-b | val-u/test-u | +
Metric | CIDEr | Acc. | Acc. | Acc. | +||
OFATiny | 119.0 / 128.7 | 70.3 / 70.4 | 85.3 / 85.2 | 80.20 / 84.07 / 75.00 | 68.22 / 75.13 / 57.66 | 72.02 / 69.74 | +
OFAMedium | 130.4 / 140.3 | 75.4 / 75.5 | 86.6 / 87.0 | 85.34 / 87.68 / 77.92 | 76.09 / 83.04 / 66.25 | 78.76 / 78.58 | +
OFABase | 138.2 / 146.7 | 78.0 / 78.1 | 89.3 / 89.2 | 88.48 / 90.67 / 83.30 | 81.39 / 87.15 / 74.29 | 82.29 / 82.31 | +
OFALarge | 142.2 / 150.7 | 80.4 / 80.7 | 90.3 / 90.2 | 90.05 / 92.93 / 85.26 | 85.80 / 89.87 / 79.22 | 85.89 / 86.55 | +
OFAHuge | 145.3 / 154.9 | 82.0 / 82.0 | 91.0 / 91.2 | 92.04 / 94.03 / 88.44 | 87.86 / 91.70 / 80.71 | 88.07 / 88.78 | +
+ To pretrain OFA, you should first download the dataset we provide (pretrain_data_examples.zip, a small subset of the original pretraining data). For your customed pretraining datasets, please prepare your training samples into the same format. pretrain_data_examples.zip
contains 4 TSV files: vision_language_examples.tsv
, text_examples.tsv
, image_examples.tsv
and detection_examples.tsv
. Details of these files are as follows:
+
+
all_captions.txt
, object.txt
and type2ans.json
. The data in these files are used as negative samples for the image-text matching (ITM) task.
+
+
+ By default, the pretraining script will attempt to restore the released pretrained checkpoints of OFA-Base or OFA-Large and perform continuous pretraining. Continuous pretraining is more recommended, which achieves much better results compared with pretraining from scratch. For continuous pretraining, please download the pretrained weights in advance (see checkpoints.md) and put them in the correct directory OFA/checkpoints/
. If not, the pretraining will begin from scratch.
+
+cd run_scripts/pretraining +bash pretrain_ofa_large.sh # Pretrain OFA-Large. For OFA-Base, use pretrain_ofa_base.sh ++
+ If the pretrained OFA checkpoint is restored successfully, you will see the following information in the log: +
++INFO: Loaded checkpoint ../../checkpoints/ofa_large.pt ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile caption_data.zip
contains caption_stage1_train.tsv, caption_stage2_train.tsv, caption_val.tsv and caption_test.tsv. Each image corresponds to only 1 caption in caption_stage1_train.tsv
and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.
+
+162365 12455 the sun sets over the trees beyond some docks. sky&&water&&dock&&pole /9j/4AAQSkZJ....UCP/2Q== ++
+ Following previous standard practice, we divide the finetuning process of image captioning into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 NVIDIA-V100 GPUs with 32GB memory (expected to obtain ~139.5 CIDEr on the validation set at this stage). In stage 2, we select the best checkpoint of stage 1 and train with CIDEr optimization on 8 NVIDIA-V100 GPUs. Note that CIDEr optimization is very unstable and requires careful hyperparameter tuning. If you encounter training errors in the stage2 finetuning, you can increase the batch size or reduce the learning rate. If neither of these works, you can directly set --freeze-resnet
to freeze the inner states of batch normalization.
+
+cd run_scripts/caption +nohup sh train_caption_stage1.sh > train_stage1.out & # stage 1, train with cross-entropy loss +nohup sh train_caption_stage2.sh > train_stage2.out & # stage 2, load the best ckpt of stage1 and train with CIDEr optimization ++
+ Run the following commands to get your results and evaluate your model. +
++cd run_scripts/caption ; sh evaluate_caption.sh # inference & evaluate ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile coco_image_gen.zip
contains coco_vqgan_train.tsv
, coco_vqgan_dev.tsv
and coco_vqgan_full_test.tsv
. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.
+
+1 6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846 the people are posing for a group photo. ++
+ The checkpoint zipfile image_gen_large_best.zip
contains image_gen_large_best.pt
, vqgan/last.ckpt
, vqgan/model.yaml
and clip/Vit-B-16.pt
.
+
+ (Optional, but achieves better result): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. +
++cd dataset/image_gen +ln coco_vqgan_train.tsv coco_vqgan_train_1.tsv +for idx in `seq 1 9`;do shuf coco_vqgan_train_${idx}.tsv > coco_vqgan_train_$[${idx}+1].tsv;done # each file is used for an epoch ++
+ Following previous practice, we divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 8-V100-32G-GPU servers (expected to obtain ~32.5+ CLIP Score on the validation set at this stage). In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization on 4 8-V100-32G-GPU servers (expected to obtain ~34.0+ CLIP Score on the validation set at this stage). During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_
.
+
+# run on each worker after the distributed and data configs have been correctly set following the guide in train_image_gen_stage1_distributed.sh +cd run_scripts/image_gen +nohup sh train_image_gen_stage1_distributed.sh # stage 1, train with cross-entropy loss +nohup sh train_image_gen_stage2_distributed.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization ++
+ Run the command below to generate your images. +
++cd run_scripts/image_gen ; sh evaluate_image_gen.sh # inference & evaluate (FID, IS and CLIP Score) ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile vqa_data.zip
is around 100G and the decompressed data costs around 135G disk storage, which contains the training, validation and testing samples together with other necessary data resources. (Since vqa_data.zip
is large in size, we have also provided chunked parts of the dataset files for more convenient and stable downloading. Please refer to issue #68.) Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.
+
+79459 79459 is this person wearing shorts? 0.6|!+no house&&short&&...&&sky /9j/4AAQS...tigZ/9k= ++
+ For fine-tuning on customed VQA-formulated tasks, please refer to issue #76, #105 and #73 for more information. +
++ (Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around +0.3 improvement on VQA accuracy. +
++cd dataset/vqa_data +ln vqa_train.tsv vqa_train_1.tsv +for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch ++
+ In our experiments, the VQA finetuning is performed on 4 8-A100-GPU servers (with RDMA). Here provides the finetuning script train_vqa_distributed.sh
, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. The command should be run on each worker.
+
+# run on each worker after the distributed and data configs have been correctly set following the guide in train_vqa_distributed.sh +cd run_scripts/vqa +bash train_vqa_distributed.sh ++
+ In our experiments, the finetuning costs around 36 hours (for 12 epochs). After each epoch, an evaluation on validation set is performed. The best validation accuracy during finetuning will be around 80.8. The log is saved in ${log_dir}
.
+
+ (Update on validation time-cost) As will be mentioned in the 4. Inference section, we prepare 2 types of inference: beam-search and all-candidate inference. By default, all-candidate inference is used for validation during fine-tuning, which achieves better accuracy but costs much time. Now we have added a new option in the training scripts called --val-inference-type
to switch the validation inference type during fine-tuning. If you feel the validation takes too long, you can refer to PR #79 to activate beam-search validation, which significantly takes much less time, with around 0.5-0.6 validation score degradation compared with all-candidate validation.
+
+ We provide 2 types of inference, beam-search (much faster but gets sub-optimal accuracy) and all-candidate evaluation (slower but best accuracy).
+ For beam-search inference, use the script evaluate_vqa_beam.sh
. Refer to the command below. The inference on test set costs around 16 GPU hours. After inference on test set, the result JSON file will be dumped in the ${result_path}
defined in the shell script. You can submit the result test_predict.json
to EvalAI. Using our released finetuned checkpoint, beam-search inference will get 80.15 validation accuracy, 79.36 test-dev accuracy and 79.48 test-std accuracy (around 0.6 lower than all-candidate evaluation).
+
+cd run_scripts/vqa +bash evaluate_vqa_beam.sh val # specify 'val' or 'test' ++
+ For all-candidate evaluation, we recommend to use the distributed script evaluate_vqa_allcand_distributed.sh
. Please refer to the guide in the script to set the distributed configs before running. The result JSON file will be dumped in the ${result_path}
defined in the shell script of rank-0 server. All-candidate evaluation computes scores on all the candidate answers in the VQA dataset, which achieves 80.82 validation accuracy, 79.87 test-dev accuracy and 80.02 test-std accuracy, reproducing our reported results in the paper. However, the inference on test set costs around 1k GPU hours, which is much slower.
+
+# run on each worker after the distributed configs have been correctly set following the guide in evaluate_vqa_allcand_distributed.sh +cd run_scripts/vqa +bash evaluate_vqa_allcand_distributed.sh val # specify 'val' or 'test' ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. We provide RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs. +
++79_1 237367 A woman in a white blouse holding a glass of wine. 230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k= ++
+ Unlike the original paper, we finetune OFA with a drop-path rate of 0.2, and found that training with this hyper-parameter achieves better results. We will update the reported results of the paper later. +
++cd run_scripts/refcoco +nohup sh train_refcoco.sh > train_refcoco.out & # finetune for refcoco +nohup sh train_refcocoplus.sh > train_refcocoplus.out & # finetune for refcoco+ +nohup sh train_refcocog.sh > train_refcocog.out & # finetune for refcocog ++
+ Run the following commands for the evaluation. +
++cd run_scripts/refcoco ; sh evaluate_refcoco.sh # inference & evaluate for refcoco/refcoco+/refcocog ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs. +
++252244149.jpg#1r1n 252244149 /9j/4AAQ...MD/2Q== a man in pink and gold is chewing on a wooden toothpick. a man in pink is chewing a toothpick on the subway. neutral ++
+ In our experiments, the SNLI-VE finetuning is performed on 8 NVIDIA-V100 GPUs with 32GB memory. In this task, we experimented with only a few sets of hyperparameters. We believe that proper hyperparameter tuning can lead to further accuracy improvement. +
++cd run_scripts/snli_ve +nohup sh train_snli_ve.sh > train_snli_ve.out & # finetune for snli_ve ++
+ Run the following command to obtain the results. +
++cd run_scripts/snli_ve ; sh evaluate_snli_ve.sh dev # specify 'dev' or 'test' ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. we provide 7 language understanding datasets from GLUE benchmark, including COLA, MNLI, MRPC, QNLI, QQP, RTE and SST2. More details about these datasets can be found in this link. +
+
+ For each task, we have tried multiple sets of hyperparameters (including learning rate, batch size, training epochs). The results under different sets of hyperparameters can be found in ${log_dir}
.
+
+cd run_scripts/glue +nohup sh train_cola.sh > train_cola.out & # finetune for cola +nohup sh train_mnli.sh > train_mnli.out & # finetune for mnli +nohup sh train_mrpc.sh > train_mrpc.out & # finetune for mrpc +nohup sh train_qnli.sh > train_qnli.out & # finetune for qnli +nohup sh train_qqp.sh > train_qqp.out & # finetune for qqp +nohup sh train_rte.sh > train_rte.out & # finetune for rte +nohup sh train_sst2.sh > train_sst2.out & # finetune for sst2 ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Our provided data is derived from the original ImageNet-1K (ILSVRC2012 train & validation) dataset and shares the same data split with it. To formulate the classification task into seq2seq paradigm, we use the synset words provided by Caffe as the generation target for each image class. Each line of the processed dataset represents a sample with the following format. The information of image base64 string, classification label (1-indexed, conform to the order in synset_words.txt
), synset words of the label are separated by tabs.
+
+_9j_4AAQS...fzX__Z 769 rugby ball ++
+ (Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around +0.2 improvement on ImageNet-1K accuracy. +
++cd dataset/imagenet_1k_data +ln imagenet_1k_train.tsv imagenet_1k_train_1.tsv +for idx in `seq 1 9`;do shuf imagenet_1k_train_${idx}.tsv > imagenet_1k_train_$[${idx}+1].tsv;done # each file is used for an epoch one by one ++
+ In our experiments, the ImageNet-1K finetuning is performed on 2 8-A100-GPU servers (with RDMA). Here provides the finetuning script train_imagenet_distributed.sh
, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. The command should be run on each worker. For quick evaluation during finetuning, by default we sample 20% of the original validation split and report accuracy on this subset after each epoch. The accuracy on the validation subset is generally ±0.1 relative to accuracy on the whole validation split.
+
+# run on each worker after the distributed and data configs have been correctly set following the guide in train_imagenet_distributed.sh +cd run_scripts/image_classify +bash train_imagenet_distributed.sh ++
+ In our experiments, the finetuning costs around 80 hours (for 32 epochs). The best accuracy on validation subset during finetuning will be around 85.0. The log is saved in ${log_dir}
.
+
+ To get the validation accuracy on the whole ImageNet-1K validation set, run the following command. The evaluation costs around 10 GPU hours. The accuracy will be reported in the stdout (expected to be around 85.0). +
++cd run_scripts/image_classify ; sh evaluate_imagenet.sh # inference & evaluate for imagenet-1k ++
+ Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The original dataset is taken from UniLM and we organized the data into the tsv format. Each line of the processed dataset represents a sample with the following format. The information of source and target texts are separated by tabs. +
++factory orders for manufactured goods rose #.# percent in september... us september factory orders up #.# percent ++
+ Run the following command to train the model. +
++cd run_scripts/gigaword +nohup sh train_gigaword.sh > train_gigaword.out & # finetune for gigaword ++
+ Run the following command to obtain the results (~36.43 rougeL). +
++cd run_scripts/gigaword ; sh evaluate_gigaword.sh # inference & evaluate for gigaword ++
+ We propose two scripts for stage1. +
++cd run_scripts/caption +nohup sh train_caption_stage1_el.sh > train_stage1_el.out & # stage 1, train with encouraging loss, expected cider 1.403 +nohup sh train_caption_stage1_el_db.sh > train_stage1_el.out & # stage 1, train with encouraging loss, and drop best examples, expected cider 1.404 ++
+cd run_scripts/refcoco +nohup sh train_refcoco_el.sh > train_refcoco_el.out & # finetune for refcoco +nohup sh train_refcocoplus_el.sh > train_refcocoplus_el.out & # finetune for refcoco+ +nohup sh train_refcocog_el.sh > train_refcocog_el.out & # finetune for refcocog ++
Model | #Params | Backbone | Hidden Size | Intermediate Size | #Heads | #Enc. Layers | #Dec. Layers | +
---|---|---|---|---|---|---|---|
OFABase | 160M | ResNet101 | 768 | 3072 | 12 | 6 | 6 | +
OFALarge | 443M | ResNet152 | 1024 | 4096 | 16 | 12 | 12 | +
Model | BLEU@4 | ROUGE-L | CIDEr-D | +
Trm | 7.33 | 51.51 | 11.00 | +
M6 | 16.19 | 55.06 | 30.75 | +
OFABase | 26.23 | 58.95 | 50.70 | +
OFALarge | 27.32 | 59.20 | 53.51 | +
Model | RefCOCO(val/testA/testB) | RefCOCO+(val/testA/testB) | RefCOCOg(val/test-u) | +
OFABase(random-init) | 30.13/35.07/25.03 | 17.89/20.90/15.83 | 20.30/20.45 | +
OFABase | 82.18/86.07/76.68 | 69.38/77.26/60.14 | 73.57/72.53 | +
OFALarge | 82.84/86.54/76.50 | 71.30/78.56/61.85 | 71.96/71.30 | +
+ +* **Convolutional Neural Networks (CNN)** + + [Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)](examples/language_model/conv_lm/README.md) + + [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](examples/conv_seq2seq/README.md) + + [Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)](https://github.com/pytorch/fairseq/tree/classic_seqlevel) + + [Hierarchical Neural Story Generation (Fan et al., 2018)](examples/stories/README.md) + + [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md) +* **LightConv and DynamicConv models** + + [Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)](examples/pay_less_attention_paper/README.md) +* **Long Short-Term Memory (LSTM) networks** + + Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015) +* **Transformer (self-attention) networks** + + Attention Is All You Need (Vaswani et al., 2017) + + [Scaling Neural Machine Translation (Ott et al., 2018)](examples/scaling_nmt/README.md) + + [Understanding Back-Translation at Scale (Edunov et al., 2018)](examples/backtranslation/README.md) + + [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/README.adaptive_inputs.md) + + [Lexically constrained decoding with dynamic beam allocation (Post & Vilar, 2018)](examples/constrained_decoding/README.md) + + [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019)](examples/truncated_bptt/README.md) + + [Adaptive Attention Span in Transformers (Sukhbaatar et al., 2019)](examples/adaptive_span/README.md) + + [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md) + + [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md) + + [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md) + + [Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)](examples/joint_alignment_translation/README.md ) + + [Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)](examples/mbart/README.md) + + [Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)](examples/byte_level_bpe/README.md) + + [Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)](examples/unsupervised_quality_estimation/README.md) + + [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](examples/wav2vec/README.md) + + [Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models (Enarvi et al., 2020)](examples/pointer_generator/README.md) + + [Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)](examples/linformer/README.md) + + [Cross-lingual Retrieval for Iterative Self-Supervised Training (Tran et al., 2020)](examples/criss/README.md) + + [Deep Transformers with Latent Depth (Li et al., 2020)](examples/latent_depth/README.md) + + [Unsupervised Cross-lingual Representation Learning for Speech Recognition (Conneau et al., 2020)](https://arxiv.org/abs/2006.13979) + + [Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training (Hsu, et al., 2021)](https://arxiv.org/abs/2104.01027) + + [Unsupervised Speech Recognition (Baevski, et al., 2021)](https://arxiv.org/abs/2105.11084) +* **Non-autoregressive Transformers** + + Non-Autoregressive Neural Machine Translation (Gu et al., 2017) + + Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee et al. 2018) + + Insertion Transformer: Flexible Sequence Generation via Insertion Operations (Stern et al. 2019) + + Mask-Predict: Parallel Decoding of Conditional Masked Language Models (Ghazvininejad et al., 2019) + + [Levenshtein Transformer (Gu et al., 2019)](examples/nonautoregressive_translation/README.md) +* **Finetuning** + + [Better Fine-Tuning by Reducing Representational Collapse (Aghajanyan et al. 2020)](examples/rxf/README.md) + +
+ +* September 2020: [Added Linformer code](examples/linformer/README.md) +* September 2020: [Added pointer-generator networks](examples/pointer_generator/README.md) +* August 2020: [Added lexically constrained decoding](examples/constrained_decoding/README.md) +* August 2020: [wav2vec2 models and code released](examples/wav2vec/README.md) +* July 2020: [Unsupervised Quality Estimation code released](examples/unsupervised_quality_estimation/README.md) +* May 2020: [Follow fairseq on Twitter](https://twitter.com/fairseq) +* April 2020: [Monotonic Multihead Attention code released](examples/simultaneous_translation/README.md) +* April 2020: [Quant-Noise code released](examples/quant_noise/README.md) +* April 2020: [Initial model parallel support and 11B parameters unidirectional LM released](examples/megatron_11b/README.md) +* March 2020: [Byte-level BPE code released](examples/byte_level_bpe/README.md) +* February 2020: [mBART model and code released](examples/mbart/README.md) +* February 2020: [Added tutorial for back-translation](https://github.com/pytorch/fairseq/tree/main/examples/backtranslation#training-your-own-model-wmt18-english-german) +* December 2019: [fairseq 0.9.0 released](https://github.com/pytorch/fairseq/releases/tag/v0.9.0) +* November 2019: [VizSeq released (a visual analysis toolkit for evaluating fairseq models)](https://facebookresearch.github.io/vizseq/docs/getting_started/fairseq_example) +* November 2019: [CamemBERT model and code released](examples/camembert/README.md) +* November 2019: [BART model and code released](examples/bart/README.md) +* November 2019: [XLM-R models and code released](examples/xlmr/README.md) +* September 2019: [Nonautoregressive translation code released](examples/nonautoregressive_translation/README.md) +* August 2019: [WMT'19 models released](examples/wmt19/README.md) +* July 2019: fairseq relicensed under MIT license +* July 2019: [RoBERTa models and code released](examples/roberta/README.md) +* June 2019: [wav2vec models and code released](examples/wav2vec/README.md) + +
+ +
+ +# Flores101: Large-Scale Multilingual Machine Translation + +## Introduction + +Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition. + +Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html + +Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/ + + + +## Pretrained models + +Model | Num layers | Embed dimension | FFN dimension| Vocab Size | #params | Download +---|---|---|---|---|---|--- +`flores101_mm100_615M` | 12 | 1024 | 4096 | 256,000 | 615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz +`flores101_mm100_175M` | 6 | 512 | 2048 | 256,000 | 175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz + + +These models are trained similar to [M2M-100](https://arxiv.org/abs/2010.11125) with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom. + + +## Example Generation code + +### Download model, sentencepiece vocab + +```bash +fairseq=/path/to/fairseq +cd $fairseq + +# Download 615M param model. +wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz + +# Extract +tar -xvzf flores101_mm100_615M.tar.gz +``` + +### Encode using our SentencePiece Model +Note: Install SentencePiece from [here](https://github.com/google/sentencepiece) + + +```bash +fairseq=/path/to/fairseq +cd $fairseq + +# Download example dataset From German to French +sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de +sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr + +for lang in de fr ; do + python scripts/spm_encode.py \ + --model flores101_mm100_615M/sentencepiece.bpe.model \ + --output_format=piece \ + --inputs=raw_input.de-fr.${lang} \ + --outputs=spm.de-fr.${lang} +done +``` + +### Binarization + +```bash +fairseq-preprocess \ + --source-lang de --target-lang fr \ + --testpref spm.de-fr \ + --thresholdsrc 0 --thresholdtgt 0 \ + --destdir data_bin \ + --srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt +``` + +### Generation + + +```bash +fairseq-generate \ + data_bin \ + --batch-size 1 \ + --path flores101_mm100_615M/model.pt \ + --fixed-dictionary flores101_mm100_615M/dict.txt \ + -s de -t fr \ + --remove-bpe 'sentencepiece' \ + --beam 5 \ + --task translation_multi_simple_epoch \ + --lang-pairs flores101_mm100_615M/language_pairs.txt \ + --decoder-langtok --encoder-langtok src \ + --gen-subset test \ + --fp16 \ + --dataset-impl mmap \ + --distributed-world-size 1 --distributed-no-spawn +``` + +### Supported Languages and lang code + +Language | lang code +---|--- +Akrikaans | af +Amharic | am +Arabic | ar +Assamese | as +Asturian | ast +Aymara | ay +Azerbaijani | az +Bashkir | ba +Belarusian | be +Bulgarian | bg +Bengali | bn +Breton | br +Bosnian | bs +Catalan | ca +Cebuano | ceb +Chokwe | cjk +Czech | cs +Welsh | cy +Danish | da +German | de +Dyula| dyu +Greek | el +English | en +Spanish | es +Estonian | et +Persian | fa +Fulah | ff +Finnish | fi +French | fr +Western Frisian | fy +Irish | ga +Scottish Gaelic | gd +Galician | gl +Gujarati | gu +Hausa | ha +Hebrew | he +Hindi | hi +Croatian | hr +Haitian Creole | ht +Hungarian | hu +Armenian | hy +Indonesian | id +Igbo | ig +Iloko | ilo +Icelandic | is +Italian | it +Japanese | ja +Javanese | jv +Georgian | ka +Kachin | kac +Kamba | kam +Kabuverdianu | kea +Kongo | kg +Kazakh | kk +Central Khmer | km +Kimbundu | kmb +Northern Kurdish | kmr +Kannada | kn +Korean | ko +Kurdish | ku +Kyrgyz | ky +Luxembourgish | lb +Ganda | lg +Lingala | ln +Lao | lo +Lithuanian | lt +Luo | luo +Latvian | lv +Malagasy | mg +Maori | mi +Macedonian | mk +Malayalam | ml +Mongolian | mn +Marathi | mr +Malay | ms +Maltese | mt +Burmese | my +Nepali | ne +Dutch | nl +Norwegian | no +Northern Sotho | ns +Nyanja | ny +Occitan | oc +Oromo | om +Oriya | or +Punjabi | pa +Polish | pl +Pashto | ps +Portuguese | pt +Quechua | qu +Romanian | ro +Russian | ru +Sindhi | sd +Shan | shn +Sinhala | si +Slovak | sk +Slovenian | sl +Shona | sn +Somali | so +Albanian | sq +Serbian | sr +Swati | ss +Sundanese | su +Swedish | sv +Swahili | sw +Tamil | ta +Telugu | te +Tajik | tg +Thai | th +Tigrinya | ti +Tagalog | tl +Tswana | tn +Turkish | tr +Ukrainian | uk +Umbundu | umb +Urdu | ur +Uzbek | uz +Vietnamese | vi +Wolof | wo +Xhosa | xh +Yiddish | yi +Yoruba | yo +Chinese| zh +Zulu | zu diff --git a/fairseq/examples/flores101/flores_logo.png b/fairseq/examples/flores101/flores_logo.png new file mode 100644 index 0000000000000000000000000000000000000000..d4d1455c6eab608ff5317ce885183cd213564273 Binary files /dev/null and b/fairseq/examples/flores101/flores_logo.png differ diff --git a/fairseq/examples/fully_sharded_data_parallel/README.md b/fairseq/examples/fully_sharded_data_parallel/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b9e44fef48bee5faeee27b3d1d1b1eb96b6a477f --- /dev/null +++ b/fairseq/examples/fully_sharded_data_parallel/README.md @@ -0,0 +1,177 @@ +# Fully Sharded Data Parallel (FSDP) + +## Overview +Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and +[Google](https://arxiv.org/abs/2004.13336) has shown that data parallel +training can be made significantly more efficient by sharding the model +parameters and optimizer state across data parallel workers. These ideas are +encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper provided +by [fairscale](https://github.com/facebookresearch/fairscale/). + +Compared to PyTorch DDP: +* FSDP produces identical results as PyTorch DDP (it's still synchronous data parallel training) +* FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs +* FSDP is faster than PyTorch DDP because the optimizer step is sharded, and the communication can be overlapped with the forward pass +* FSDP enables training 13B parameter models on 8 GPUs and 175B parameter models on 128 GPUs + +FSDP is fully supported in fairseq via the following new arguments: +* `--ddp-backend=fully_sharded`: enables full sharding via FSDP +* `--cpu-offload`: offloads the optimizer state and FP32 model copy to CPU (combine with `--optimizer=cpu_adam`) +* `--no-reshard-after-forward`: increases training speed for large models (1B+ params) and is similar to ZeRO stage 2 +* other popular options (`--fp16`, `--update-freq`, `--checkpoint-activations`, `--offload-activations`, etc.) continue to work as normal + ++ +FSDP currently has several limitations compared to fairseq's default DDP backend (PyTorch DDP): +* while FSDP is full compatible with pointwise Optimizers (e.g., Adam, AdamW, Adadelta, Adamax, SGD, etc.), it is not currently compatible with non-pointwise Optimizers (e.g., Adagrad, Adafactor, LAMB, etc.) +* FSDP depends on flattening the parameters, so models that currently require `--fp16-no-flatten-grads` may not be supported + +See the [fairscale docs](https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html) for a more detailed +explanation of these and other limitations. + +
+ + + +See the [fairscale docs](https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html) for a more detailed +explanation of how FSDP works. + +
+ +``` +(...) +2021-03-08 12:29:51 | INFO | fairseq_cli.train | num. model params: 13,110,865,920 (num. trained: 13,110,865,920) +(...) +2021-03-08 12:29:51 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs) +2021-03-08 12:29:51 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8 +(...) +Adam Optimizer #0 is created with AVX2 arithmetic capability. +Config: alpha=0.000100, betas=(0.900000, 0.980000), weight_decay=0.000000, adam_w=1 +(...) +2021-03-08 12:31:36 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "16.475", "ppl": "91120.8", "wps": "0", "ups": "0", "wpb": "16384", "bsz": "8", "num_updates": "1", "lr": "2e-05", "gnorm": "20.751", "loss_scale": "4", "train_wall": "99", "gb_free": "9.3", "wall": "105"} +2021-03-08 12:32:33 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "16.446", "ppl": "89281.6", "wps": "288.7", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "2", "lr": "4e-05", "gnorm": "19.777", "loss_scale": "4", "train_wall": "57", "gb_free": "9.3", "wall": "161"} +2021-03-08 12:33:12 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0 +2021-03-08 12:33:51 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0 +2021-03-08 12:34:45 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "25.22", "ppl": "3.90691e+07", "wps": "123.4", "ups": "0.01", "wpb": "16384", "bsz": "8", "num_updates": "3", "lr": "6e-05", "gnorm": "131.281", "loss_scale": "1", "train_wall": "133", "gb_free": "9.3", "wall": "294"} +2021-03-08 12:35:43 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "18.079", "ppl": "276809", "wps": "285.5", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "4", "lr": "8e-05", "gnorm": "13.776", "loss_scale": "1", "train_wall": "57", "gb_free": "9.3", "wall": "351"} +2021-03-08 12:36:35 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "23.729", "ppl": "1.39088e+07", "wps": "316.7", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "72.774", "loss_scale": "1", "train_wall": "52", "gb_free": "9.3", "wall": "403"} +2021-03-08 12:37:28 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "20.429", "ppl": "1.41203e+06", "wps": "307.6", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "6", "lr": "8e-05", "gnorm": "60.846", "loss_scale": "1", "train_wall": "53", "gb_free": "9.3", "wall": "456"} +2021-03-08 12:38:27 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "18.965", "ppl": "511684", "wps": "279.4", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "7", "lr": "6e-05", "gnorm": "22.687", "loss_scale": "1", "train_wall": "59", "gb_free": "9.3", "wall": "515"} +2021-03-08 12:39:18 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "18.345", "ppl": "332887", "wps": "319.1", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "8", "lr": "4e-05", "gnorm": "8.451", "loss_scale": "1", "train_wall": "51", "gb_free": "9.3", "wall": "566"} +2021-03-08 12:40:11 | INFO | train_inner | {"epoch": 1, "update": 0.002, "loss": "18.262", "ppl": "314336", "wps": "305.9", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "9", "lr": "2e-05", "gnorm": "6.457", "loss_scale": "1", "train_wall": "54", "gb_free": "9.3", "wall": "620"} +2021-03-08 12:41:04 | INFO | train_inner | {"epoch": 1, "update": 0.002, "loss": "17.556", "ppl": "192686", "wps": "311.8", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "10", "lr": "0", "gnorm": "5.796", "loss_scale": "1", "train_wall": "53", "gb_free": "9.3", "wall": "673"} +2021-03-08 12:41:04 | INFO | fairseq_cli.train | Stopping training due to num_updates: 10 >= max_update: 10 +2021-03-08 12:41:04 | INFO | fairseq_cli.train | begin validation on "valid" subset +2021-03-08 12:43:15 | INFO | valid | {"epoch": 1, "valid_loss": "17.953", "valid_ppl": "253807", "valid_wps": "1868.4", "valid_wpb": "15400.2", "valid_bsz": "7.6", "valid_num_updates": "10"} +2021-03-08 12:43:15 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) +2021-03-08 12:43:15 | INFO | train | {"epoch": 1, "train_loss": "19.351", "train_ppl": "668509", "train_wps": "210.9", "train_ups": "0.01", "train_wpb": "16384", "train_bsz": "8", "train_num_updates": "10", "train_lr": "0", "train_gnorm": "36.26", "train_loss_scale": "1", "train_train_wall": "667", "train_gb_free": "9.3", "train_wall": "804"} +2021-03-08 12:43:15 | INFO | fairseq_cli.train | done training in 798.6 seconds +``` + +
+ +``` +(...) +2021-03-08 18:04:09 | INFO | fairseq_cli.train | num. model params: 13,110,865,920 (num. trained: 13,110,865,920) +(...) +2021-03-08 18:04:09 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs) +2021-03-08 18:04:09 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8 +(...) +Adam Optimizer #0 is created with AVX2 arithmetic capability. +Config: alpha=0.000100, betas=(0.900000, 0.980000), weight_decay=0.000000, adam_w=1 +(...) +2021-03-08 18:05:06 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "16.408", "ppl": "86945.6", "wps": "0", "ups": "0", "wpb": "131072", "bsz": "64", "num_updates": "1", "lr": "2e-05", "gnorm": "18.27", "loss_scale": "4", "train_wall": "47", "gb_free": "9.3", "wall": "56"} +2021-03-08 18:05:45 | INFO | train_inner | {"epoch": 1, "update": 0.002, "loss": "16.352", "ppl": "83644.3", "wps": "3283.4", "ups": "0.03", "wpb": "131072", "bsz": "64", "num_updates": "2", "lr": "4e-05", "gnorm": "18.411", "loss_scale": "4", "train_wall": "40", "gb_free": "9.3", "wall": "96"} +2021-03-08 18:06:21 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0 +2021-03-08 18:06:56 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0 +2021-03-08 18:07:37 | INFO | train_inner | {"epoch": 1, "update": 0.006, "loss": "23.682", "ppl": "1.34537e+07", "wps": "1176.6", "ups": "0.01", "wpb": "131072", "bsz": "64", "num_updates": "3", "lr": "6e-05", "gnorm": "119.682", "loss_scale": "1", "train_wall": "111", "gb_free": "9.3", "wall": "208"} +2021-03-08 18:08:18 | INFO | train_inner | {"epoch": 1, "update": 0.007, "loss": "18.988", "ppl": "519921", "wps": "3189.1", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "4", "lr": "8e-05", "gnorm": "14.934", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "249"} +2021-03-08 18:08:59 | INFO | train_inner | {"epoch": 1, "update": 0.008, "loss": "20.08", "ppl": "1.10798e+06", "wps": "3223.1", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "5", "lr": "0.0001", "gnorm": "59.92", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "289"} +2021-03-08 18:09:39 | INFO | train_inner | {"epoch": 1, "update": 0.009, "loss": "18.323", "ppl": "327980", "wps": "3256.6", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "6", "lr": "8e-05", "gnorm": "37.425", "loss_scale": "1", "train_wall": "40", "gb_free": "9.3", "wall": "330"} +2021-03-08 18:10:20 | INFO | train_inner | {"epoch": 1, "update": 0.01, "loss": "17.264", "ppl": "157354", "wps": "3188.7", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "7", "lr": "6e-05", "gnorm": "10.824", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "371"} +2021-03-08 18:11:01 | INFO | train_inner | {"epoch": 1, "update": 0.011, "loss": "16.794", "ppl": "113647", "wps": "3230", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "8", "lr": "4e-05", "gnorm": "5.616", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "411"} +2021-03-08 18:11:39 | INFO | train_inner | {"epoch": 1, "update": 0.012, "loss": "16.706", "ppl": "106938", "wps": "3384", "ups": "0.03", "wpb": "131072", "bsz": "64", "num_updates": "9", "lr": "2e-05", "gnorm": "5.318", "loss_scale": "1", "train_wall": "39", "gb_free": "9.3", "wall": "450"} +2021-03-08 18:12:19 | INFO | train_inner | {"epoch": 1, "update": 0.013, "loss": "16.548", "ppl": "95796.2", "wps": "3274.4", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "10", "lr": "0", "gnorm": "5.22", "loss_scale": "1", "train_wall": "40", "gb_free": "9.3", "wall": "490"} +2021-03-08 18:12:19 | INFO | fairseq_cli.train | Stopping training due to num_updates: 10 >= max_update: 10 +2021-03-08 18:12:19 | INFO | fairseq_cli.train | begin validation on "valid" subset +2021-03-08 18:12:45 | INFO | valid | {"epoch": 1, "valid_loss": "16.624", "valid_ppl": "101000", "valid_wps": "10855.9", "valid_wpb": "123202", "valid_bsz": "60.5", "valid_num_updates": "10"} +2021-03-08 18:12:45 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) +2021-03-08 18:12:45 | INFO | train | {"epoch": 1, "train_loss": "18.114", "train_ppl": "283776", "train_wps": "2567.8", "train_ups": "0.02", "train_wpb": "131072", "train_bsz": "64", "train_num_updates": "10", "train_lr": "0", "train_gnorm": "29.562", "train_loss_scale": "1", "train_train_wall": "480", "train_gb_free": "9.3", "train_wall": "516"} +2021-03-08 18:12:45 | INFO | fairseq_cli.train | done training in 509.9 seconds +``` + +