diff --git a/LICENSE b/LICENSE
deleted file mode 100644
index a7767b63a8d61b2622642ccc9012f06af5053e17..0000000000000000000000000000000000000000
--- a/LICENSE
+++ /dev/null
@@ -1,201 +0,0 @@
- Apache License
- Version 2.0, January 2004
- http://www.apache.org/licenses/
-
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
- 1. Definitions.
-
- "License" shall mean the terms and conditions for use, reproduction,
- and distribution as defined by Sections 1 through 9 of this document.
-
- "Licensor" shall mean the copyright owner or entity authorized by
- the copyright owner that is granting the License.
-
- "Legal Entity" shall mean the union of the acting entity and all
- other entities that control, are controlled by, or are under common
- control with that entity. For the purposes of this definition,
- "control" means (i) the power, direct or indirect, to cause the
- direction or management of such entity, whether by contract or
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
- outstanding shares, or (iii) beneficial ownership of such entity.
-
- "You" (or "Your") shall mean an individual or Legal Entity
- exercising permissions granted by this License.
-
- "Source" form shall mean the preferred form for making modifications,
- including but not limited to software source code, documentation
- source, and configuration files.
-
- "Object" form shall mean any form resulting from mechanical
- transformation or translation of a Source form, including but
- not limited to compiled object code, generated documentation,
- and conversions to other media types.
-
- "Work" shall mean the work of authorship, whether in Source or
- Object form, made available under the License, as indicated by a
- copyright notice that is included in or attached to the work
- (an example is provided in the Appendix below).
-
- "Derivative Works" shall mean any work, whether in Source or Object
- form, that is based on (or derived from) the Work and for which the
- editorial revisions, annotations, elaborations, or other modifications
- represent, as a whole, an original work of authorship. For the purposes
- of this License, Derivative Works shall not include works that remain
- separable from, or merely link (or bind by name) to the interfaces of,
- the Work and Derivative Works thereof.
-
- "Contribution" shall mean any work of authorship, including
- the original version of the Work and any modifications or additions
- to that Work or Derivative Works thereof, that is intentionally
- submitted to Licensor for inclusion in the Work by the copyright owner
- or by an individual or Legal Entity authorized to submit on behalf of
- the copyright owner. For the purposes of this definition, "submitted"
- means any form of electronic, verbal, or written communication sent
- to the Licensor or its representatives, including but not limited to
- communication on electronic mailing lists, source code control systems,
- and issue tracking systems that are managed by, or on behalf of, the
- Licensor for the purpose of discussing and improving the Work, but
- excluding communication that is conspicuously marked or otherwise
- designated in writing by the copyright owner as "Not a Contribution."
-
- "Contributor" shall mean Licensor and any individual or Legal Entity
- on behalf of whom a Contribution has been received by Licensor and
- subsequently incorporated within the Work.
-
- 2. Grant of Copyright License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- copyright license to reproduce, prepare Derivative Works of,
- publicly display, publicly perform, sublicense, and distribute the
- Work and such Derivative Works in Source or Object form.
-
- 3. Grant of Patent License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- (except as stated in this section) patent license to make, have made,
- use, offer to sell, sell, import, and otherwise transfer the Work,
- where such license applies only to those patent claims licensable
- by such Contributor that are necessarily infringed by their
- Contribution(s) alone or by combination of their Contribution(s)
- with the Work to which such Contribution(s) was submitted. If You
- institute patent litigation against any entity (including a
- cross-claim or counterclaim in a lawsuit) alleging that the Work
- or a Contribution incorporated within the Work constitutes direct
- or contributory patent infringement, then any patent licenses
- granted to You under this License for that Work shall terminate
- as of the date such litigation is filed.
-
- 4. Redistribution. You may reproduce and distribute copies of the
- Work or Derivative Works thereof in any medium, with or without
- modifications, and in Source or Object form, provided that You
- meet the following conditions:
-
- (a) You must give any other recipients of the Work or
- Derivative Works a copy of this License; and
-
- (b) You must cause any modified files to carry prominent notices
- stating that You changed the files; and
-
- (c) You must retain, in the Source form of any Derivative Works
- that You distribute, all copyright, patent, trademark, and
- attribution notices from the Source form of the Work,
- excluding those notices that do not pertain to any part of
- the Derivative Works; and
-
- (d) If the Work includes a "NOTICE" text file as part of its
- distribution, then any Derivative Works that You distribute must
- include a readable copy of the attribution notices contained
- within such NOTICE file, excluding those notices that do not
- pertain to any part of the Derivative Works, in at least one
- of the following places: within a NOTICE text file distributed
- as part of the Derivative Works; within the Source form or
- documentation, if provided along with the Derivative Works; or,
- within a display generated by the Derivative Works, if and
- wherever such third-party notices normally appear. The contents
- of the NOTICE file are for informational purposes only and
- do not modify the License. You may add Your own attribution
- notices within Derivative Works that You distribute, alongside
- or as an addendum to the NOTICE text from the Work, provided
- that such additional attribution notices cannot be construed
- as modifying the License.
-
- You may add Your own copyright statement to Your modifications and
- may provide additional or different license terms and conditions
- for use, reproduction, or distribution of Your modifications, or
- for any such Derivative Works as a whole, provided Your use,
- reproduction, and distribution of the Work otherwise complies with
- the conditions stated in this License.
-
- 5. Submission of Contributions. Unless You explicitly state otherwise,
- any Contribution intentionally submitted for inclusion in the Work
- by You to the Licensor shall be under the terms and conditions of
- this License, without any additional terms or conditions.
- Notwithstanding the above, nothing herein shall supersede or modify
- the terms of any separate license agreement you may have executed
- with Licensor regarding such Contributions.
-
- 6. Trademarks. This License does not grant permission to use the trade
- names, trademarks, service marks, or product names of the Licensor,
- except as required for reasonable and customary use in describing the
- origin of the Work and reproducing the content of the NOTICE file.
-
- 7. Disclaimer of Warranty. Unless required by applicable law or
- agreed to in writing, Licensor provides the Work (and each
- Contributor provides its Contributions) on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
- implied, including, without limitation, any warranties or conditions
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
- PARTICULAR PURPOSE. You are solely responsible for determining the
- appropriateness of using or redistributing the Work and assume any
- risks associated with Your exercise of permissions under this License.
-
- 8. Limitation of Liability. In no event and under no legal theory,
- whether in tort (including negligence), contract, or otherwise,
- unless required by applicable law (such as deliberate and grossly
- negligent acts) or agreed to in writing, shall any Contributor be
- liable to You for damages, including any direct, indirect, special,
- incidental, or consequential damages of any character arising as a
- result of this License or out of the use or inability to use the
- Work (including but not limited to damages for loss of goodwill,
- work stoppage, computer failure or malfunction, or any and all
- other commercial damages or losses), even if such Contributor
- has been advised of the possibility of such damages.
-
- 9. Accepting Warranty or Additional Liability. While redistributing
- the Work or Derivative Works thereof, You may choose to offer,
- and charge a fee for, acceptance of support, warranty, indemnity,
- or other liability obligations and/or rights consistent with this
- License. However, in accepting such obligations, You may act only
- on Your own behalf and on Your sole responsibility, not on behalf
- of any other Contributor, and only if You agree to indemnify,
- defend, and hold each Contributor harmless for any liability
- incurred by, or claims asserted against, such Contributor by reason
- of your accepting any such warranty or additional liability.
-
- END OF TERMS AND CONDITIONS
-
- APPENDIX: How to apply the Apache License to your work.
-
- To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "[]"
- replaced with your own identifying information. (Don't include
- the brackets!) The text should be enclosed in the appropriate
- comment syntax for the file format. We also recommend that a
- file or class name and description of purpose be included on the
- same "printed page" as the copyright notice for easier
- identification within third-party archives.
-
- Copyright 1999-2022 Alibaba Group Holding Ltd.
-
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
diff --git a/README.md b/README.md
deleted file mode 100644
index e14aa49f34f6751103034ffc4dd6d30deeee500c..0000000000000000000000000000000000000000
--- a/README.md
+++ /dev/null
@@ -1,27 +0,0 @@
----
-title: Chinese OCR
-emoji: 📖
-colorFrom: red
-colorTo: indigo
-sdk: gradio
-sdk_version: 3.9.1
-app_file: app.py
-pinned: false
----
-# Configuration
-`title`: _string_
-OFA Image Caption
-`emoji`: _string_
-🖼
-`colorFrom`: _string_
-red
-`colorTo`: _string_
-indigo
-`sdk`: _string_
-gradio
-`app_file`: _string_
-app.py
-
-
-`pinned`: _boolean_
-false
\ No newline at end of file
diff --git a/README_EncouragingLoss.md b/README_EncouragingLoss.md
deleted file mode 100644
index 430b45ee0720084347539901909e8222152ab99d..0000000000000000000000000000000000000000
--- a/README_EncouragingLoss.md
+++ /dev/null
@@ -1,34 +0,0 @@
-# Finetuning with Encouraging Loss (EL)
-Below we provide methods for finetuning with label smoothed encouraging loss proposed in [_Well-classified Examples are Underestimated in Classification with Deep Neural Networks_](https://arxiv.org/pdf/2110.06537.pdf) on different downstream tasks.
-The implementation is in [label_smoothed_encouraging_loss.py](criterions/label_smoothed_encouraging_loss.py).
-You can set the `--criterion` to `adjust_label_smoothed_encouraging_loss` to use it. This criterion has a hyper-parameter `--log-end`.
-`--log-end < 1` results in a approximated and conservative version of the full encouraging loss.
-A high log_end will more strongly weaken the gradient vanishing, enhance the modeling of the data, and increase the growth rate of the margin, but it will also bring a larger gradient norm, which will bring challenges to the existing optimization system.
-We recommend higher log_end for cases with higher performance, and 0.75 or 0.5 as your first try.
-## Image Captioning
-We provide procedures for image captioning with EL below. The preprocessing is identical to default setting.
-
-
- We propose two scripts for stage1.
- Finetuning
-
-cd run_scripts/caption
-nohup sh train_caption_stage1_el.sh > train_stage1_el.out & # stage 1, train with encouraging loss, expected cider 1.403
-nohup sh train_caption_stage1_el_db.sh > train_stage1_el.out & # stage 1, train with encouraging loss, and drop best examples, expected cider 1.404
-
-Finetuning
-
-cd run_scripts/refcoco
-nohup sh train_refcoco_el.sh > train_refcoco_el.out & # finetune for refcoco
-nohup sh train_refcocoplus_el.sh > train_refcocoplus_el.out & # finetune for refcoco+
-nohup sh train_refcocog_el.sh > train_refcocog_el.out & # finetune for refcocog
-
-
Model | #Params | Backbone | Hidden Size | Intermediate Size | #Heads | #Enc. Layers | #Dec. Layers | -
---|---|---|---|---|---|---|---|
OFABase | 160M | ResNet101 | 768 | 3072 | 12 | 6 | 6 | -
OFALarge | 443M | ResNet152 | 1024 | 4096 | 16 | 12 | 12 | -
Model | BLEU@4 | ROUGE-L | CIDEr-D | -
Trm | 7.33 | 51.51 | 11.00 | -
M6 | 16.19 | 55.06 | 30.75 | -
OFABase | 26.23 | 58.95 | 50.70 | -
OFALarge | 27.32 | 59.20 | 53.51 | -
Model | RefCOCO(val/testA/testB) | RefCOCO+(val/testA/testB) | RefCOCOg(val/test-u) | -
OFABase(random-init) | 30.13/35.07/25.03 | 17.89/20.90/15.83 | 20.30/20.45 | -
OFABase | 82.18/86.07/76.68 | 69.38/77.26/60.14 | 73.57/72.53 | -
OFALarge | 82.84/86.54/76.50 | 71.30/78.56/61.85 | 71.96/71.30 | -
- -* **Convolutional Neural Networks (CNN)** - + [Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)](examples/language_model/conv_lm/README.md) - + [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](examples/conv_seq2seq/README.md) - + [Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)](https://github.com/pytorch/fairseq/tree/classic_seqlevel) - + [Hierarchical Neural Story Generation (Fan et al., 2018)](examples/stories/README.md) - + [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md) -* **LightConv and DynamicConv models** - + [Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)](examples/pay_less_attention_paper/README.md) -* **Long Short-Term Memory (LSTM) networks** - + Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015) -* **Transformer (self-attention) networks** - + Attention Is All You Need (Vaswani et al., 2017) - + [Scaling Neural Machine Translation (Ott et al., 2018)](examples/scaling_nmt/README.md) - + [Understanding Back-Translation at Scale (Edunov et al., 2018)](examples/backtranslation/README.md) - + [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/README.adaptive_inputs.md) - + [Lexically constrained decoding with dynamic beam allocation (Post & Vilar, 2018)](examples/constrained_decoding/README.md) - + [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019)](examples/truncated_bptt/README.md) - + [Adaptive Attention Span in Transformers (Sukhbaatar et al., 2019)](examples/adaptive_span/README.md) - + [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md) - + [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md) - + [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md) - + [Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)](examples/joint_alignment_translation/README.md ) - + [Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)](examples/mbart/README.md) - + [Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)](examples/byte_level_bpe/README.md) - + [Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)](examples/unsupervised_quality_estimation/README.md) - + [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](examples/wav2vec/README.md) - + [Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models (Enarvi et al., 2020)](examples/pointer_generator/README.md) - + [Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)](examples/linformer/README.md) - + [Cross-lingual Retrieval for Iterative Self-Supervised Training (Tran et al., 2020)](examples/criss/README.md) - + [Deep Transformers with Latent Depth (Li et al., 2020)](examples/latent_depth/README.md) - + [Unsupervised Cross-lingual Representation Learning for Speech Recognition (Conneau et al., 2020)](https://arxiv.org/abs/2006.13979) - + [Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training (Hsu, et al., 2021)](https://arxiv.org/abs/2104.01027) - + [Unsupervised Speech Recognition (Baevski, et al., 2021)](https://arxiv.org/abs/2105.11084) -* **Non-autoregressive Transformers** - + Non-Autoregressive Neural Machine Translation (Gu et al., 2017) - + Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee et al. 2018) - + Insertion Transformer: Flexible Sequence Generation via Insertion Operations (Stern et al. 2019) - + Mask-Predict: Parallel Decoding of Conditional Masked Language Models (Ghazvininejad et al., 2019) - + [Levenshtein Transformer (Gu et al., 2019)](examples/nonautoregressive_translation/README.md) -* **Finetuning** - + [Better Fine-Tuning by Reducing Representational Collapse (Aghajanyan et al. 2020)](examples/rxf/README.md) - -
- -* September 2020: [Added Linformer code](examples/linformer/README.md) -* September 2020: [Added pointer-generator networks](examples/pointer_generator/README.md) -* August 2020: [Added lexically constrained decoding](examples/constrained_decoding/README.md) -* August 2020: [wav2vec2 models and code released](examples/wav2vec/README.md) -* July 2020: [Unsupervised Quality Estimation code released](examples/unsupervised_quality_estimation/README.md) -* May 2020: [Follow fairseq on Twitter](https://twitter.com/fairseq) -* April 2020: [Monotonic Multihead Attention code released](examples/simultaneous_translation/README.md) -* April 2020: [Quant-Noise code released](examples/quant_noise/README.md) -* April 2020: [Initial model parallel support and 11B parameters unidirectional LM released](examples/megatron_11b/README.md) -* March 2020: [Byte-level BPE code released](examples/byte_level_bpe/README.md) -* February 2020: [mBART model and code released](examples/mbart/README.md) -* February 2020: [Added tutorial for back-translation](https://github.com/pytorch/fairseq/tree/main/examples/backtranslation#training-your-own-model-wmt18-english-german) -* December 2019: [fairseq 0.9.0 released](https://github.com/pytorch/fairseq/releases/tag/v0.9.0) -* November 2019: [VizSeq released (a visual analysis toolkit for evaluating fairseq models)](https://facebookresearch.github.io/vizseq/docs/getting_started/fairseq_example) -* November 2019: [CamemBERT model and code released](examples/camembert/README.md) -* November 2019: [BART model and code released](examples/bart/README.md) -* November 2019: [XLM-R models and code released](examples/xlmr/README.md) -* September 2019: [Nonautoregressive translation code released](examples/nonautoregressive_translation/README.md) -* August 2019: [WMT'19 models released](examples/wmt19/README.md) -* July 2019: fairseq relicensed under MIT license -* July 2019: [RoBERTa models and code released](examples/roberta/README.md) -* June 2019: [wav2vec models and code released](examples/wav2vec/README.md) - -
- -
- -# Flores101: Large-Scale Multilingual Machine Translation - -## Introduction - -Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition. - -Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html - -Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/ - - - -## Pretrained models - -Model | Num layers | Embed dimension | FFN dimension| Vocab Size | #params | Download ----|---|---|---|---|---|--- -`flores101_mm100_615M` | 12 | 1024 | 4096 | 256,000 | 615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz -`flores101_mm100_175M` | 6 | 512 | 2048 | 256,000 | 175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz - - -These models are trained similar to [M2M-100](https://arxiv.org/abs/2010.11125) with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom. - - -## Example Generation code - -### Download model, sentencepiece vocab - -```bash -fairseq=/path/to/fairseq -cd $fairseq - -# Download 615M param model. -wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz - -# Extract -tar -xvzf flores101_mm100_615M.tar.gz -``` - -### Encode using our SentencePiece Model -Note: Install SentencePiece from [here](https://github.com/google/sentencepiece) - - -```bash -fairseq=/path/to/fairseq -cd $fairseq - -# Download example dataset From German to French -sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de -sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr - -for lang in de fr ; do - python scripts/spm_encode.py \ - --model flores101_mm100_615M/sentencepiece.bpe.model \ - --output_format=piece \ - --inputs=raw_input.de-fr.${lang} \ - --outputs=spm.de-fr.${lang} -done -``` - -### Binarization - -```bash -fairseq-preprocess \ - --source-lang de --target-lang fr \ - --testpref spm.de-fr \ - --thresholdsrc 0 --thresholdtgt 0 \ - --destdir data_bin \ - --srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt -``` - -### Generation - - -```bash -fairseq-generate \ - data_bin \ - --batch-size 1 \ - --path flores101_mm100_615M/model.pt \ - --fixed-dictionary flores101_mm100_615M/dict.txt \ - -s de -t fr \ - --remove-bpe 'sentencepiece' \ - --beam 5 \ - --task translation_multi_simple_epoch \ - --lang-pairs flores101_mm100_615M/language_pairs.txt \ - --decoder-langtok --encoder-langtok src \ - --gen-subset test \ - --fp16 \ - --dataset-impl mmap \ - --distributed-world-size 1 --distributed-no-spawn -``` - -### Supported Languages and lang code - -Language | lang code ----|--- -Akrikaans | af -Amharic | am -Arabic | ar -Assamese | as -Asturian | ast -Aymara | ay -Azerbaijani | az -Bashkir | ba -Belarusian | be -Bulgarian | bg -Bengali | bn -Breton | br -Bosnian | bs -Catalan | ca -Cebuano | ceb -Chokwe | cjk -Czech | cs -Welsh | cy -Danish | da -German | de -Dyula| dyu -Greek | el -English | en -Spanish | es -Estonian | et -Persian | fa -Fulah | ff -Finnish | fi -French | fr -Western Frisian | fy -Irish | ga -Scottish Gaelic | gd -Galician | gl -Gujarati | gu -Hausa | ha -Hebrew | he -Hindi | hi -Croatian | hr -Haitian Creole | ht -Hungarian | hu -Armenian | hy -Indonesian | id -Igbo | ig -Iloko | ilo -Icelandic | is -Italian | it -Japanese | ja -Javanese | jv -Georgian | ka -Kachin | kac -Kamba | kam -Kabuverdianu | kea -Kongo | kg -Kazakh | kk -Central Khmer | km -Kimbundu | kmb -Northern Kurdish | kmr -Kannada | kn -Korean | ko -Kurdish | ku -Kyrgyz | ky -Luxembourgish | lb -Ganda | lg -Lingala | ln -Lao | lo -Lithuanian | lt -Luo | luo -Latvian | lv -Malagasy | mg -Maori | mi -Macedonian | mk -Malayalam | ml -Mongolian | mn -Marathi | mr -Malay | ms -Maltese | mt -Burmese | my -Nepali | ne -Dutch | nl -Norwegian | no -Northern Sotho | ns -Nyanja | ny -Occitan | oc -Oromo | om -Oriya | or -Punjabi | pa -Polish | pl -Pashto | ps -Portuguese | pt -Quechua | qu -Romanian | ro -Russian | ru -Sindhi | sd -Shan | shn -Sinhala | si -Slovak | sk -Slovenian | sl -Shona | sn -Somali | so -Albanian | sq -Serbian | sr -Swati | ss -Sundanese | su -Swedish | sv -Swahili | sw -Tamil | ta -Telugu | te -Tajik | tg -Thai | th -Tigrinya | ti -Tagalog | tl -Tswana | tn -Turkish | tr -Ukrainian | uk -Umbundu | umb -Urdu | ur -Uzbek | uz -Vietnamese | vi -Wolof | wo -Xhosa | xh -Yiddish | yi -Yoruba | yo -Chinese| zh -Zulu | zu diff --git a/fairseq/examples/flores101/flores_logo.png b/fairseq/examples/flores101/flores_logo.png deleted file mode 100644 index d4d1455c6eab608ff5317ce885183cd213564273..0000000000000000000000000000000000000000 Binary files a/fairseq/examples/flores101/flores_logo.png and /dev/null differ diff --git a/fairseq/examples/fully_sharded_data_parallel/README.md b/fairseq/examples/fully_sharded_data_parallel/README.md deleted file mode 100644 index b9e44fef48bee5faeee27b3d1d1b1eb96b6a477f..0000000000000000000000000000000000000000 --- a/fairseq/examples/fully_sharded_data_parallel/README.md +++ /dev/null @@ -1,177 +0,0 @@ -# Fully Sharded Data Parallel (FSDP) - -## Overview -Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and -[Google](https://arxiv.org/abs/2004.13336) has shown that data parallel -training can be made significantly more efficient by sharding the model -parameters and optimizer state across data parallel workers. These ideas are -encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper provided -by [fairscale](https://github.com/facebookresearch/fairscale/). - -Compared to PyTorch DDP: -* FSDP produces identical results as PyTorch DDP (it's still synchronous data parallel training) -* FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs -* FSDP is faster than PyTorch DDP because the optimizer step is sharded, and the communication can be overlapped with the forward pass -* FSDP enables training 13B parameter models on 8 GPUs and 175B parameter models on 128 GPUs - -FSDP is fully supported in fairseq via the following new arguments: -* `--ddp-backend=fully_sharded`: enables full sharding via FSDP -* `--cpu-offload`: offloads the optimizer state and FP32 model copy to CPU (combine with `--optimizer=cpu_adam`) -* `--no-reshard-after-forward`: increases training speed for large models (1B+ params) and is similar to ZeRO stage 2 -* other popular options (`--fp16`, `--update-freq`, `--checkpoint-activations`, `--offload-activations`, etc.) continue to work as normal - -- -FSDP currently has several limitations compared to fairseq's default DDP backend (PyTorch DDP): -* while FSDP is full compatible with pointwise Optimizers (e.g., Adam, AdamW, Adadelta, Adamax, SGD, etc.), it is not currently compatible with non-pointwise Optimizers (e.g., Adagrad, Adafactor, LAMB, etc.) -* FSDP depends on flattening the parameters, so models that currently require `--fp16-no-flatten-grads` may not be supported - -See the [fairscale docs](https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html) for a more detailed -explanation of these and other limitations. - -
- - - -See the [fairscale docs](https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html) for a more detailed -explanation of how FSDP works. - -
- -``` -(...) -2021-03-08 12:29:51 | INFO | fairseq_cli.train | num. model params: 13,110,865,920 (num. trained: 13,110,865,920) -(...) -2021-03-08 12:29:51 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs) -2021-03-08 12:29:51 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8 -(...) -Adam Optimizer #0 is created with AVX2 arithmetic capability. -Config: alpha=0.000100, betas=(0.900000, 0.980000), weight_decay=0.000000, adam_w=1 -(...) -2021-03-08 12:31:36 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "16.475", "ppl": "91120.8", "wps": "0", "ups": "0", "wpb": "16384", "bsz": "8", "num_updates": "1", "lr": "2e-05", "gnorm": "20.751", "loss_scale": "4", "train_wall": "99", "gb_free": "9.3", "wall": "105"} -2021-03-08 12:32:33 | INFO | train_inner | {"epoch": 1, "update": 0.0, "loss": "16.446", "ppl": "89281.6", "wps": "288.7", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "2", "lr": "4e-05", "gnorm": "19.777", "loss_scale": "4", "train_wall": "57", "gb_free": "9.3", "wall": "161"} -2021-03-08 12:33:12 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0 -2021-03-08 12:33:51 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0 -2021-03-08 12:34:45 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "25.22", "ppl": "3.90691e+07", "wps": "123.4", "ups": "0.01", "wpb": "16384", "bsz": "8", "num_updates": "3", "lr": "6e-05", "gnorm": "131.281", "loss_scale": "1", "train_wall": "133", "gb_free": "9.3", "wall": "294"} -2021-03-08 12:35:43 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "18.079", "ppl": "276809", "wps": "285.5", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "4", "lr": "8e-05", "gnorm": "13.776", "loss_scale": "1", "train_wall": "57", "gb_free": "9.3", "wall": "351"} -2021-03-08 12:36:35 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "23.729", "ppl": "1.39088e+07", "wps": "316.7", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "5", "lr": "0.0001", "gnorm": "72.774", "loss_scale": "1", "train_wall": "52", "gb_free": "9.3", "wall": "403"} -2021-03-08 12:37:28 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "20.429", "ppl": "1.41203e+06", "wps": "307.6", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "6", "lr": "8e-05", "gnorm": "60.846", "loss_scale": "1", "train_wall": "53", "gb_free": "9.3", "wall": "456"} -2021-03-08 12:38:27 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "18.965", "ppl": "511684", "wps": "279.4", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "7", "lr": "6e-05", "gnorm": "22.687", "loss_scale": "1", "train_wall": "59", "gb_free": "9.3", "wall": "515"} -2021-03-08 12:39:18 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "18.345", "ppl": "332887", "wps": "319.1", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "8", "lr": "4e-05", "gnorm": "8.451", "loss_scale": "1", "train_wall": "51", "gb_free": "9.3", "wall": "566"} -2021-03-08 12:40:11 | INFO | train_inner | {"epoch": 1, "update": 0.002, "loss": "18.262", "ppl": "314336", "wps": "305.9", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "9", "lr": "2e-05", "gnorm": "6.457", "loss_scale": "1", "train_wall": "54", "gb_free": "9.3", "wall": "620"} -2021-03-08 12:41:04 | INFO | train_inner | {"epoch": 1, "update": 0.002, "loss": "17.556", "ppl": "192686", "wps": "311.8", "ups": "0.02", "wpb": "16384", "bsz": "8", "num_updates": "10", "lr": "0", "gnorm": "5.796", "loss_scale": "1", "train_wall": "53", "gb_free": "9.3", "wall": "673"} -2021-03-08 12:41:04 | INFO | fairseq_cli.train | Stopping training due to num_updates: 10 >= max_update: 10 -2021-03-08 12:41:04 | INFO | fairseq_cli.train | begin validation on "valid" subset -2021-03-08 12:43:15 | INFO | valid | {"epoch": 1, "valid_loss": "17.953", "valid_ppl": "253807", "valid_wps": "1868.4", "valid_wpb": "15400.2", "valid_bsz": "7.6", "valid_num_updates": "10"} -2021-03-08 12:43:15 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) -2021-03-08 12:43:15 | INFO | train | {"epoch": 1, "train_loss": "19.351", "train_ppl": "668509", "train_wps": "210.9", "train_ups": "0.01", "train_wpb": "16384", "train_bsz": "8", "train_num_updates": "10", "train_lr": "0", "train_gnorm": "36.26", "train_loss_scale": "1", "train_train_wall": "667", "train_gb_free": "9.3", "train_wall": "804"} -2021-03-08 12:43:15 | INFO | fairseq_cli.train | done training in 798.6 seconds -``` - -
- -``` -(...) -2021-03-08 18:04:09 | INFO | fairseq_cli.train | num. model params: 13,110,865,920 (num. trained: 13,110,865,920) -(...) -2021-03-08 18:04:09 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs) -2021-03-08 18:04:09 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8 -(...) -Adam Optimizer #0 is created with AVX2 arithmetic capability. -Config: alpha=0.000100, betas=(0.900000, 0.980000), weight_decay=0.000000, adam_w=1 -(...) -2021-03-08 18:05:06 | INFO | train_inner | {"epoch": 1, "update": 0.001, "loss": "16.408", "ppl": "86945.6", "wps": "0", "ups": "0", "wpb": "131072", "bsz": "64", "num_updates": "1", "lr": "2e-05", "gnorm": "18.27", "loss_scale": "4", "train_wall": "47", "gb_free": "9.3", "wall": "56"} -2021-03-08 18:05:45 | INFO | train_inner | {"epoch": 1, "update": 0.002, "loss": "16.352", "ppl": "83644.3", "wps": "3283.4", "ups": "0.03", "wpb": "131072", "bsz": "64", "num_updates": "2", "lr": "4e-05", "gnorm": "18.411", "loss_scale": "4", "train_wall": "40", "gb_free": "9.3", "wall": "96"} -2021-03-08 18:06:21 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0 -2021-03-08 18:06:56 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0 -2021-03-08 18:07:37 | INFO | train_inner | {"epoch": 1, "update": 0.006, "loss": "23.682", "ppl": "1.34537e+07", "wps": "1176.6", "ups": "0.01", "wpb": "131072", "bsz": "64", "num_updates": "3", "lr": "6e-05", "gnorm": "119.682", "loss_scale": "1", "train_wall": "111", "gb_free": "9.3", "wall": "208"} -2021-03-08 18:08:18 | INFO | train_inner | {"epoch": 1, "update": 0.007, "loss": "18.988", "ppl": "519921", "wps": "3189.1", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "4", "lr": "8e-05", "gnorm": "14.934", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "249"} -2021-03-08 18:08:59 | INFO | train_inner | {"epoch": 1, "update": 0.008, "loss": "20.08", "ppl": "1.10798e+06", "wps": "3223.1", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "5", "lr": "0.0001", "gnorm": "59.92", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "289"} -2021-03-08 18:09:39 | INFO | train_inner | {"epoch": 1, "update": 0.009, "loss": "18.323", "ppl": "327980", "wps": "3256.6", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "6", "lr": "8e-05", "gnorm": "37.425", "loss_scale": "1", "train_wall": "40", "gb_free": "9.3", "wall": "330"} -2021-03-08 18:10:20 | INFO | train_inner | {"epoch": 1, "update": 0.01, "loss": "17.264", "ppl": "157354", "wps": "3188.7", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "7", "lr": "6e-05", "gnorm": "10.824", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "371"} -2021-03-08 18:11:01 | INFO | train_inner | {"epoch": 1, "update": 0.011, "loss": "16.794", "ppl": "113647", "wps": "3230", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "8", "lr": "4e-05", "gnorm": "5.616", "loss_scale": "1", "train_wall": "41", "gb_free": "9.3", "wall": "411"} -2021-03-08 18:11:39 | INFO | train_inner | {"epoch": 1, "update": 0.012, "loss": "16.706", "ppl": "106938", "wps": "3384", "ups": "0.03", "wpb": "131072", "bsz": "64", "num_updates": "9", "lr": "2e-05", "gnorm": "5.318", "loss_scale": "1", "train_wall": "39", "gb_free": "9.3", "wall": "450"} -2021-03-08 18:12:19 | INFO | train_inner | {"epoch": 1, "update": 0.013, "loss": "16.548", "ppl": "95796.2", "wps": "3274.4", "ups": "0.02", "wpb": "131072", "bsz": "64", "num_updates": "10", "lr": "0", "gnorm": "5.22", "loss_scale": "1", "train_wall": "40", "gb_free": "9.3", "wall": "490"} -2021-03-08 18:12:19 | INFO | fairseq_cli.train | Stopping training due to num_updates: 10 >= max_update: 10 -2021-03-08 18:12:19 | INFO | fairseq_cli.train | begin validation on "valid" subset -2021-03-08 18:12:45 | INFO | valid | {"epoch": 1, "valid_loss": "16.624", "valid_ppl": "101000", "valid_wps": "10855.9", "valid_wpb": "123202", "valid_bsz": "60.5", "valid_num_updates": "10"} -2021-03-08 18:12:45 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) -2021-03-08 18:12:45 | INFO | train | {"epoch": 1, "train_loss": "18.114", "train_ppl": "283776", "train_wps": "2567.8", "train_ups": "0.02", "train_wpb": "131072", "train_bsz": "64", "train_num_updates": "10", "train_lr": "0", "train_gnorm": "29.562", "train_loss_scale": "1", "train_train_wall": "480", "train_gb_free": "9.3", "train_wall": "516"} -2021-03-08 18:12:45 | INFO | fairseq_cli.train | done training in 509.9 seconds -``` - -