google
/

fnet-base

@@ -141,21 +141,20 @@ The details of the masking procedure for each sentence are the following:
 ### Pretraining
-The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
 of 256. The sequence length was limited to 512 tokens. The optimizer
 used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
 learning rate warmup for 10,000 steps and linear decay of the learning rate after.
 ## Evaluation results
-According to [the official paper](https://arxiv.org/abs/2105.03824) (*cf.* with Table 1 on page 7), this model achieves the following performance on the GLUE test data:
-| Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
-|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
-|      | 72/73        | 83 | 80 | 95  | 69 | 79  | 76 | 63| 76.7    |
-The following table contains test results on the HuggingFace model in comparison with [bert-base-cased](https://hf.co/models/bert-base-cased). The training was done on a single 16GB NVIDIA Tesla V100 GPU. For MRPC/WNLI, the models were trained for 5 epochs, while for other tasks, the models were trained for 3 epochs. Please refer to the checkpoints linked with the scores. The sequence length used for 512 with batch size 16 and learning rate 2e-5.
 | Task  | Metric                 | Result                                                        |                  |                                                                           | Training time |          |
 | ----- | ---------------------- | --------------------------------------------------------------|----------------- | ------------------------------------------------------------------------- | ------------- | -------- |
@@ -165,12 +164,12 @@ The following table contains test results on the HuggingFace model in comparison
 | QNLI  | Accuracy               | [90.99](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qnli)       | [84.39](https://huggingface.co/gchhablani/fnet-base-finetuned-qnli)       | 80  |02:40:22      | 01:48:22 |
 | SST-2 | Accuracy               | [92.32](https://huggingface.co/gchhablani/bert-base-cased-finetuned-sst2)       | [89.45](https://huggingface.co/gchhablani/fnet-base-finetuned-sst2)       | 95 | 01:42:17      | 01:09:27 |
 | CoLA  | Matthews corr or Accuracy         | [59.57](https://huggingface.co/gchhablani/bert-base-cased-finetuned-cola) (Matthews corr)     | [35.94](https://huggingface.co/gchhablani/fnet-base-finetuned-cola) (Matthews Corr)      | 69 (Accuracy) | 14:20         | 09:47    |
-| STS-B | Spearman corr. | [89.26/88.98](https://huggingface.co/gchhablani/bert-base-cased-finetuned-stsb) | [82.56/82.19](https://huggingface.co/gchhablani/fnet-base-finetuned-stsb) | 79  |10:24         | 07:09    |
 | MRPC  | mean(F1/Accuracy)           | [88.15](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mrpc) | [81.15](https://huggingface.co/gchhablani/fnet-base-finetuned-mrpc) |  76 |11:12         | 07:48    |
 | RTE   | Accuracy               | [67.15](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qnli)       | [62.82](https://huggingface.co/gchhablani/fnet-base-finetuned-qnli)       |  63  |04:51         | 03:24    |
 | WNLI  | Accuracy               | [46.48](https://huggingface.co/gchhablani/bert-base-cased-finetuned-wnli)       | [54.93](https://huggingface.co/gchhablani/fnet-base-finetuned-wnli)       |  -  |03:23         | 02:37    |
-We can see that the FNet model achieves around ~93% of BERT's performance on average while it requires on average ~30% less time to fine-tune on the downstream tasks.
 ### BibTeX entry and citation info

 ### Pretraining
+FNet-base was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
 of 256. The sequence length was limited to 512 tokens. The optimizer
 used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
 learning rate warmup for 10,000 steps and linear decay of the learning rate after.
 ## Evaluation results
+FNet-base was fine-tuned and evaluated on the validation data of the [GLUE benchamrk](https://huggingface.co/datasets/glue). The results of the official model (written in Flax) can be seen in Table 1 on page 7 of [the official paper](https://arxiv.org/abs/2105.03824).
+For comparison, this model (ported to PyTorch) was fine-tuned and evaluated using the [official Hugging Face GLUE evaluation scripts](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification#glue-tasks) alongside [bert-base-cased](https://hf.co/models/bert-base-cased) for comparison.
+The training was done on a single 16GB NVIDIA Tesla V100 GPU. For MRPC/WNLI, the models were trained for 5 epochs, while for other tasks, the models were trained for 3 epochs. A sequence length of 512 was used with batch size 16 and learning rate 2e-5.
+The following table summarizes the results for [fnet-base](https://huggingface.co/google/fnet-base) (called *FNet (PyTorch) - Reproduced*) and [bert-base-cased](https://hf.co/models/bert-base-cased) (called *Bert (PyTorch) - Reproduced*) both in terms of performance and training times and compares it to the reported performance of the official FNet-base model (called *FNet (Flax) - Official*).
+For more details, please refer to the checkpoints linked with the scores. The sequence length used for 512 with batch size 16 and learning rate 2e-5.
 | Task  | Metric                 | Result                                                        |                  |                                                                           | Training time |          |
 | ----- | ---------------------- | --------------------------------------------------------------|----------------- | ------------------------------------------------------------------------- | ------------- | -------- |
 | QNLI  | Accuracy               | [90.99](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qnli)       | [84.39](https://huggingface.co/gchhablani/fnet-base-finetuned-qnli)       | 80  |02:40:22      | 01:48:22 |
 | SST-2 | Accuracy               | [92.32](https://huggingface.co/gchhablani/bert-base-cased-finetuned-sst2)       | [89.45](https://huggingface.co/gchhablani/fnet-base-finetuned-sst2)       | 95 | 01:42:17      | 01:09:27 |
 | CoLA  | Matthews corr or Accuracy         | [59.57](https://huggingface.co/gchhablani/bert-base-cased-finetuned-cola) (Matthews corr)     | [35.94](https://huggingface.co/gchhablani/fnet-base-finetuned-cola) (Matthews Corr)      | 69 (Accuracy) | 14:20         | 09:47    |
+| STS-B | Spearman corr. | [88.98](https://huggingface.co/gchhablani/bert-base-cased-finetuned-stsb) | [82.19](https://huggingface.co/gchhablani/fnet-base-finetuned-stsb) | 79  |10:24         | 07:09    |
 | MRPC  | mean(F1/Accuracy)           | [88.15](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mrpc) | [81.15](https://huggingface.co/gchhablani/fnet-base-finetuned-mrpc) |  76 |11:12         | 07:48    |
 | RTE   | Accuracy               | [67.15](https://huggingface.co/gchhablani/bert-base-cased-finetuned-qnli)       | [62.82](https://huggingface.co/gchhablani/fnet-base-finetuned-qnli)       |  63  |04:51         | 03:24    |
 | WNLI  | Accuracy               | [46.48](https://huggingface.co/gchhablani/bert-base-cased-finetuned-wnli)       | [54.93](https://huggingface.co/gchhablani/fnet-base-finetuned-wnli)       |  -  |03:23         | 02:37    |
+We can see that FNet-base achieves around 93% of BERT-base's performance while it requires *ca.* 30% less time to fine-tune on the downstream tasks.
 ### BibTeX entry and citation info