stefan-it commited on
Commit
87b1360
1 Parent(s): 37162e4

readme: update

Browse files
Files changed (4) hide show
  1. README.md +69 -103
  2. best-lm.pt +0 -3
  3. loss.txt +0 -227
  4. training.log +0 -0
README.md CHANGED
@@ -2,118 +2,84 @@
2
  license: cc-by-sa-3.0
3
  language:
4
  - de
5
- library_name: flair
6
  ---
7
 
8
- # Flair xLSTM Embeddings (German Wikipedia, Forward)
9
 
10
- Research & development of Flair xLSTM Embeddings (Forward) trained on [German Wikipedia dump](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus).
11
 
12
  The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks).
13
- Check out the `xlstm` [branch in the Flair repository](https://github.com/flairNLP/flair/tree/xlstm) - many thanks to [Patrick Haller](https://huggingface.co/PatrickHaller) for the work on it.
14
 
15
- # Training
16
 
17
- The current model was trained with commit `18ef331` from the [`xlstm` branch](https://github.com/flairNLP/flair/tree/xlstm). The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed.
18
-
19
- The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used, including sharding the corpus into a Flair-compatible format:
20
-
21
- * `valid.txt` -> Validation corpus
22
- * `test.txt` -> Test corpus
23
- * `train` -> Folder with text files as training corpus
24
-
25
- The model was trained with the following parameters for 2 epochs:
26
-
27
- ```python3
28
- import flair
29
- import torch
30
-
31
- from flair.data import SubTokenDictionary
32
- from flair.models import xLSTMLanguageModel
33
- from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
34
-
35
- from transformers import AutoTokenizer
36
-
37
- flair.device = torch.device('cuda:0')
38
-
39
- is_forward_lm = True
40
-
41
- dictionary = SubTokenDictionary.load("gwlms/bert-base-dewiki-v1")
42
-
43
- corpus = TextCorpus("/home/ubuntu/splitted_corpus",
44
- dictionary,
45
- is_forward_lm,
46
- character_level=False,
47
- random_case_flip=True,
48
- )
49
-
50
- xlstm_ablation_1 = """
51
- mlstm_block:
52
- mlstm:
53
- conv1d_kernel_size: 2
54
- qkv_proj_blocksize: 2
55
- num_heads: 2
56
- slstm_block:
57
- slstm:
58
- backend: cuda
59
- num_heads: 2
60
- conv1d_kernel_size: 2
61
- bias_init: powerlaw_blockdependent
62
- feedforward:
63
- proj_factor: 1.3
64
- act_fn: gelu
65
- context_length: 256
66
- num_blocks: 7
67
- embedding_dim: 128
68
- slstm_at: [1]
69
- """
70
-
71
- language_model = xLSTMLanguageModel(dictionary, xlstm_cfg=xlstm_ablation_1,
72
- is_forward_lm=True)
73
- print(language_model)
74
-
75
- trainer = LanguageModelTrainer(language_model, corpus)
76
-
77
- trainer.train("xflair-german-wikipedia-xlstm_ablation_1-bs64-lr5-e2",
78
- sequence_length=256,
79
- mini_batch_size=64,
80
- learning_rate=5,
81
- patience=50,
82
- max_epochs=2,
83
- checkpoint=False,
84
- num_workers=4,
85
- )
86
- ```
87
 
88
- Output of last lines of training log:
89
-
90
- ```bash
91
- 2024-06-10 22:06:54,411 Split 113 - (22:06:54)
92
- 2024-06-10 22:07:23,726 | split 113/113 | 100/ 773 batches | ms/batch 293.11 | loss 4.4117 | ppl 82.4075
93
- 2024-06-10 22:07:52,762 | split 113/113 | 200/ 773 batches | ms/batch 290.36 | loss 4.3306 | ppl 75.9880
94
- 2024-06-10 22:08:21,813 | split 113/113 | 300/ 773 batches | ms/batch 290.51 | loss 4.3406 | ppl 76.7523
95
- 2024-06-10 22:08:50,869 | split 113/113 | 400/ 773 batches | ms/batch 290.56 | loss 4.3063 | ppl 74.1655
96
- 2024-06-10 22:09:19,923 | split 113/113 | 500/ 773 batches | ms/batch 290.54 | loss 4.3354 | ppl 76.3573
97
- 2024-06-10 22:09:48,965 | split 113/113 | 600/ 773 batches | ms/batch 290.41 | loss 4.3417 | ppl 76.8392
98
- 2024-06-10 22:10:18,014 | split 113/113 | 700/ 773 batches | ms/batch 290.50 | loss 4.3299 | ppl 75.9367
99
- 2024-06-10 22:10:45,001 best loss so far 7.03638310
100
- 2024-06-10 22:10:46,537 ['ist ein Wildschlafen, der vom Schmelzwärmezug verbindet. Zimmer und Kondonien.
101
- Der nächste Ausbau der Geländeulanzhaube in dem zweitältesten Zentrum liegt an seinem 2003 gegründete Mooshalle.
102
- Das fertige große Jagdwasserkraftwerk befindet sich damit im benachbarten Astasper Ortsteil Zechbach nahe der Lenzeifel.
103
- Er bildet ab dem 11. Juni 1999 eine Ortschaft ( bis 2009 Stollladen - Laufen ) in der Landschaft, liegt nur noch in Augen und Hetz.
104
- Verkehr. Die Bahn', 'Kleinsecker. Verwandter. Löwenmann ( auch * Hans ), einer Person von Gottfried Meyer, unter.
105
- Die Herkunft der 1810 verlorenen Familie, Ziegelei, Börsenbuch, Personen, Schriften, Jugendeinheit und die Öffentlichkeitsarbeit dienen dem Pfarrer in Knechtenmann dort.
106
- Zur Genetion sind die übrigen Menschen weit verbreitet, in denen sich das Leben der " Admiralism " widmen.
107
- Ein besonderes Verbreitungsgebiet erstreckt sich in grober Form : " Anthogrammam ist eine schlanker, gepflanzt etwa']
108
- 2024-06-10 22:10:46,537 -----------------------------------------------------------------------------------------
109
- 2024-06-10 22:10:46,538 | end of split 113 /113 | epoch 2 | time: 232.14s | valid loss 7.0906 | valid ppl 1200.6055 | learning rate 0.0781
110
- 2024-06-10 22:10:46,538 -----------------------------------------------------------------------------------------
111
- 2024-06-10 22:10:46,538 232 seconds for train split 113
112
- 2024-06-10 22:10:46,846 Epoch time: 26260.23
113
- 2024-06-10 22:10:52,959 TEST: valid loss 7.0908 | valid ppl 1200.8965
114
- 2024-06-10 22:10:52,959 -----------------------------------------------------------------------------------------
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ```
116
 
117
  # Caveats
118
 
119
- Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters. Also downstream experiments are coming very soon.
 
 
 
 
 
 
2
  license: cc-by-sa-3.0
3
  language:
4
  - de
 
5
  ---
6
 
7
+ # xLSTM Model trained on German Wikipedia
8
 
9
+ Research & development of an xLSTM model trained on German Wikipedia.
10
 
11
  The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks).
 
12
 
13
+ For pretraining this xLSTM model, we this [fork](https://github.com/HallerPatrick/helibrunna) (from [Patrick Haller](https://huggingface.co/PatrickHaller)) of the awesome [Helibrunna](https://github.com/AI-Guru/helibrunna) library.
14
 
15
+ Initially, we integrated xLSTM model training into Flair - for more information about this, please refer to the archived [flair-old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch of this repository.
16
+
17
+ # Changelog
18
+
19
+ - 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
20
+ - 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ # Training
23
+
24
+ The current model was trained with commit `f66cc55` from the [`main` branch](https://github.com/HallerPatrick/helibrunna) of the forked Helibrunna repo.
25
+
26
+ The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed.
27
+
28
+ The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used.
29
+
30
+ The following training configuration is used:
31
+
32
+ ```yaml
33
+ description: "Train a wikipedia xLSTM"
34
+
35
+ training:
36
+ model_name: "german_wikipedia"
37
+ batch_size: 10
38
+ lr: 6e-4
39
+ lr_warmup_steps: 4584
40
+ lr_decay_until_steps: "auto"
41
+ lr_decay_factor: 0.001
42
+ weight_decay: 0.1
43
+ amp_precision: bfloat16
44
+ weight_precision: float32
45
+ enable_mixed_precision: true
46
+ num_epochs: 1
47
+ output_dir: "./output"
48
+ save_every_step: 2000
49
+ log_every_step: 10
50
+ generate_every_step: 5000
51
+ wandb_project: "xlstm"
52
+ gradient_clipping: "auto"
53
+ # wandb_project: "lovecraftxlstm"
54
+
55
+ model:
56
+ num_blocks: 24
57
+ embedding_dim: 768
58
+ mlstm_block:
59
+ mlstm:
60
+ num_heads: 4
61
+ slstm_block: {}
62
+ slstm_at: []
63
+ context_length: 512
64
+
65
+ dataset:
66
+ output_path: "./output/german-wikipedia-dataset"
67
+ hugging_face_id: ["stefan-it/dewiki-20230701"]
68
+ split: "train" # Also subsetting is possible: "train[:100000]"
69
+ shuffle: False
70
+ seed: 42
71
+
72
+ tokenizer:
73
+ type: "pretrained"
74
+ pretrained_class: "LlamaTokenizer"
75
+ pretrained_id: "meta-llama/Llama-2-7b-hf"
76
  ```
77
 
78
  # Caveats
79
 
80
+ Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
81
+ Also downstream experiments are coming very soon.
82
+
83
+ Unfortunately, there are nan's occuring in the training:
84
+
85
+ ![Training Loss](training-loss.png)
best-lm.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b5e754202c41fb92228df2651c4e24d497f8446493802f45f43f0ea8d47a7ec8
3
- size 36371434
 
 
 
 
loss.txt DELETED
@@ -1,227 +0,0 @@
1
- | end of split 1 /113 | epoch 1 | time: 224.45s | valid loss 7.6183 | valid ppl 2035.0861 | learning rate 5.0000
2
- | end of split 2 /113 | epoch 1 | time: 229.45s | valid loss 7.3864 | valid ppl 1613.9065 | learning rate 5.0000
3
- | end of split 3 /113 | epoch 1 | time: 239.40s | valid loss 7.3424 | valid ppl 1544.3504 | learning rate 5.0000
4
- | end of split 4 /113 | epoch 1 | time: 233.67s | valid loss 7.2568 | valid ppl 1417.6838 | learning rate 5.0000
5
- | end of split 5 /113 | epoch 1 | time: 227.57s | valid loss 7.2848 | valid ppl 1458.0133 | learning rate 5.0000
6
- | end of split 6 /113 | epoch 1 | time: 235.49s | valid loss 7.2458 | valid ppl 1402.2080 | learning rate 5.0000
7
- | end of split 7 /113 | epoch 1 | time: 235.14s | valid loss 7.2137 | valid ppl 1357.8841 | learning rate 5.0000
8
- | end of split 8 /113 | epoch 1 | time: 238.90s | valid loss 7.1989 | valid ppl 1337.9002 | learning rate 5.0000
9
- | end of split 9 /113 | epoch 1 | time: 228.81s | valid loss 7.1782 | valid ppl 1310.5202 | learning rate 5.0000
10
- | end of split 10 /113 | epoch 1 | time: 230.95s | valid loss 7.1692 | valid ppl 1298.8697 | learning rate 5.0000
11
- | end of split 11 /113 | epoch 1 | time: 231.70s | valid loss 7.1442 | valid ppl 1266.7305 | learning rate 5.0000
12
- | end of split 12 /113 | epoch 1 | time: 240.42s | valid loss 7.1839 | valid ppl 1317.9954 | learning rate 5.0000
13
- | end of split 13 /113 | epoch 1 | time: 235.25s | valid loss 7.2127 | valid ppl 1356.5282 | learning rate 5.0000
14
- | end of split 14 /113 | epoch 1 | time: 232.67s | valid loss 7.2704 | valid ppl 1437.1488 | learning rate 5.0000
15
- | end of split 15 /113 | epoch 1 | time: 229.99s | valid loss 7.1410 | valid ppl 1262.7434 | learning rate 5.0000
16
- | end of split 16 /113 | epoch 1 | time: 230.24s | valid loss 7.2028 | valid ppl 1343.1933 | learning rate 5.0000
17
- | end of split 17 /113 | epoch 1 | time: 48.80s | valid loss 7.1864 | valid ppl 1321.2975 | learning rate 5.0000
18
- | end of split 18 /113 | epoch 1 | time: 238.71s | valid loss 7.1344 | valid ppl 1254.4124 | learning rate 5.0000
19
- | end of split 19 /113 | epoch 1 | time: 238.74s | valid loss 7.1402 | valid ppl 1261.6803 | learning rate 5.0000
20
- | end of split 20 /113 | epoch 1 | time: 230.88s | valid loss 7.2222 | valid ppl 1369.5573 | learning rate 5.0000
21
- | end of split 21 /113 | epoch 1 | time: 235.01s | valid loss 7.1024 | valid ppl 1214.8458 | learning rate 5.0000
22
- | end of split 22 /113 | epoch 1 | time: 233.22s | valid loss 7.1523 | valid ppl 1277.0068 | learning rate 5.0000
23
- | end of split 23 /113 | epoch 1 | time: 234.10s | valid loss 7.1516 | valid ppl 1276.1012 | learning rate 5.0000
24
- | end of split 24 /113 | epoch 1 | time: 234.94s | valid loss 7.1347 | valid ppl 1254.7220 | learning rate 5.0000
25
- | end of split 25 /113 | epoch 1 | time: 232.93s | valid loss 7.1199 | valid ppl 1236.2833 | learning rate 5.0000
26
- | end of split 26 /113 | epoch 1 | time: 234.40s | valid loss 7.1184 | valid ppl 1234.5018 | learning rate 5.0000
27
- | end of split 27 /113 | epoch 1 | time: 237.28s | valid loss 7.1083 | valid ppl 1222.0958 | learning rate 5.0000
28
- | end of split 28 /113 | epoch 1 | time: 231.57s | valid loss 7.1589 | valid ppl 1285.4715 | learning rate 5.0000
29
- | end of split 29 /113 | epoch 1 | time: 232.64s | valid loss 7.1232 | valid ppl 1240.4354 | learning rate 5.0000
30
- | end of split 30 /113 | epoch 1 | time: 238.52s | valid loss 7.0960 | valid ppl 1207.1889 | learning rate 5.0000
31
- | end of split 31 /113 | epoch 1 | time: 235.86s | valid loss 7.1294 | valid ppl 1248.0873 | learning rate 5.0000
32
- | end of split 32 /113 | epoch 1 | time: 234.67s | valid loss 7.1366 | valid ppl 1257.1105 | learning rate 5.0000
33
- | end of split 33 /113 | epoch 1 | time: 236.46s | valid loss 7.0806 | valid ppl 1188.6487 | learning rate 5.0000
34
- | end of split 34 /113 | epoch 1 | time: 231.14s | valid loss 7.1160 | valid ppl 1231.4851 | learning rate 5.0000
35
- | end of split 35 /113 | epoch 1 | time: 236.11s | valid loss 7.1426 | valid ppl 1264.6883 | learning rate 5.0000
36
- | end of split 36 /113 | epoch 1 | time: 232.98s | valid loss 7.1442 | valid ppl 1266.7118 | learning rate 5.0000
37
- | end of split 37 /113 | epoch 1 | time: 235.77s | valid loss 7.1382 | valid ppl 1259.1016 | learning rate 5.0000
38
- | end of split 38 /113 | epoch 1 | time: 235.38s | valid loss 7.0742 | valid ppl 1181.0755 | learning rate 5.0000
39
- | end of split 39 /113 | epoch 1 | time: 230.26s | valid loss 7.1081 | valid ppl 1221.7934 | learning rate 5.0000
40
- | end of split 40 /113 | epoch 1 | time: 233.25s | valid loss 7.0893 | valid ppl 1199.0533 | learning rate 5.0000
41
- | end of split 41 /113 | epoch 1 | time: 232.96s | valid loss 7.0886 | valid ppl 1198.2460 | learning rate 5.0000
42
- | end of split 42 /113 | epoch 1 | time: 233.86s | valid loss 7.1457 | valid ppl 1268.6031 | learning rate 5.0000
43
- | end of split 43 /113 | epoch 1 | time: 234.62s | valid loss 7.1386 | valid ppl 1259.6532 | learning rate 5.0000
44
- | end of split 44 /113 | epoch 1 | time: 232.69s | valid loss 7.0900 | valid ppl 1199.9118 | learning rate 5.0000
45
- | end of split 45 /113 | epoch 1 | time: 230.84s | valid loss 7.1523 | valid ppl 1276.9780 | learning rate 5.0000
46
- | end of split 46 /113 | epoch 1 | time: 231.71s | valid loss 7.1219 | valid ppl 1238.7760 | learning rate 5.0000
47
- | end of split 47 /113 | epoch 1 | time: 230.86s | valid loss 7.0811 | valid ppl 1189.2806 | learning rate 5.0000
48
- | end of split 48 /113 | epoch 1 | time: 232.63s | valid loss 7.1543 | valid ppl 1279.6527 | learning rate 5.0000
49
- | end of split 49 /113 | epoch 1 | time: 233.86s | valid loss 7.0683 | valid ppl 1174.0986 | learning rate 5.0000
50
- | end of split 50 /113 | epoch 1 | time: 229.15s | valid loss 7.0550 | valid ppl 1158.6403 | learning rate 5.0000
51
- | end of split 51 /113 | epoch 1 | time: 236.63s | valid loss 7.1117 | valid ppl 1226.2546 | learning rate 5.0000
52
- | end of split 52 /113 | epoch 1 | time: 238.10s | valid loss 7.1026 | valid ppl 1215.1584 | learning rate 5.0000
53
- | end of split 53 /113 | epoch 1 | time: 232.74s | valid loss 7.0969 | valid ppl 1208.2648 | learning rate 5.0000
54
- | end of split 54 /113 | epoch 1 | time: 238.09s | valid loss 7.0846 | valid ppl 1193.4612 | learning rate 5.0000
55
- | end of split 55 /113 | epoch 1 | time: 233.70s | valid loss 7.1157 | valid ppl 1231.1284 | learning rate 5.0000
56
- | end of split 56 /113 | epoch 1 | time: 230.09s | valid loss 7.0540 | valid ppl 1157.4801 | learning rate 5.0000
57
- | end of split 57 /113 | epoch 1 | time: 235.27s | valid loss 7.0783 | valid ppl 1185.9658 | learning rate 5.0000
58
- | end of split 58 /113 | epoch 1 | time: 233.74s | valid loss 7.1189 | valid ppl 1235.0774 | learning rate 5.0000
59
- | end of split 59 /113 | epoch 1 | time: 229.77s | valid loss 7.0364 | valid ppl 1137.2668 | learning rate 5.0000
60
- | end of split 60 /113 | epoch 1 | time: 233.24s | valid loss 7.0514 | valid ppl 1154.5030 | learning rate 5.0000
61
- | end of split 61 /113 | epoch 1 | time: 236.63s | valid loss 7.1055 | valid ppl 1218.6020 | learning rate 5.0000
62
- | end of split 62 /113 | epoch 1 | time: 233.17s | valid loss 7.1210 | valid ppl 1237.6443 | learning rate 5.0000
63
- | end of split 63 /113 | epoch 1 | time: 234.66s | valid loss 7.0762 | valid ppl 1183.4137 | learning rate 5.0000
64
- | end of split 64 /113 | epoch 1 | time: 232.58s | valid loss 7.1240 | valid ppl 1241.4370 | learning rate 5.0000
65
- | end of split 65 /113 | epoch 1 | time: 231.51s | valid loss 7.0930 | valid ppl 1203.5000 | learning rate 5.0000
66
- | end of split 66 /113 | epoch 1 | time: 232.26s | valid loss 7.1001 | valid ppl 1212.0637 | learning rate 5.0000
67
- | end of split 67 /113 | epoch 1 | time: 228.92s | valid loss 7.0738 | valid ppl 1180.6015 | learning rate 5.0000
68
- | end of split 68 /113 | epoch 1 | time: 230.60s | valid loss 7.1206 | valid ppl 1237.2528 | learning rate 5.0000
69
- | end of split 69 /113 | epoch 1 | time: 232.29s | valid loss 7.1268 | valid ppl 1244.8903 | learning rate 5.0000
70
- | end of split 70 /113 | epoch 1 | time: 234.60s | valid loss 7.1138 | valid ppl 1228.8092 | learning rate 5.0000
71
- | end of split 71 /113 | epoch 1 | time: 231.33s | valid loss 7.0736 | valid ppl 1180.4231 | learning rate 5.0000
72
- | end of split 72 /113 | epoch 1 | time: 235.50s | valid loss 7.0407 | valid ppl 1142.1916 | learning rate 5.0000
73
- | end of split 73 /113 | epoch 1 | time: 230.23s | valid loss 7.0512 | valid ppl 1154.2604 | learning rate 5.0000
74
- | end of split 74 /113 | epoch 1 | time: 239.00s | valid loss 7.1215 | valid ppl 1238.2501 | learning rate 5.0000
75
- | end of split 75 /113 | epoch 1 | time: 234.03s | valid loss 7.1852 | valid ppl 1319.7906 | learning rate 5.0000
76
- | end of split 76 /113 | epoch 1 | time: 234.28s | valid loss 7.0916 | valid ppl 1201.8453 | learning rate 5.0000
77
- | end of split 77 /113 | epoch 1 | time: 235.71s | valid loss 7.0874 | valid ppl 1196.7356 | learning rate 5.0000
78
- | end of split 78 /113 | epoch 1 | time: 237.06s | valid loss 7.1335 | valid ppl 1253.2911 | learning rate 5.0000
79
- | end of split 79 /113 | epoch 1 | time: 233.74s | valid loss 7.1122 | valid ppl 1226.8927 | learning rate 5.0000
80
- | end of split 80 /113 | epoch 1 | time: 233.17s | valid loss 7.1309 | valid ppl 1250.0614 | learning rate 5.0000
81
- | end of split 81 /113 | epoch 1 | time: 232.30s | valid loss 7.0873 | valid ppl 1196.7297 | learning rate 5.0000
82
- | end of split 82 /113 | epoch 1 | time: 231.22s | valid loss 7.1370 | valid ppl 1257.6055 | learning rate 5.0000
83
- | end of split 83 /113 | epoch 1 | time: 231.43s | valid loss 7.0576 | valid ppl 1161.6918 | learning rate 5.0000
84
- | end of split 84 /113 | epoch 1 | time: 235.02s | valid loss 7.0657 | valid ppl 1171.0550 | learning rate 5.0000
85
- | end of split 85 /113 | epoch 1 | time: 234.79s | valid loss 7.1117 | valid ppl 1226.2184 | learning rate 5.0000
86
- | end of split 86 /113 | epoch 1 | time: 239.30s | valid loss 7.0911 | valid ppl 1201.2320 | learning rate 5.0000
87
- | end of split 87 /113 | epoch 1 | time: 230.62s | valid loss 7.0994 | valid ppl 1211.2212 | learning rate 5.0000
88
- | end of split 88 /113 | epoch 1 | time: 231.93s | valid loss 7.1275 | valid ppl 1245.7974 | learning rate 5.0000
89
- | end of split 89 /113 | epoch 1 | time: 231.13s | valid loss 7.0923 | valid ppl 1202.6127 | learning rate 5.0000
90
- | end of split 90 /113 | epoch 1 | time: 236.74s | valid loss 7.1520 | valid ppl 1276.6935 | learning rate 5.0000
91
- | end of split 91 /113 | epoch 1 | time: 232.98s | valid loss 7.1159 | valid ppl 1231.3526 | learning rate 5.0000
92
- | end of split 92 /113 | epoch 1 | time: 236.25s | valid loss 7.1405 | valid ppl 1262.0972 | learning rate 5.0000
93
- | end of split 93 /113 | epoch 1 | time: 234.62s | valid loss 7.0885 | valid ppl 1198.1424 | learning rate 5.0000
94
- | end of split 94 /113 | epoch 1 | time: 233.59s | valid loss 7.1003 | valid ppl 1212.3560 | learning rate 5.0000
95
- | end of split 95 /113 | epoch 1 | time: 233.27s | valid loss 7.1059 | valid ppl 1219.0888 | learning rate 5.0000
96
- | end of split 96 /113 | epoch 1 | time: 231.78s | valid loss 7.1232 | valid ppl 1240.4668 | learning rate 5.0000
97
- | end of split 97 /113 | epoch 1 | time: 235.60s | valid loss 7.1186 | valid ppl 1234.7345 | learning rate 5.0000
98
- | end of split 98 /113 | epoch 1 | time: 233.88s | valid loss 7.1161 | valid ppl 1231.6487 | learning rate 5.0000
99
- | end of split 99 /113 | epoch 1 | time: 236.68s | valid loss 7.1076 | valid ppl 1221.1639 | learning rate 5.0000
100
- | end of split 100 /113 | epoch 1 | time: 232.62s | valid loss 7.0984 | valid ppl 1210.0832 | learning rate 5.0000
101
- | end of split 101 /113 | epoch 1 | time: 233.49s | valid loss 7.1288 | valid ppl 1247.4030 | learning rate 5.0000
102
- | end of split 102 /113 | epoch 1 | time: 232.34s | valid loss 7.0934 | valid ppl 1204.0527 | learning rate 5.0000
103
- | end of split 103 /113 | epoch 1 | time: 230.64s | valid loss 7.1062 | valid ppl 1219.4642 | learning rate 5.0000
104
- | end of split 104 /113 | epoch 1 | time: 235.83s | valid loss 7.1531 | valid ppl 1278.0091 | learning rate 5.0000
105
- | end of split 105 /113 | epoch 1 | time: 230.35s | valid loss 7.1200 | valid ppl 1236.4884 | learning rate 5.0000
106
- | end of split 106 /113 | epoch 1 | time: 231.68s | valid loss 7.1236 | valid ppl 1240.9623 | learning rate 5.0000
107
- | end of split 107 /113 | epoch 1 | time: 236.04s | valid loss 7.0998 | valid ppl 1211.7024 | learning rate 5.0000
108
- | end of split 108 /113 | epoch 1 | time: 231.16s | valid loss 7.1267 | valid ppl 1244.7170 | learning rate 5.0000
109
- | end of split 109 /113 | epoch 1 | time: 235.80s | valid loss 7.1114 | valid ppl 1225.8615 | learning rate 5.0000
110
- | end of split 110 /113 | epoch 1 | time: 229.11s | valid loss 7.0848 | valid ppl 1193.6844 | learning rate 5.0000
111
- | end of split 111 /113 | epoch 1 | time: 232.32s | valid loss 7.0782 | valid ppl 1185.7957 | learning rate 1.2500
112
- | end of split 112 /113 | epoch 1 | time: 232.60s | valid loss 7.0965 | valid ppl 1207.7586 | learning rate 1.2500
113
- | end of split 113 /113 | epoch 1 | time: 237.25s | valid loss 7.1007 | valid ppl 1212.7755 | learning rate 1.2500
114
- | end of split 1 /113 | epoch 2 | time: 229.76s | valid loss 7.0779 | valid ppl 1185.4298 | learning rate 1.2500
115
- | end of split 2 /113 | epoch 2 | time: 232.20s | valid loss 7.0994 | valid ppl 1211.1846 | learning rate 1.2500
116
- | end of split 3 /113 | epoch 2 | time: 230.39s | valid loss 7.0802 | valid ppl 1188.2092 | learning rate 1.2500
117
- | end of split 4 /113 | epoch 2 | time: 232.46s | valid loss 7.0951 | valid ppl 1205.9962 | learning rate 1.2500
118
- | end of split 5 /113 | epoch 2 | time: 232.66s | valid loss 7.1047 | valid ppl 1217.6557 | learning rate 1.2500
119
- | end of split 6 /113 | epoch 2 | time: 231.54s | valid loss 7.0950 | valid ppl 1205.9267 | learning rate 1.2500
120
- | end of split 7 /113 | epoch 2 | time: 234.75s | valid loss 7.1142 | valid ppl 1229.3492 | learning rate 1.2500
121
- | end of split 8 /113 | epoch 2 | time: 235.30s | valid loss 7.0901 | valid ppl 1200.0375 | learning rate 1.2500
122
- | end of split 9 /113 | epoch 2 | time: 235.81s | valid loss 7.0971 | valid ppl 1208.4907 | learning rate 1.2500
123
- | end of split 10 /113 | epoch 2 | time: 230.40s | valid loss 7.0927 | valid ppl 1203.1642 | learning rate 1.2500
124
- | end of split 11 /113 | epoch 2 | time: 235.86s | valid loss 7.1028 | valid ppl 1215.3789 | learning rate 1.2500
125
- | end of split 12 /113 | epoch 2 | time: 230.91s | valid loss 7.0949 | valid ppl 1205.7953 | learning rate 1.2500
126
- | end of split 13 /113 | epoch 2 | time: 233.88s | valid loss 7.0789 | valid ppl 1186.6439 | learning rate 1.2500
127
- | end of split 14 /113 | epoch 2 | time: 232.71s | valid loss 7.0946 | valid ppl 1205.4994 | learning rate 1.2500
128
- | end of split 15 /113 | epoch 2 | time: 230.99s | valid loss 7.0850 | valid ppl 1193.9639 | learning rate 1.2500
129
- | end of split 16 /113 | epoch 2 | time: 227.77s | valid loss 7.1121 | valid ppl 1226.6969 | learning rate 1.2500
130
- | end of split 17 /113 | epoch 2 | time: 235.85s | valid loss 7.0980 | valid ppl 1209.5941 | learning rate 1.2500
131
- | end of split 18 /113 | epoch 2 | time: 235.06s | valid loss 7.0815 | valid ppl 1189.7783 | learning rate 1.2500
132
- | end of split 19 /113 | epoch 2 | time: 237.29s | valid loss 7.1028 | valid ppl 1215.3490 | learning rate 1.2500
133
- | end of split 20 /113 | epoch 2 | time: 235.29s | valid loss 7.0942 | valid ppl 1204.9817 | learning rate 1.2500
134
- | end of split 21 /113 | epoch 2 | time: 231.22s | valid loss 7.0837 | valid ppl 1192.3273 | learning rate 1.2500
135
- | end of split 22 /113 | epoch 2 | time: 235.58s | valid loss 7.0989 | valid ppl 1210.6321 | learning rate 1.2500
136
- | end of split 23 /113 | epoch 2 | time: 232.62s | valid loss 7.0947 | valid ppl 1205.5749 | learning rate 1.2500
137
- | end of split 24 /113 | epoch 2 | time: 238.49s | valid loss 7.1007 | valid ppl 1212.8266 | learning rate 1.2500
138
- | end of split 25 /113 | epoch 2 | time: 228.89s | valid loss 7.0794 | valid ppl 1187.2814 | learning rate 1.2500
139
- | end of split 26 /113 | epoch 2 | time: 231.21s | valid loss 7.0910 | valid ppl 1201.0850 | learning rate 1.2500
140
- | end of split 27 /113 | epoch 2 | time: 236.23s | valid loss 7.0950 | valid ppl 1205.9267 | learning rate 1.2500
141
- | end of split 28 /113 | epoch 2 | time: 234.70s | valid loss 7.0858 | valid ppl 1194.8918 | learning rate 1.2500
142
- | end of split 29 /113 | epoch 2 | time: 229.67s | valid loss 7.0637 | valid ppl 1168.7198 | learning rate 1.2500
143
- | end of split 30 /113 | epoch 2 | time: 230.59s | valid loss 7.1101 | valid ppl 1224.2250 | learning rate 1.2500
144
- | end of split 31 /113 | epoch 2 | time: 232.68s | valid loss 7.0836 | valid ppl 1192.2460 | learning rate 1.2500
145
- | end of split 32 /113 | epoch 2 | time: 231.80s | valid loss 7.1094 | valid ppl 1223.3879 | learning rate 1.2500
146
- | end of split 33 /113 | epoch 2 | time: 234.73s | valid loss 7.1026 | valid ppl 1215.0679 | learning rate 1.2500
147
- | end of split 34 /113 | epoch 2 | time: 232.94s | valid loss 7.0845 | valid ppl 1193.3580 | learning rate 1.2500
148
- | end of split 35 /113 | epoch 2 | time: 232.85s | valid loss 7.1046 | valid ppl 1217.5067 | learning rate 1.2500
149
- | end of split 36 /113 | epoch 2 | time: 236.10s | valid loss 7.1064 | valid ppl 1219.7146 | learning rate 1.2500
150
- | end of split 37 /113 | epoch 2 | time: 234.89s | valid loss 7.0999 | valid ppl 1211.8541 | learning rate 1.2500
151
- | end of split 38 /113 | epoch 2 | time: 239.33s | valid loss 7.0895 | valid ppl 1199.2961 | learning rate 1.2500
152
- | end of split 39 /113 | epoch 2 | time: 239.01s | valid loss 7.1112 | valid ppl 1225.6211 | learning rate 1.2500
153
- | end of split 40 /113 | epoch 2 | time: 233.50s | valid loss 7.0895 | valid ppl 1199.3484 | learning rate 1.2500
154
- | end of split 41 /113 | epoch 2 | time: 237.27s | valid loss 7.0723 | valid ppl 1178.8008 | learning rate 1.2500
155
- | end of split 42 /113 | epoch 2 | time: 231.15s | valid loss 7.0958 | valid ppl 1206.8495 | learning rate 1.2500
156
- | end of split 43 /113 | epoch 2 | time: 231.39s | valid loss 7.0922 | valid ppl 1202.5908 | learning rate 1.2500
157
- | end of split 44 /113 | epoch 2 | time: 229.96s | valid loss 7.1024 | valid ppl 1214.8449 | learning rate 1.2500
158
- | end of split 45 /113 | epoch 2 | time: 237.25s | valid loss 7.1115 | valid ppl 1226.0123 | learning rate 1.2500
159
- | end of split 46 /113 | epoch 2 | time: 233.19s | valid loss 7.0828 | valid ppl 1191.2430 | learning rate 1.2500
160
- | end of split 47 /113 | epoch 2 | time: 232.26s | valid loss 7.0917 | valid ppl 1201.9762 | learning rate 1.2500
161
- | end of split 48 /113 | epoch 2 | time: 227.95s | valid loss 7.0983 | valid ppl 1209.8765 | learning rate 1.2500
162
- | end of split 49 /113 | epoch 2 | time: 232.30s | valid loss 7.0888 | valid ppl 1198.4128 | learning rate 0.3125
163
- | end of split 50 /113 | epoch 2 | time: 238.16s | valid loss 7.0910 | valid ppl 1201.0504 | learning rate 0.3125
164
- | end of split 51 /113 | epoch 2 | time: 233.23s | valid loss 7.0949 | valid ppl 1205.7495 | learning rate 0.3125
165
- | end of split 52 /113 | epoch 2 | time: 232.61s | valid loss 7.0807 | valid ppl 1188.8117 | learning rate 0.3125
166
- | end of split 53 /113 | epoch 2 | time: 233.73s | valid loss 7.0902 | valid ppl 1200.1734 | learning rate 0.3125
167
- | end of split 54 /113 | epoch 2 | time: 230.67s | valid loss 7.0855 | valid ppl 1194.5399 | learning rate 0.3125
168
- | end of split 55 /113 | epoch 2 | time: 235.17s | valid loss 7.0903 | valid ppl 1200.2645 | learning rate 0.3125
169
- | end of split 56 /113 | epoch 2 | time: 230.04s | valid loss 7.0905 | valid ppl 1200.5506 | learning rate 0.3125
170
- | end of split 57 /113 | epoch 2 | time: 235.80s | valid loss 7.0972 | valid ppl 1208.5664 | learning rate 0.3125
171
- | end of split 58 /113 | epoch 2 | time: 233.83s | valid loss 7.0926 | valid ppl 1203.0872 | learning rate 0.3125
172
- | end of split 59 /113 | epoch 2 | time: 234.66s | valid loss 7.0922 | valid ppl 1202.5223 | learning rate 0.3125
173
- | end of split 60 /113 | epoch 2 | time: 231.74s | valid loss 7.0899 | valid ppl 1199.8190 | learning rate 0.3125
174
- | end of split 61 /113 | epoch 2 | time: 228.91s | valid loss 7.0938 | valid ppl 1204.4743 | learning rate 0.3125
175
- | end of split 62 /113 | epoch 2 | time: 235.87s | valid loss 7.0887 | valid ppl 1198.3909 | learning rate 0.3125
176
- | end of split 63 /113 | epoch 2 | time: 234.42s | valid loss 7.0820 | valid ppl 1190.2886 | learning rate 0.3125
177
- | end of split 64 /113 | epoch 2 | time: 233.77s | valid loss 7.0910 | valid ppl 1201.1087 | learning rate 0.3125
178
- | end of split 65 /113 | epoch 2 | time: 235.55s | valid loss 7.0922 | valid ppl 1202.4961 | learning rate 0.3125
179
- | end of split 66 /113 | epoch 2 | time: 231.77s | valid loss 7.0890 | valid ppl 1198.6597 | learning rate 0.3125
180
- | end of split 67 /113 | epoch 2 | time: 239.03s | valid loss 7.0907 | valid ppl 1200.6899 | learning rate 0.3125
181
- | end of split 68 /113 | epoch 2 | time: 233.79s | valid loss 7.0929 | valid ppl 1203.3503 | learning rate 0.3125
182
- | end of split 69 /113 | epoch 2 | time: 230.34s | valid loss 7.0980 | valid ppl 1209.6052 | learning rate 0.3125
183
- | end of split 70 /113 | epoch 2 | time: 236.49s | valid loss 7.0882 | valid ppl 1197.7819 | learning rate 0.3125
184
- | end of split 71 /113 | epoch 2 | time: 234.44s | valid loss 7.1003 | valid ppl 1212.3714 | learning rate 0.3125
185
- | end of split 72 /113 | epoch 2 | time: 233.01s | valid loss 7.0828 | valid ppl 1191.3159 | learning rate 0.3125
186
- | end of split 73 /113 | epoch 2 | time: 238.78s | valid loss 7.0959 | valid ppl 1207.0328 | learning rate 0.3125
187
- | end of split 74 /113 | epoch 2 | time: 239.67s | valid loss 7.0914 | valid ppl 1201.5850 | learning rate 0.3125
188
- | end of split 75 /113 | epoch 2 | time: 230.83s | valid loss 7.1005 | valid ppl 1212.5495 | learning rate 0.3125
189
- | end of split 76 /113 | epoch 2 | time: 235.05s | valid loss 7.0889 | valid ppl 1198.6319 | learning rate 0.3125
190
- | end of split 77 /113 | epoch 2 | time: 230.27s | valid loss 7.0923 | valid ppl 1202.6914 | learning rate 0.3125
191
- | end of split 78 /113 | epoch 2 | time: 231.51s | valid loss 7.0787 | valid ppl 1186.4144 | learning rate 0.3125
192
- | end of split 79 /113 | epoch 2 | time: 232.70s | valid loss 7.0995 | valid ppl 1211.3830 | learning rate 0.3125
193
- | end of split 80 /113 | epoch 2 | time: 233.21s | valid loss 7.0929 | valid ppl 1203.3740 | learning rate 0.3125
194
- | end of split 81 /113 | epoch 2 | time: 230.05s | valid loss 7.0802 | valid ppl 1188.1591 | learning rate 0.3125
195
- | end of split 82 /113 | epoch 2 | time: 235.62s | valid loss 7.0860 | valid ppl 1195.0842 | learning rate 0.3125
196
- | end of split 83 /113 | epoch 2 | time: 236.11s | valid loss 7.0906 | valid ppl 1200.6764 | learning rate 0.3125
197
- | end of split 84 /113 | epoch 2 | time: 230.87s | valid loss 7.0850 | valid ppl 1193.9009 | learning rate 0.3125
198
- | end of split 85 /113 | epoch 2 | time: 232.62s | valid loss 7.0939 | valid ppl 1204.6437 | learning rate 0.3125
199
- | end of split 86 /113 | epoch 2 | time: 238.23s | valid loss 7.0856 | valid ppl 1194.6482 | learning rate 0.3125
200
- | end of split 87 /113 | epoch 2 | time: 233.77s | valid loss 7.0942 | valid ppl 1205.0113 | learning rate 0.3125
201
- | end of split 88 /113 | epoch 2 | time: 230.52s | valid loss 7.0954 | valid ppl 1206.3736 | learning rate 0.3125
202
- | end of split 89 /113 | epoch 2 | time: 235.21s | valid loss 7.0953 | valid ppl 1206.2616 | learning rate 0.3125
203
- | end of split 90 /113 | epoch 2 | time: 236.74s | valid loss 7.0902 | valid ppl 1200.1371 | learning rate 0.3125
204
- | end of split 91 /113 | epoch 2 | time: 234.19s | valid loss 7.0940 | valid ppl 1204.7284 | learning rate 0.3125
205
- | end of split 92 /113 | epoch 2 | time: 229.17s | valid loss 7.0667 | valid ppl 1172.2181 | learning rate 0.3125
206
- | end of split 93 /113 | epoch 2 | time: 233.18s | valid loss 7.0851 | valid ppl 1193.9966 | learning rate 0.3125
207
- | end of split 94 /113 | epoch 2 | time: 233.54s | valid loss 7.0983 | valid ppl 1209.8629 | learning rate 0.3125
208
- | end of split 95 /113 | epoch 2 | time: 240.46s | valid loss 7.0915 | valid ppl 1201.7565 | learning rate 0.3125
209
- | end of split 96 /113 | epoch 2 | time: 232.63s | valid loss 7.0925 | valid ppl 1202.8766 | learning rate 0.3125
210
- | end of split 97 /113 | epoch 2 | time: 236.79s | valid loss 7.0868 | valid ppl 1196.0248 | learning rate 0.3125
211
- | end of split 98 /113 | epoch 2 | time: 234.71s | valid loss 7.0826 | valid ppl 1191.0655 | learning rate 0.3125
212
- | end of split 99 /113 | epoch 2 | time: 233.29s | valid loss 7.0957 | valid ppl 1206.8113 | learning rate 0.3125
213
- | end of split 100 /113 | epoch 2 | time: 236.83s | valid loss 7.0924 | valid ppl 1202.8005 | learning rate 0.0781
214
- | end of split 101 /113 | epoch 2 | time: 48.85s | valid loss 7.0897 | valid ppl 1199.5980 | learning rate 0.0781
215
- | end of split 102 /113 | epoch 2 | time: 236.70s | valid loss 7.0890 | valid ppl 1198.7280 | learning rate 0.0781
216
- | end of split 103 /113 | epoch 2 | time: 238.79s | valid loss 7.0864 | valid ppl 1195.5683 | learning rate 0.0781
217
- | end of split 104 /113 | epoch 2 | time: 232.38s | valid loss 7.0929 | valid ppl 1203.4357 | learning rate 0.0781
218
- | end of split 105 /113 | epoch 2 | time: 229.19s | valid loss 7.0942 | valid ppl 1204.8987 | learning rate 0.0781
219
- | end of split 106 /113 | epoch 2 | time: 231.16s | valid loss 7.0949 | valid ppl 1205.8207 | learning rate 0.0781
220
- | end of split 107 /113 | epoch 2 | time: 232.93s | valid loss 7.0896 | valid ppl 1199.3762 | learning rate 0.0781
221
- | end of split 108 /113 | epoch 2 | time: 234.06s | valid loss 7.0961 | valid ppl 1207.2101 | learning rate 0.0781
222
- | end of split 109 /113 | epoch 2 | time: 233.27s | valid loss 7.0883 | valid ppl 1197.8653 | learning rate 0.0781
223
- | end of split 110 /113 | epoch 2 | time: 234.69s | valid loss 7.0930 | valid ppl 1203.4772 | learning rate 0.0781
224
- | end of split 111 /113 | epoch 2 | time: 231.50s | valid loss 7.0946 | valid ppl 1205.4435 | learning rate 0.0781
225
- | end of split 112 /113 | epoch 2 | time: 233.79s | valid loss 7.0864 | valid ppl 1195.5549 | learning rate 0.0781
226
- | end of split 113 /113 | epoch 2 | time: 232.14s | valid loss 7.0906 | valid ppl 1200.6055 | learning rate 0.0781
227
- TEST: valid loss 7.0908 | valid ppl 1200.8965
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
training.log DELETED
The diff for this file is too large to render. See raw diff