kejian commited on
Commit
28cfbf1
1 Parent(s): 4d5afe3

Training in progress, step 21362

Browse files
checkpoint-21362/config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "gpt2",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2LMAndValueHeadModel"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "reorder_and_upcast_attn": true,
21
+ "resid_pdrop": 0.1,
22
+ "scale_attn_by_inverse_layer_idx": false,
23
+ "scale_attn_weights": true,
24
+ "summary_activation": null,
25
+ "summary_first_dropout": 0.1,
26
+ "summary_proj_to_labels": true,
27
+ "summary_type": "cls_index",
28
+ "summary_use_proj": true,
29
+ "task_specific_params": {
30
+ "text-generation": {
31
+ "do_sample": true,
32
+ "max_length": 50
33
+ }
34
+ },
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.23.0",
37
+ "use_cache": true,
38
+ "vocab_size": 50257
39
+ }
checkpoint-21362/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-21362/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4e02415e8f5c9472a525d24073decf821dba16aa36e9c5d28b0813203933a09
3
+ size 995605189
checkpoint-21362/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9404903bc78cf2c713583822be36bec6051c5830b03e060ec00c9dc050e53e30
3
+ size 510398013
checkpoint-21362/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e46d0d51c9d952e32c0fbef91aa7ae68815b1d0195d8150cc3570dadae1c908
3
+ size 15661
checkpoint-21362/scaler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c59a48219652a5befea65c6c08c8855102169fb828d2e7796f6fd9dbcfcaefb
3
+ size 557
checkpoint-21362/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8704a361bf2a57302e27561b21f7c8c3061146b9af822e7098f333f4227c6001
3
+ size 627
checkpoint-21362/special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "unk_token": "<|endoftext|>"
6
+ }
checkpoint-21362/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-21362/tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "eos_token": "<|endoftext|>",
5
+ "model_max_length": 1024,
6
+ "name_or_path": "gpt2",
7
+ "special_tokens_map_file": null,
8
+ "tokenizer_class": "GPT2Tokenizer",
9
+ "unk_token": "<|endoftext|>"
10
+ }
checkpoint-21362/trainer_state.json ADDED
@@ -0,0 +1,3152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.5,
5
+ "global_step": 21362,
6
+ "is_hyper_param_search": false,
7
+ "is_local_process_zero": true,
8
+ "is_world_process_zero": true,
9
+ "log_history": [
10
+ {
11
+ "epoch": 0.0,
12
+ "learning_rate": 1.6355140186915887e-06,
13
+ "loss": 10.8008,
14
+ "theoretical_loss": 20.81281780154715,
15
+ "tokens_seen": 65536
16
+ },
17
+ {
18
+ "epoch": 0.0,
19
+ "learning_rate": 8.177570093457944e-05,
20
+ "loss": 8.7055,
21
+ "theoretical_loss": 8.563482664611069,
22
+ "tokens_seen": 3276800
23
+ },
24
+ {
25
+ "epoch": 0.0,
26
+ "learning_rate": 0.0001635514018691589,
27
+ "loss": 6.6732,
28
+ "theoretical_loss": 7.4777587180480305,
29
+ "tokens_seen": 6553600
30
+ },
31
+ {
32
+ "epoch": 0.0,
33
+ "learning_rate": 0.0002453271028037383,
34
+ "loss": 5.9716,
35
+ "theoretical_loss": 6.9337544888949,
36
+ "tokens_seen": 9830400
37
+ },
38
+ {
39
+ "epoch": 0.0,
40
+ "learning_rate": 0.0003271028037383178,
41
+ "loss": 5.5864,
42
+ "theoretical_loss": 6.583566228426414,
43
+ "tokens_seen": 13107200
44
+ },
45
+ {
46
+ "epoch": 0.01,
47
+ "learning_rate": 0.0004088785046728972,
48
+ "loss": 5.3924,
49
+ "theoretical_loss": 6.330713565116083,
50
+ "tokens_seen": 16384000
51
+ },
52
+ {
53
+ "epoch": 0.01,
54
+ "learning_rate": 0.0004906542056074766,
55
+ "loss": 5.2081,
56
+ "theoretical_loss": 6.135529231940326,
57
+ "tokens_seen": 19660800
58
+ },
59
+ {
60
+ "epoch": 0.01,
61
+ "learning_rate": 0.0005724299065420561,
62
+ "loss": 5.0469,
63
+ "theoretical_loss": 5.978101583869607,
64
+ "tokens_seen": 22937600
65
+ },
66
+ {
67
+ "epoch": 0.01,
68
+ "learning_rate": 0.0006542056074766356,
69
+ "loss": 4.9522,
70
+ "theoretical_loss": 5.8471173262659235,
71
+ "tokens_seen": 26214400
72
+ },
73
+ {
74
+ "epoch": 0.01,
75
+ "learning_rate": 0.0006996358993758275,
76
+ "loss": 4.7919,
77
+ "theoretical_loss": 5.7355768158821245,
78
+ "tokens_seen": 29491200
79
+ },
80
+ {
81
+ "epoch": 0.01,
82
+ "learning_rate": 0.0006988083979572536,
83
+ "loss": 4.7314,
84
+ "theoretical_loss": 5.638870144071353,
85
+ "tokens_seen": 32768000
86
+ },
87
+ {
88
+ "epoch": 0.01,
89
+ "learning_rate": 0.0006979808965386797,
90
+ "loss": 4.6075,
91
+ "theoretical_loss": 5.553812381844907,
92
+ "tokens_seen": 36044800
93
+ },
94
+ {
95
+ "epoch": 0.01,
96
+ "learning_rate": 0.000697153395120106,
97
+ "loss": 4.5189,
98
+ "theoretical_loss": 5.478118080556438,
99
+ "tokens_seen": 39321600
100
+ },
101
+ {
102
+ "epoch": 0.02,
103
+ "learning_rate": 0.0006963258937015321,
104
+ "loss": 4.4329,
105
+ "theoretical_loss": 5.410095959579362,
106
+ "tokens_seen": 42598400
107
+ },
108
+ {
109
+ "epoch": 0.02,
110
+ "learning_rate": 0.0006954983922829582,
111
+ "loss": 4.4025,
112
+ "theoretical_loss": 5.348462083735834,
113
+ "tokens_seen": 45875200
114
+ },
115
+ {
116
+ "epoch": 0.02,
117
+ "learning_rate": 0.0006946708908643843,
118
+ "loss": 4.25,
119
+ "theoretical_loss": 5.292220566937567,
120
+ "tokens_seen": 49152000
121
+ },
122
+ {
123
+ "epoch": 0.02,
124
+ "learning_rate": 0.0006938433894458105,
125
+ "loss": 4.2174,
126
+ "theoretical_loss": 5.240584625769978,
127
+ "tokens_seen": 52428800
128
+ },
129
+ {
130
+ "epoch": 0.02,
131
+ "learning_rate": 0.0006930158880272367,
132
+ "loss": 4.1421,
133
+ "theoretical_loss": 5.192922724525789,
134
+ "tokens_seen": 55705600
135
+ },
136
+ {
137
+ "epoch": 0.02,
138
+ "learning_rate": 0.0006921883866086628,
139
+ "loss": 4.0643,
140
+ "theoretical_loss": 5.1487208633564405,
141
+ "tokens_seen": 58982400
142
+ },
143
+ {
144
+ "epoch": 0.02,
145
+ "learning_rate": 0.0006913608851900889,
146
+ "loss": 3.9375,
147
+ "theoretical_loss": 5.107555562405102,
148
+ "tokens_seen": 62259200
149
+ },
150
+ {
151
+ "epoch": 0.02,
152
+ "learning_rate": 0.000690533383771515,
153
+ "loss": 3.8331,
154
+ "theoretical_loss": 5.069074117143246,
155
+ "tokens_seen": 65536000
156
+ },
157
+ {
158
+ "epoch": 0.02,
159
+ "learning_rate": 0.0006897058823529412,
160
+ "loss": 3.7941,
161
+ "theoretical_loss": 5.032979909838007,
162
+ "tokens_seen": 68812800
163
+ },
164
+ {
165
+ "epoch": 0.03,
166
+ "learning_rate": 0.0006888783809343674,
167
+ "loss": 3.7297,
168
+ "theoretical_loss": 4.999021308224664,
169
+ "tokens_seen": 72089600
170
+ },
171
+ {
172
+ "epoch": 0.03,
173
+ "learning_rate": 0.0006880508795157935,
174
+ "loss": 3.6773,
175
+ "theoretical_loss": 4.966983155351962,
176
+ "tokens_seen": 75366400
177
+ },
178
+ {
179
+ "epoch": 0.03,
180
+ "learning_rate": 0.0006872233780972196,
181
+ "loss": 3.5973,
182
+ "theoretical_loss": 4.9366801616251355,
183
+ "tokens_seen": 78643200
184
+ },
185
+ {
186
+ "epoch": 0.03,
187
+ "learning_rate": 0.0006863958766786457,
188
+ "loss": 3.5893,
189
+ "theoretical_loss": 4.907951713830082,
190
+ "tokens_seen": 81920000
191
+ },
192
+ {
193
+ "epoch": 0.03,
194
+ "learning_rate": 0.0006855683752600718,
195
+ "loss": 3.5262,
196
+ "theoretical_loss": 4.880657753812926,
197
+ "tokens_seen": 85196800
198
+ },
199
+ {
200
+ "epoch": 0.03,
201
+ "learning_rate": 0.000684740873841498,
202
+ "loss": 3.5248,
203
+ "theoretical_loss": 4.854675474481779,
204
+ "tokens_seen": 88473600
205
+ },
206
+ {
207
+ "epoch": 0.03,
208
+ "learning_rate": 0.0006839133724229242,
209
+ "loss": 3.4372,
210
+ "theoretical_loss": 4.8298966473088125,
211
+ "tokens_seen": 91750400
212
+ },
213
+ {
214
+ "epoch": 0.03,
215
+ "learning_rate": 0.0006830858710043503,
216
+ "loss": 3.4666,
217
+ "theoretical_loss": 4.8062254427779205,
218
+ "tokens_seen": 95027200
219
+ },
220
+ {
221
+ "epoch": 0.04,
222
+ "learning_rate": 0.0006822583695857764,
223
+ "loss": 3.4167,
224
+ "theoretical_loss": 4.783576639276257,
225
+ "tokens_seen": 98304000
226
+ },
227
+ {
228
+ "epoch": 0.04,
229
+ "learning_rate": 0.0006814308681672025,
230
+ "loss": 3.4719,
231
+ "theoretical_loss": 4.761874140772408,
232
+ "tokens_seen": 101580800
233
+ },
234
+ {
235
+ "epoch": 0.04,
236
+ "learning_rate": 0.0006806033667486286,
237
+ "loss": 3.4687,
238
+ "theoretical_loss": 4.741049741962473,
239
+ "tokens_seen": 104857600
240
+ },
241
+ {
242
+ "epoch": 0.04,
243
+ "learning_rate": 0.0006797758653300548,
244
+ "loss": 3.4353,
245
+ "theoretical_loss": 4.721042093249051,
246
+ "tokens_seen": 108134400
247
+ },
248
+ {
249
+ "epoch": 0.04,
250
+ "learning_rate": 0.000678948363911481,
251
+ "loss": 3.4114,
252
+ "theoretical_loss": 4.701795828231866,
253
+ "tokens_seen": 111411200
254
+ },
255
+ {
256
+ "epoch": 0.04,
257
+ "learning_rate": 0.0006781208624929071,
258
+ "loss": 3.41,
259
+ "theoretical_loss": 4.68326082423593,
260
+ "tokens_seen": 114688000
261
+ },
262
+ {
263
+ "epoch": 0.04,
264
+ "learning_rate": 0.0006772933610743332,
265
+ "loss": 3.3813,
266
+ "theoretical_loss": 4.665391572426282,
267
+ "tokens_seen": 117964800
268
+ },
269
+ {
270
+ "epoch": 0.04,
271
+ "learning_rate": 0.0006764658596557593,
272
+ "loss": 3.3451,
273
+ "theoretical_loss": 4.648146638719739,
274
+ "tokens_seen": 121241600
275
+ },
276
+ {
277
+ "epoch": 0.04,
278
+ "learning_rate": 0.0006756383582371856,
279
+ "loss": 3.3597,
280
+ "theoretical_loss": 4.631488200339643,
281
+ "tokens_seen": 124518400
282
+ },
283
+ {
284
+ "epoch": 0.05,
285
+ "learning_rate": 0.0006748108568186117,
286
+ "loss": 3.3776,
287
+ "theoretical_loss": 4.615381645715717,
288
+ "tokens_seen": 127795200
289
+ },
290
+ {
291
+ "epoch": 0.05,
292
+ "learning_rate": 0.0006739833554000378,
293
+ "loss": 3.3091,
294
+ "theoretical_loss": 4.599795227690505,
295
+ "tokens_seen": 131072000
296
+ },
297
+ {
298
+ "epoch": 0.05,
299
+ "learning_rate": 0.000673155853981464,
300
+ "loss": 3.3312,
301
+ "theoretical_loss": 4.584699761792674,
302
+ "tokens_seen": 134348800
303
+ },
304
+ {
305
+ "epoch": 0.05,
306
+ "learning_rate": 0.0006723283525628902,
307
+ "loss": 3.2886,
308
+ "theoretical_loss": 4.570068362778516,
309
+ "tokens_seen": 137625600
310
+ },
311
+ {
312
+ "epoch": 0.05,
313
+ "learning_rate": 0.0006715008511443163,
314
+ "loss": 3.3051,
315
+ "theoretical_loss": 4.555876213804037,
316
+ "tokens_seen": 140902400
317
+ },
318
+ {
319
+ "epoch": 0.05,
320
+ "learning_rate": 0.0006706733497257424,
321
+ "loss": 3.2901,
322
+ "theoretical_loss": 4.542100363530799,
323
+ "tokens_seen": 144179200
324
+ },
325
+ {
326
+ "epoch": 0.05,
327
+ "learning_rate": 0.0006698458483071685,
328
+ "loss": 3.2702,
329
+ "theoretical_loss": 4.528719547234816,
330
+ "tokens_seen": 147456000
331
+ },
332
+ {
333
+ "epoch": 0.05,
334
+ "learning_rate": 0.0006690183468885946,
335
+ "loss": 3.2275,
336
+ "theoretical_loss": 4.515714028614996,
337
+ "tokens_seen": 150732800
338
+ },
339
+ {
340
+ "epoch": 0.06,
341
+ "learning_rate": 0.0006681908454700209,
342
+ "loss": 3.2427,
343
+ "theoretical_loss": 4.503065459513339,
344
+ "tokens_seen": 154009600
345
+ },
346
+ {
347
+ "epoch": 0.06,
348
+ "learning_rate": 0.000667363344051447,
349
+ "loss": 3.2513,
350
+ "theoretical_loss": 4.4907567551852665,
351
+ "tokens_seen": 157286400
352
+ },
353
+ {
354
+ "epoch": 0.06,
355
+ "learning_rate": 0.0006665358426328731,
356
+ "loss": 3.2329,
357
+ "theoretical_loss": 4.478771983111967,
358
+ "tokens_seen": 160563200
359
+ },
360
+ {
361
+ "epoch": 0.06,
362
+ "objective/train/avg_token_score": 0.027871694415807724,
363
+ "objective/train/avg_weight": 0.977715790271759,
364
+ "objective/train/docs_used": 104000,
365
+ "objective/train/instantaneous_batch_size": 32,
366
+ "objective/train/instantaneous_microbatch_size": 32768,
367
+ "objective/train/original_loss": 3.3409407138824463,
368
+ "objective/train/std_weight": 0.06310887634754181,
369
+ "objective/train/theoretical_loss": 4.467096263641219,
370
+ "objective/train/tokens_used": 184300000,
371
+ "theoretical_loss": 4.467096263641219,
372
+ "tokens_seen": 163840000
373
+ },
374
+ {
375
+ "epoch": 0.06,
376
+ "learning_rate": 0.0006657083412142992,
377
+ "loss": 3.2359,
378
+ "theoretical_loss": 4.467096263641219,
379
+ "tokens_seen": 163840000
380
+ },
381
+ {
382
+ "epoch": 0.06,
383
+ "learning_rate": 0.0006648808397957253,
384
+ "loss": 3.3137,
385
+ "theoretical_loss": 4.455715680989545,
386
+ "tokens_seen": 167116800
387
+ },
388
+ {
389
+ "epoch": 0.06,
390
+ "learning_rate": 0.0006640533383771514,
391
+ "loss": 3.2122,
392
+ "theoretical_loss": 4.44461720334543,
393
+ "tokens_seen": 170393600
394
+ },
395
+ {
396
+ "epoch": 0.06,
397
+ "learning_rate": 0.0006632258369585777,
398
+ "loss": 3.1455,
399
+ "theoretical_loss": 4.433788610987646,
400
+ "tokens_seen": 173670400
401
+ },
402
+ {
403
+ "epoch": 0.06,
404
+ "learning_rate": 0.0006623983355400038,
405
+ "loss": 3.198,
406
+ "theoretical_loss": 4.42321843148016,
407
+ "tokens_seen": 176947200
408
+ },
409
+ {
410
+ "epoch": 0.06,
411
+ "learning_rate": 0.0006615708341214299,
412
+ "loss": 3.1453,
413
+ "theoretical_loss": 4.412895881130142,
414
+ "tokens_seen": 180224000
415
+ },
416
+ {
417
+ "epoch": 0.07,
418
+ "learning_rate": 0.000660743332702856,
419
+ "loss": 3.1834,
420
+ "theoretical_loss": 4.4028108120020795,
421
+ "tokens_seen": 183500800
422
+ },
423
+ {
424
+ "epoch": 0.07,
425
+ "learning_rate": 0.0006599158312842821,
426
+ "loss": 3.1772,
427
+ "theoretical_loss": 4.392953663871862,
428
+ "tokens_seen": 186777600
429
+ },
430
+ {
431
+ "epoch": 0.07,
432
+ "learning_rate": 0.0006590883298657083,
433
+ "loss": 3.1628,
434
+ "theoretical_loss": 4.383315420582533,
435
+ "tokens_seen": 190054400
436
+ },
437
+ {
438
+ "epoch": 0.07,
439
+ "learning_rate": 0.0006582608284471345,
440
+ "loss": 3.1954,
441
+ "theoretical_loss": 4.373887570330275,
442
+ "tokens_seen": 193331200
443
+ },
444
+ {
445
+ "epoch": 0.07,
446
+ "learning_rate": 0.0006574333270285606,
447
+ "loss": 3.1504,
448
+ "theoretical_loss": 4.364662069466704,
449
+ "tokens_seen": 196608000
450
+ },
451
+ {
452
+ "epoch": 0.07,
453
+ "learning_rate": 0.0006566058256099867,
454
+ "loss": 3.1496,
455
+ "theoretical_loss": 4.355631309453283,
456
+ "tokens_seen": 199884800
457
+ },
458
+ {
459
+ "epoch": 0.07,
460
+ "learning_rate": 0.0006557783241914128,
461
+ "loss": 3.1045,
462
+ "theoretical_loss": 4.346788086646671,
463
+ "tokens_seen": 203161600
464
+ },
465
+ {
466
+ "epoch": 0.07,
467
+ "learning_rate": 0.0006549508227728391,
468
+ "loss": 3.1352,
469
+ "theoretical_loss": 4.33812557463116,
470
+ "tokens_seen": 206438400
471
+ },
472
+ {
473
+ "epoch": 0.07,
474
+ "learning_rate": 0.0006541233213542652,
475
+ "loss": 3.1432,
476
+ "theoretical_loss": 4.329637298846812,
477
+ "tokens_seen": 209715200
478
+ },
479
+ {
480
+ "epoch": 0.08,
481
+ "learning_rate": 0.0006532958199356913,
482
+ "loss": 3.106,
483
+ "theoretical_loss": 4.321317113290252,
484
+ "tokens_seen": 212992000
485
+ },
486
+ {
487
+ "epoch": 0.08,
488
+ "learning_rate": 0.0006524683185171174,
489
+ "loss": 3.0747,
490
+ "theoretical_loss": 4.3131591790897925,
491
+ "tokens_seen": 216268800
492
+ },
493
+ {
494
+ "epoch": 0.08,
495
+ "learning_rate": 0.0006516408170985437,
496
+ "loss": 3.1173,
497
+ "theoretical_loss": 4.305157944778228,
498
+ "tokens_seen": 219545600
499
+ },
500
+ {
501
+ "epoch": 0.08,
502
+ "learning_rate": 0.0006508133156799698,
503
+ "loss": 3.1451,
504
+ "theoretical_loss": 4.297308128105687,
505
+ "tokens_seen": 222822400
506
+ },
507
+ {
508
+ "epoch": 0.08,
509
+ "learning_rate": 0.0006499858142613959,
510
+ "loss": 3.1329,
511
+ "theoretical_loss": 4.2896046992515995,
512
+ "tokens_seen": 226099200
513
+ },
514
+ {
515
+ "epoch": 0.08,
516
+ "learning_rate": 0.000649158312842822,
517
+ "loss": 3.1677,
518
+ "theoretical_loss": 4.282042865309616,
519
+ "tokens_seen": 229376000
520
+ },
521
+ {
522
+ "epoch": 0.08,
523
+ "learning_rate": 0.0006483308114242481,
524
+ "loss": 3.1376,
525
+ "theoretical_loss": 4.274618055932298,
526
+ "tokens_seen": 232652800
527
+ },
528
+ {
529
+ "epoch": 0.08,
530
+ "learning_rate": 0.0006475033100056744,
531
+ "loss": 3.1095,
532
+ "theoretical_loss": 4.267325910033897,
533
+ "tokens_seen": 235929600
534
+ },
535
+ {
536
+ "epoch": 0.09,
537
+ "learning_rate": 0.0006466758085871005,
538
+ "loss": 3.0555,
539
+ "theoretical_loss": 4.260162263459744,
540
+ "tokens_seen": 239206400
541
+ },
542
+ {
543
+ "epoch": 0.09,
544
+ "learning_rate": 0.0006458483071685266,
545
+ "loss": 3.083,
546
+ "theoretical_loss": 4.253123137539814,
547
+ "tokens_seen": 242483200
548
+ },
549
+ {
550
+ "epoch": 0.09,
551
+ "learning_rate": 0.0006450208057499527,
552
+ "loss": 3.1657,
553
+ "theoretical_loss": 4.246204728452055,
554
+ "tokens_seen": 245760000
555
+ },
556
+ {
557
+ "epoch": 0.09,
558
+ "learning_rate": 0.0006441933043313788,
559
+ "loss": 3.1313,
560
+ "theoretical_loss": 4.239403397328261,
561
+ "tokens_seen": 249036800
562
+ },
563
+ {
564
+ "epoch": 0.09,
565
+ "learning_rate": 0.0006433658029128049,
566
+ "loss": 3.0927,
567
+ "theoretical_loss": 4.232715661041632,
568
+ "tokens_seen": 252313600
569
+ },
570
+ {
571
+ "epoch": 0.09,
572
+ "learning_rate": 0.0006425383014942312,
573
+ "loss": 3.1078,
574
+ "theoretical_loss": 4.226138183620867,
575
+ "tokens_seen": 255590400
576
+ },
577
+ {
578
+ "epoch": 0.09,
579
+ "learning_rate": 0.0006417108000756573,
580
+ "loss": 3.0412,
581
+ "theoretical_loss": 4.219667768240775,
582
+ "tokens_seen": 258867200
583
+ },
584
+ {
585
+ "epoch": 0.09,
586
+ "learning_rate": 0.0006408832986570834,
587
+ "loss": 3.045,
588
+ "theoretical_loss": 4.213301349743924,
589
+ "tokens_seen": 262144000
590
+ },
591
+ {
592
+ "epoch": 0.09,
593
+ "learning_rate": 0.0006400557972385095,
594
+ "loss": 2.9994,
595
+ "theoretical_loss": 4.20703598765197,
596
+ "tokens_seen": 265420800
597
+ },
598
+ {
599
+ "epoch": 0.1,
600
+ "learning_rate": 0.0006392282958199356,
601
+ "loss": 2.9626,
602
+ "theoretical_loss": 4.2008688596290025,
603
+ "tokens_seen": 268697600
604
+ },
605
+ {
606
+ "epoch": 0.1,
607
+ "learning_rate": 0.0006384007944013618,
608
+ "loss": 2.9181,
609
+ "theoretical_loss": 4.194797255362549,
610
+ "tokens_seen": 271974400
611
+ },
612
+ {
613
+ "epoch": 0.1,
614
+ "learning_rate": 0.000637573292982788,
615
+ "loss": 3.0047,
616
+ "theoretical_loss": 4.188818570830883,
617
+ "tokens_seen": 275251200
618
+ },
619
+ {
620
+ "epoch": 0.1,
621
+ "learning_rate": 0.0006367457915642141,
622
+ "loss": 3.0216,
623
+ "theoretical_loss": 4.182930302927963,
624
+ "tokens_seen": 278528000
625
+ },
626
+ {
627
+ "epoch": 0.1,
628
+ "learning_rate": 0.0006359182901456402,
629
+ "loss": 2.9805,
630
+ "theoretical_loss": 4.17713004441978,
631
+ "tokens_seen": 281804800
632
+ },
633
+ {
634
+ "epoch": 0.1,
635
+ "learning_rate": 0.0006350907887270663,
636
+ "loss": 3.0196,
637
+ "theoretical_loss": 4.1714154792080915,
638
+ "tokens_seen": 285081600
639
+ },
640
+ {
641
+ "epoch": 0.1,
642
+ "learning_rate": 0.0006342632873084925,
643
+ "loss": 3.0324,
644
+ "theoretical_loss": 4.165784377879517,
645
+ "tokens_seen": 288358400
646
+ },
647
+ {
648
+ "epoch": 0.1,
649
+ "learning_rate": 0.0006334357858899187,
650
+ "loss": 2.9929,
651
+ "theoretical_loss": 4.160234593519768,
652
+ "tokens_seen": 291635200
653
+ },
654
+ {
655
+ "epoch": 0.11,
656
+ "learning_rate": 0.0006326082844713448,
657
+ "loss": 3.0177,
658
+ "theoretical_loss": 4.15476405777444,
659
+ "tokens_seen": 294912000
660
+ },
661
+ {
662
+ "epoch": 0.11,
663
+ "learning_rate": 0.0006317807830527709,
664
+ "loss": 2.9921,
665
+ "theoretical_loss": 4.149370777139286,
666
+ "tokens_seen": 298188800
667
+ },
668
+ {
669
+ "epoch": 0.11,
670
+ "learning_rate": 0.0006309532816341972,
671
+ "loss": 2.9847,
672
+ "theoretical_loss": 4.144052829464249,
673
+ "tokens_seen": 301465600
674
+ },
675
+ {
676
+ "epoch": 0.11,
677
+ "learning_rate": 0.0006301257802156233,
678
+ "loss": 2.9369,
679
+ "theoretical_loss": 4.138808360656742,
680
+ "tokens_seen": 304742400
681
+ },
682
+ {
683
+ "epoch": 0.11,
684
+ "learning_rate": 0.0006292982787970494,
685
+ "loss": 2.9483,
686
+ "theoretical_loss": 4.133635581570836,
687
+ "tokens_seen": 308019200
688
+ },
689
+ {
690
+ "epoch": 0.11,
691
+ "learning_rate": 0.0006284707773784755,
692
+ "loss": 2.9245,
693
+ "theoretical_loss": 4.128532765070004,
694
+ "tokens_seen": 311296000
695
+ },
696
+ {
697
+ "epoch": 0.11,
698
+ "learning_rate": 0.0006276432759599016,
699
+ "loss": 2.9521,
700
+ "theoretical_loss": 4.123498243252032,
701
+ "tokens_seen": 314572800
702
+ },
703
+ {
704
+ "epoch": 0.11,
705
+ "learning_rate": 0.0006268157745413279,
706
+ "loss": 2.9375,
707
+ "theoretical_loss": 4.118530404825556,
708
+ "tokens_seen": 317849600
709
+ },
710
+ {
711
+ "epoch": 0.11,
712
+ "learning_rate": 0.000625988273122754,
713
+ "loss": 2.9599,
714
+ "theoretical_loss": 4.113627692628464,
715
+ "tokens_seen": 321126400
716
+ },
717
+ {
718
+ "epoch": 0.12,
719
+ "learning_rate": 0.0006251607717041801,
720
+ "loss": 2.9643,
721
+ "theoretical_loss": 4.108788601279149,
722
+ "tokens_seen": 324403200
723
+ },
724
+ {
725
+ "debugging/Self-BLEU-5": 0.5365128506817183,
726
+ "debugging/distinct-1-grams": 0.7612814402327299,
727
+ "debugging/distinct-2-grams": 0.9694583753853511,
728
+ "debugging/entropy-1-grams": 6.003629944255698,
729
+ "debugging/entropy-2-grams": 7.054987089269872,
730
+ "debugging/length": 495.25,
731
+ "debugging/num_segments": 16,
732
+ "epoch": 0.12,
733
+ "objective/train/avg_token_score": 0.04385810345411301,
734
+ "objective/train/avg_weight": 0.9649326205253601,
735
+ "objective/train/docs_used": 197327,
736
+ "objective/train/instantaneous_batch_size": 32,
737
+ "objective/train/instantaneous_microbatch_size": 32768,
738
+ "objective/train/original_loss": 2.908684015274048,
739
+ "objective/train/std_weight": 0.12546230852603912,
740
+ "objective/train/theoretical_loss": 4.10401167495222,
741
+ "objective/train/tokens_used": 348140000,
742
+ "theoretical_loss": 4.10401167495222,
743
+ "tokens_seen": 327680000
744
+ },
745
+ {
746
+ "epoch": 0.12,
747
+ "learning_rate": 0.0006243332702856062,
748
+ "loss": 2.9932,
749
+ "theoretical_loss": 4.10401167495222,
750
+ "tokens_seen": 327680000
751
+ },
752
+ {
753
+ "epoch": 0.12,
754
+ "learning_rate": 0.0006235057688670323,
755
+ "loss": 2.9925,
756
+ "theoretical_loss": 4.099295505270921,
757
+ "tokens_seen": 330956800
758
+ },
759
+ {
760
+ "epoch": 0.12,
761
+ "learning_rate": 0.0006226782674484584,
762
+ "loss": 2.9372,
763
+ "theoretical_loss": 4.094638729309031,
764
+ "tokens_seen": 334233600
765
+ },
766
+ {
767
+ "epoch": 0.12,
768
+ "learning_rate": 0.0006218507660298847,
769
+ "loss": 2.9488,
770
+ "theoretical_loss": 4.090040027695556,
771
+ "tokens_seen": 337510400
772
+ },
773
+ {
774
+ "epoch": 0.12,
775
+ "learning_rate": 0.0006210232646113108,
776
+ "loss": 2.906,
777
+ "theoretical_loss": 4.085498122815992,
778
+ "tokens_seen": 340787200
779
+ },
780
+ {
781
+ "epoch": 0.12,
782
+ "learning_rate": 0.0006201957631927369,
783
+ "loss": 2.9313,
784
+ "theoretical_loss": 4.081011777104333,
785
+ "tokens_seen": 344064000
786
+ },
787
+ {
788
+ "epoch": 0.12,
789
+ "learning_rate": 0.000619368261774163,
790
+ "loss": 2.9368,
791
+ "theoretical_loss": 4.076579791420469,
792
+ "tokens_seen": 347340800
793
+ },
794
+ {
795
+ "epoch": 0.13,
796
+ "learning_rate": 0.0006185407603555891,
797
+ "loss": 2.9504,
798
+ "theoretical_loss": 4.0722010035079155,
799
+ "tokens_seen": 350617600
800
+ },
801
+ {
802
+ "epoch": 0.13,
803
+ "learning_rate": 0.0006177132589370153,
804
+ "loss": 2.9416,
805
+ "theoretical_loss": 4.067874286527197,
806
+ "tokens_seen": 353894400
807
+ },
808
+ {
809
+ "epoch": 0.13,
810
+ "learning_rate": 0.0006168857575184414,
811
+ "loss": 2.9402,
812
+ "theoretical_loss": 4.063598547660519,
813
+ "tokens_seen": 357171200
814
+ },
815
+ {
816
+ "epoch": 0.13,
817
+ "learning_rate": 0.0006160582560998676,
818
+ "loss": 2.9692,
819
+ "theoretical_loss": 4.05937272678363,
820
+ "tokens_seen": 360448000
821
+ },
822
+ {
823
+ "epoch": 0.13,
824
+ "learning_rate": 0.0006152307546812937,
825
+ "loss": 2.957,
826
+ "theoretical_loss": 4.055195795201069,
827
+ "tokens_seen": 363724800
828
+ },
829
+ {
830
+ "epoch": 0.13,
831
+ "learning_rate": 0.0006144032532627198,
832
+ "loss": 2.9066,
833
+ "theoretical_loss": 4.051066754441235,
834
+ "tokens_seen": 367001600
835
+ },
836
+ {
837
+ "epoch": 0.13,
838
+ "learning_rate": 0.000613575751844146,
839
+ "loss": 2.9128,
840
+ "theoretical_loss": 4.04698463510794,
841
+ "tokens_seen": 370278400
842
+ },
843
+ {
844
+ "epoch": 0.13,
845
+ "learning_rate": 0.0006127482504255721,
846
+ "loss": 2.9181,
847
+ "theoretical_loss": 4.042948495785312,
848
+ "tokens_seen": 373555200
849
+ },
850
+ {
851
+ "epoch": 0.13,
852
+ "learning_rate": 0.0006119207490069983,
853
+ "loss": 2.8869,
854
+ "theoretical_loss": 4.038957421993153,
855
+ "tokens_seen": 376832000
856
+ },
857
+ {
858
+ "epoch": 0.14,
859
+ "learning_rate": 0.0006110932475884244,
860
+ "loss": 2.9055,
861
+ "theoretical_loss": 4.035010525189982,
862
+ "tokens_seen": 380108800
863
+ },
864
+ {
865
+ "epoch": 0.14,
866
+ "learning_rate": 0.0006102657461698505,
867
+ "loss": 2.953,
868
+ "theoretical_loss": 4.031106941821218,
869
+ "tokens_seen": 383385600
870
+ },
871
+ {
872
+ "epoch": 0.14,
873
+ "learning_rate": 0.0006094382447512768,
874
+ "loss": 2.9048,
875
+ "theoretical_loss": 4.027245832410079,
876
+ "tokens_seen": 386662400
877
+ },
878
+ {
879
+ "epoch": 0.14,
880
+ "learning_rate": 0.0006086107433327029,
881
+ "loss": 2.8699,
882
+ "theoretical_loss": 4.023426380688943,
883
+ "tokens_seen": 389939200
884
+ },
885
+ {
886
+ "epoch": 0.14,
887
+ "learning_rate": 0.000607783241914129,
888
+ "loss": 2.9089,
889
+ "theoretical_loss": 4.019647792769048,
890
+ "tokens_seen": 393216000
891
+ },
892
+ {
893
+ "epoch": 0.14,
894
+ "learning_rate": 0.0006069722905239265,
895
+ "loss": 2.9903,
896
+ "theoretical_loss": 4.015909296346521,
897
+ "tokens_seen": 396492800
898
+ },
899
+ {
900
+ "epoch": 0.14,
901
+ "learning_rate": 0.0006061447891053528,
902
+ "loss": 2.9199,
903
+ "theoretical_loss": 4.012210139942894,
904
+ "tokens_seen": 399769600
905
+ },
906
+ {
907
+ "epoch": 0.14,
908
+ "learning_rate": 0.0006053172876867789,
909
+ "loss": 2.9731,
910
+ "theoretical_loss": 4.008549592178291,
911
+ "tokens_seen": 403046400
912
+ },
913
+ {
914
+ "epoch": 0.15,
915
+ "learning_rate": 0.000604489786268205,
916
+ "loss": 2.9404,
917
+ "theoretical_loss": 4.004926941075674,
918
+ "tokens_seen": 406323200
919
+ },
920
+ {
921
+ "epoch": 0.15,
922
+ "learning_rate": 0.0006036622848496311,
923
+ "loss": 2.9768,
924
+ "theoretical_loss": 4.001341493394558,
925
+ "tokens_seen": 409600000
926
+ },
927
+ {
928
+ "epoch": 0.15,
929
+ "learning_rate": 0.0006028347834310573,
930
+ "loss": 2.9789,
931
+ "theoretical_loss": 3.997792573992726,
932
+ "tokens_seen": 412876800
933
+ },
934
+ {
935
+ "epoch": 0.15,
936
+ "learning_rate": 0.0006020072820124835,
937
+ "loss": 2.9275,
938
+ "theoretical_loss": 3.994279525214554,
939
+ "tokens_seen": 416153600
940
+ },
941
+ {
942
+ "epoch": 0.15,
943
+ "learning_rate": 0.0006011797805939096,
944
+ "loss": 2.9751,
945
+ "theoretical_loss": 3.990801706304647,
946
+ "tokens_seen": 419430400
947
+ },
948
+ {
949
+ "epoch": 0.15,
950
+ "learning_rate": 0.0006003522791753358,
951
+ "loss": 2.9311,
952
+ "theoretical_loss": 3.987358492845532,
953
+ "tokens_seen": 422707200
954
+ },
955
+ {
956
+ "epoch": 0.15,
957
+ "learning_rate": 0.0005995247777567619,
958
+ "loss": 2.9599,
959
+ "theoretical_loss": 3.9839492762182647,
960
+ "tokens_seen": 425984000
961
+ },
962
+ {
963
+ "epoch": 0.15,
964
+ "learning_rate": 0.000598697276338188,
965
+ "loss": 2.9319,
966
+ "theoretical_loss": 3.9805734630848306,
967
+ "tokens_seen": 429260800
968
+ },
969
+ {
970
+ "epoch": 0.15,
971
+ "learning_rate": 0.0005978697749196142,
972
+ "loss": 2.9141,
973
+ "theoretical_loss": 3.9772304748913054,
974
+ "tokens_seen": 432537600
975
+ },
976
+ {
977
+ "epoch": 0.16,
978
+ "learning_rate": 0.0005970422735010403,
979
+ "loss": 2.9076,
980
+ "theoretical_loss": 3.973919747390801,
981
+ "tokens_seen": 435814400
982
+ },
983
+ {
984
+ "epoch": 0.16,
985
+ "learning_rate": 0.0005962147720824664,
986
+ "loss": 2.918,
987
+ "theoretical_loss": 3.9706407301852487,
988
+ "tokens_seen": 439091200
989
+ },
990
+ {
991
+ "epoch": 0.16,
992
+ "learning_rate": 0.0005953872706638926,
993
+ "loss": 2.8962,
994
+ "theoretical_loss": 3.9673928862851655,
995
+ "tokens_seen": 442368000
996
+ },
997
+ {
998
+ "epoch": 0.16,
999
+ "learning_rate": 0.0005945597692453187,
1000
+ "loss": 2.887,
1001
+ "theoretical_loss": 3.9641756916865463,
1002
+ "tokens_seen": 445644800
1003
+ },
1004
+ {
1005
+ "epoch": 0.16,
1006
+ "learning_rate": 0.0005937322678267449,
1007
+ "loss": 2.8944,
1008
+ "theoretical_loss": 3.960988634964113,
1009
+ "tokens_seen": 448921600
1010
+ },
1011
+ {
1012
+ "epoch": 0.16,
1013
+ "learning_rate": 0.000592904766408171,
1014
+ "loss": 2.8689,
1015
+ "theoretical_loss": 3.9578312168801597,
1016
+ "tokens_seen": 452198400
1017
+ },
1018
+ {
1019
+ "epoch": 0.16,
1020
+ "learning_rate": 0.0005920772649895971,
1021
+ "loss": 2.8671,
1022
+ "theoretical_loss": 3.954702950008308,
1023
+ "tokens_seen": 455475200
1024
+ },
1025
+ {
1026
+ "epoch": 0.16,
1027
+ "learning_rate": 0.0005912497635710232,
1028
+ "loss": 2.8527,
1029
+ "theoretical_loss": 3.9516033583714734,
1030
+ "tokens_seen": 458752000
1031
+ },
1032
+ {
1033
+ "epoch": 0.17,
1034
+ "learning_rate": 0.0005904222621524494,
1035
+ "loss": 2.8787,
1036
+ "theoretical_loss": 3.9485319770934355,
1037
+ "tokens_seen": 462028800
1038
+ },
1039
+ {
1040
+ "epoch": 0.17,
1041
+ "learning_rate": 0.0005895947607338756,
1042
+ "loss": 2.8612,
1043
+ "theoretical_loss": 3.945488352063391,
1044
+ "tokens_seen": 465305600
1045
+ },
1046
+ {
1047
+ "epoch": 0.17,
1048
+ "learning_rate": 0.0005887672593153017,
1049
+ "loss": 2.8394,
1050
+ "theoretical_loss": 3.942472039612926,
1051
+ "tokens_seen": 468582400
1052
+ },
1053
+ {
1054
+ "epoch": 0.17,
1055
+ "learning_rate": 0.0005879397578967278,
1056
+ "loss": 2.8409,
1057
+ "theoretical_loss": 3.939482606204863,
1058
+ "tokens_seen": 471859200
1059
+ },
1060
+ {
1061
+ "epoch": 0.17,
1062
+ "learning_rate": 0.0005871122564781539,
1063
+ "loss": 2.8556,
1064
+ "theoretical_loss": 3.936519628133466,
1065
+ "tokens_seen": 475136000
1066
+ },
1067
+ {
1068
+ "epoch": 0.17,
1069
+ "learning_rate": 0.00058628475505958,
1070
+ "loss": 2.8919,
1071
+ "theoretical_loss": 3.9335826912355114,
1072
+ "tokens_seen": 478412800
1073
+ },
1074
+ {
1075
+ "epoch": 0.17,
1076
+ "learning_rate": 0.0005854572536410061,
1077
+ "loss": 2.9328,
1078
+ "theoretical_loss": 3.93067139061177,
1079
+ "tokens_seen": 481689600
1080
+ },
1081
+ {
1082
+ "epoch": 0.17,
1083
+ "learning_rate": 0.0005846297522224324,
1084
+ "loss": 2.859,
1085
+ "theoretical_loss": 3.927785330358441,
1086
+ "tokens_seen": 484966400
1087
+ },
1088
+ {
1089
+ "epoch": 0.17,
1090
+ "learning_rate": 0.0005838022508038585,
1091
+ "loss": 2.8503,
1092
+ "theoretical_loss": 3.9249241233081333,
1093
+ "tokens_seen": 488243200
1094
+ },
1095
+ {
1096
+ "epoch": 0.18,
1097
+ "objective/train/avg_token_score": 0.004644907079637051,
1098
+ "objective/train/avg_weight": 0.9962868094444275,
1099
+ "objective/train/docs_used": 287192,
1100
+ "objective/train/instantaneous_batch_size": 32,
1101
+ "objective/train/instantaneous_microbatch_size": 32768,
1102
+ "objective/train/original_loss": 3.003862142562866,
1103
+ "objective/train/std_weight": 0.018475370481610298,
1104
+ "objective/train/theoretical_loss": 3.92208739077998,
1105
+ "objective/train/tokens_used": 511980000,
1106
+ "theoretical_loss": 3.92208739077998,
1107
+ "tokens_seen": 491520000
1108
+ },
1109
+ {
1110
+ "epoch": 0.18,
1111
+ "learning_rate": 0.0005829747493852846,
1112
+ "loss": 2.8251,
1113
+ "theoretical_loss": 3.92208739077998,
1114
+ "tokens_seen": 491520000
1115
+ },
1116
+ {
1117
+ "epoch": 0.18,
1118
+ "learning_rate": 0.0005821472479667107,
1119
+ "loss": 2.8393,
1120
+ "theoretical_loss": 3.919274762338519,
1121
+ "tokens_seen": 494796800
1122
+ },
1123
+ {
1124
+ "epoch": 0.18,
1125
+ "learning_rate": 0.000581319746548137,
1126
+ "loss": 2.8629,
1127
+ "theoretical_loss": 3.9164858755609613,
1128
+ "tokens_seen": 498073600
1129
+ },
1130
+ {
1131
+ "epoch": 0.18,
1132
+ "learning_rate": 0.0005804922451295631,
1133
+ "loss": 2.8465,
1134
+ "theoretical_loss": 3.9137203758125176,
1135
+ "tokens_seen": 501350400
1136
+ },
1137
+ {
1138
+ "epoch": 0.18,
1139
+ "learning_rate": 0.0005796647437109892,
1140
+ "loss": 2.8088,
1141
+ "theoretical_loss": 3.910977916029439,
1142
+ "tokens_seen": 504627200
1143
+ },
1144
+ {
1145
+ "epoch": 0.18,
1146
+ "learning_rate": 0.0005788372422924154,
1147
+ "loss": 2.8193,
1148
+ "theoretical_loss": 3.908258156509472,
1149
+ "tokens_seen": 507904000
1150
+ },
1151
+ {
1152
+ "epoch": 0.18,
1153
+ "learning_rate": 0.0005780097408738415,
1154
+ "loss": 2.8096,
1155
+ "theoretical_loss": 3.905560764709417,
1156
+ "tokens_seen": 511180800
1157
+ },
1158
+ {
1159
+ "epoch": 0.18,
1160
+ "learning_rate": 0.0005771822394552677,
1161
+ "loss": 2.7896,
1162
+ "theoretical_loss": 3.9028854150495143,
1163
+ "tokens_seen": 514457600
1164
+ },
1165
+ {
1166
+ "epoch": 0.18,
1167
+ "learning_rate": 0.0005763547380366938,
1168
+ "loss": 2.8133,
1169
+ "theoretical_loss": 3.9002317887243834,
1170
+ "tokens_seen": 517734400
1171
+ },
1172
+ {
1173
+ "epoch": 0.19,
1174
+ "learning_rate": 0.0005755272366181199,
1175
+ "loss": 2.8119,
1176
+ "theoretical_loss": 3.897599573520247,
1177
+ "tokens_seen": 521011200
1178
+ },
1179
+ {
1180
+ "epoch": 0.19,
1181
+ "learning_rate": 0.0005746997351995461,
1182
+ "loss": 2.8456,
1183
+ "theoretical_loss": 3.8949884636382106,
1184
+ "tokens_seen": 524288000
1185
+ },
1186
+ {
1187
+ "epoch": 0.19,
1188
+ "learning_rate": 0.0005738722337809722,
1189
+ "loss": 2.8278,
1190
+ "theoretical_loss": 3.892398159523345,
1191
+ "tokens_seen": 527564800
1192
+ },
1193
+ {
1194
+ "epoch": 0.19,
1195
+ "learning_rate": 0.0005730447323623984,
1196
+ "loss": 2.8807,
1197
+ "theoretical_loss": 3.889828367699349,
1198
+ "tokens_seen": 530841600
1199
+ },
1200
+ {
1201
+ "epoch": 0.19,
1202
+ "learning_rate": 0.0005722172309438245,
1203
+ "loss": 2.8198,
1204
+ "theoretical_loss": 3.8872788006085894,
1205
+ "tokens_seen": 534118400
1206
+ },
1207
+ {
1208
+ "epoch": 0.19,
1209
+ "learning_rate": 0.0005713897295252506,
1210
+ "loss": 2.8464,
1211
+ "theoretical_loss": 3.8847491764572926,
1212
+ "tokens_seen": 537395200
1213
+ },
1214
+ {
1215
+ "epoch": 0.19,
1216
+ "learning_rate": 0.0005705622281066767,
1217
+ "loss": 2.8284,
1218
+ "theoretical_loss": 3.882239219065708,
1219
+ "tokens_seen": 540672000
1220
+ },
1221
+ {
1222
+ "epoch": 0.19,
1223
+ "learning_rate": 0.0005697347266881029,
1224
+ "loss": 2.8548,
1225
+ "theoretical_loss": 3.879748657723039,
1226
+ "tokens_seen": 543948800
1227
+ },
1228
+ {
1229
+ "epoch": 0.2,
1230
+ "learning_rate": 0.0005689072252695291,
1231
+ "loss": 2.8829,
1232
+ "theoretical_loss": 3.8772772270469824,
1233
+ "tokens_seen": 547225600
1234
+ },
1235
+ {
1236
+ "epoch": 0.2,
1237
+ "learning_rate": 0.0005680797238509552,
1238
+ "loss": 2.8451,
1239
+ "theoretical_loss": 3.8748246668476827,
1240
+ "tokens_seen": 550502400
1241
+ },
1242
+ {
1243
+ "epoch": 0.2,
1244
+ "learning_rate": 0.0005672522224323813,
1245
+ "loss": 2.8627,
1246
+ "theoretical_loss": 3.8723907219959486,
1247
+ "tokens_seen": 553779200
1248
+ },
1249
+ {
1250
+ "epoch": 0.2,
1251
+ "learning_rate": 0.0005664247210138074,
1252
+ "loss": 2.8789,
1253
+ "theoretical_loss": 3.869975142295573,
1254
+ "tokens_seen": 557056000
1255
+ },
1256
+ {
1257
+ "epoch": 0.2,
1258
+ "learning_rate": 0.0005655972195952335,
1259
+ "loss": 2.9283,
1260
+ "theoretical_loss": 3.8675776823595998,
1261
+ "tokens_seen": 560332800
1262
+ },
1263
+ {
1264
+ "epoch": 0.2,
1265
+ "learning_rate": 0.0005647697181766596,
1266
+ "loss": 2.8956,
1267
+ "theoretical_loss": 3.8651981014904027,
1268
+ "tokens_seen": 563609600
1269
+ },
1270
+ {
1271
+ "epoch": 0.2,
1272
+ "learning_rate": 0.0005639422167580859,
1273
+ "loss": 2.8693,
1274
+ "theoretical_loss": 3.8628361635634265,
1275
+ "tokens_seen": 566886400
1276
+ },
1277
+ {
1278
+ "epoch": 0.2,
1279
+ "learning_rate": 0.000563114715339512,
1280
+ "loss": 2.8919,
1281
+ "theoretical_loss": 3.8604916369144666,
1282
+ "tokens_seen": 570163200
1283
+ },
1284
+ {
1285
+ "epoch": 0.2,
1286
+ "learning_rate": 0.0005622872139209381,
1287
+ "loss": 2.8761,
1288
+ "theoretical_loss": 3.858164294230354,
1289
+ "tokens_seen": 573440000
1290
+ },
1291
+ {
1292
+ "epoch": 0.21,
1293
+ "learning_rate": 0.0005614597125023642,
1294
+ "loss": 2.8493,
1295
+ "theoretical_loss": 3.85585391244293,
1296
+ "tokens_seen": 576716800
1297
+ },
1298
+ {
1299
+ "epoch": 0.21,
1300
+ "learning_rate": 0.0005606322110837905,
1301
+ "loss": 2.8528,
1302
+ "theoretical_loss": 3.8535602726261864,
1303
+ "tokens_seen": 579993600
1304
+ },
1305
+ {
1306
+ "epoch": 0.21,
1307
+ "learning_rate": 0.0005598047096652166,
1308
+ "loss": 2.8398,
1309
+ "theoretical_loss": 3.851283159896468,
1310
+ "tokens_seen": 583270400
1311
+ },
1312
+ {
1313
+ "epoch": 0.21,
1314
+ "learning_rate": 0.0005589772082466427,
1315
+ "loss": 2.8225,
1316
+ "theoretical_loss": 3.8490223633156173,
1317
+ "tokens_seen": 586547200
1318
+ },
1319
+ {
1320
+ "epoch": 0.21,
1321
+ "learning_rate": 0.0005581497068280688,
1322
+ "loss": 2.8074,
1323
+ "theoretical_loss": 3.846777675796974,
1324
+ "tokens_seen": 589824000
1325
+ },
1326
+ {
1327
+ "epoch": 0.21,
1328
+ "learning_rate": 0.000557322205409495,
1329
+ "loss": 2.8092,
1330
+ "theoretical_loss": 3.844548894014116,
1331
+ "tokens_seen": 593100800
1332
+ },
1333
+ {
1334
+ "epoch": 0.21,
1335
+ "learning_rate": 0.0005564947039909212,
1336
+ "loss": 2.8185,
1337
+ "theoretical_loss": 3.8423358183122582,
1338
+ "tokens_seen": 596377600
1339
+ },
1340
+ {
1341
+ "epoch": 0.21,
1342
+ "learning_rate": 0.0005556672025723473,
1343
+ "loss": 2.7918,
1344
+ "theoretical_loss": 3.840138252622208,
1345
+ "tokens_seen": 599654400
1346
+ },
1347
+ {
1348
+ "epoch": 0.22,
1349
+ "learning_rate": 0.0005548397011537734,
1350
+ "loss": 2.8229,
1351
+ "theoretical_loss": 3.837956004376799,
1352
+ "tokens_seen": 602931200
1353
+ },
1354
+ {
1355
+ "epoch": 0.22,
1356
+ "learning_rate": 0.0005540121997351996,
1357
+ "loss": 2.7877,
1358
+ "theoretical_loss": 3.8357888844297094,
1359
+ "tokens_seen": 606208000
1360
+ },
1361
+ {
1362
+ "epoch": 0.22,
1363
+ "learning_rate": 0.0005531846983166257,
1364
+ "loss": 2.8112,
1365
+ "theoretical_loss": 3.8336367069765958,
1366
+ "tokens_seen": 609484800
1367
+ },
1368
+ {
1369
+ "epoch": 0.22,
1370
+ "learning_rate": 0.0005523571968980519,
1371
+ "loss": 2.7626,
1372
+ "theoretical_loss": 3.8314992894784536,
1373
+ "tokens_seen": 612761600
1374
+ },
1375
+ {
1376
+ "epoch": 0.22,
1377
+ "learning_rate": 0.000551529695479478,
1378
+ "loss": 2.8083,
1379
+ "theoretical_loss": 3.829376452587134,
1380
+ "tokens_seen": 616038400
1381
+ },
1382
+ {
1383
+ "epoch": 0.22,
1384
+ "learning_rate": 0.0005507021940609041,
1385
+ "loss": 2.8228,
1386
+ "theoretical_loss": 3.827268020072948,
1387
+ "tokens_seen": 619315200
1388
+ },
1389
+ {
1390
+ "epoch": 0.22,
1391
+ "learning_rate": 0.0005498746926423302,
1392
+ "loss": 2.8554,
1393
+ "theoretical_loss": 3.8251738187542843,
1394
+ "tokens_seen": 622592000
1395
+ },
1396
+ {
1397
+ "epoch": 0.22,
1398
+ "learning_rate": 0.0005490471912237563,
1399
+ "loss": 2.8388,
1400
+ "theoretical_loss": 3.8230936784291787,
1401
+ "tokens_seen": 625868800
1402
+ },
1403
+ {
1404
+ "epoch": 0.22,
1405
+ "learning_rate": 0.0005482196898051826,
1406
+ "loss": 2.8705,
1407
+ "theoretical_loss": 3.8210274318087656,
1408
+ "tokens_seen": 629145600
1409
+ },
1410
+ {
1411
+ "epoch": 0.23,
1412
+ "learning_rate": 0.0005473921883866087,
1413
+ "loss": 2.8195,
1414
+ "theoretical_loss": 3.818974914452557,
1415
+ "tokens_seen": 632422400
1416
+ },
1417
+ {
1418
+ "epoch": 0.23,
1419
+ "learning_rate": 0.0005465646869680348,
1420
+ "loss": 2.8152,
1421
+ "theoretical_loss": 3.8169359647054835,
1422
+ "tokens_seen": 635699200
1423
+ },
1424
+ {
1425
+ "epoch": 0.23,
1426
+ "learning_rate": 0.0005457371855494609,
1427
+ "loss": 2.7996,
1428
+ "theoretical_loss": 3.8149104236366433,
1429
+ "tokens_seen": 638976000
1430
+ },
1431
+ {
1432
+ "epoch": 0.23,
1433
+ "learning_rate": 0.000544909684130887,
1434
+ "loss": 2.7787,
1435
+ "theoretical_loss": 3.8128981349797098,
1436
+ "tokens_seen": 642252800
1437
+ },
1438
+ {
1439
+ "epoch": 0.23,
1440
+ "learning_rate": 0.0005440821827123131,
1441
+ "loss": 2.8041,
1442
+ "theoretical_loss": 3.8108989450749293,
1443
+ "tokens_seen": 645529600
1444
+ },
1445
+ {
1446
+ "epoch": 0.23,
1447
+ "learning_rate": 0.0005432546812937394,
1448
+ "loss": 2.7924,
1449
+ "theoretical_loss": 3.8089127028126764,
1450
+ "tokens_seen": 648806400
1451
+ },
1452
+ {
1453
+ "epoch": 0.23,
1454
+ "learning_rate": 0.0005424271798751655,
1455
+ "loss": 2.8324,
1456
+ "theoretical_loss": 3.8069392595785083,
1457
+ "tokens_seen": 652083200
1458
+ },
1459
+ {
1460
+ "debugging/Self-BLEU-5": 0.5265375629586004,
1461
+ "debugging/distinct-1-grams": 0.7435820408094715,
1462
+ "debugging/distinct-2-grams": 0.9558103821233092,
1463
+ "debugging/entropy-1-grams": 5.931434510687563,
1464
+ "debugging/entropy-2-grams": 6.886416755326388,
1465
+ "debugging/length": 521.9230769230769,
1466
+ "debugging/num_segments": 13,
1467
+ "epoch": 0.23,
1468
+ "objective/train/avg_token_score": 0.022742915898561478,
1469
+ "objective/train/avg_weight": 0.9818012714385986,
1470
+ "objective/train/docs_used": 379091,
1471
+ "objective/train/instantaneous_batch_size": 32,
1472
+ "objective/train/instantaneous_microbatch_size": 32768,
1473
+ "objective/train/original_loss": 3.068922758102417,
1474
+ "objective/train/std_weight": 0.06274868547916412,
1475
+ "objective/train/theoretical_loss": 3.804978469199669,
1476
+ "objective/train/tokens_used": 675820000,
1477
+ "theoretical_loss": 3.804978469199669,
1478
+ "tokens_seen": 655360000
1479
+ },
1480
+ {
1481
+ "epoch": 0.23,
1482
+ "learning_rate": 0.0005415996784565916,
1483
+ "loss": 2.8223,
1484
+ "theoretical_loss": 3.804978469199669,
1485
+ "tokens_seen": 655360000
1486
+ },
1487
+ {
1488
+ "epoch": 0.24,
1489
+ "learning_rate": 0.0005407721770380177,
1490
+ "loss": 2.837,
1491
+ "theoretical_loss": 3.803030187893005,
1492
+ "tokens_seen": 658636800
1493
+ },
1494
+ {
1495
+ "epoch": 0.24,
1496
+ "learning_rate": 0.0005399612256478154,
1497
+ "loss": 2.8345,
1498
+ "theoretical_loss": 3.8010942742142415,
1499
+ "tokens_seen": 661913600
1500
+ },
1501
+ {
1502
+ "epoch": 0.24,
1503
+ "learning_rate": 0.0005391337242292415,
1504
+ "loss": 2.8643,
1505
+ "theoretical_loss": 3.799170589008585,
1506
+ "tokens_seen": 665190400
1507
+ },
1508
+ {
1509
+ "epoch": 0.24,
1510
+ "learning_rate": 0.0005383062228106677,
1511
+ "loss": 2.8481,
1512
+ "theoretical_loss": 3.7972589953626006,
1513
+ "tokens_seen": 668467200
1514
+ },
1515
+ {
1516
+ "epoch": 0.24,
1517
+ "learning_rate": 0.0005374787213920938,
1518
+ "loss": 2.8602,
1519
+ "theoretical_loss": 3.795359358557337,
1520
+ "tokens_seen": 671744000
1521
+ },
1522
+ {
1523
+ "epoch": 0.24,
1524
+ "learning_rate": 0.0005366512199735199,
1525
+ "loss": 2.8011,
1526
+ "theoretical_loss": 3.79347154602265,
1527
+ "tokens_seen": 675020800
1528
+ },
1529
+ {
1530
+ "epoch": 0.24,
1531
+ "learning_rate": 0.0005358237185549461,
1532
+ "loss": 2.8665,
1533
+ "theoretical_loss": 3.7915954272926955,
1534
+ "tokens_seen": 678297600
1535
+ },
1536
+ {
1537
+ "epoch": 0.24,
1538
+ "learning_rate": 0.0005350127671647437,
1539
+ "loss": 2.8685,
1540
+ "theoretical_loss": 3.789730873962557,
1541
+ "tokens_seen": 681574400
1542
+ },
1543
+ {
1544
+ "epoch": 0.24,
1545
+ "learning_rate": 0.0005341852657461698,
1546
+ "loss": 2.7516,
1547
+ "theoretical_loss": 3.787877759645963,
1548
+ "tokens_seen": 684851200
1549
+ },
1550
+ {
1551
+ "epoch": 0.25,
1552
+ "learning_rate": 0.000533357764327596,
1553
+ "loss": 2.7768,
1554
+ "theoretical_loss": 3.7860359599340776,
1555
+ "tokens_seen": 688128000
1556
+ },
1557
+ {
1558
+ "epoch": 0.25,
1559
+ "learning_rate": 0.0005325302629090222,
1560
+ "loss": 2.8109,
1561
+ "theoretical_loss": 3.784205352355321,
1562
+ "tokens_seen": 691404800
1563
+ },
1564
+ {
1565
+ "epoch": 0.25,
1566
+ "learning_rate": 0.0005317027614904483,
1567
+ "loss": 2.8397,
1568
+ "theoretical_loss": 3.782385816336189,
1569
+ "tokens_seen": 694681600
1570
+ },
1571
+ {
1572
+ "epoch": 0.25,
1573
+ "learning_rate": 0.0005308752600718744,
1574
+ "loss": 2.7955,
1575
+ "theoretical_loss": 3.7805772331630516,
1576
+ "tokens_seen": 697958400
1577
+ },
1578
+ {
1579
+ "epoch": 0.25,
1580
+ "learning_rate": 0.0005300477586533005,
1581
+ "loss": 2.7725,
1582
+ "theoretical_loss": 3.7787794859448898,
1583
+ "tokens_seen": 701235200
1584
+ },
1585
+ {
1586
+ "epoch": 0.25,
1587
+ "learning_rate": 0.0005292202572347266,
1588
+ "loss": 2.754,
1589
+ "theoretical_loss": 3.7769924595769546,
1590
+ "tokens_seen": 704512000
1591
+ },
1592
+ {
1593
+ "epoch": 0.25,
1594
+ "learning_rate": 0.0005283927558161528,
1595
+ "loss": 2.7445,
1596
+ "theoretical_loss": 3.7752160407053115,
1597
+ "tokens_seen": 707788800
1598
+ },
1599
+ {
1600
+ "epoch": 0.25,
1601
+ "learning_rate": 0.000527565254397579,
1602
+ "loss": 2.7262,
1603
+ "theoretical_loss": 3.7734501176922493,
1604
+ "tokens_seen": 711065600
1605
+ },
1606
+ {
1607
+ "epoch": 0.26,
1608
+ "learning_rate": 0.0005267377529790051,
1609
+ "loss": 2.7704,
1610
+ "theoretical_loss": 3.7716945805825337,
1611
+ "tokens_seen": 714342400
1612
+ },
1613
+ {
1614
+ "epoch": 0.26,
1615
+ "learning_rate": 0.0005259102515604312,
1616
+ "loss": 2.8113,
1617
+ "theoretical_loss": 3.7699493210704667,
1618
+ "tokens_seen": 717619200
1619
+ },
1620
+ {
1621
+ "epoch": 0.26,
1622
+ "learning_rate": 0.0005250827501418573,
1623
+ "loss": 2.7852,
1624
+ "theoretical_loss": 3.7682142324677455,
1625
+ "tokens_seen": 720896000
1626
+ },
1627
+ {
1628
+ "epoch": 0.26,
1629
+ "learning_rate": 0.0005242552487232835,
1630
+ "loss": 2.8514,
1631
+ "theoretical_loss": 3.7664892096720886,
1632
+ "tokens_seen": 724172800
1633
+ },
1634
+ {
1635
+ "epoch": 0.26,
1636
+ "learning_rate": 0.0005234277473047097,
1637
+ "loss": 2.7829,
1638
+ "theoretical_loss": 3.7647741491366067,
1639
+ "tokens_seen": 727449600
1640
+ },
1641
+ {
1642
+ "epoch": 0.26,
1643
+ "learning_rate": 0.0005226002458861358,
1644
+ "loss": 2.7945,
1645
+ "theoretical_loss": 3.7630689488399027,
1646
+ "tokens_seen": 730726400
1647
+ },
1648
+ {
1649
+ "epoch": 0.26,
1650
+ "learning_rate": 0.0005217727444675619,
1651
+ "loss": 2.7599,
1652
+ "theoretical_loss": 3.7613735082568764,
1653
+ "tokens_seen": 734003200
1654
+ },
1655
+ {
1656
+ "epoch": 0.26,
1657
+ "learning_rate": 0.0005209452430489881,
1658
+ "loss": 2.7937,
1659
+ "theoretical_loss": 3.759687728330217,
1660
+ "tokens_seen": 737280000
1661
+ },
1662
+ {
1663
+ "epoch": 0.26,
1664
+ "learning_rate": 0.0005201177416304143,
1665
+ "loss": 2.7877,
1666
+ "theoretical_loss": 3.75801151144256,
1667
+ "tokens_seen": 740556800
1668
+ },
1669
+ {
1670
+ "epoch": 0.27,
1671
+ "learning_rate": 0.0005192902402118404,
1672
+ "loss": 2.7713,
1673
+ "theoretical_loss": 3.756344761389295,
1674
+ "tokens_seen": 743833600
1675
+ },
1676
+ {
1677
+ "epoch": 0.27,
1678
+ "learning_rate": 0.0005184627387932665,
1679
+ "loss": 2.7429,
1680
+ "theoretical_loss": 3.754687383352003,
1681
+ "tokens_seen": 747110400
1682
+ },
1683
+ {
1684
+ "epoch": 0.27,
1685
+ "learning_rate": 0.0005176352373746927,
1686
+ "loss": 2.747,
1687
+ "theoretical_loss": 3.7530392838725097,
1688
+ "tokens_seen": 750387200
1689
+ },
1690
+ {
1691
+ "epoch": 0.27,
1692
+ "learning_rate": 0.0005168077359561188,
1693
+ "loss": 2.7654,
1694
+ "theoretical_loss": 3.751400370827529,
1695
+ "tokens_seen": 753664000
1696
+ },
1697
+ {
1698
+ "epoch": 0.27,
1699
+ "learning_rate": 0.000515980234537545,
1700
+ "loss": 2.7878,
1701
+ "theoretical_loss": 3.749770553403895,
1702
+ "tokens_seen": 756940800
1703
+ },
1704
+ {
1705
+ "epoch": 0.27,
1706
+ "learning_rate": 0.0005151527331189711,
1707
+ "loss": 2.7522,
1708
+ "theoretical_loss": 3.748149742074355,
1709
+ "tokens_seen": 760217600
1710
+ },
1711
+ {
1712
+ "epoch": 0.27,
1713
+ "learning_rate": 0.0005143252317003972,
1714
+ "loss": 2.782,
1715
+ "theoretical_loss": 3.746537848573908,
1716
+ "tokens_seen": 763494400
1717
+ },
1718
+ {
1719
+ "epoch": 0.27,
1720
+ "learning_rate": 0.0005134977302818233,
1721
+ "loss": 2.7967,
1722
+ "theoretical_loss": 3.744934785876686,
1723
+ "tokens_seen": 766771200
1724
+ },
1725
+ {
1726
+ "epoch": 0.28,
1727
+ "learning_rate": 0.0005126702288632494,
1728
+ "loss": 2.802,
1729
+ "theoretical_loss": 3.7433404681733475,
1730
+ "tokens_seen": 770048000
1731
+ },
1732
+ {
1733
+ "epoch": 0.28,
1734
+ "learning_rate": 0.0005118427274446757,
1735
+ "loss": 2.8081,
1736
+ "theoretical_loss": 3.7417548108489846,
1737
+ "tokens_seen": 773324800
1738
+ },
1739
+ {
1740
+ "epoch": 0.28,
1741
+ "learning_rate": 0.0005110152260261018,
1742
+ "loss": 2.781,
1743
+ "theoretical_loss": 3.740177730461517,
1744
+ "tokens_seen": 776601600
1745
+ },
1746
+ {
1747
+ "epoch": 0.28,
1748
+ "learning_rate": 0.0005101877246075279,
1749
+ "loss": 2.8304,
1750
+ "theoretical_loss": 3.73860914472057,
1751
+ "tokens_seen": 779878400
1752
+ },
1753
+ {
1754
+ "epoch": 0.28,
1755
+ "learning_rate": 0.000509360223188954,
1756
+ "loss": 2.7936,
1757
+ "theoretical_loss": 3.7370489724668197,
1758
+ "tokens_seen": 783155200
1759
+ },
1760
+ {
1761
+ "epoch": 0.28,
1762
+ "learning_rate": 0.0005085327217703801,
1763
+ "loss": 2.8074,
1764
+ "theoretical_loss": 3.735497133651788,
1765
+ "tokens_seen": 786432000
1766
+ },
1767
+ {
1768
+ "epoch": 0.28,
1769
+ "learning_rate": 0.0005077052203518063,
1770
+ "loss": 2.7948,
1771
+ "theoretical_loss": 3.733953549318091,
1772
+ "tokens_seen": 789708800
1773
+ },
1774
+ {
1775
+ "epoch": 0.28,
1776
+ "learning_rate": 0.0005068777189332325,
1777
+ "loss": 2.8081,
1778
+ "theoretical_loss": 3.7324181415801094,
1779
+ "tokens_seen": 792985600
1780
+ },
1781
+ {
1782
+ "epoch": 0.28,
1783
+ "learning_rate": 0.0005060502175146586,
1784
+ "loss": 2.8048,
1785
+ "theoretical_loss": 3.7308908336050814,
1786
+ "tokens_seen": 796262400
1787
+ },
1788
+ {
1789
+ "epoch": 0.29,
1790
+ "learning_rate": 0.0005052227160960847,
1791
+ "loss": 2.7886,
1792
+ "theoretical_loss": 3.729371549594614,
1793
+ "tokens_seen": 799539200
1794
+ },
1795
+ {
1796
+ "epoch": 0.29,
1797
+ "learning_rate": 0.0005043952146775108,
1798
+ "loss": 2.8075,
1799
+ "theoretical_loss": 3.7278602147665776,
1800
+ "tokens_seen": 802816000
1801
+ },
1802
+ {
1803
+ "epoch": 0.29,
1804
+ "learning_rate": 0.000503567713258937,
1805
+ "loss": 2.7927,
1806
+ "theoretical_loss": 3.726356755337407,
1807
+ "tokens_seen": 806092800
1808
+ },
1809
+ {
1810
+ "epoch": 0.29,
1811
+ "learning_rate": 0.0005027402118403631,
1812
+ "loss": 2.7687,
1813
+ "theoretical_loss": 3.724861098504767,
1814
+ "tokens_seen": 809369600
1815
+ },
1816
+ {
1817
+ "epoch": 0.29,
1818
+ "learning_rate": 0.0005019127104217893,
1819
+ "loss": 2.748,
1820
+ "theoretical_loss": 3.7233731724305974,
1821
+ "tokens_seen": 812646400
1822
+ },
1823
+ {
1824
+ "epoch": 0.29,
1825
+ "learning_rate": 0.0005010852090032154,
1826
+ "loss": 2.7755,
1827
+ "theoretical_loss": 3.7218929062245105,
1828
+ "tokens_seen": 815923200
1829
+ },
1830
+ {
1831
+ "epoch": 0.29,
1832
+ "objective/train/avg_token_score": 0.009068925864994526,
1833
+ "objective/train/avg_weight": 0.9927405714988708,
1834
+ "objective/train/docs_used": 471128,
1835
+ "objective/train/instantaneous_batch_size": 32,
1836
+ "objective/train/instantaneous_microbatch_size": 32768,
1837
+ "objective/train/original_loss": 2.671570301055908,
1838
+ "objective/train/std_weight": 0.04655551165342331,
1839
+ "objective/train/theoretical_loss": 3.7204202299275475,
1840
+ "objective/train/tokens_used": 839660000,
1841
+ "theoretical_loss": 3.7204202299275475,
1842
+ "tokens_seen": 819200000
1843
+ },
1844
+ {
1845
+ "epoch": 0.29,
1846
+ "learning_rate": 0.0005002577075846415,
1847
+ "loss": 2.7207,
1848
+ "theoretical_loss": 3.7204202299275475,
1849
+ "tokens_seen": 819200000
1850
+ },
1851
+ {
1852
+ "epoch": 0.29,
1853
+ "learning_rate": 0.0004994302061660678,
1854
+ "loss": 2.744,
1855
+ "theoretical_loss": 3.7189550744962707,
1856
+ "tokens_seen": 822476800
1857
+ },
1858
+ {
1859
+ "epoch": 0.29,
1860
+ "learning_rate": 0.0004986027047474939,
1861
+ "loss": 2.7245,
1862
+ "theoretical_loss": 3.717497371787192,
1863
+ "tokens_seen": 825753600
1864
+ },
1865
+ {
1866
+ "epoch": 0.3,
1867
+ "learning_rate": 0.00049777520332892,
1868
+ "loss": 2.7221,
1869
+ "theoretical_loss": 3.7160470545415274,
1870
+ "tokens_seen": 829030400
1871
+ },
1872
+ {
1873
+ "epoch": 0.3,
1874
+ "learning_rate": 0.0004969477019103461,
1875
+ "loss": 2.721,
1876
+ "theoretical_loss": 3.714604056370267,
1877
+ "tokens_seen": 832307200
1878
+ },
1879
+ {
1880
+ "epoch": 0.3,
1881
+ "learning_rate": 0.0004961202004917723,
1882
+ "loss": 2.7335,
1883
+ "theoretical_loss": 3.713168311739558,
1884
+ "tokens_seen": 835584000
1885
+ },
1886
+ {
1887
+ "epoch": 0.3,
1888
+ "learning_rate": 0.0004952926990731985,
1889
+ "loss": 2.7169,
1890
+ "theoretical_loss": 3.7117397559563843,
1891
+ "tokens_seen": 838860800
1892
+ },
1893
+ {
1894
+ "epoch": 0.3,
1895
+ "learning_rate": 0.0004944651976546246,
1896
+ "loss": 2.7024,
1897
+ "theoretical_loss": 3.710318325154545,
1898
+ "tokens_seen": 842137600
1899
+ },
1900
+ {
1901
+ "epoch": 0.3,
1902
+ "learning_rate": 0.0004936376962360507,
1903
+ "loss": 2.7488,
1904
+ "theoretical_loss": 3.7089039562809223,
1905
+ "tokens_seen": 845414400
1906
+ },
1907
+ {
1908
+ "epoch": 0.3,
1909
+ "learning_rate": 0.0004928101948174768,
1910
+ "loss": 2.7216,
1911
+ "theoretical_loss": 3.7074965870820193,
1912
+ "tokens_seen": 848691200
1913
+ },
1914
+ {
1915
+ "epoch": 0.3,
1916
+ "learning_rate": 0.0004919826933989029,
1917
+ "loss": 2.7227,
1918
+ "theoretical_loss": 3.7060961560907857,
1919
+ "tokens_seen": 851968000
1920
+ },
1921
+ {
1922
+ "epoch": 0.31,
1923
+ "learning_rate": 0.0004911551919803292,
1924
+ "loss": 2.78,
1925
+ "theoretical_loss": 3.7047026026137,
1926
+ "tokens_seen": 855244800
1927
+ },
1928
+ {
1929
+ "epoch": 0.31,
1930
+ "learning_rate": 0.0004903276905617553,
1931
+ "loss": 2.7405,
1932
+ "theoretical_loss": 3.7033158667181154,
1933
+ "tokens_seen": 858521600
1934
+ },
1935
+ {
1936
+ "epoch": 0.31,
1937
+ "learning_rate": 0.0004895001891431814,
1938
+ "loss": 2.7717,
1939
+ "theoretical_loss": 3.701935889219863,
1940
+ "tokens_seen": 861798400
1941
+ },
1942
+ {
1943
+ "epoch": 0.31,
1944
+ "learning_rate": 0.0004886726877246075,
1945
+ "loss": 2.7393,
1946
+ "theoretical_loss": 3.7005626116710966,
1947
+ "tokens_seen": 865075200
1948
+ },
1949
+ {
1950
+ "epoch": 0.31,
1951
+ "learning_rate": 0.00048784518630603363,
1952
+ "loss": 2.7463,
1953
+ "theoretical_loss": 3.69919597634839,
1954
+ "tokens_seen": 868352000
1955
+ },
1956
+ {
1957
+ "epoch": 0.31,
1958
+ "learning_rate": 0.00048701768488745975,
1959
+ "loss": 2.7256,
1960
+ "theoretical_loss": 3.6978359262410603,
1961
+ "tokens_seen": 871628800
1962
+ },
1963
+ {
1964
+ "epoch": 0.31,
1965
+ "learning_rate": 0.000486190183468886,
1966
+ "loss": 2.7512,
1967
+ "theoretical_loss": 3.6964824050397276,
1968
+ "tokens_seen": 874905600
1969
+ },
1970
+ {
1971
+ "epoch": 0.31,
1972
+ "learning_rate": 0.0004853626820503121,
1973
+ "loss": 2.6954,
1974
+ "theoretical_loss": 3.6951353571251015,
1975
+ "tokens_seen": 878182400
1976
+ },
1977
+ {
1978
+ "epoch": 0.31,
1979
+ "learning_rate": 0.0004845351806317382,
1980
+ "loss": 2.695,
1981
+ "theoretical_loss": 3.693794727556988,
1982
+ "tokens_seen": 881459200
1983
+ },
1984
+ {
1985
+ "epoch": 0.32,
1986
+ "learning_rate": 0.0004837076792131644,
1987
+ "loss": 2.699,
1988
+ "theoretical_loss": 3.692460462063506,
1989
+ "tokens_seen": 884736000
1990
+ },
1991
+ {
1992
+ "epoch": 0.32,
1993
+ "learning_rate": 0.0004828801777945905,
1994
+ "loss": 2.7523,
1995
+ "theoretical_loss": 3.691132507030521,
1996
+ "tokens_seen": 888012800
1997
+ },
1998
+ {
1999
+ "epoch": 0.32,
2000
+ "learning_rate": 0.0004820526763760166,
2001
+ "loss": 2.7663,
2002
+ "theoretical_loss": 3.6898108094912816,
2003
+ "tokens_seen": 891289600
2004
+ },
2005
+ {
2006
+ "epoch": 0.32,
2007
+ "learning_rate": 0.00048122517495744274,
2008
+ "loss": 2.7245,
2009
+ "theoretical_loss": 3.6884953171162556,
2010
+ "tokens_seen": 894566400
2011
+ },
2012
+ {
2013
+ "epoch": 0.32,
2014
+ "learning_rate": 0.00048039767353886897,
2015
+ "loss": 2.7305,
2016
+ "theoretical_loss": 3.6871859782031624,
2017
+ "tokens_seen": 897843200
2018
+ },
2019
+ {
2020
+ "epoch": 0.32,
2021
+ "learning_rate": 0.0004795701721202951,
2022
+ "loss": 2.7417,
2023
+ "theoretical_loss": 3.685882741667202,
2024
+ "tokens_seen": 901120000
2025
+ },
2026
+ {
2027
+ "epoch": 0.32,
2028
+ "learning_rate": 0.0004787426707017212,
2029
+ "loss": 2.744,
2030
+ "theoretical_loss": 3.684585557031461,
2031
+ "tokens_seen": 904396800
2032
+ },
2033
+ {
2034
+ "epoch": 0.32,
2035
+ "learning_rate": 0.0004779151692831473,
2036
+ "loss": 2.7572,
2037
+ "theoretical_loss": 3.6832943744175126,
2038
+ "tokens_seen": 907673600
2039
+ },
2040
+ {
2041
+ "epoch": 0.33,
2042
+ "learning_rate": 0.00047708766786457344,
2043
+ "loss": 2.7303,
2044
+ "theoretical_loss": 3.682009144536188,
2045
+ "tokens_seen": 910950400
2046
+ },
2047
+ {
2048
+ "epoch": 0.33,
2049
+ "learning_rate": 0.00047626016644599956,
2050
+ "loss": 2.7572,
2051
+ "theoretical_loss": 3.680729818678526,
2052
+ "tokens_seen": 914227200
2053
+ },
2054
+ {
2055
+ "epoch": 0.33,
2056
+ "learning_rate": 0.0004754326650274258,
2057
+ "loss": 2.7362,
2058
+ "theoretical_loss": 3.6794563487068936,
2059
+ "tokens_seen": 917504000
2060
+ },
2061
+ {
2062
+ "epoch": 0.33,
2063
+ "learning_rate": 0.0004746051636088519,
2064
+ "loss": 2.7454,
2065
+ "theoretical_loss": 3.6781886870462692,
2066
+ "tokens_seen": 920780800
2067
+ },
2068
+ {
2069
+ "epoch": 0.33,
2070
+ "learning_rate": 0.000473777662190278,
2071
+ "loss": 2.7626,
2072
+ "theoretical_loss": 3.676926786675698,
2073
+ "tokens_seen": 924057600
2074
+ },
2075
+ {
2076
+ "epoch": 0.33,
2077
+ "learning_rate": 0.0004729501607717042,
2078
+ "loss": 2.7011,
2079
+ "theoretical_loss": 3.6756706011198963,
2080
+ "tokens_seen": 927334400
2081
+ },
2082
+ {
2083
+ "epoch": 0.33,
2084
+ "learning_rate": 0.0004721226593531303,
2085
+ "loss": 2.6682,
2086
+ "theoretical_loss": 3.6744200844410217,
2087
+ "tokens_seen": 930611200
2088
+ },
2089
+ {
2090
+ "epoch": 0.33,
2091
+ "learning_rate": 0.00047129515793455643,
2092
+ "loss": 2.6695,
2093
+ "theoretical_loss": 3.6731751912305914,
2094
+ "tokens_seen": 933888000
2095
+ },
2096
+ {
2097
+ "epoch": 0.33,
2098
+ "learning_rate": 0.00047046765651598266,
2099
+ "loss": 2.7057,
2100
+ "theoretical_loss": 3.671935876601547,
2101
+ "tokens_seen": 937164800
2102
+ },
2103
+ {
2104
+ "epoch": 0.34,
2105
+ "learning_rate": 0.0004696401550974088,
2106
+ "loss": 2.6439,
2107
+ "theoretical_loss": 3.6707020961804715,
2108
+ "tokens_seen": 940441600
2109
+ },
2110
+ {
2111
+ "epoch": 0.34,
2112
+ "learning_rate": 0.0004688126536788349,
2113
+ "loss": 2.6978,
2114
+ "theoretical_loss": 3.6694738060999468,
2115
+ "tokens_seen": 943718400
2116
+ },
2117
+ {
2118
+ "epoch": 0.34,
2119
+ "learning_rate": 0.000467985152260261,
2120
+ "loss": 2.7145,
2121
+ "theoretical_loss": 3.668250962991049,
2122
+ "tokens_seen": 946995200
2123
+ },
2124
+ {
2125
+ "epoch": 0.34,
2126
+ "learning_rate": 0.00046715765084168713,
2127
+ "loss": 2.7161,
2128
+ "theoretical_loss": 3.667033523975983,
2129
+ "tokens_seen": 950272000
2130
+ },
2131
+ {
2132
+ "epoch": 0.34,
2133
+ "learning_rate": 0.00046633014942311325,
2134
+ "loss": 2.6986,
2135
+ "theoretical_loss": 3.66582144666085,
2136
+ "tokens_seen": 953548800
2137
+ },
2138
+ {
2139
+ "epoch": 0.34,
2140
+ "learning_rate": 0.0004655026480045394,
2141
+ "loss": 2.7021,
2142
+ "theoretical_loss": 3.664614689128546,
2143
+ "tokens_seen": 956825600
2144
+ },
2145
+ {
2146
+ "epoch": 0.34,
2147
+ "learning_rate": 0.0004646751465859656,
2148
+ "loss": 2.6985,
2149
+ "theoretical_loss": 3.6634132099317886,
2150
+ "tokens_seen": 960102400
2151
+ },
2152
+ {
2153
+ "epoch": 0.34,
2154
+ "learning_rate": 0.0004638476451673917,
2155
+ "loss": 2.7076,
2156
+ "theoretical_loss": 3.662216968086267,
2157
+ "tokens_seen": 963379200
2158
+ },
2159
+ {
2160
+ "epoch": 0.35,
2161
+ "learning_rate": 0.00046302014374881783,
2162
+ "loss": 2.7369,
2163
+ "theoretical_loss": 3.6610259230639217,
2164
+ "tokens_seen": 966656000
2165
+ },
2166
+ {
2167
+ "epoch": 0.35,
2168
+ "learning_rate": 0.000462192642330244,
2169
+ "loss": 2.7711,
2170
+ "theoretical_loss": 3.659840034786333,
2171
+ "tokens_seen": 969932800
2172
+ },
2173
+ {
2174
+ "epoch": 0.35,
2175
+ "learning_rate": 0.0004613651409116701,
2176
+ "loss": 2.8314,
2177
+ "theoretical_loss": 3.6586592636182376,
2178
+ "tokens_seen": 973209600
2179
+ },
2180
+ {
2181
+ "epoch": 0.35,
2182
+ "learning_rate": 0.0004605541895214677,
2183
+ "loss": 2.8239,
2184
+ "theoretical_loss": 3.6574835703611566,
2185
+ "tokens_seen": 976486400
2186
+ },
2187
+ {
2188
+ "epoch": 0.35,
2189
+ "learning_rate": 0.00045972668810289393,
2190
+ "loss": 2.8097,
2191
+ "theoretical_loss": 3.6563129162471313,
2192
+ "tokens_seen": 979763200
2193
+ },
2194
+ {
2195
+ "debugging/Self-BLEU-5": 0.4286046663919377,
2196
+ "debugging/distinct-1-grams": 0.8147567798871364,
2197
+ "debugging/distinct-2-grams": 0.9823269374342457,
2198
+ "debugging/entropy-1-grams": 6.1671920556004824,
2199
+ "debugging/entropy-2-grams": 6.947028138756313,
2200
+ "debugging/length": 477.53333333333336,
2201
+ "debugging/num_segments": 15,
2202
+ "epoch": 0.35,
2203
+ "objective/train/avg_token_score": 0.020611366257071495,
2204
+ "objective/train/avg_weight": 0.9834998250007629,
2205
+ "objective/train/docs_used": 560408,
2206
+ "objective/train/instantaneous_batch_size": 32,
2207
+ "objective/train/instantaneous_microbatch_size": 32768,
2208
+ "objective/train/original_loss": 2.9286718368530273,
2209
+ "objective/train/std_weight": 0.0680035874247551,
2210
+ "objective/train/theoretical_loss": 3.6551472629325787,
2211
+ "objective/train/tokens_used": 1003500000,
2212
+ "theoretical_loss": 3.6551472629325787,
2213
+ "tokens_seen": 983040000
2214
+ },
2215
+ {
2216
+ "epoch": 0.35,
2217
+ "learning_rate": 0.00045889918668432005,
2218
+ "loss": 2.8094,
2219
+ "theoretical_loss": 3.6551472629325787,
2220
+ "tokens_seen": 983040000
2221
+ },
2222
+ {
2223
+ "epoch": 0.35,
2224
+ "learning_rate": 0.00045807168526574616,
2225
+ "loss": 2.8104,
2226
+ "theoretical_loss": 3.653986572492247,
2227
+ "tokens_seen": 986316800
2228
+ },
2229
+ {
2230
+ "epoch": 0.35,
2231
+ "learning_rate": 0.0004572441838471723,
2232
+ "loss": 2.7787,
2233
+ "theoretical_loss": 3.65283080741328,
2234
+ "tokens_seen": 989593600
2235
+ },
2236
+ {
2237
+ "epoch": 0.35,
2238
+ "learning_rate": 0.0004564166824285984,
2239
+ "loss": 2.7906,
2240
+ "theoretical_loss": 3.6516799305893866,
2241
+ "tokens_seen": 992870400
2242
+ },
2243
+ {
2244
+ "epoch": 0.36,
2245
+ "learning_rate": 0.0004555891810100246,
2246
+ "loss": 2.7836,
2247
+ "theoretical_loss": 3.6505339053151076,
2248
+ "tokens_seen": 996147200
2249
+ },
2250
+ {
2251
+ "epoch": 0.36,
2252
+ "learning_rate": 0.00045476167959145075,
2253
+ "loss": 2.7921,
2254
+ "theoretical_loss": 3.649392695280186,
2255
+ "tokens_seen": 999424000
2256
+ },
2257
+ {
2258
+ "epoch": 0.36,
2259
+ "learning_rate": 0.00045393417817287686,
2260
+ "loss": 2.7558,
2261
+ "theoretical_loss": 3.6482562645640337,
2262
+ "tokens_seen": 1002700800
2263
+ },
2264
+ {
2265
+ "epoch": 0.36,
2266
+ "learning_rate": 0.00045310667675430304,
2267
+ "loss": 2.761,
2268
+ "theoretical_loss": 3.6471245776302883,
2269
+ "tokens_seen": 1005977600
2270
+ },
2271
+ {
2272
+ "epoch": 0.36,
2273
+ "learning_rate": 0.00045227917533572916,
2274
+ "loss": 2.7962,
2275
+ "theoretical_loss": 3.6459975993214724,
2276
+ "tokens_seen": 1009254400
2277
+ },
2278
+ {
2279
+ "epoch": 0.36,
2280
+ "learning_rate": 0.0004514516739171553,
2281
+ "loss": 2.7592,
2282
+ "theoretical_loss": 3.6448752948537377,
2283
+ "tokens_seen": 1012531200
2284
+ },
2285
+ {
2286
+ "epoch": 0.36,
2287
+ "learning_rate": 0.0004506241724985814,
2288
+ "loss": 2.7952,
2289
+ "theoretical_loss": 3.6437576298116996,
2290
+ "tokens_seen": 1015808000
2291
+ },
2292
+ {
2293
+ "epoch": 0.36,
2294
+ "learning_rate": 0.0004497966710800076,
2295
+ "loss": 2.7877,
2296
+ "theoretical_loss": 3.6426445701433607,
2297
+ "tokens_seen": 1019084800
2298
+ },
2299
+ {
2300
+ "epoch": 0.37,
2301
+ "learning_rate": 0.00044896916966143374,
2302
+ "loss": 2.8108,
2303
+ "theoretical_loss": 3.6415360821551226,
2304
+ "tokens_seen": 1022361600
2305
+ },
2306
+ {
2307
+ "epoch": 0.37,
2308
+ "learning_rate": 0.00044814166824285986,
2309
+ "loss": 2.7629,
2310
+ "theoretical_loss": 3.6404321325068754,
2311
+ "tokens_seen": 1025638400
2312
+ },
2313
+ {
2314
+ "epoch": 0.37,
2315
+ "learning_rate": 0.000447314166824286,
2316
+ "loss": 2.7842,
2317
+ "theoretical_loss": 3.639332688207178,
2318
+ "tokens_seen": 1028915200
2319
+ },
2320
+ {
2321
+ "epoch": 0.37,
2322
+ "learning_rate": 0.0004464866654057121,
2323
+ "loss": 2.7649,
2324
+ "theoretical_loss": 3.6382377166085096,
2325
+ "tokens_seen": 1032192000
2326
+ },
2327
+ {
2328
+ "epoch": 0.37,
2329
+ "learning_rate": 0.0004456591639871382,
2330
+ "loss": 2.7244,
2331
+ "theoretical_loss": 3.6371471854026147,
2332
+ "tokens_seen": 1035468800
2333
+ },
2334
+ {
2335
+ "epoch": 0.37,
2336
+ "learning_rate": 0.0004448316625685644,
2337
+ "loss": 2.7313,
2338
+ "theoretical_loss": 3.6360610626159087,
2339
+ "tokens_seen": 1038745600
2340
+ },
2341
+ {
2342
+ "epoch": 0.37,
2343
+ "learning_rate": 0.00044400416114999055,
2344
+ "loss": 2.6993,
2345
+ "theoretical_loss": 3.634979316604973,
2346
+ "tokens_seen": 1042022400
2347
+ },
2348
+ {
2349
+ "epoch": 0.37,
2350
+ "learning_rate": 0.0004431766597314167,
2351
+ "loss": 2.714,
2352
+ "theoretical_loss": 3.6339019160521198,
2353
+ "tokens_seen": 1045299200
2354
+ },
2355
+ {
2356
+ "epoch": 0.37,
2357
+ "learning_rate": 0.00044234915831284285,
2358
+ "loss": 2.7069,
2359
+ "theoretical_loss": 3.632828829961029,
2360
+ "tokens_seen": 1048576000
2361
+ },
2362
+ {
2363
+ "epoch": 0.38,
2364
+ "learning_rate": 0.00044152165689426896,
2365
+ "loss": 2.7232,
2366
+ "theoretical_loss": 3.631760027652461,
2367
+ "tokens_seen": 1051852800
2368
+ },
2369
+ {
2370
+ "epoch": 0.38,
2371
+ "learning_rate": 0.0004406941554756951,
2372
+ "loss": 2.7506,
2373
+ "theoretical_loss": 3.630695478760034,
2374
+ "tokens_seen": 1055129600
2375
+ },
2376
+ {
2377
+ "epoch": 0.38,
2378
+ "learning_rate": 0.0004398666540571212,
2379
+ "loss": 2.7249,
2380
+ "theoretical_loss": 3.6296351532260767,
2381
+ "tokens_seen": 1058406400
2382
+ },
2383
+ {
2384
+ "epoch": 0.38,
2385
+ "learning_rate": 0.0004390391526385474,
2386
+ "loss": 2.7194,
2387
+ "theoretical_loss": 3.6285790212975435,
2388
+ "tokens_seen": 1061683200
2389
+ },
2390
+ {
2391
+ "epoch": 0.38,
2392
+ "learning_rate": 0.00043821165121997355,
2393
+ "loss": 2.6962,
2394
+ "theoretical_loss": 3.6275270535220008,
2395
+ "tokens_seen": 1064960000
2396
+ },
2397
+ {
2398
+ "epoch": 0.38,
2399
+ "learning_rate": 0.00043738414980139966,
2400
+ "loss": 2.7203,
2401
+ "theoretical_loss": 3.626479220743673,
2402
+ "tokens_seen": 1068236800
2403
+ },
2404
+ {
2405
+ "epoch": 0.38,
2406
+ "learning_rate": 0.0004365566483828258,
2407
+ "loss": 2.7303,
2408
+ "theoretical_loss": 3.6254354940995586,
2409
+ "tokens_seen": 1071513600
2410
+ },
2411
+ {
2412
+ "epoch": 0.38,
2413
+ "learning_rate": 0.0004357291469642519,
2414
+ "loss": 2.7215,
2415
+ "theoretical_loss": 3.624395845015602,
2416
+ "tokens_seen": 1074790400
2417
+ },
2418
+ {
2419
+ "epoch": 0.39,
2420
+ "learning_rate": 0.00043490164554567807,
2421
+ "loss": 2.6763,
2422
+ "theoretical_loss": 3.6233602452029348,
2423
+ "tokens_seen": 1078067200
2424
+ },
2425
+ {
2426
+ "epoch": 0.39,
2427
+ "learning_rate": 0.00043407414412710424,
2428
+ "loss": 2.6939,
2429
+ "theoretical_loss": 3.6223286666541683,
2430
+ "tokens_seen": 1081344000
2431
+ },
2432
+ {
2433
+ "epoch": 0.39,
2434
+ "learning_rate": 0.00043324664270853036,
2435
+ "loss": 2.7216,
2436
+ "theoretical_loss": 3.621301081639753,
2437
+ "tokens_seen": 1084620800
2438
+ },
2439
+ {
2440
+ "epoch": 0.39,
2441
+ "learning_rate": 0.0004324191412899565,
2442
+ "loss": 2.7593,
2443
+ "theoretical_loss": 3.6202774627043923,
2444
+ "tokens_seen": 1087897600
2445
+ },
2446
+ {
2447
+ "epoch": 0.39,
2448
+ "learning_rate": 0.00043159163987138265,
2449
+ "loss": 2.7155,
2450
+ "theoretical_loss": 3.619257782663513,
2451
+ "tokens_seen": 1091174400
2452
+ },
2453
+ {
2454
+ "epoch": 0.39,
2455
+ "learning_rate": 0.00043078068848118023,
2456
+ "loss": 2.7244,
2457
+ "theoretical_loss": 3.618242014599793,
2458
+ "tokens_seen": 1094451200
2459
+ },
2460
+ {
2461
+ "epoch": 0.39,
2462
+ "learning_rate": 0.00042995318706260635,
2463
+ "loss": 2.7124,
2464
+ "theoretical_loss": 3.617230131859743,
2465
+ "tokens_seen": 1097728000
2466
+ },
2467
+ {
2468
+ "epoch": 0.39,
2469
+ "learning_rate": 0.00042912568564403247,
2470
+ "loss": 2.6959,
2471
+ "theoretical_loss": 3.6162221080503416,
2472
+ "tokens_seen": 1101004800
2473
+ },
2474
+ {
2475
+ "epoch": 0.39,
2476
+ "learning_rate": 0.0004282981842254587,
2477
+ "loss": 2.6816,
2478
+ "theoretical_loss": 3.615217917035726,
2479
+ "tokens_seen": 1104281600
2480
+ },
2481
+ {
2482
+ "epoch": 0.4,
2483
+ "learning_rate": 0.0004274706828068848,
2484
+ "loss": 2.7135,
2485
+ "theoretical_loss": 3.614217532933929,
2486
+ "tokens_seen": 1107558400
2487
+ },
2488
+ {
2489
+ "epoch": 0.4,
2490
+ "learning_rate": 0.00042664318138831093,
2491
+ "loss": 2.7048,
2492
+ "theoretical_loss": 3.6132209301136715,
2493
+ "tokens_seen": 1110835200
2494
+ },
2495
+ {
2496
+ "epoch": 0.4,
2497
+ "learning_rate": 0.00042581567996973705,
2498
+ "loss": 2.6877,
2499
+ "theoretical_loss": 3.612228083191205,
2500
+ "tokens_seen": 1114112000
2501
+ },
2502
+ {
2503
+ "epoch": 0.4,
2504
+ "learning_rate": 0.0004249881785511632,
2505
+ "loss": 2.7443,
2506
+ "theoretical_loss": 3.611238967027199,
2507
+ "tokens_seen": 1117388800
2508
+ },
2509
+ {
2510
+ "epoch": 0.4,
2511
+ "learning_rate": 0.00042416067713258934,
2512
+ "loss": 2.7844,
2513
+ "theoretical_loss": 3.610253556723679,
2514
+ "tokens_seen": 1120665600
2515
+ },
2516
+ {
2517
+ "epoch": 0.4,
2518
+ "learning_rate": 0.0004233331757140155,
2519
+ "loss": 2.7489,
2520
+ "theoretical_loss": 3.609271827621014,
2521
+ "tokens_seen": 1123942400
2522
+ },
2523
+ {
2524
+ "epoch": 0.4,
2525
+ "learning_rate": 0.0004225056742954417,
2526
+ "loss": 2.7523,
2527
+ "theoretical_loss": 3.6082937552949463,
2528
+ "tokens_seen": 1127219200
2529
+ },
2530
+ {
2531
+ "epoch": 0.4,
2532
+ "learning_rate": 0.0004216781728768678,
2533
+ "loss": 2.7191,
2534
+ "theoretical_loss": 3.607319315553669,
2535
+ "tokens_seen": 1130496000
2536
+ },
2537
+ {
2538
+ "epoch": 0.4,
2539
+ "learning_rate": 0.0004208506714582939,
2540
+ "loss": 2.6832,
2541
+ "theoretical_loss": 3.6063484844349456,
2542
+ "tokens_seen": 1133772800
2543
+ },
2544
+ {
2545
+ "epoch": 0.41,
2546
+ "learning_rate": 0.00042002317003972004,
2547
+ "loss": 2.7085,
2548
+ "theoretical_loss": 3.605381238203279,
2549
+ "tokens_seen": 1137049600
2550
+ },
2551
+ {
2552
+ "epoch": 0.41,
2553
+ "learning_rate": 0.00041919566862114616,
2554
+ "loss": 2.7165,
2555
+ "theoretical_loss": 3.604417553347117,
2556
+ "tokens_seen": 1140326400
2557
+ },
2558
+ {
2559
+ "epoch": 0.41,
2560
+ "learning_rate": 0.0004183681672025724,
2561
+ "loss": 2.7073,
2562
+ "theoretical_loss": 3.603457406576106,
2563
+ "tokens_seen": 1143603200
2564
+ },
2565
+ {
2566
+ "epoch": 0.41,
2567
+ "objective/train/avg_token_score": 0.022877871990203857,
2568
+ "objective/train/avg_weight": 0.9816967248916626,
2569
+ "objective/train/docs_used": 649861,
2570
+ "objective/train/instantaneous_batch_size": 32,
2571
+ "objective/train/instantaneous_microbatch_size": 32768,
2572
+ "objective/train/original_loss": 2.648519992828369,
2573
+ "objective/train/std_weight": 0.07637551426887512,
2574
+ "objective/train/theoretical_loss": 3.602500774818379,
2575
+ "objective/train/tokens_used": 1167340000,
2576
+ "theoretical_loss": 3.602500774818379,
2577
+ "tokens_seen": 1146880000
2578
+ },
2579
+ {
2580
+ "epoch": 0.41,
2581
+ "learning_rate": 0.0004175406657839985,
2582
+ "loss": 2.6952,
2583
+ "theoretical_loss": 3.602500774818379,
2584
+ "tokens_seen": 1146880000
2585
+ },
2586
+ {
2587
+ "epoch": 0.41,
2588
+ "learning_rate": 0.0004167131643654246,
2589
+ "loss": 2.7354,
2590
+ "theoretical_loss": 3.601547635217892,
2591
+ "tokens_seen": 1150156800
2592
+ },
2593
+ {
2594
+ "epoch": 0.41,
2595
+ "learning_rate": 0.00041588566294685074,
2596
+ "loss": 2.7348,
2597
+ "theoretical_loss": 3.6005979651317976,
2598
+ "tokens_seen": 1153433600
2599
+ },
2600
+ {
2601
+ "epoch": 0.41,
2602
+ "learning_rate": 0.00041505816152827686,
2603
+ "loss": 2.7431,
2604
+ "theoretical_loss": 3.599651742127855,
2605
+ "tokens_seen": 1156710400
2606
+ },
2607
+ {
2608
+ "epoch": 0.41,
2609
+ "learning_rate": 0.00041423066010970303,
2610
+ "loss": 2.6952,
2611
+ "theoretical_loss": 3.5987089439818805,
2612
+ "tokens_seen": 1159987200
2613
+ },
2614
+ {
2615
+ "epoch": 0.42,
2616
+ "learning_rate": 0.00041340315869112915,
2617
+ "loss": 2.6743,
2618
+ "theoretical_loss": 3.5977695486752426,
2619
+ "tokens_seen": 1163264000
2620
+ },
2621
+ {
2622
+ "epoch": 0.42,
2623
+ "learning_rate": 0.0004125756572725553,
2624
+ "loss": 2.709,
2625
+ "theoretical_loss": 3.596833534392379,
2626
+ "tokens_seen": 1166540800
2627
+ },
2628
+ {
2629
+ "epoch": 0.42,
2630
+ "learning_rate": 0.0004117481558539815,
2631
+ "loss": 2.7274,
2632
+ "theoretical_loss": 3.595900879518368,
2633
+ "tokens_seen": 1169817600
2634
+ },
2635
+ {
2636
+ "epoch": 0.42,
2637
+ "learning_rate": 0.0004109206544354076,
2638
+ "loss": 2.7434,
2639
+ "theoretical_loss": 3.594971562636521,
2640
+ "tokens_seen": 1173094400
2641
+ },
2642
+ {
2643
+ "epoch": 0.42,
2644
+ "learning_rate": 0.00041009315301683373,
2645
+ "loss": 2.6976,
2646
+ "theoretical_loss": 3.5940455625260226,
2647
+ "tokens_seen": 1176371200
2648
+ },
2649
+ {
2650
+ "epoch": 0.42,
2651
+ "learning_rate": 0.00040926565159825985,
2652
+ "loss": 2.7078,
2653
+ "theoretical_loss": 3.5931228581595938,
2654
+ "tokens_seen": 1179648000
2655
+ },
2656
+ {
2657
+ "epoch": 0.42,
2658
+ "learning_rate": 0.00040843815017968597,
2659
+ "loss": 2.7067,
2660
+ "theoretical_loss": 3.5922034287011995,
2661
+ "tokens_seen": 1182924800
2662
+ },
2663
+ {
2664
+ "epoch": 0.42,
2665
+ "learning_rate": 0.0004076106487611122,
2666
+ "loss": 2.6717,
2667
+ "theoretical_loss": 3.5912872535037828,
2668
+ "tokens_seen": 1186201600
2669
+ },
2670
+ {
2671
+ "epoch": 0.42,
2672
+ "learning_rate": 0.0004067831473425383,
2673
+ "loss": 2.7545,
2674
+ "theoretical_loss": 3.590374312107035,
2675
+ "tokens_seen": 1189478400
2676
+ },
2677
+ {
2678
+ "epoch": 0.43,
2679
+ "learning_rate": 0.00040595564592396443,
2680
+ "loss": 2.7398,
2681
+ "theoretical_loss": 3.5894645842351993,
2682
+ "tokens_seen": 1192755200
2683
+ },
2684
+ {
2685
+ "epoch": 0.43,
2686
+ "learning_rate": 0.00040512814450539055,
2687
+ "loss": 2.7241,
2688
+ "theoretical_loss": 3.588558049794902,
2689
+ "tokens_seen": 1196032000
2690
+ },
2691
+ {
2692
+ "epoch": 0.43,
2693
+ "learning_rate": 0.0004043006430868167,
2694
+ "loss": 2.7036,
2695
+ "theoretical_loss": 3.5876546888730187,
2696
+ "tokens_seen": 1199308800
2697
+ },
2698
+ {
2699
+ "epoch": 0.43,
2700
+ "learning_rate": 0.00040347314166824284,
2701
+ "loss": 2.7239,
2702
+ "theoretical_loss": 3.5867544817345713,
2703
+ "tokens_seen": 1202585600
2704
+ },
2705
+ {
2706
+ "epoch": 0.43,
2707
+ "learning_rate": 0.000402645640249669,
2708
+ "loss": 2.7503,
2709
+ "theoretical_loss": 3.585857408820652,
2710
+ "tokens_seen": 1205862400
2711
+ },
2712
+ {
2713
+ "epoch": 0.43,
2714
+ "learning_rate": 0.00040181813883109513,
2715
+ "loss": 2.7573,
2716
+ "theoretical_loss": 3.58496345074638,
2717
+ "tokens_seen": 1209139200
2718
+ },
2719
+ {
2720
+ "epoch": 0.43,
2721
+ "learning_rate": 0.0004009906374125213,
2722
+ "loss": 2.7716,
2723
+ "theoretical_loss": 3.5840725882988873,
2724
+ "tokens_seen": 1212416000
2725
+ },
2726
+ {
2727
+ "epoch": 0.43,
2728
+ "learning_rate": 0.0004001631359939474,
2729
+ "loss": 2.7395,
2730
+ "theoretical_loss": 3.5831848024353317,
2731
+ "tokens_seen": 1215692800
2732
+ },
2733
+ {
2734
+ "epoch": 0.44,
2735
+ "learning_rate": 0.00039933563457537354,
2736
+ "loss": 2.7634,
2737
+ "theoretical_loss": 3.5823000742809374,
2738
+ "tokens_seen": 1218969600
2739
+ },
2740
+ {
2741
+ "epoch": 0.44,
2742
+ "learning_rate": 0.00039850813315679966,
2743
+ "loss": 2.7348,
2744
+ "theoretical_loss": 3.5814183851270673,
2745
+ "tokens_seen": 1222246400
2746
+ },
2747
+ {
2748
+ "epoch": 0.44,
2749
+ "learning_rate": 0.0003976806317382258,
2750
+ "loss": 2.7239,
2751
+ "theoretical_loss": 3.5805397164293167,
2752
+ "tokens_seen": 1225523200
2753
+ },
2754
+ {
2755
+ "epoch": 0.44,
2756
+ "learning_rate": 0.000396853130319652,
2757
+ "loss": 2.7059,
2758
+ "theoretical_loss": 3.5796640498056407,
2759
+ "tokens_seen": 1228800000
2760
+ },
2761
+ {
2762
+ "epoch": 0.44,
2763
+ "learning_rate": 0.0003960256289010781,
2764
+ "loss": 2.7119,
2765
+ "theoretical_loss": 3.5787913670345013,
2766
+ "tokens_seen": 1232076800
2767
+ },
2768
+ {
2769
+ "epoch": 0.44,
2770
+ "learning_rate": 0.00039519812748250424,
2771
+ "loss": 2.6905,
2772
+ "theoretical_loss": 3.577921650053045,
2773
+ "tokens_seen": 1235353600
2774
+ },
2775
+ {
2776
+ "epoch": 0.44,
2777
+ "learning_rate": 0.00039437062606393036,
2778
+ "loss": 2.687,
2779
+ "theoretical_loss": 3.577054880955303,
2780
+ "tokens_seen": 1238630400
2781
+ },
2782
+ {
2783
+ "epoch": 0.44,
2784
+ "learning_rate": 0.00039354312464535653,
2785
+ "loss": 2.6705,
2786
+ "theoretical_loss": 3.5761910419904193,
2787
+ "tokens_seen": 1241907200
2788
+ },
2789
+ {
2790
+ "epoch": 0.44,
2791
+ "learning_rate": 0.00039271562322678265,
2792
+ "loss": 2.7329,
2793
+ "theoretical_loss": 3.5753301155609014,
2794
+ "tokens_seen": 1245184000
2795
+ },
2796
+ {
2797
+ "epoch": 0.45,
2798
+ "learning_rate": 0.0003918881218082088,
2799
+ "loss": 2.6739,
2800
+ "theoretical_loss": 3.574472084220896,
2801
+ "tokens_seen": 1248460800
2802
+ },
2803
+ {
2804
+ "epoch": 0.45,
2805
+ "learning_rate": 0.00039106062038963494,
2806
+ "loss": 2.6639,
2807
+ "theoretical_loss": 3.5736169306744885,
2808
+ "tokens_seen": 1251737600
2809
+ },
2810
+ {
2811
+ "epoch": 0.45,
2812
+ "learning_rate": 0.0003902331189710611,
2813
+ "loss": 2.6337,
2814
+ "theoretical_loss": 3.572764637774024,
2815
+ "tokens_seen": 1255014400
2816
+ },
2817
+ {
2818
+ "epoch": 0.45,
2819
+ "learning_rate": 0.00038940561755248723,
2820
+ "loss": 2.6435,
2821
+ "theoretical_loss": 3.571915188518457,
2822
+ "tokens_seen": 1258291200
2823
+ },
2824
+ {
2825
+ "epoch": 0.45,
2826
+ "learning_rate": 0.00038857811613391335,
2827
+ "loss": 2.662,
2828
+ "theoretical_loss": 3.571068566051716,
2829
+ "tokens_seen": 1261568000
2830
+ },
2831
+ {
2832
+ "epoch": 0.45,
2833
+ "learning_rate": 0.00038775061471533947,
2834
+ "loss": 2.704,
2835
+ "theoretical_loss": 3.5702247536610976,
2836
+ "tokens_seen": 1264844800
2837
+ },
2838
+ {
2839
+ "epoch": 0.45,
2840
+ "learning_rate": 0.0003869231132967657,
2841
+ "loss": 2.6831,
2842
+ "theoretical_loss": 3.5693837347756783,
2843
+ "tokens_seen": 1268121600
2844
+ },
2845
+ {
2846
+ "epoch": 0.45,
2847
+ "learning_rate": 0.0003860956118781918,
2848
+ "loss": 2.6552,
2849
+ "theoretical_loss": 3.5685454929647475,
2850
+ "tokens_seen": 1271398400
2851
+ },
2852
+ {
2853
+ "epoch": 0.46,
2854
+ "learning_rate": 0.00038526811045961793,
2855
+ "loss": 2.6671,
2856
+ "theoretical_loss": 3.5677100119362675,
2857
+ "tokens_seen": 1274675200
2858
+ },
2859
+ {
2860
+ "epoch": 0.46,
2861
+ "learning_rate": 0.00038444060904104405,
2862
+ "loss": 2.6461,
2863
+ "theoretical_loss": 3.566877275535345,
2864
+ "tokens_seen": 1277952000
2865
+ },
2866
+ {
2867
+ "epoch": 0.46,
2868
+ "learning_rate": 0.00038361310762247017,
2869
+ "loss": 2.6451,
2870
+ "theoretical_loss": 3.566047267742733,
2871
+ "tokens_seen": 1281228800
2872
+ },
2873
+ {
2874
+ "epoch": 0.46,
2875
+ "learning_rate": 0.00038278560620389634,
2876
+ "loss": 2.6221,
2877
+ "theoretical_loss": 3.5652199726733453,
2878
+ "tokens_seen": 1284505600
2879
+ },
2880
+ {
2881
+ "epoch": 0.46,
2882
+ "learning_rate": 0.0003819581047853225,
2883
+ "loss": 2.6172,
2884
+ "theoretical_loss": 3.564395374574796,
2885
+ "tokens_seen": 1287782400
2886
+ },
2887
+ {
2888
+ "epoch": 0.46,
2889
+ "learning_rate": 0.00038113060336674863,
2890
+ "loss": 2.6767,
2891
+ "theoretical_loss": 3.5635734578259557,
2892
+ "tokens_seen": 1291059200
2893
+ },
2894
+ {
2895
+ "epoch": 0.46,
2896
+ "learning_rate": 0.0003803031019481748,
2897
+ "loss": 2.6908,
2898
+ "theoretical_loss": 3.5627542069355282,
2899
+ "tokens_seen": 1294336000
2900
+ },
2901
+ {
2902
+ "epoch": 0.46,
2903
+ "learning_rate": 0.0003794756005296009,
2904
+ "loss": 2.6578,
2905
+ "theoretical_loss": 3.5619376065406474,
2906
+ "tokens_seen": 1297612800
2907
+ },
2908
+ {
2909
+ "epoch": 0.46,
2910
+ "learning_rate": 0.00037864809911102704,
2911
+ "loss": 2.6722,
2912
+ "theoretical_loss": 3.5611236414054868,
2913
+ "tokens_seen": 1300889600
2914
+ },
2915
+ {
2916
+ "epoch": 0.47,
2917
+ "learning_rate": 0.00037782059769245316,
2918
+ "loss": 2.6776,
2919
+ "theoretical_loss": 3.560312296419899,
2920
+ "tokens_seen": 1304166400
2921
+ },
2922
+ {
2923
+ "epoch": 0.47,
2924
+ "learning_rate": 0.0003769930962738793,
2925
+ "loss": 2.6728,
2926
+ "theoretical_loss": 3.55950355659806,
2927
+ "tokens_seen": 1307443200
2928
+ },
2929
+ {
2930
+ "debugging/Self-BLEU-5": 0.49020908264157476,
2931
+ "debugging/distinct-1-grams": 0.768901113497886,
2932
+ "debugging/distinct-2-grams": 0.9428782333551957,
2933
+ "debugging/entropy-1-grams": 6.085999550681761,
2934
+ "debugging/entropy-2-grams": 7.0033060167714964,
2935
+ "debugging/length": 490.2352941176471,
2936
+ "debugging/num_segments": 17,
2937
+ "epoch": 0.47,
2938
+ "objective/train/avg_token_score": 0.02056093141436577,
2939
+ "objective/train/avg_weight": 0.983538031578064,
2940
+ "objective/train/docs_used": 741674,
2941
+ "objective/train/instantaneous_batch_size": 32,
2942
+ "objective/train/instantaneous_microbatch_size": 32768,
2943
+ "objective/train/original_loss": 2.6528468132019043,
2944
+ "objective/train/std_weight": 0.0878894254565239,
2945
+ "objective/train/theoretical_loss": 3.558697407077142,
2946
+ "objective/train/tokens_used": 1331180000,
2947
+ "theoretical_loss": 3.558697407077142,
2948
+ "tokens_seen": 1310720000
2949
+ },
2950
+ {
2951
+ "epoch": 0.47,
2952
+ "learning_rate": 0.0003761655948553055,
2953
+ "loss": 2.6652,
2954
+ "theoretical_loss": 3.558697407077142,
2955
+ "tokens_seen": 1310720000
2956
+ },
2957
+ {
2958
+ "epoch": 0.47,
2959
+ "learning_rate": 0.0003753380934367316,
2960
+ "loss": 2.6785,
2961
+ "theoretical_loss": 3.5578938331159975,
2962
+ "tokens_seen": 1313996800
2963
+ },
2964
+ {
2965
+ "epoch": 0.47,
2966
+ "learning_rate": 0.00037451059201815774,
2967
+ "loss": 2.6523,
2968
+ "theoretical_loss": 3.557092820093863,
2969
+ "tokens_seen": 1317273600
2970
+ },
2971
+ {
2972
+ "epoch": 0.47,
2973
+ "learning_rate": 0.00037368309059958386,
2974
+ "loss": 2.6731,
2975
+ "theoretical_loss": 3.556294353509079,
2976
+ "tokens_seen": 1320550400
2977
+ },
2978
+ {
2979
+ "epoch": 0.47,
2980
+ "learning_rate": 0.00037285558918101,
2981
+ "loss": 2.6639,
2982
+ "theoretical_loss": 3.555498418977828,
2983
+ "tokens_seen": 1323827200
2984
+ },
2985
+ {
2986
+ "epoch": 0.47,
2987
+ "learning_rate": 0.00037202808776243615,
2988
+ "loss": 2.6579,
2989
+ "theoretical_loss": 3.5547050022328874,
2990
+ "tokens_seen": 1327104000
2991
+ },
2992
+ {
2993
+ "epoch": 0.48,
2994
+ "learning_rate": 0.0003712005863438623,
2995
+ "loss": 2.6554,
2996
+ "theoretical_loss": 3.553914089122399,
2997
+ "tokens_seen": 1330380800
2998
+ },
2999
+ {
3000
+ "epoch": 0.48,
3001
+ "learning_rate": 0.00037037308492528844,
3002
+ "loss": 2.7062,
3003
+ "theoretical_loss": 3.553125665608655,
3004
+ "tokens_seen": 1333657600
3005
+ },
3006
+ {
3007
+ "epoch": 0.48,
3008
+ "learning_rate": 0.0003695455835067146,
3009
+ "loss": 2.6814,
3010
+ "theoretical_loss": 3.5523397177669005,
3011
+ "tokens_seen": 1336934400
3012
+ },
3013
+ {
3014
+ "epoch": 0.48,
3015
+ "learning_rate": 0.00036871808208814073,
3016
+ "loss": 2.691,
3017
+ "theoretical_loss": 3.551556231784149,
3018
+ "tokens_seen": 1340211200
3019
+ },
3020
+ {
3021
+ "epoch": 0.48,
3022
+ "learning_rate": 0.00036789058066956685,
3023
+ "loss": 2.6452,
3024
+ "theoretical_loss": 3.5507751939580148,
3025
+ "tokens_seen": 1343488000
3026
+ },
3027
+ {
3028
+ "epoch": 0.48,
3029
+ "learning_rate": 0.00036706307925099297,
3030
+ "loss": 2.646,
3031
+ "theoretical_loss": 3.5499965906955606,
3032
+ "tokens_seen": 1346764800
3033
+ },
3034
+ {
3035
+ "epoch": 0.48,
3036
+ "learning_rate": 0.0003662355778324192,
3037
+ "loss": 2.6952,
3038
+ "theoretical_loss": 3.549220408512161,
3039
+ "tokens_seen": 1350041600
3040
+ },
3041
+ {
3042
+ "epoch": 0.48,
3043
+ "learning_rate": 0.0003654080764138453,
3044
+ "loss": 2.6716,
3045
+ "theoretical_loss": 3.5484466340303755,
3046
+ "tokens_seen": 1353318400
3047
+ },
3048
+ {
3049
+ "epoch": 0.48,
3050
+ "learning_rate": 0.00036458057499527143,
3051
+ "loss": 2.6566,
3052
+ "theoretical_loss": 3.547675253978843,
3053
+ "tokens_seen": 1356595200
3054
+ },
3055
+ {
3056
+ "epoch": 0.49,
3057
+ "learning_rate": 0.00036375307357669755,
3058
+ "loss": 2.6947,
3059
+ "theoretical_loss": 3.5469062551911854,
3060
+ "tokens_seen": 1359872000
3061
+ },
3062
+ {
3063
+ "epoch": 0.49,
3064
+ "learning_rate": 0.00036292557215812367,
3065
+ "loss": 2.7012,
3066
+ "theoretical_loss": 3.5461396246049244,
3067
+ "tokens_seen": 1363148800
3068
+ },
3069
+ {
3070
+ "epoch": 0.49,
3071
+ "learning_rate": 0.0003620980707395498,
3072
+ "loss": 2.6675,
3073
+ "theoretical_loss": 3.545375349260419,
3074
+ "tokens_seen": 1366425600
3075
+ },
3076
+ {
3077
+ "epoch": 0.49,
3078
+ "learning_rate": 0.00036127056932097596,
3079
+ "loss": 2.7088,
3080
+ "theoretical_loss": 3.544613416299808,
3081
+ "tokens_seen": 1369702400
3082
+ },
3083
+ {
3084
+ "epoch": 0.49,
3085
+ "learning_rate": 0.00036044306790240213,
3086
+ "loss": 2.6847,
3087
+ "theoretical_loss": 3.5438538129659687,
3088
+ "tokens_seen": 1372979200
3089
+ },
3090
+ {
3091
+ "epoch": 0.49,
3092
+ "learning_rate": 0.00035961556648382825,
3093
+ "loss": 2.7015,
3094
+ "theoretical_loss": 3.5430965266014933,
3095
+ "tokens_seen": 1376256000
3096
+ },
3097
+ {
3098
+ "epoch": 0.49,
3099
+ "learning_rate": 0.0003587880650652544,
3100
+ "loss": 2.7051,
3101
+ "theoretical_loss": 3.5423415446476705,
3102
+ "tokens_seen": 1379532800
3103
+ },
3104
+ {
3105
+ "epoch": 0.49,
3106
+ "learning_rate": 0.00035796056364668054,
3107
+ "loss": 2.6407,
3108
+ "theoretical_loss": 3.541588854643487,
3109
+ "tokens_seen": 1382809600
3110
+ },
3111
+ {
3112
+ "epoch": 0.5,
3113
+ "learning_rate": 0.00035713306222810666,
3114
+ "loss": 2.6672,
3115
+ "theoretical_loss": 3.5408384442246343,
3116
+ "tokens_seen": 1386086400
3117
+ },
3118
+ {
3119
+ "epoch": 0.5,
3120
+ "learning_rate": 0.0003563055608095328,
3121
+ "loss": 2.623,
3122
+ "theoretical_loss": 3.540090301122535,
3123
+ "tokens_seen": 1389363200
3124
+ },
3125
+ {
3126
+ "epoch": 0.5,
3127
+ "learning_rate": 0.000355478059390959,
3128
+ "loss": 2.6465,
3129
+ "theoretical_loss": 3.5393444131633762,
3130
+ "tokens_seen": 1392640000
3131
+ },
3132
+ {
3133
+ "epoch": 0.5,
3134
+ "learning_rate": 0.0003546505579723851,
3135
+ "loss": 2.696,
3136
+ "theoretical_loss": 3.5386007682671576,
3137
+ "tokens_seen": 1395916800
3138
+ },
3139
+ {
3140
+ "epoch": 0.5,
3141
+ "learning_rate": 0.00035382305655381124,
3142
+ "loss": 2.6751,
3143
+ "theoretical_loss": 3.5378593544467494,
3144
+ "tokens_seen": 1399193600
3145
+ }
3146
+ ],
3147
+ "max_steps": 42724,
3148
+ "num_train_epochs": 9223372036854775807,
3149
+ "total_flos": 7.14460209610752e+17,
3150
+ "trial_name": null,
3151
+ "trial_params": null
3152
+ }
checkpoint-21362/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a32ea2ef12beede46f0a0e389a80e6baabee1d453e1a781f52cc6fd941c1b56
3
+ size 3451
checkpoint-21362/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "gpt2",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2LMAndValueHeadModel"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "reorder_and_upcast_attn": true,
21
+ "resid_pdrop": 0.1,
22
+ "scale_attn_by_inverse_layer_idx": false,
23
+ "scale_attn_weights": true,
24
+ "summary_activation": null,
25
+ "summary_first_dropout": 0.1,
26
+ "summary_proj_to_labels": true,
27
+ "summary_type": "cls_index",
28
+ "summary_use_proj": true,
29
+ "task_specific_params": {
30
+ "text-generation": {
31
+ "do_sample": true,
32
+ "max_length": 50
33
+ }
34
+ },
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.23.0",
37
+ "use_cache": true,
38
+ "vocab_size": 50257
39
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9404903bc78cf2c713583822be36bec6051c5830b03e060ec00c9dc050e53e30
3
+ size 510398013
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "<|endoftext|>",
5
+ "unk_token": "<|endoftext|>"
6
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "eos_token": "<|endoftext|>",
5
+ "model_max_length": 1024,
6
+ "name_or_path": "gpt2",
7
+ "special_tokens_map_file": null,
8
+ "tokenizer_class": "GPT2Tokenizer",
9
+ "unk_token": "<|endoftext|>"
10
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a32ea2ef12beede46f0a0e389a80e6baabee1d453e1a781f52cc6fd941c1b56
3
+ size 3451
vocab.json ADDED
The diff for this file is too large to render. See raw diff