Upload 6 files

Files changed (5) hide show

tokenizer.bin ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d84b01069686fecae45d66b8d1468a2bdaa1b1b7221502e85b8b17bfacbec40
+size 466508

tokenizer.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:906f1895535a801b111afef745ea4b5b559cfb73a3369c162e70de50bb72013e
-size 1352366

 version https://git-lfs.github.com/spec/v1
+oid sha256:38e1f4e133816d9bf7acb1e63a55aff18c3a0f987bf7624552c4be3e5a8f08b6
+size 499194

tokenizer_config.json CHANGED Viewed

@@ -19,7 +19,7 @@
     "single_word": false
   },
   "legacy": false,
-  "max_len": 4096,
   "model_max_length": 4096,
   "pad_token": null,
   "sp_model_kwargs": {},

     "single_word": false
   },
   "legacy": false,
+  "max_length": 4096,
   "model_max_length": 4096,
   "pad_token": null,
   "sp_model_kwargs": {},

tokenizer_word.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_log.txt ADDED Viewed

+ parameters: Namespace(corpus_dir='../datasets/online_novel//data.txt', output_dir='../models/baby-chinese-llama2', model_type='bpe', max_sentence_length=4096, vocab_size=32000, max_lines=1000000, shuffle_lines=True, pad_id=3, normalization_rule_name='identity', character_coverage=0.9995, action='export')
+trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
+trainer_interface.cc(537) LOG(INFO) all chars count=796343461
+trainer_interface.cc(548) LOG(INFO) Done: 99.95% characters are covered.
+trainer_interface.cc(558) LOG(INFO) Alphabet size=5013
+trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9995
+trainer_interface.cc(591) LOG(INFO) Done! preprocessed 800000 sentences.
+trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 800000
+trainer_interface.cc(608) LOG(INFO) Done! 1021909
+Raw corpus
+Total lines: 2995508
+Total tokens: 1827.13MB
+Mean: 610, Median: 606.0,
+5th percentile: 546.0,
+25th percentile: 580.0,
+75th percentile: 636.0,
+95th percentile: 686.0,
+99th percentile: 722.0,
+max: 2657