|
trainer_interface.cc(537) LOG(INFO) all chars count=796343461 |
|
trainer_interface.cc(548) LOG(INFO) Done: 99.95% characters are covered. |
|
trainer_interface.cc(558) LOG(INFO) Alphabet size=5013 |
|
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9995 |
|
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 800000 sentences. |
|
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 800000 |
|
trainer_interface.cc(608) LOG(INFO) Done! 1021909 |
|
corpus_dir='../datasets/online_novel//data.txt', output_dir='../models/baby-chinese-llama2/64k', model_type='bpe', max_sentence_length=4096, vocab_size=64000, max_lines=1000000, shuffle_lines=True, pad_id=3, normalization_rule_name='identity', character_coverage=0.9995, action='export') |
|
|
|
Total tokens: 1686.54MB |
|
Mean: 563.0230508481366, Median: 559.0, |
|
5th percentile: 504.0, |
|
25th percentile: 535.0, |
|
75th percentile: 587.0, |
|
95th percentile: 634.0, |
|
99th percentile: 669.0, |
|
max: 2657 |