Alyosha11 commited on
Commit
e43cafd
·
verified ·
1 Parent(s): 6bfd4dd

Upload token_sharing_trainer.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. token_sharing_trainer.py +59 -0
token_sharing_trainer.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datasets import load_dataset
2
+ from models.bpe_trainer import BpeTrainer
3
+ from tqdm import tqdm
4
+
5
+ raw_ds = load_dataset("parquet", data_files ={'train':'data/culturaX_bnhi_500Kx2.parquet'})
6
+ raw_ds = raw_ds['train']
7
+
8
+ phn_ds = load_dataset("parquet", data_files ={'train':'data/culturaX_bnhi_500Kx2_phonemized.parquet'})
9
+ phn_ds = phn_ds['train']
10
+
11
+ # vocab_sizes = [size for size in range(2000, 34000, 2000)]
12
+ vocab_sizes = [16000]
13
+
14
+ for vocab_size in tqdm(vocab_sizes):
15
+ BpeTrainer(dataset=raw_ds, vocab_size=vocab_size, batch_size=50000,
16
+ output_dir=f"trained_tokenizers/multi/multi_raw_bnhi_bpetokenizer_{vocab_size//1000}K")
17
+ BpeTrainer(dataset=phn_ds, vocab_size=vocab_size, batch_size=50000,
18
+ output_dir=f"trained_tokenizers/multi/multi_phn_bnhi_bpetokenizer_{vocab_size//1000}K")
19
+
20
+ # 8K for one language in native tokenizer
21
+ # < 8K for one language in phonemized tokenizer
22
+ # 16k for 2 languages (mutually exclusive, script has diff char)
23
+ # How much lesser than 16 K we are?
24
+ # Lower limit 8K
25
+ # Anywhere bw 8K and 16K, 12K --> phonemized tokenizer had FS as the 16K.
26
+
27
+ '''
28
+ Benchmarking for how much time for phonemization:
29
+ NUM_SAMPLES = 50,000
30
+ Convert to text
31
+ Phonemization script
32
+
33
+ time command_for_script
34
+
35
+ time/500000
36
+
37
+ ------------------------------------------------
38
+
39
+ Prep data:
40
+ Native script = directly from Sangraha
41
+ Phonemization
42
+
43
+ HF dataset --> Convert to text files and store in a dir --> Phonemization script -->
44
+ phonemized text files --> convert back to HF dataset (parquet format)
45
+
46
+ ------------------------------------------------
47
+
48
+
49
+ 1st exp:
50
+
51
+ Hi, Phn_Hi --> Plot FS from vocab size 4K to 16K. Train 12 tokenizers.
52
+ Ur, Phn_Ur --> Plot FS from vocab size 4K to 16K.
53
+
54
+ 2nd exp:
55
+
56
+ HiUr, Phn HiUr --> Plot FS from vs 8K to 16K. 8 in total.
57
+
58
+ '''
59
+