ldilov commited on
Commit
76b7208
1 Parent(s): 59b2a92

Tokenizer v1.1

Browse files
Files changed (3) hide show
  1. README.md +3 -4
  2. tokenizer.json +0 -0
  3. tokenizer_config.json +47 -8
README.md CHANGED
@@ -127,7 +127,6 @@ The tokenizer training approach showcases a sophisticated and advanced methodolo
127
 
128
  - **Dynamic Dropout**: Tokenizer training process doesn't use predefined `dropout` but instead calculates on the fly specifically tailored value, based on the current training dataset. This ensures that tokenizer model can generalize better by putting more weight on context rather than specifics. This would be beneficial at later stage when finetuning LLM with this tokenizer.
129
  - **Dynamic Adaptation**: The ability to dynamically adjust tokenization parameters based (like `min_frequency`) on dataset analysis ensures that the tokenizer remains effective across different text domains.
130
- - **Dynamic Tokens**: The dataset is divided into chunks, each chunk of the dataset is analyzed to count the occurrences of each token. This is done across all chunks in parallel, and the results are aggregated. A threshold (e.g., `0.0005`) is applied to identify tokens that constitute a small fraction of the total words in the dataset. Tokens below this threshold are considered rare or dynamic. From these dynamic tokens, the top `k` tokens with the highest counts (but still under the threshold) are selected. We add them manually to tokenizer's vocabulary so that the tokenizer can focus its attention on the most relevant rare tokens. Dynamic tokens often include terminology, names, or concepts specific to a dataset's domain. Their inclusion in the tokenizer's vocabulary allows the LLM to capture and understand these unique elements more effectively, leading to improved performance in tasks requiring deep domain knowledge or contextual nuance.
131
  - **Sophisticated Evaluation**: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
132
  - **Number Bucketing**: Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Each bucket is represented by its own token that follows specific convention. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule and they are represented they way they are written. This reduces sparsity and improves generalization without overfitting to specific values
133
  - **URL Replacement**: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token `<url>`. Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.
@@ -152,7 +151,7 @@ The evaluation results after training and testing the tokenizer with 5,000 rando
152
 
153
  | Version | Vocab Size | Loss | Training Time (seconds) |
154
  |---------|------------| ----------------|-------------------------|
155
- | v1 | 32,000 | 0.08300010458445718 | 9188.8694 |
156
 
157
  ### **Interpreting the Evaluation Score**
158
 
@@ -162,9 +161,9 @@ The evaluation results after training and testing the tokenizer with 5,000 rando
162
 
163
  ### **Mathematical Significance of the Evaluation Score**
164
 
165
- From `Levenshtein distance` definition => **On average, the necessary edits to recover the original text from the detokenized output account for `8.3%` of the length of the original texts.**
166
 
167
- The loss value of `0.08300010458445718` suggests that the tokenizer performs well in maintaining the integrity of the text through the tokenization process and sustains a high level of fidelity. Mathematically, this low loss score signifies a high degree of similarity between the original and detokenized texts, demonstrating the tokenizer's effectiveness. The process of detokenization, converting tokenized representations back into their original text form, does not always guarantee a 1:1 exact match to the original text. While the goal of detokenization is to reconstruct the original text as closely as possible, minor differences can occur. These variances are generally acceptable and sometimes inevitable.
168
  Most NLP models and applications can tolerate some level of discrepancy between original and processed texts.
169
 
170
 
 
127
 
128
  - **Dynamic Dropout**: Tokenizer training process doesn't use predefined `dropout` but instead calculates on the fly specifically tailored value, based on the current training dataset. This ensures that tokenizer model can generalize better by putting more weight on context rather than specifics. This would be beneficial at later stage when finetuning LLM with this tokenizer.
129
  - **Dynamic Adaptation**: The ability to dynamically adjust tokenization parameters based (like `min_frequency`) on dataset analysis ensures that the tokenizer remains effective across different text domains.
 
130
  - **Sophisticated Evaluation**: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
131
  - **Number Bucketing**: Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Each bucket is represented by its own token that follows specific convention. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule and they are represented they way they are written. This reduces sparsity and improves generalization without overfitting to specific values
132
  - **URL Replacement**: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token `<url>`. Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.
 
151
 
152
  | Version | Vocab Size | Loss | Training Time (seconds) |
153
  |---------|------------| ----------------|-------------------------|
154
+ | v1.1 | 32,000 | 0.00791045752872809 | 9188.8694 |
155
 
156
  ### **Interpreting the Evaluation Score**
157
 
 
161
 
162
  ### **Mathematical Significance of the Evaluation Score**
163
 
164
+ From `Levenshtein distance` definition => **On average, the necessary edits to recover the original text from the detokenized output account for `0.79%` of the length of the original texts.**
165
 
166
+ The loss value of `0.00791045752872809` suggests that the tokenizer performs well in maintaining the integrity of the text through the tokenization process and sustains a high level of fidelity. Mathematically, this low loss score signifies a high degree of similarity between the original and detokenized texts, demonstrating the tokenizer's effectiveness. The process of detokenization, converting tokenized representations back into their original text form, does not always guarantee a 1:1 exact match to the original text. While the goal of detokenization is to reconstruct the original text as closely as possible, minor differences can occur. These variances are generally acceptable and sometimes inevitable.
167
  Most NLP models and applications can tolerate some level of discrepancy between original and processed texts.
168
 
169
 
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,9 +1,48 @@
1
  {
2
- "bos_token": "<s>",
3
- "clean_up_tokenization_spaces": false,
4
- "eos_token": "</s>",
5
- "model_max_length": 1000000000000000019884624838656,
6
- "pad_token": "<s>",
7
- "tokenizer_class": "PreTrainedTokenizerFast",
8
- "unk_token": "<unk>"
9
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "additional_special_tokens": [
31
+ "<unk>",
32
+ "<s>",
33
+ "</s>"
34
+ ],
35
+ "bos_token": "<s>",
36
+ "clean_up_tokenization_spaces": false,
37
+ "eos_token": "</s>",
38
+ "legacy": true,
39
+ "loss_score": 0.00791045752872809,
40
+ "model_max_length": 1000000000000000019884624838656,
41
+ "pad_token": "<s>",
42
+ "spaces_between_special_tokens": false,
43
+ "tokenizer_class": "LlamaTokenizer",
44
+ "unk_token": "<unk>",
45
+ "use_default_system_prompt": false,
46
+ "use_fast": true,
47
+ "version": 1.1
48
+ }