nreimers commited on
Commit
01caffa
·
1 Parent(s): f19f09a
Files changed (1) hide show
  1. README.md +26 -1
README.md CHANGED
@@ -23,4 +23,29 @@ print(tokens)
23
 
24
  number_of_tokens = len(enc['input_ids'])
25
  print("Number of tokens:", number_of_tokens)
26
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  number_of_tokens = len(enc['input_ids'])
25
  print("Number of tokens:", number_of_tokens)
26
+ ```
27
+
28
+ ## Computing number of tokens
29
+
30
+ The following values can be used to approximate the number of tokens given the number input characters:
31
+ ```
32
+ approx_number_of_tokens = len(input_text) / ratio
33
+ ```
34
+
35
+ E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`.
36
+
37
+ | Language | Avg. characters per token |
38
+ | --- | :---: |
39
+ | ar | 3.6 |
40
+ | de | 4.6 |
41
+ | en | 4.8 |
42
+ | es | 4.6 |
43
+ | fr | 4.4 |
44
+ | hi | 3.8 |
45
+ | it | 4.5 |
46
+ | ja | 1.3 |
47
+ | ko | 2.0 |
48
+ | zh | 1.1 |
49
+
50
+
51
+ These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change.