add stats
Browse files
README.md
CHANGED
@@ -23,4 +23,29 @@ print(tokens)
|
|
23 |
|
24 |
number_of_tokens = len(enc['input_ids'])
|
25 |
print("Number of tokens:", number_of_tokens)
|
26 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
number_of_tokens = len(enc['input_ids'])
|
25 |
print("Number of tokens:", number_of_tokens)
|
26 |
+
```
|
27 |
+
|
28 |
+
## Computing number of tokens
|
29 |
+
|
30 |
+
The following values can be used to approximate the number of tokens given the number input characters:
|
31 |
+
```
|
32 |
+
approx_number_of_tokens = len(input_text) / ratio
|
33 |
+
```
|
34 |
+
|
35 |
+
E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`.
|
36 |
+
|
37 |
+
| Language | Avg. characters per token |
|
38 |
+
| --- | :---: |
|
39 |
+
| ar | 3.6 |
|
40 |
+
| de | 4.6 |
|
41 |
+
| en | 4.8 |
|
42 |
+
| es | 4.6 |
|
43 |
+
| fr | 4.4 |
|
44 |
+
| hi | 3.8 |
|
45 |
+
| it | 4.5 |
|
46 |
+
| ja | 1.3 |
|
47 |
+
| ko | 2.0 |
|
48 |
+
| zh | 1.1 |
|
49 |
+
|
50 |
+
|
51 |
+
These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change.
|