rishiraj commited on
Commit
e77e072
·
verified ·
1 Parent(s): 669d0ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -4
README.md CHANGED
@@ -10,8 +10,45 @@ tags:
10
  - gemma
11
  ---
12
 
13
- Number of tokens in google/gemma-2-9b: 256000
14
- Number of tokens in rishiraj/gemma-2-9b-bn: 392402
15
 
16
- Why fewer tokens than English?
17
- While Bengali is very expressive and flexible, it hasn't undergone as much global influence as English in terms of absorbing new words from many different languages.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - gemma
11
  ---
12
 
13
+ This repository extends the `google/gemma-2-9b` tokenizer by training it on Bengali text.
 
14
 
15
+ ## Token Information
16
+
17
+ | Tokenizer | Number of Tokens |
18
+ |------------------------------------|------------------|
19
+ | `google/gemma-2-9b` | 256,000 |
20
+ | `rishiraj/gemma-2-9b-bn` | 392,402 |
21
+
22
+ ### Why Fewer Tokens for Bengali?
23
+ While Bengali is very expressive and flexible, it hasn't undergone as much global influence as English in terms of absorbing new words from many different languages.
24
+
25
+ ## Tokenizer Comparison
26
+
27
+ **Text:**
28
+ ```text
29
+ আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি
30
+ ```
31
+
32
+ | Tokenizer | Output |
33
+ |----------------------------|----------------------------------------------------------------------------------------------------------------------|
34
+ | `gemma_tokenizer` | ['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি'] |
35
+ | `our_tokenizer` | ['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি'] |
36
+
37
+ ## Usage
38
+
39
+ 1. Install dependencies:
40
+ ```bash
41
+ pip install transformers
42
+ ```
43
+
44
+ 2. Load and use the tokenizer:
45
+ ```python
46
+ from transformers import AutoTokenizer
47
+ tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn")
48
+ tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি")
49
+ print(tokens)
50
+ ```
51
+
52
+ ## Conclusion
53
+
54
+ The original `gemma_tokenizer` splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.