Update README.md
Browse files
README.md
CHANGED
@@ -23,16 +23,6 @@ Unveiling the debut of Gujju-Llama 7B Base model, offering researchers and devel
|
|
23 |
- **Training Precision** float16
|
24 |
- **License:** GNU General Public License v3.0
|
25 |
|
26 |
-
### Gujarati Tokenization
|
27 |
-
|
28 |
-
- Prior to pre-training, the base Llama-2 model lacked the ability to recognize Gujarati characters. As illustrated in the tokenization process below:
|
29 |
-
- Sample English sentence:- ISRO created history.
|
30 |
-
- Sample Gujarati sentence:- ઈસરોએ ઈતિહાસ રચ્યો.
|
31 |
-
|
32 |
-
**Base Llama-2 Tokenization:** ['▁', '<0xE0>', '<0xAA>', '<0x88>', '<0xE0>', '<0xAA>', '<0xB8>', '<0xE0>', '<0xAA>', '<0xB0>', '<0xE0>', '<0xAB>', '<0x8B>', '<0xE0>', '<0xAA>', '<0x8F>', '▁', '<0xE0>', '<0xAA>', '<0x88>', '<0xE0>', '<0xAA>', '<0xA4>', '<0xE0>', '<0xAA>', '<0xBF>', '<0xE0>', '<0xAA>', '<0xB9>', 'ા', '<0xE0>', '<0xAA>', '<0xB8>', '▁', '<0xE0>', '<0xAA>', '<0xB0>', '<0xE0>', '<0xAA>', '<0x9A>', '<0xE0>', '<0xAB>', '<0x8D>', '<0xE0>', '<0xAA>', '<0xAF>', '<0xE0>', '<0xAB>', '<0x8B>', '.']
|
33 |
-
|
34 |
-
**Gujju-Llama Tokenization:** ['ઈસ', 'રો', 'એ', '▁ઈ', 'તિ', 'હા', 'સ', '▁રચ્યો', '.']
|
35 |
-
|
36 |
## Usage Note
|
37 |
|
38 |
These models possess impressive linguistic skills, but it's important to remember they haven't been specifically optimized to avoid potentially harmful or offensive content. To mitigate this risk, we advise users to:
|
|
|
23 |
- **Training Precision** float16
|
24 |
- **License:** GNU General Public License v3.0
|
25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
## Usage Note
|
27 |
|
28 |
These models possess impressive linguistic skills, but it's important to remember they haven't been specifically optimized to avoid potentially harmful or offensive content. To mitigate this risk, we advise users to:
|