Cohere multilingual-22-12
tokenizer
This is the tokenizer for the Cohere multilingual-22-12
embedding model: Cohere Multilingual Embeddings
You can load it with the transformers library like this:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12")
text = "Hellö World, this is my input string!"
enc = tokenizer(text)
print("Encoded input:")
print(enc)
inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
print("Tokens:")
print(tokens)
number_of_tokens = len(enc['input_ids'])
print("Number of tokens:", number_of_tokens)
Computing number of tokens
The following values can be used to approximate the number of tokens given the number input characters:
approx_number_of_tokens = len(input_text) / ratio
E.g. for English, approx_number_of_tokens = len(input_text) / 4.8
.
Language | Avg. characters per token |
---|---|
ar | 3.6 |
de | 4.6 |
en | 4.8 |
es | 4.6 |
fr | 4.4 |
hi | 3.8 |
it | 4.5 |
ja | 1.3 |
ko | 2.0 |
zh | 1.1 |
These values have been computed on the first 10,000 paragraphs from Wikipedia. For other dataset, these values might change.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.