This is a tokenizer only, with the following modification:
- Replaced
[unused0]
,[unused1]
,[unused2]
with[ES]
,[DE]
,[FR]
respectively in the vocabulary - Added
[ES]
,[DE]
,[FR]
as special tokens and therefore they won't lowercased or splitted