Weird token in the tokenizer?
I'm looking at the tokenizer.json and saw a strange thing at 922:
"',": 920,
"▁out": 921,
"▁ا": 922,
"block": 923,
"ies": 924,
"lay": 925,
"▁his": 926,
Is this an error that's possibly causing other issues?
Also here:
"ien": 1145,
"IC": 1146,
"▁ال": 1147,
"▁/": 1148,
"str": 1149,
"▁mu": 1150,
(Manually looking through, not programmatically, so these are just examples)
Also here:
"ien": 1145, "IC": 1146, "▁ال": 1147, "▁/": 1148, "str": 1149, "▁mu": 1150,
(Manually looking through, not programmatically, so these are just examples)
those are Arabic, letters and could be just the formatting which is why it appeared so to you.
here I'll try adding Arabic letters after this and see if we can replicate it كما هكد_1147 like so. try to copy it and see.
Not sure if it's replicating, but I think I see what's going on. Mixing right to left and left to right makes it look like the : is in the quotes along with the number, but if I actually highlight it, the structure is not what it looks like.
Wow TIL 🐣 did not know this could should like this in the vocab!
thanks for sharing the finding! :)
Hi @Lambent , and @Lyte , Could you please confirm if this issue is resolved for above comments. We can close this issue or else if you have any concerns let us know. Thank you.
yep, it was just an issue with the formatting of the Arabic characters that often appear strangely in text format due to the right-to-left writing style, but it's nothing serious, so please feel free to close it.