Training dataset

#2
by HappyLemon - opened

Do I understand correctly, that SONAR was trained on short sentences? As in the paper it was said, that the same training data from NLLB (https://arxiv.org/pdf/2207.04672) was used (which is FLORES 200, right?) and the it consist of 3001 sentence translated to 200 languages with average length of 21 word?

Sign up or log in to comment