About context size and difference in quality

#1
by droussis - opened

Hi,

I wanted to ask 2 things:
(1) To verify whether the context length of the model is the same as XLM-R (512 I think) and ask you if you have had any experience with using it to score large parallel segments, e.g., with >300 words by splitting them into smaller chunks/sentences etc.
(2) If this "full-large" version differs significantly from the "full" versions for each language pair. For example, since "full-en-el" is already available, should this version be a lot better given its higher computational requirements? I am asking for your opinion in case you 've performed internal evaluations.

Thanks for the great work and your time!

droussis changed discussion title from About context size to About context size and difference in quality
Bitextor Team org
  1. Unfortunately, the current implementation of the batching in bicleaner-ai is line-based, not token-based. That means we have to limit the context length to 150/200 tokens to avoid possible OOM errors because all the samples in a batch are padded to the maximum lenght in the batch. However, it is possible to experiment with larger context length if you adjust max-len parameter in the config and you set a batch size that lets you fit all batches into memory. I haven't tried though, but it is one of the TODOs in the mid/short-term.
  2. It is not 100% clear yet, but our first experiments show that even the full-en-xx might be comparable or a bit better in performance than each of the bilingual models. The large is probably better also.

Thank you very much for your quick reply!

So far, I've tried splitting the parallel documents into sentences, mining sentence pairs with LASER and determining a mean LASER score.
BicleanerAI could be used in the same way I guess, but I'd have to modify it to also work with JSONL files or HF datasets.

I'll let you know if/when I experiment with what you've proposed. Closing for now.

droussis changed discussion status to closed
Bitextor Team org

If you are extracting bitext with laser, we already have a pipeline where bicleaner-ai is integrated. Bitextor can also use embeddings for alignment (using LaBSE or any other model that you might want to use) of documents and sentences: https://github.com/bitextor/bitextor/blob/master/docs/CONFIG.md#document-alignment

But regardless of whether if you use bitextor or not, after alignment with LASER, you will need bicleaner-ai. The embedding distance threshold is not enough to obtain clean corpora, unless you start from a very clean source of documents.

Sign up or log in to comment