How to use an LM like n-gram LM with w2v-bert-2.0?

#22

by lukarape - opened Mar 14, 2024

Mar 14, 2024

I have fine-tuned w2v-bert-2.0 for a low-resource language. The results are good, but could be better with an LM like n-gram LM. I am interested whether that's possible or not? Or what are the alternatives?

ylacombe

Mar 15, 2024

Yes, it's really possible, take a look at this guide: https://huggingface.co/blog/wav2vec2-with-ngram
Note that you should adapt the code for this particular model! From my experience you get a nice boost

lukarape

Mar 15, 2024

•

edited Mar 15, 2024

Thank you for your response!

I took a look at the guide, but couldn't find an equivalent of Wav2Vec2ProcessorWithLM for w2v-bert-2.0.
Does the n-gram LM or decoder need to be called manually? If yes, could you please provide another guide if there is one?

UmarRamzan

May 12, 2024

@ylacombe I'm not sure if w2v-bert-2.0 can even be combined with an ngram since the Wav2Vec2ProcessorWithLM function in the shared blog post expects a Wav2Vec2 Feature Extractor while w2v-bert-2.0 contains a SeamlessM4T Feature Extractor.

UmarRamzan

May 12, 2024

Apparently this has already been discussed and fixed by the repo owners but just hasn't been merged yet with the main branch. For anyone else facing this problem, adding Wav2Vec2ProcessorWithLM.feature_extractor_class = "AutoFeatureExtractor" before creating the processor is a hacky solution to get it working for now.

StephennFernandes

May 16, 2024

@UmarRamzan slightly confused on how to use the following line of code, could you please show me the code in a bit more collaborative way. (more lines, the entire function/class implementation)

UmarRamzan

May 16, 2024

@StephennFernandes

from transformers import Wav2Vec2ProcessorWithLM

# just have to add this line
Wav2Vec2ProcessorWithLM.feature_extractor_class = "AutoFeatureExtractor"

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

and the rest of the code is as provided here https://huggingface.co/blog/wav2vec2-with-ngram

lukarape

Jul 1, 2024

Thanks a lot for your help, @UmarRamzan and @ylacombe , it really helped.

Now I have this problem, where ngram decoder corrects letters instead of correcting the whole misspelled word.
I've tried playing with batch_decoder parameters, but it doesn't seem to help.

decoded_output = processor_with_lm.batch_decode(
    logits=logits.detach().numpy(),
    beam_width=300,           # Increased beam width for better exploration
    alpha=2.0,                # Increased language model weight
    beta=1.5,                 # Length score adjustment (default)
    lm_score_boundary=True,   # Respect word boundaries
    unk_score_offset=-10.0,   # Penalize unknown tokens
    token_min_logp=-3.0,      # Skip tokens with low probability
).text[0]

Any idea on how to correct misspelled words instead of working with letters?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment