anton-l/wav2vec2-base-lang-id · Interval inference?

When I am doing SST, I am facing issues where people mixing up two languages such as
E.G Би өнөөдөр crypto currency худалдаж авсан which means that Today I bought crypto currency
What I am trying to do is to split the audio into intervals where language is changed.

input: Би өнөөдөр crypto currency худалдаж авсан
      |----MN----|-------EN------|------MN------|

I wonder if there is any way to construct interval prediction on audio file using this model?