metadata
language: fr
pipeline_tag: token-classification
widget:
- text: je voudrais réserver une chambre à paris pour demain et lundi
- text: d'accord pour l'hôtel à quatre vingt dix euros la nuit
- text: deux nuits s'il vous plait
- text: dans un hôtel avec piscine à marseille
tags:
- bert
- flaubert
- natural language understanding
- NLU
- spoken language understanding
- SLU
- understanding
- MEDIA
vpelloin/MEDIA_NLU-flaubert_base_cased
This is a Natural Language Understanding (NLU) model for the French MEDIA benchmark. It maps each input words into outputs concepts tags (76 available).
This model is trained using flaubert/flaubert_base_cased
as its inital checkpoint. It obtained 13.20% CER (lower is better) in the MEDIA test set, in our Interspeech 2023 publication, using Kaldi ASR transcriptions.
Available MEDIA NLU models:
vpelloin/MEDIA_NLU-flaubert_base_cased
: MEDIA NLU model trained usingflaubert/flaubert_base_cased
. Obtains 13.20% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_base_uncased
: MEDIA NLU model trained usingflaubert/flaubert_base_uncased
. Obtains 12.40% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_ft
: MEDIA NLU model trained usingnherve/flaubert-oral-ft
. Obtains 11.98% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_mixed
: MEDIA NLU model trained usingnherve/flaubert-oral-mixed
. Obtains 12.47% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_asr
: MEDIA NLU model trained usingnherve/flaubert-oral-asr
. Obtains 12.43% CER on MEDIA test.vpelloin/MEDIA_NLU-flaubert_oral_asr_nb
: MEDIA NLU model trained usingnherve/flaubert-oral-asr_nb
. Obtains 12.24% CER on MEDIA test.
Usage with Pipeline
from transformers import pipeline
generator = pipeline(
model="vpelloin/MEDIA_NLU-flaubert_base_cased",
task="token-classification"
)
sentences = [
"je voudrais réserver une chambre à paris pour demain et lundi",
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
"deux nuits s'il vous plait",
"dans un hôtel avec piscine à marseille"
]
for sentence in sentences:
print([(tok['word'], tok['entity']) for tok in generator(sentence)])
Usage with AutoTokenizer/AutoModel
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
"vpelloin/MEDIA_NLU-flaubert_base_cased"
)
model = AutoModelForTokenClassification.from_pretrained(
"vpelloin/MEDIA_NLU-flaubert_base_cased"
)
sentences = [
"je voudrais réserver une chambre à paris pour demain et lundi",
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
"deux nuits s'il vous plait",
"dans un hôtel avec piscine à marseille"
]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outputs = model(**inputs).logits
print([
[model.config.id2label[i] for i in b]
for b in outputs.argmax(dim=-1).tolist()
])
Reference
If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the following paper:
@inproceedings{pelloin22_interspeech,
author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={3453--3457},
doi={10.21437/Interspeech.2022-352}
}