File size: 7,680 Bytes
e96e4c5 98e3643 e96e4c5 99162de 43aff3f 3ddb297 4e19e4d 5e9250f 99162de e96e4c5 b20530c 3121ec5 b20530c 06769d9 f287bfe 06769d9 68e6993 aef1c1e f287bfe 099ecde 516e152 f287bfe 516e152 dddb018 22e7078 1a516d8 dddb018 ef292d4 22e7078 68e6993 facd6d9 e502d25 667124b ef292d4 1a516d8 dddb018 5a6773d dddb018 e502d25 bf3d76a 667124b 1a516d8 dddb018 67cdd40 6bffc61 aef1c1e 6bffc61 b38403b 6bffc61 1737f7e fc8fd39 6bffc61 b38403b 6bffc61 68e6993 fc8fd39 68e6993 fc8fd39 6bffc61 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
language: fr
license: mit
tags:
- zero-shot-classification
- nli
pipeline_tag: zero-shot-classification
widget:
- text: "Selon certains physiciens, un univers parallèle, miroir du nôtre ou relevant de ce que l'on appelle la théorie des branes, autoriserait des neutrons à sortir de notre Univers pour y entrer à nouveau. L'idée a été testée une nouvelle fois avec le réacteur nucléaire de l'Institut Laue-Langevin à Grenoble, plus précisément en utilisant le détecteur de l'expérience Stereo initialement conçu pour chasser des particules de matière noire potentielles, les neutrinos stériles."
candidate_labels: "politique, science, sport, santé"
hypothesis_template: "Ce texte parle de {}."
datasets:
- flue
---
DistilCamemBERT-NLI
===================
We present DistilCamemBERT-NLI which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset which consists to determine whether a premise entails, contradicts or neither entails nor contradicts a hypothesis.
This modelization is close to [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) based on [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power thanks to DistilCamemBERT.
Dataset
-------
The dataset XNLI from [FLUE](https://huggingface.co/datasets/flue) is composed of 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). The sentence A is called *premise* and sentence B is called *hypothesis*, then the goal of modelization is determined as follows:
$$P(premise=c\in\{contradiction, entailment, neutral\}\vert hypothesis)$$
Evaluation results
------------------
| **class** | **precision (%)** | **f1-score (%)** | **support** |
| :----------------: | :---------------: | :--------------: | :---------: |
| **global** | 77.70 | 77.45 | 5,010 |
| **contradiction** | 78.00 | 79.54 | 1,670 |
| **entailment** | 82.90 | 78.87 | 1,670 |
| **neutral** | 72.18 | 74.04 | 1,670 |
Benchmark
---------
We compare the [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) model with 2 other modelizations working on french language. The first one [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) is based on well named [CamemBERT](https://huggingface.co/camembert-base), the french RoBERTa model and the second one [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) based on [mDeBERTav3](https://huggingface.co/microsoft/mdeberta-v3-base) a multilingual model. To compare the performances the metrics of accuracy and [MCC (Matthews Correlation Coefficient)](https://en.wikipedia.org/wiki/Phi_coefficient) was used and for the mean inference time measure, an **AMD Ryzen 5 4500U @ 2.3GHz with 6 cores** was used:
| **model** | **time (ms)** | **accuracy (%)** | **MCC (x100)** |
| :--------------: | :-----------: | :--------------: | :------------: |
| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) | **51.35** | 77.45 | 66.24 |
| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) | 105.0 | 81.72 | 72.67 |
| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) | 299.18 | **83.43** | **75.15** |
Zero-shot classification
------------------------
The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by:
$$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$
For this part, we use 2 datasets, the first one: [allocine](https://huggingface.co/datasets/allocine) used to train the sentiment analysis models. The dataset is composed of 2 classes: "positif" and "négatif" appreciation of movies reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels.
| **model** | **time (ms)** | **accuracy (%)** | **MCC (x100)** |
| :--------------: | :-----------: | :--------------: | :------------: |
| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) | **195.54** | 80.59 | 63.71 |
| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) | 378.39 | **86.37** | **73.74** |
| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) | 520.58 | 84.97 | 70.05 |
The second one: [mlsum](https://huggingface.co/datasets/mlsum) used to train the summarization models. We use the articles summary part to predict their topics. In this aim, we aggregate sub-topics and select a few of them. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science".
| **model** | **time (ms)** | **accuracy (%)** | **MCC (x100)** |
| :--------------: | :-----------: | :--------------: | :------------: |
| [cmarkea/distilcamembert-base-nli](https://huggingface.co/cmarkea/distilcamembert-base-nli) | **217.77** | **79.30** | **70.55** |
| [BaptisteDoyen/camembert-base-xnli](https://huggingface.co/BaptisteDoyen/camembert-base-xnli) | 448.27 | 70.7 | 64.10 |
| [MoritzLaurer/mDeBERTa-v3-base-mnli-xnli](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli) | 591.34 | 64.45 | 58.67 |
How to use DistilCamemBERT-NLI
------------------------------
```python
from transformers import pipeline
classifier = pipeline(
task='zero-shot-classification',
model="cmarkea/distilcamembert-base-nli",
tokenizer="cmarkea/distilcamembert-base-nli"
)
result = classifier (
sequences="Le style très cinéphile de Quentin Tarantino "
"se reconnaît entre autres par sa narration postmoderne "
"et non linéaire, ses dialogues travaillés souvent "
"émaillés de références à la culture populaire, et ses "
"scènes hautement esthétiques mais d'une violence "
"extrême, inspirées de films d'exploitation, d'arts "
"martiaux ou de western spaghetti.",
candidate_labels="cinéma, technologie, littérature, politique",
hypothesis_template="Ce texte parle de {}."
)
result
{"labels": ["cinéma",
"littérature",
"technologie",
"politique"],
"scores": [0.7164115309715271,
0.12878799438476562,
0.1092301607131958,
0.0455702543258667]}
``` |