Commit
•
a7ffa6b
1
Parent(s):
d2d634e
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,167 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- text-classification
|
6 |
+
- zero-shot-classification
|
7 |
+
license: mit
|
8 |
+
metrics:
|
9 |
+
- accuracy
|
10 |
+
datasets:
|
11 |
+
- multi_nli
|
12 |
+
- anli
|
13 |
+
- fever
|
14 |
+
- lingnli
|
15 |
+
- alisawuffles/WANLI
|
16 |
+
#pipeline_tag:
|
17 |
+
#- text-classification
|
18 |
+
widget:
|
19 |
+
- text: "I first thought that I really liked the movie, but upon second thought it was actually disappointing. [SEP] The movie was good."
|
20 |
+
|
21 |
+
model-index: # info: https://github.com/huggingface/hub-docs/blame/main/modelcard.md
|
22 |
+
- name: DeBERTa-v3-large-mnli-fever-anli-ling-wanli
|
23 |
+
results:
|
24 |
+
- task:
|
25 |
+
type: text-classification # Required. Example: automatic-speech-recognition
|
26 |
+
name: Natural Language Inference # Optional. Example: Speech Recognition
|
27 |
+
dataset:
|
28 |
+
type: multi_nli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
|
29 |
+
name: MultiNLI-matched # Required. A pretty name for the dataset. Example: Common Voice (French)
|
30 |
+
split: validation_matched # Optional. Example: test
|
31 |
+
metrics:
|
32 |
+
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
|
33 |
+
value: 0,912 # Required. Example: 20.90
|
34 |
+
#name: # Optional. Example: Test WER
|
35 |
+
verified: false # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
|
36 |
+
- task:
|
37 |
+
type: text-classification # Required. Example: automatic-speech-recognition
|
38 |
+
name: Natural Language Inference # Optional. Example: Speech Recognition
|
39 |
+
dataset:
|
40 |
+
type: multi_nli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
|
41 |
+
name: MultiNLI-mismatched # Required. A pretty name for the dataset. Example: Common Voice (French)
|
42 |
+
split: validation_mismatched # Optional. Example: test
|
43 |
+
metrics:
|
44 |
+
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
|
45 |
+
value: 0,908 # Required. Example: 20.90
|
46 |
+
#name: # Optional. Example: Test WER
|
47 |
+
verified: false # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
|
48 |
+
- task:
|
49 |
+
type: text-classification # Required. Example: automatic-speech-recognition
|
50 |
+
name: Natural Language Inference # Optional. Example: Speech Recognition
|
51 |
+
dataset:
|
52 |
+
type: anli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
|
53 |
+
name: ANLI-all # Required. A pretty name for the dataset. Example: Common Voice (French)
|
54 |
+
split: test_r1+test_r2+test_r3 # Optional. Example: test
|
55 |
+
metrics:
|
56 |
+
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
|
57 |
+
value: 0,702 # Required. Example: 20.90
|
58 |
+
#name: # Optional. Example: Test WER
|
59 |
+
verified: false # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
|
60 |
+
- task:
|
61 |
+
type: text-classification # Required. Example: automatic-speech-recognition
|
62 |
+
name: Natural Language Inference # Optional. Example: Speech Recognition
|
63 |
+
dataset:
|
64 |
+
type: anli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
|
65 |
+
name: ANLI-r3 # Required. A pretty name for the dataset. Example: Common Voice (French)
|
66 |
+
split: test_r3 # Optional. Example: test
|
67 |
+
metrics:
|
68 |
+
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
|
69 |
+
value: 0,64 # Required. Example: 20.90
|
70 |
+
#name: # Optional. Example: Test WER
|
71 |
+
verified: false # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
|
72 |
+
- task:
|
73 |
+
type: text-classification # Required. Example: automatic-speech-recognition
|
74 |
+
name: Natural Language Inference # Optional. Example: Speech Recognition
|
75 |
+
dataset:
|
76 |
+
type: alisawuffles/WANLI # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
|
77 |
+
name: WANLI # Required. A pretty name for the dataset. Example: Common Voice (French)
|
78 |
+
split: test # Optional. Example: test
|
79 |
+
metrics:
|
80 |
+
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
|
81 |
+
value: 0,77 # Required. Example: 20.90
|
82 |
+
#name: # Optional. Example: Test WER
|
83 |
+
verified: false # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
|
84 |
+
- task:
|
85 |
+
type: text-classification # Required. Example: automatic-speech-recognition
|
86 |
+
name: Natural Language Inference # Optional. Example: Speech Recognition
|
87 |
+
dataset:
|
88 |
+
type: lingnli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
|
89 |
+
name: LingNLI # Required. A pretty name for the dataset. Example: Common Voice (French)
|
90 |
+
split: test # Optional. Example: test
|
91 |
+
metrics:
|
92 |
+
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
|
93 |
+
value: 0,87 # Required. Example: 20.90
|
94 |
+
#name: # Optional. Example: Test WER
|
95 |
+
verified: false # Optional. If true, indicates that evaluation was generated by Hugging Face (vs. self-reported).
|
96 |
+
|
97 |
+
|
98 |
+
|
99 |
+
---
|
100 |
+
|
101 |
+
# DeBERTa-v3-large-mnli-fever-anli-ling-wanli
|
102 |
+
## Model description
|
103 |
+
This model was fine-tuned on the [MultiNLI](https://huggingface.co/datasets/multi_nli), [Fever-NLI](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md), Adversarial-NLI ([ANLI](https://huggingface.co/datasets/anli)), [LingNLI](https://arxiv.org/pdf/2104.07179.pdf) and [WANLI](https://huggingface.co/datasets/alisawuffles/WANLI) datasets, which comprise 885 242 NLI hypothesis-premise pairs. This model is the best NLI and zero-shot model on the Hugging Face Hub as of 06.06.22. It significantly outperforms all other large models on the [ANLI benchmark](https://github.com/facebookresearch/anli).
|
104 |
+
|
105 |
+
The foundation model is [DeBERTa-v3-large from Microsoft](https://huggingface.co/microsoft/deberta-v3-large). Released on 06.12.21, DeBERTa-v3-large is currently the best large-sized foundation model for text classification. It combines several recent innovations compared to classical Masked Language Models like BERT, RoBERTa etc., see the [paper](https://arxiv.org/abs/2111.09543)
|
106 |
+
|
107 |
+
## Intended uses & limitations
|
108 |
+
#### How to use the model
|
109 |
+
```python
|
110 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
111 |
+
import torch
|
112 |
+
|
113 |
+
model_name = "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli"
|
114 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
115 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
116 |
+
|
117 |
+
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
|
118 |
+
hypothesis = "The movie was good."
|
119 |
+
|
120 |
+
input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
|
121 |
+
output = model(input["input_ids"].to(device)) # device = "cuda:0" or "cpu"
|
122 |
+
prediction = torch.softmax(output["logits"][0], -1).tolist()
|
123 |
+
label_names = ["entailment", "neutral", "contradiction"]
|
124 |
+
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
|
125 |
+
print(prediction)
|
126 |
+
```
|
127 |
+
|
128 |
+
### Training data
|
129 |
+
DeBERTa-v3-large-mnli-fever-anli-ling-wanli was trained on the [MultiNLI](https://huggingface.co/datasets/multi_nli), [Fever-NLI](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md), Adversarial-NLI ([ANLI](https://huggingface.co/datasets/anli)), [LingNLI](https://arxiv.org/pdf/2104.07179.pdf) and [WANLI](https://huggingface.co/datasets/alisawuffles/WANLI) datasets, which comprise 885 242 NLI hypothesis-premise pairs. Note that [SNLI](https://huggingface.co/datasets/snli) was explicitly excluded due to quality issues with the dataset. More data does not necessarily make for better NLI models.
|
130 |
+
|
131 |
+
### Training procedure
|
132 |
+
DeBERTa-v3-large-mnli-fever-anli-ling-wanli was trained using the Hugging Face trainer with the following hyperparameters. Note that longer training with more epochs hurt performance in my tests (overfitting).
|
133 |
+
|
134 |
+
|
135 |
+
```
|
136 |
+
training_args = TrainingArguments(
|
137 |
+
num_train_epochs=4, # total number of training epochs
|
138 |
+
learning_rate=5e-06,
|
139 |
+
per_device_train_batch_size=16, # batch size per device during training
|
140 |
+
gradient_accumulation_steps=2, # doubles the effective batch_size to 32, while decreasing memory requirements
|
141 |
+
per_device_eval_batch_size=64, # batch size for evaluation
|
142 |
+
warmup_ratio=0.06, # number of warmup steps for learning rate scheduler
|
143 |
+
weight_decay=0.01, # strength of weight decay
|
144 |
+
fp16=True # mixed precision training
|
145 |
+
)
|
146 |
+
```
|
147 |
+
|
148 |
+
### Eval results
|
149 |
+
The model was evaluated using the test sets for MultiNLI, ANLI, LingNLI, WANLI and the dev set for Fever-NLI. The metric used is accuracy.
|
150 |
+
The model achieves state-of-the-art performance on each dataset. Surprisingly, it outperforms the previous [state-of-the-art on ANLI](https://github.com/facebookresearch/anli) (ALBERT-XXL) by 8,3%. I assume that this is because ANLI was created to fool masked language models like RoBERTa (or ALBERT), while DeBERTa-v3 uses a better pre-training objective (RTD), disentangled attention and I fine-tuned it on higher quality NLI data.
|
151 |
+
|
152 |
+
|Datasets|mnli_test_m|mnli_test_mm|anli_test|anli_test_r3|ling_test|wanli_test|
|
153 |
+
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
154 |
+
|Accuracy|0.912|0.908|0.702|0.64|0.87|0.77|
|
155 |
+
|Speed (text/sec, A100 GPU)|696.0|697.0|488.0|425.0|828.0|980.0|
|
156 |
+
|
157 |
+
## Limitations and bias
|
158 |
+
Please consult the original DeBERTa-v3 paper and literature on different NLI datasets for more information on the training data and potential biases. The model will reproduce statistical patterns in the training data.
|
159 |
+
|
160 |
+
### BibTeX entry and citation info
|
161 |
+
If you want to cite this model, please cite my [preprint on low-resource text classification](https://osf.io/74b8k/) and the original DeBERTa-v3 paper.
|
162 |
+
|
163 |
+
### Ideas for cooperation or questions?
|
164 |
+
If you have questions or ideas for cooperation, contact me at m{dot}laurer{at}vu{dot}nl or [LinkedIn](https://www.linkedin.com/in/moritz-laurer/)
|
165 |
+
|
166 |
+
### Debugging and issues
|
167 |
+
Note that DeBERTa-v3 was released on 06.12.21 and older versions of HF Transformers seem to have issues running the model (e.g. resulting in an issue with the tokenizer). Using Transformers>=4.13 might solve some issues.
|