AntoineBlanot
commited on
Commit
•
1ba8f2e
1
Parent(s):
f4440cf
Update README.md
Browse files
README.md
CHANGED
@@ -7,5 +7,22 @@ metrics:
|
|
7 |
- accuracy
|
8 |
- f1
|
9 |
pipeline_tag: zero-shot-classification
|
|
|
|
|
10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
Model of philschmid/flan-t5-xxl-sharded-fp16 with a single decoder layer and a classification head on top.
|
|
|
7 |
- accuracy
|
8 |
- f1
|
9 |
pipeline_tag: zero-shot-classification
|
10 |
+
language:
|
11 |
+
- en
|
12 |
---
|
13 |
+
# T5ForSequenceClassification
|
14 |
+
**T5ForSequenceClassification** adapts the original [T5](https://github.com/google-research/text-to-text-transfer-transformer) architecture for sequence classification tasks.
|
15 |
+
|
16 |
+
T5 was originally built for text-to-text tasks and excels in it.
|
17 |
+
It can handle any NLP task if it has been converted to a text-to-text format, including sequence classification task!
|
18 |
+
You can find [here](https://huggingface.co/google/flan-t5-base?text=Premise%3A++At+my+age+you+will+probably+have+learnt+one+lesson.+Hypothesis%3A++It%27s+not+certain+how+many+lessons+you%27ll+learn+by+your+thirties.+Does+the+premise+entail+the+hypothesis%3F) how the original T5 is used for sequence classification task.
|
19 |
+
|
20 |
+
Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary.
|
21 |
+
By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task.
|
22 |
+
|
23 |
+
# Why use T5ForSequenceClassification?
|
24 |
+
Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today.
|
25 |
+
However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models.
|
26 |
+
On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving (T5, [mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...)
|
27 |
+
|
28 |
Model of philschmid/flan-t5-xxl-sharded-fp16 with a single decoder layer and a classification head on top.
|