|
--- |
|
tags: |
|
- generation |
|
language: |
|
- multilingual |
|
- cs |
|
- en |
|
--- |
|
|
|
# Mt5-large for Prime Czech+English Generative Question Answering |
|
|
|
This is the [mt5-base](https://huggingface.co/google/mt5-base) model with an LM head for a generation of extractive answers, |
|
given a small set of 2-5 demonstrations (i.e. primes). |
|
|
|
## Priming |
|
|
|
Note that **this is a priming model** that expects a **set of demonstrations** of your task of interest, |
|
similarly to GPT-3. |
|
Rather than performing well on the conventional question answering, it aims to learn to extrapolate the pattern of given demonstrations |
|
to novel tasks, such as Named Entity Recognition or Keywords Extraction from a given pattern. |
|
|
|
## Data & Training |
|
|
|
This model was trained on a combination of [AdversarialQA](https://adversarialqa.github.io) |
|
and [Czech SQAD 3.0](https://lindat.cz/repository/xmlui/handle/11234/1-3069) |
|
Question Answering datasets. |
|
|
|
To train the model to use the demonstrations, we've **clustered** the samples by the question-word(s) |
|
in English AdversarialQA and by the category in the Czech SQAD and used the examples of the same cluster as the demonstrations |
|
of the task in training. |
|
|
|
We find that the specific algorithm of selection of these demonstrations makes a big difference in the model's ability to extrapolate |
|
to new tasks and will be shared in the following article; stay tuned! |
|
|
|
For the Czech SQAD 3.0, original contexts (=whole Wikipedia websites) were limited to a maximum of 8000 characters |
|
per a sequence of prime demonstrations. |
|
Pre-processing script for Czech SQAD is available [here](https://huggingface.co/gaussalgo/xlm-roberta-large_extractive-QA_en-cs/blob/main/parse_czech_squad.py). |
|
|
|
|
|
For training the model (and hence intended also for the inference), we've used the following patterns of 2-7 demonstrations: |
|
|
|
For English samples: |
|
|
|
*input*: |
|
``` |
|
Question: {Q1} Context: {C1} Answer: {A1}, |
|
Question: {Q2} Context: {C2} Answer: {A2}, |
|
[...possibly more demonstrations...] |
|
|
|
Question: {Q} Context: {C} Answer:` |
|
``` |
|
=> *target*: |
|
``` |
|
{A} |
|
``` |
|
|
|
For Czech samples: |
|
|
|
*input*: |
|
``` |
|
Otázka: {Q1} Kontext: {C1} Odpověď: {A1}, |
|
Otázka: {Q2} Kontext: {C2} Odpověď: {A2}, |
|
[...possibly more demonstrations...] |
|
|
|
Otázka: {Q} Kontext: {C} Odpověď:` |
|
``` |
|
=> *target*: |
|
``` |
|
{A} |
|
``` |
|
|
|
|
|
The best checkpoint was picked to maximize the model's zero-shot performance on unseen Named Entity Recognition |
|
from the out-of-distribution domain of texts and labels. |
|
|
|
## Intended uses & limitations |
|
|
|
This model is purposed for a few-shot application on any text extraction task in English and Czech, where the prompt can be stated |
|
as a natural question. E.g. to use this model for extracting the entities of customer names from the text, |
|
prompt it with demonstrations in the following format: |
|
|
|
```python |
|
input_text = """ |
|
Question: What is the customer's name? |
|
Context: Origin: Barrack Obama, Customer id: Bill Moe. |
|
Answer: Bill Moe, |
|
Question: What is the customer's name? |
|
Context: Customer id: Barrack Obama, if not deliverable, return to Bill Clinton. |
|
Answer:""" |
|
``` |
|
|
|
Note that despite its size, English AdversarialQA has a variety of reported biases, |
|
conditioned by the relative position or type of the answer in the context that can affect the model's performance on new data |
|
(see, e.g. [L. Mikula (2022)](https://is.muni.cz/th/adh58/?lang=en), Chap. 4.1). |
|
|
|
## Usage |
|
|
|
Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("gaussalgo/mt5-base-priming-QA_en-cs") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("gaussalgo/mt5-base-priming-QA_en-cs") |
|
|
|
# For the expected format of input_text, see Intended use above |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
|
outputs = model.generate(**inputs) |
|
|
|
print("Answer:") |
|
print(tokenizer.decode(outputs)) |
|
``` |
|
|