pedramyazdipoor
commited on
Commit
•
addb83f
1
Parent(s):
c3b4d9d
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Persian XLM-RoBERTa Large For Qusetion Answering Task
|
2 |
+
|
3 |
+
XLM-RoBERTa is a multilingual language model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116v2) by Conneau et al. .
|
4 |
+
|
5 |
+
Multilingual [XLM-RoBERTa large for QA on various languages](https://huggingface.co/deepset/xlm-roberta-large-squad2) is fine-tuned on various QA datasets but PQuAD, which is the biggest persian QA dataset so far. This second model is our base model to be fine-tuned.
|
6 |
+
|
7 |
+
Paper presenting PQuAD dataset: [arXiv:2202.06219](https://arxiv.org/abs/2202.06219)
|
8 |
+
|
9 |
+
---
|
10 |
+
|
11 |
+
## Introduction
|
12 |
+
|
13 |
+
This model is fine-tuned on PQuAD Train set and is easily ready to use. Too long training time encouraged me to publish this model in order to make life easier for those who need.
|
14 |
+
|
15 |
+
|
16 |
+
## Hyperparameters
|
17 |
+
The reason I set batch_size to 4 is limited GPU memory in Google Colab.
|
18 |
+
Pre-processing, training and evaluating model took me about 4 hours.
|
19 |
+
```
|
20 |
+
batch_size = 4
|
21 |
+
n_epochs = 1
|
22 |
+
base_LM_model = "deepset/xlm-roberta-large-squad2"
|
23 |
+
max_seq_len = 256
|
24 |
+
learning_rate = 3e-5
|
25 |
+
evaluation_strategy = "epoch",
|
26 |
+
save_strategy = "epoch",
|
27 |
+
learning_rate = 3e-5,
|
28 |
+
warmup_ratio = 0.1,
|
29 |
+
gradient_accumulation_steps = 8,
|
30 |
+
weight_decay = 0.01,
|
31 |
+
```
|
32 |
+
## Performance
|
33 |
+
Evaluated on the PQuAD Persian test set with the [official PQuAD link](https://huggingface.co/datasets/newsha/PQuAD/blob/main/Test_v8.0.json).
|
34 |
+
I trained for more than 1 epoch as well, but I get worse results.
|
35 |
+
```
|
36 |
+
"exact_match": 66.85,
|
37 |
+
"f1": 87.56
|
38 |
+
```
|
39 |
+
## How to use
|
40 |
+
```python
|
41 |
+
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
|
42 |
+
tokenizer = AutoTokenizer.from_pretrained('pedramyazdipoor/persian_xlm_roberta_large')
|
43 |
+
model = AutoModel.from_pretrained('pedramyazdipoor/persian_xlm_roberta_large')
|
44 |
+
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
45 |
+
model.eval().to(device)
|
46 |
+
text = 'سلام من پدرامم 26 سالمه'
|
47 |
+
question = 'نامم چیست؟'
|
48 |
+
print(tokenizer.tokenize(text + question))
|
49 |
+
encoding = tokenizer(text,question,add_special_tokens = True,
|
50 |
+
return_token_type_ids = True,
|
51 |
+
return_tensors = 'pt',
|
52 |
+
padding = True,
|
53 |
+
return_offsets_mapping = True,
|
54 |
+
truncation = 'only_first',
|
55 |
+
max_length = 32)
|
56 |
+
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
|
57 |
+
#out.start_logits
|
58 |
+
#out.end_logits
|
59 |
+
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟']
|
60 |
+
```
|
61 |
+
## Acknowledgments
|
62 |
+
We hereby, express our gratitude to the [Newsha Shahbodaghkhan](https://huggingface.co/datasets/newsha/PQuAD/tree/main) for facilitating dataset gathering.
|
63 |
+
## Contributors
|
64 |
+
- Pedram Yazdipoor : [Linkedin](https://www.linkedin.com/in/pedram-yazdipour/)
|
65 |
+
## Releases
|
66 |
+
### Release v0.1 (Sep 14, 2022)
|
67 |
+
This is the first version of our Persian XLM-Roberta-Large.
|