pedramyazdipoor commited on
Commit
addb83f
1 Parent(s): c3b4d9d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Persian XLM-RoBERTa Large For Qusetion Answering Task
2
+
3
+ XLM-RoBERTa is a multilingual language model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116v2) by Conneau et al. .
4
+
5
+ Multilingual [XLM-RoBERTa large for QA on various languages](https://huggingface.co/deepset/xlm-roberta-large-squad2) is fine-tuned on various QA datasets but PQuAD, which is the biggest persian QA dataset so far. This second model is our base model to be fine-tuned.
6
+
7
+ Paper presenting PQuAD dataset: [arXiv:2202.06219](https://arxiv.org/abs/2202.06219)
8
+
9
+ ---
10
+
11
+ ## Introduction
12
+
13
+ This model is fine-tuned on PQuAD Train set and is easily ready to use. Too long training time encouraged me to publish this model in order to make life easier for those who need.
14
+
15
+
16
+ ## Hyperparameters
17
+ The reason I set batch_size to 4 is limited GPU memory in Google Colab.
18
+ Pre-processing, training and evaluating model took me about 4 hours.
19
+ ```
20
+ batch_size = 4
21
+ n_epochs = 1
22
+ base_LM_model = "deepset/xlm-roberta-large-squad2"
23
+ max_seq_len = 256
24
+ learning_rate = 3e-5
25
+ evaluation_strategy = "epoch",
26
+ save_strategy = "epoch",
27
+ learning_rate = 3e-5,
28
+ warmup_ratio = 0.1,
29
+ gradient_accumulation_steps = 8,
30
+ weight_decay = 0.01,
31
+ ```
32
+ ## Performance
33
+ Evaluated on the PQuAD Persian test set with the [official PQuAD link](https://huggingface.co/datasets/newsha/PQuAD/blob/main/Test_v8.0.json).
34
+ I trained for more than 1 epoch as well, but I get worse results.
35
+ ```
36
+ "exact_match": 66.85,
37
+ "f1": 87.56
38
+ ```
39
+ ## How to use
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering
42
+ tokenizer = AutoTokenizer.from_pretrained('pedramyazdipoor/persian_xlm_roberta_large')
43
+ model = AutoModel.from_pretrained('pedramyazdipoor/persian_xlm_roberta_large')
44
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
45
+ model.eval().to(device)
46
+ text = 'سلام من پدرامم 26 سالمه'
47
+ question = 'نامم چیست؟'
48
+ print(tokenizer.tokenize(text + question))
49
+ encoding = tokenizer(text,question,add_special_tokens = True,
50
+ return_token_type_ids = True,
51
+ return_tensors = 'pt',
52
+ padding = True,
53
+ return_offsets_mapping = True,
54
+ truncation = 'only_first',
55
+ max_length = 32)
56
+ out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
57
+ #out.start_logits
58
+ #out.end_logits
59
+ >>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟']
60
+ ```
61
+ ## Acknowledgments
62
+ We hereby, express our gratitude to the [Newsha Shahbodaghkhan](https://huggingface.co/datasets/newsha/PQuAD/tree/main) for facilitating dataset gathering.
63
+ ## Contributors
64
+ - Pedram Yazdipoor : [Linkedin](https://www.linkedin.com/in/pedram-yazdipour/)
65
+ ## Releases
66
+ ### Release v0.1 (Sep 14, 2022)
67
+ This is the first version of our Persian XLM-Roberta-Large.