EC2 Default User commited on
Commit
bb8e52c
0 Parent(s):

the first commit

Browse files
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - tr
4
+ arXiv: 2403.01308
5
+ library_name: transformers
6
+ pipeline_tag: text2text-generation
7
+ widget:
8
+ - text: >-
9
+ Soru yarat: cevap: Alan Mathison Turing İngiliz matematikçi, bilgisayar
10
+ bilimcisi ve kriptolog. II. Dünya Savaşı sırasında Alman şifrelerinin
11
+ kırılmasında çok önemli bir rol oynadığı için savaş kahramanı sayılmıştır.
12
+ Ayrıca Manchester Üniversitesi'nde çalıştığı yıllarda, Turing makinesi
13
+ denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini
14
+ atmıştır.
15
+ example_title: Question generation
16
+ - text: >-
17
+ Soru cevapla: Turing makinesi denilen algoritma tanımı ile modern
18
+ bilgisayarların kavramsal temelini atan bilim insanı kimdir? kaynak: Alan
19
+ Mathison Turing İngiliz matematikçi, bilgisayar bilimcisi ve kriptolog. II.
20
+ Dünya Savaşı sırasında Alman şifrelerinin kırılmasında çok önemli bir rol
21
+ oynadığı için savaş kahramanı sayılmıştır. Ayrıca Manchester
22
+ Üniversitesi'nde çalıştığı yıllarda, Turing makinesi denilen algoritma
23
+ tanımı ile modern bilgisayarların kavramsal temelini atmıştır.
24
+ example_title: Question answering
25
+ - text: >-
26
+ yanıtları çıkar: Alan Mathison Turing İngiliz matematikçi, bilgisayar
27
+ bilimcisi ve kriptolog. II. Dünya Savaşı sırasında Alman şifrelerinin
28
+ kırılmasında çok önemli bir rol oynadığı için savaş kahramanı sayılmıştır.
29
+ <hl> Ayrıca Manchester Üniversitesi'nde çalıştığı yıllarda, Turing makinesi
30
+ denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini
31
+ atmıştır <hl> .
32
+ example_title: Answer Extraction
33
+ license: cc-by-nc-sa-4.0
34
+ ---
35
+ # VBART Model Card
36
+
37
+ ## Model Description
38
+
39
+ This repo contains pretrained tensorflow and safetensors weights of VBART the first sequence-to-sequence model trained in Turkish corpora from scratch. VBART was trained by VNGRS in February 2023.
40
+ The model is capable of text transformation tasks such as summarization, paraphrasing, and title generation with fine-tuning.
41
+
42
+ This model overperforms its multilingual counterparts, albeit being much smaller than other implementations.
43
+
44
+ This repository contains fine-tuned weights of VBART for question-answering and generation tasks described in the [paper](https://doi.org/10.55730/1300-0632.3914).
45
+
46
+ - **Developed by:** [VNGRS-AI](https://vngrs.com/ai/)
47
+ - **Model type:** Transformer encoder-decoder based on mBART architecture
48
+ - **Language(s) (NLP):** Turkish
49
+ - **License:** CC BY-NC-SA 4.0
50
+ - **Finetuned from:** VBART-Large
51
+ - **Paper:** [arXiv](https://arxiv.org/abs/2403.01308)
52
+ ## How to Get Started with the Model
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
55
+
56
+ tokenizer = AutoTokenizer.from_pretrained("vngrs-ai/VBART-Large-QAQG",
57
+ model_input_names=['input_ids', 'attention_mask'])
58
+ # Uncomment the device_map kwarg and delete the closing bracket to infer model in gpu
59
+ model = AutoModelForSeq2SeqLM.from_pretrained("vngrs-ai/VBART-Large-QAQG")#, device_map="auto")
60
+
61
+ context="..."
62
+ question="..."
63
+ highlighted_context="..."
64
+
65
+ # Prompt for question generation
66
+ qg_prompt = f'Soru yarat: cevap: {context}'
67
+ # Prompt for question answering
68
+ qa_prompt = f'Soru cevapla: {question} kaynak: {context}'
69
+ # Prompt for answer extraction
70
+ ae_prompt = f'yanıtları çıkar: {highlighted_context}'
71
+
72
+
73
+ # text_input = f"{qg_prompt} {context} "
74
+ token_input = tokenizer(ae_prompt, return_tensors="pt")#.to('cuda')
75
+
76
+ # token_input
77
+ outputs = model.generate(**token_input)
78
+ print(tokenizer.decode(outputs[0]))
79
+ ```
80
+
81
+ ## Training Details
82
+ ### Fine-tuning prompt
83
+ This model is trained on three tasks:
84
+ - question answering: Answer a question with given context. Prompted with
85
+ ```Soru cevapla: <question> kaynak: <context>```
86
+ - question generation: Generate a question from a given context. Will accept a highlight token (`<hl>`, without spaces) to specify the answer to the question generated. Prompted with
87
+ ```Soru yarat: <context>```
88
+ - answer extraction: Will extract possible answers from a highlighted range (using the same highlight token). Prompted with
89
+ ``` yanıtları çıkar: <context with highlighted parts>```
90
+
91
+ ### Training Data
92
+ The base model is pre-trained on cleaned and filtered versions of a mixed corpus made of Turkish parts of [OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) and [mC4](https://huggingface.co/datasets/mc4) datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our [paper](https://arxiv.org/abs/2403.01308).
93
+
94
+ The fine-tuning dataset is [TQuAD](https://github.com/obss/turkish-question-generation), which has two versions. We have concatenated them and dropped duplicate samples. More information about this process can be found in Appendix B of our [paper](https://arxiv.org/abs/2403.01308).
95
+
96
+ ### Limitations
97
+ This model is fine-tuned for question-answering and question-generation tasks with specific prompts. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
98
+
99
+ ### Training Procedure
100
+ Pretrained for 30 days and for a total of 708B tokens. Finetuned for 5 epoch.
101
+ #### Hardware
102
+ - **GPUs**: 8 x Nvidia A100-80 GB
103
+ #### Software
104
+ - Tensorflow
105
+ #### Hyperparameters
106
+ ##### Pretraining
107
+ - **Training regime:** fp16 mixed precision
108
+ - **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
109
+ - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
110
+ - **Scheduler**: Linear decay scheduler (20,000 warm-up steps)
111
+ - **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 165k and 205 steps, respectively)
112
+ - **Initial Learning rate**: 5e-6
113
+ - **Training tokens**: 708B
114
+
115
+ ##### Fine-tuning
116
+ - **Training regime:** fp16 mixed precision
117
+ - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
118
+ - **Scheduler**: Linear decay scheduler
119
+ - **Dropout**: 0.1
120
+ - **Learning rate**: 5e-5
121
+ - **Fine-tune epochs**: 5
122
+
123
+ #### Metrics
124
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/D-Epasj5C4icAu0ykqt10.png)
125
+
126
+ ## Citation
127
+ ```
128
+ @article{turker2024vbart,
129
+ title={VBART: The Turkish LLM},
130
+ author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
131
+ journal={arXiv preprint arXiv:2403.01308},
132
+ year={2024}
133
+ }
134
+ ```
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "tfhf_model",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "gelu",
5
+ "architectures": [
6
+ "MBartForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "bos_token_id": 2,
10
+ "classifier_dropout": 0.0,
11
+ "d_model": 1024,
12
+ "decoder_attention_heads": 16,
13
+ "decoder_ffn_dim": 4096,
14
+ "decoder_layerdrop": 0.0,
15
+ "decoder_layers": 12,
16
+ "decoder_start_token_id": 2,
17
+ "dropout": 0.1,
18
+ "encoder_attention_heads": 16,
19
+ "encoder_ffn_dim": 4096,
20
+ "encoder_layerdrop": 0.0,
21
+ "encoder_layers": 12,
22
+ "eos_token_id": 3,
23
+ "forced_eos_token_id": 3,
24
+ "init_std": 0.02,
25
+ "is_encoder_decoder": true,
26
+ "max_position_embeddings": 1024,
27
+ "model_type": "mbart",
28
+ "num_hidden_layers": 12,
29
+ "pad_token_id": 0,
30
+ "scale_embedding": false,
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.38.2",
33
+ "use_cache": true,
34
+ "vocab_size": 32000
35
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 2,
4
+ "decoder_start_token_id": 2,
5
+ "eos_token_id": 3,
6
+ "forced_eos_token_id": 3,
7
+ "pad_token_id": 0,
8
+ "transformers_version": "4.38.2"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a5f5db735b604098beb9b331361b42a143ef0944f1f3ee2742e5951a3ffc257
3
+ size 1550557280
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<BOS>",
3
+ "eos_token": "<EOS>",
4
+ "mask_token": "<MASK>",
5
+ "pad_token": "<PAD>",
6
+ "unk_token": "<UNK>"
7
+ }
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c30087012e88164bda070f62b685d9c0e39d55f362ae0252965a33dc6ede3e0
3
+ size 1551059288
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<BOS>",
3
+ "clean_up_tokenization_spaces": false,
4
+ "eos_token": "<EOS>",
5
+ "mask_token": "<MASK>",
6
+ "model_max_length": 1024,
7
+ "pad_token": "<PAD>",
8
+ "padding_side": "right",
9
+ "tokenizer_class": "PreTrainedTokenizerFast",
10
+ "truncation_side": "right",
11
+ "unk_token": "<UNK>"
12
+ }