cjvt
/

File size: 11,797 Bytes
05b7b98
 
 
 
 
 
 
 
 
fa2aa6e
 
652a798
fa2aa6e
3df7805
fa2aa6e
14024ba
fa2aa6e
3df7805
 
 
fa2aa6e
14024ba
fa2aa6e
436c233
3df7805
14024ba
 
 
fa2aa6e
842ae42
 
3df7805
842ae42
3df7805
842ae42
14024ba
fa2aa6e
 
 
e90ddb4
14024ba
fa2aa6e
0645d4c
fa2aa6e
 
 
0645d4c
fa2aa6e
 
 
 
 
 
 
 
 
 
 
 
 
14024ba
fa2aa6e
 
 
 
 
 
 
14024ba
 
 
 
 
 
 
 
3df7805
14024ba
 
 
 
 
c7504a8
14024ba
 
 
 
 
 
 
 
 
3df7805
ab6dde5
 
 
3df7805
ab6dde5
 
 
 
 
 
 
 
 
 
 
 
 
bceb26f
ab6dde5
 
 
 
bceb26f
 
c7504a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
842ae42
 
436c233
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: apache-2.0
language:
- en
- sl
- hr
- sr
- bs
library_name: transformers
---

# Model Card for OPT_GaMS 1B

We proudly present the familly of GaMS (Generative Model for Slovene) models. The 1B version is based on [Facebook's OPT model](https://huggingface.co/facebook/opt-1.3b) and is adapted for Slovene. OPT_GaMS models use original OPT tokenizer.

## Acknowledgment

The model was developed within the [PoVeJMo](https://www.cjvt.si/povejmo/en/project/) research program (Adaptive Natural Language Processing with Large Language Models), particularly within the research project titled SloLLaMai -- Open-access computationally efficient models for Slovenian. The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. The authors also acknowledge the financial support from the Slovenian Research and Innovation Agency (research core funding No. P6-0411 -- Language Resources and Technologies for Slovene).

We thank everyone who worked on data collection and preparation, enabling us to train our model. Special thanks go to Nikola Ljubešić, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec and Iztok Kosem.

## Basic information

- **Developed by:** team of researchers at the University of Ljubljana, Faculty for Computer and Information Science and XLAB.doo. Team members: Domen Vreš, Martin Božič, Aljaž Potočnik, Tomaž Martinčič, Iztok Lebar Bajec, Timotej Petrič and Marko Robnik-Šikonja.
- **Languages:** Slovene (primary), English, Croatian, Bosnian and Serbian (secondary)
- **License:** Apache 2.0
- **Repository:** https://github.com/SloLama/NeMo
- **Paper:** https://www.sdjt.si/wp/wp-content/uploads/2024/09/JT-DH-2024_Vres_Bozic_Potocnik_Martincic_Robnik.pdf

## Intended usage

This version of the model is quite small and lacks instruction and safety tuning. Hence, using it as a general-purpose model is **STRONGLY DISCOURAGED!** The model might also contain certain biases. We do not recommend the usage of this model in any other language than Slovene.

The model can be efficiently tuned for specific use cases, as suggested by promising results of fine-tuned models on SuperGLUE and SI-NLI benchmarks.

## How to Get Started with the Model

The inference can be done using the following snippet of code:

```python
from transformers import pipeline

model_id = "cjvt/OPT_GaMS-1B"

pline = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto"
)

prompts = [
    "The examples of antonyms are:\nhigh => low\nwide => narrow\nbig =>",
    "Pristanek je bil prvi nadzorovani spust ameriškega vesoljskega plovila na površje Lune po Apollu 17 leta 1972, ko je na Luni pristala zadnja Nasina misija s posadko.\nDoslej so na Luni pristala vesoljska plovila le iz štirih drugih držav –",
    "U četvrtak je bila prva polufinalna večer Dore, a komentari na društvenim mrežama ne prestaju. U nedjeljno finale prošli su:"
]

sequences = pline(
    prompts,
    max_length=1000,
    do_sample=False,
    num_return_sequences=1
)

for seq in sequences:
    print("--------------------------")
    print(f"Result: {seq[0]['generated_text']}")
    print("--------------------------\n")
```

## Training Details

### Training Data

The model was additionally pretrained on the following Slovene, English, and Croatian-Bosnian-Serbian (CBS) corpora:
| Corpus | Language | # Tokens | Percentage |
| :----- | :------- | :------: | :--------: |
| MetaFida | Slovene | 6.59 B | 13.89 % |
| KAS | Slovene | 3.61 B | 7.62 % |
| Trendi | Slovene | 1.4 B | 2.96 % |
| mC4 | Slovene | 5.5 B | 11.6 % |
| MaCoCu | Slovene | 4.68 B | 9.86 % |
| CC100 | Slovene | 0.54 B | 1.14 % |
| Riznica | Croatian | 0.21 B | 0.44 % |
| Hr News | Croatian | 4.16 B | 8.77 % |
| MaCoCu HBS | CBS | 15.65 B | 32.98 % |
| Wikipedia | English | 4.7 B | 9.9 % |
| CC-News | English | 0.4 B | 0.83 % |

The total size of additional training data is **47.44 B** tokens.

### Training Procedure

The model was trained using the NeMo framework on Slovene HPC Vega, utilizing 64 A100 GPUs simultaneously. Training took approximately 16 hours. The model was trained with batch size 1024 (2 million tokens) using Adam optimizer and cosine learning rate scheduler with 1000 warmup and constant steps.

## Evaluation

The models were evaluated using [Slovene SuperGLUE](https://slobench.cjvt.si/leaderboard/view/3) and [SI-NLI](https://slobench.cjvt.si/leaderboard/view/9) tasks on [SloBench](https://slobench.cjvt.si). Additionally, the models were evaluated on an improved version of the Slovenian-LLM-eval introduced by Aleksa Gordić. All decoder-type models were evaluated using few-shot prompts and were not finetuned on the benchmark (except for the versions with finetuned in the name).

### SuperGLUE results

| Model | SuperGLUE Average | BoolQ Accuracy | CB Accuracy | CB F1 Score | CB Average |	COPA Accuracy |	MultiRC EM | MultiRC F1a Score | MultiRC Average | RTE Accuracy | WSC Accuracy |
| :---- | :---------------: | :------------: | :---------: | :---------: | :--------: | :-----------: | :--------: | :---------------: | :-------------: | :----------: | :----------: |
| OPT_GaMS-1B                | 0.4408     | 0.5667     | 0.5040     | 0.3885     | 0.4463     | 0.5020     | 0.0961     | 0.2543     | 0.1752     | 0.4138     | 0.5411     |
| GaMS-1B                    | 0.4604     | 0.5000     | 0.6200     | 0.4565     | 0.5382     | 0.4920     | 0.1351     | 0.2675     | 0.2013     | 0.4828     | 0.5479     |
| OPT_GaMS-1B-Chat           | 0.4165     | 0.7000     | 0.3720     | 0.2961     | 0.3341     | 0.4600     | 0.1111     | 0.3448     | 0.2280     | 0.4138     | 0.3630     |
| GaMS-1B-Chat               | 0.4570     | **0.8000** | 0.4880     | 0.3023     | 0.3951     | 0.4840     | 0.1081     | 0.2428     | 0.1755     | 0.5172     | 0.3699     |
| OPT_GaMS-1B-Chat finetuned | 0.5645     | 0.7000     | 0.8040     | 0.5884     | 0.6962     | 0.5860     | 0.1021     | 0.4808     | 0.2914     | 0.5862     | 0.5274     |
| GaMS-1B-Chat finetuned     | 0.5806     | 0.7333     | **0.8120** | 0.5592     | 0.6856     | 0.5080     | 0.1381     | 0.4882     | 0.3132     | 0.5862     | **0.6575** |
| SlovenianGPT-Chat*         | 0.5078     | 0.7333     | 0.3920     | 0.3829     | 0.3874     | **0.6840** | **0.2432** | 0.4944     | **0.3688** | 0.5172     | 0.3562     |
| CroSloEngual BERT          | **0.6078** | 0.7333     | 0.7920     | **0.7437** | **0.7679** | 0.5720     | 0.0931     | **0.5241** | 0.3086     | **0.6552** | 0.6096     |

*SlovenianGPT-Chat was obtained by instruction-tuning Aleksa Gordić's [SlovenianGPT](https://huggingface.co/gordicaleksa/SlovenianGPT) on our instruction dataset.

### SI-NLI results

| Model | Accuracy | P(entailment) | R(entailment) | F1(entailment) | P(neutral) |	R(neutral) | F1(neutral) |	P(contradiction) |	R(contradiction) |	F1(contradiction) |
| :---- | :------: | :-----------: | :-----------: | :------------: | :--------: | :---------: | :---------: | :---------------: | :---------------: | :----------------: |
| OPT_GaMS-1B                | 0.3277     | 0.3407     | 0.6754     | 0.4529     | 0.3538     | 0.1402     | 0.2009     | 0.2632     | 0.1524     | 0.1931     |
| GaMS-1B                    | 0.3317     | 0.3418     | 0.4327     | 0.3819     | 0.3353     | 0.5122     | 0.4053     | 0.2344     | 0.0457     | 0.0765     |
| OPT_GaMS-1B-Chat           | 0.3447     | 0.3515     | 0.6784     | 0.4631     | 0.3386     | 0.3293     | 0.3338     | 0.2105     | 0.0122     | 0.0231     |
| GaMS-1B-Chat               | 0.3417     | 0.3405     | **0.9737** | 0.5045     | 0.2857     | 0.0061     | 0.0119     | 0.4615     | 0.0183     | 0.0352     |
| OPT_GaMS-1B-Chat finetuned | 0.7244     | 0.7065     | 0.8304     | 0.7634     | 0.7269     | 0.6006     | 0.6578     | 0.7446     | 0.7378     | 0.7412     |
| GaMS-1B-Chat finetuned     | 0.7144     | 0.8037     | 0.6345     | 0.7092     | 0.7247     | 0.6341     | 0.6764     | 0.6531     | **0.8780** | 0.7490     |
| SlovenianGPT-Chat*         | 0.4729     | 0.4399     | 0.7281     | 0.5485     | 0.3719     | 0.1372     | 0.2004     | 0.5723     | 0.5427     | 0.5571     |
| GPT-3.5-Turbo finetuned    | **0.8567** | **0.8464** | 0.8538     | **0.8501** | **0.8041** | **0.8384** | **0.8209** | **0.9260** | **0.8780** | **0.9014** |
| SloBERTa                   | 0.7375     | 0.8127     | 0.7105     | 0.7582     | 0.6844     | 0.7470     | 0.7143     | 0.7273     | 0.7561     | 0.7414     |
| CroSloEngual BERT          | 0.6623     | 0.7147     | 0.6667     | 0.6899     | 0.6072     | 0.6646     | 0.6346     | 0.6719     | 0.6555     | 0.6636     |

*SlovenianGPT-Chat was obtained by instruction-tuning Aleksa Gordić's [SlovenianGPT](https://huggingface.co/gordicaleksa/SlovenianGPT) on our instruction dataset.

### Slovenian-LLM-eval results

| Model | ARC-Challenge Accuracy | ARC-Easy Accuracy | BoolQ Accuracy | HellaSwag Accuracy | NQ-Open EM | OpenBookQA Accuracy | PIQA Accuracy | WinoGrande Accuracy |
| :---- | :--------------------: | :---------------: | :------------: | :----------------: | :--------------: | :-----------------: | :-----------: | :-----------------: |
| OPT_GaMS-1B        | 0.2227 ± 0.0122     | 0.436  ± 0.0102     | 0.378  ± 0.0085     | 0.3394 ± 0.0047     | 0.0003 ± 0.0003     | 0.214 ± 0.0184     | 0.6083 ± 0.0114     | 0.5533 ± 0.014      |
| GaMS-1B            | 0.2329 ± 0.0124     | 0.4743 ± 0.0102     | 0.3813 ± 0.0085     | 0.3555 ± 0.0048     | 0.0036 ± 0.001      | 0.22  ± 0.0185     | 0.624  ± 0.0113     | 0.532  ± 0.014      |
| OPT_GaMS-1B-Chat   | 0.2355 ± 0.0124     | 0.3960 ± 0.0100     | 0.4398 ± 0.0087     | 0.3459 ± 0.0047     | 0.0011 ± 0.0006     | 0.20  ± 0.0179     | 0.5778 ± 0.0115     | 0.5359 ± 0.014      |
| GaMS-1B-Chat       | 0.2517 ± 0.0127     | 0.4394 ± 0.0102     | 0.4502 ± 0.0087     | 0.3634 ± 0.0048     | 0      ± 0          | 0.196 ± 0.0178     | 0.6115 ± 0.0114     | 0.5572 ± 0.014      |
| YugoGPT            | 0.2961 ± 0.0133     | 0.4781 ± 0.0102     | 0.3783 ± 0.0085     | 0.3890 ± 0.0047     | 0.0385 ± 0.0032     | 0.226 ± 0.0187     | 0.5816 ± 0.0115     | 0.5588 ± 0.014      |
| SlovenianGPT       | **0.3805 ± 0.0142** | **0.6498 ± 0.0098** | 0.4523 ± 0.0087     | **0.4935 ± 0.0050** | **0.0432 ± 0.0034** | **0.27  ± 0.0199** | **0.6937 ± 0.0108** | **0.644  ± 0.0135** |
| SlovenianGPT-Chat* | 0.3567 ± 0.014      | 0.5901 ± 0.0101     | **0.4706 ± 0.0087** | 0.4719 ± 0.0050     | 0.0003 ± 0.0003     | **0.27  ± 0.0199** | 0.6861 ± 0.0108     | 0.6425 ± 0.0135     |

*SlovenianGPT-Chat was obtained by instruction-tuning Aleksa Gordić's [SlovenianGPT](https://huggingface.co/gordicaleksa/SlovenianGPT) on our instruction dataset.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/652d40a78fa1fbb0aae165bb/_2h977RjIu0nI_IJG_9bL.png)

```
@inproceedings{GaMS,
 author = {Vre{\v s}, Domen and Bo{\v z}i{\v c}, Martin and Poto{\v c}nik, Alja{\v z} and Martin{\v c}i{\v c}, Toma{\v z} and Robnik-{\v S}ikonja, Marko},
 booktitle = {Language Technologies and Digital Humanities Conference},
 title = {{Generative Model for Less-Resourced Language with 1 billion parameters}},
 url = {https://www.sdjt.si/wp/wp-content/uploads/2024/09/JT-DH-2024_Vres_Bozic_Potocnik_Martincic_Robnik.pdf},
 year = {2024}
}
```