cjvt
/

OPT_GaMS-1B / README.md
dvres's picture
Update README.md
0645d4c verified
|
raw
history blame
2.16 kB
metadata
license: apache-2.0
language:
  - en
  - sl
  - hr
  - sr
  - bs
library_name: transformers

OPT_GaMS 1B

This is the 1B OPT model additionally pretrained on Slovene data. The model was created as a part of project Povejmo: https://www.cjvt.si/povejmo/.

This is the base version of the model and is not instruction-tuned.

Data

The model was additionally pretrained on the following Slovene, English, and Croatian-Bosnian-Serbian (CBS) corpora:

Corpus Language # Tokens Percentage
Metafida Slovene 6.59 B 13.89 %
KAS Slovene 3.61 B 7.62 %
Trendi Slovene 1.4 B 2.96 %
mC4 Slovene 5.5 B 11.6 %
MaCoCu Slovene 4.68 B 9.86 %
CC100 Slovene 0.54 B 1.14 %
Rižnica Croatian 0.21 B 0.44 %
Hr News Croatian 4.16 B 8.77 %
MaCoCu HBS CBS 15.65 B 32.98 %
Wikipedia English 4.7 B 9.9 %
CC-News English 0.4 B 0.83 %

The total size of additional training data is 47.44 B tokens.

Model usage

The inference can be done using the following snippet of code:

from transformers import AutoTokenizer, pipeline

model_id = "cjvt/OPT_GaMS-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

pline = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    device_map="auto"
)

prompts = [
    "The examples of antonyms are:\nhigh => low\nwide => narrow\nbig =>",
    "Pristanek je bil prvi nadzorovani spust ameriškega vesoljskega plovila na površje Lune po Apollu 17 leta 1972, ko je na Luni pristala zadnja Nasina misija s posadko.\nDoslej so na Luni pristala vesoljska plovila le iz štirih drugih držav –",
    "U četvrtak je bila prva polufinalna večer Dore, a komentari na društvenim mrežama ne prestaju. U nedjeljno finale prošli su:"
]

sequences = pline(
    prompts,
    max_length=1000,
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
)

for seq in sequences:
    print("--------------------------")
    print(f"Result: {seq[0]['generated_text']}")
    print("--------------------------\n")