Text Generation
Transformers
Safetensors
Czech
mpt
custom_code
text-generation-inference
Inference Endpoints
csmpt7b / README.md
mfajcik's picture
Update README.md
ae88d52 verified
|
raw
history blame
2.87 kB
metadata
license: apache-2.0

Intruduction

Eval

Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark)

Model Model Accuracy
mistral7b 0.4992
csmpt-130k 0.5004
csmpt-100k 0.4959
csmpt-75k 0.4895
csmpt-50k steps 0.4755
csmpt-26.5k steps 0.4524

However, we ran validation over the course of training on CS-Hellaswag, and after 100k, the improvements were very noisy if any. The improvement over mistral7b is not significant.

Usage

How to Setup Environment

pip install transformers==4.37.2 torch==2.1.2 einops==0.7.0

# be sure to install right flash-attn, we use torch compiled with CUDA 12.1, no ABI, python 3.9, Linux x86_64 architecture
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.3/flash_attn-2.5.3+cu122torch2.
1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl

Running the Code

import torch
import transformers
from transformers import pipeline

name = 'BUT-FIT/csmpt7b'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:0'  # For fast initialization directly on GPU!
model = transformers.AutoModelForCausalLM.from_pretrained(
    name,
    config=config,
    torch_dtype=torch.bfloat16,  # Load model weights in bfloat16
    trust_remote_code=True
)

tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)

pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')

with torch.autocast('cuda', dtype=torch.bfloat16):
    print(
        pipe('Nejznámějším českým spisovatelem ',
             max_new_tokens=100,
             top_p=0.95,
             repetition_penalty=1.0,
             do_sample=True,
             use_cache=True))

Training Data

We release most of our training data here [TBD MDocekal.].

Our Release Plan

Stage Description Date
1 'Best' model + training data 11.03.2024
2 All checkpoints + training code
3 Benczechmark a collection of Czech datasets for few-shot LLM evaluation Get in touch if you want to contribute!
4 Preprint Publication

Getting in Touch

For further questions, email to martin.fajcik@vut.cz.

Disclaimer

This is a probabilistic model, and authors are not responsible for the model outputs. Use at your own risk.

Acknowledgement

This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT --- "Sémantický průzkumník textového kulturního dědictví" grant no. DH23P03OVV060 and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254).