Maral 7B Alpha 1

What is Maral?

Maral is just a new large lanugage model, specializing on the Persian language. This model is based on Mistral and trained an Alpaca Persian dataset. This model is one of the few efforts in Persian speaking scene in order to bring our language to a new life in the era of AI.

Also, since Maral is based on Mistral, it's capable of producing English answers as well.

What does "Maral" mean?

Maral is the Persian name of Red Deer, which is a native species of deers in Iran. The name has chosen for quite a few reasons, one of them is that the environmental concerns we have and second, since it's a Persian LLM, made by Iranian people, it deserves an Iranian name.

Inference

Prompt Format

This model requires Guanaco format, which is like this:

### Human: <prompt>
### Assistant: <answer>

So in your code, you may write prompts like this:

prompt = "در سال ۱۹۹۶ چه کسی رییس جمهور آمریکا بود؟"
prompt = f"### Human:{prompt}\n### Assistant:"

More information about this on the inference sections.

4 bit Quantization

If you want to use 4 bit quantization, we have a PEFT for you here. Also, you can find Google Colab notebooks here.

Installing Libraries

pip install transformers accelerate bitsandbytes

NOTE: bitsandbytes library is only needed for 8 bit version. Otherwise, it's not necessary.

Inference on a big GPU

If you have a big enough GPU like an A100 in your posession, this code is for you.

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name_or_id = "MaralGPT/Maral-7B-alpha-1"

model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id)

prompt = "در سال ۱۹۹۶ چه کسی رییس جمهور آمریکا بود؟"
prompt = f"### Human:{prompt}\n### Assistant:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.5,
    max_new_tokens=300,
    pad_token_id=tokenizer.eos_token_id
)

outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference on a small GPU (Consumer Hardware/Free Colab)

The code is pretty much the same as above, but with a slight diferrence.

Make sure bitsandbytes is installed correctly.
Your model loading must be model = AutoModelForCausalLM.from_pretrained(model_name_or_id, load_in_8bit=True, torch_dtype=torch.bfloat16, device_map="auto")

On free version of Google Colab, you may face RAM problems. I guess using low_cpu_mem_usage=True in model loading would help.

Known Issues

The model produces GPT-3.5 level answers in terms of grammar (specially Persian) but is capable of extremely insane hallucinations. This problem can be solved by a better dataset and better training procedures (such as DPO).
According to the previous issue, the model can also generate misinforming answers specially when dealing with reasoning problems in Persian.
The model is huge, so it requires a lot of resources in order to work correctly. However, we may provide GPTQ or GGUF versions as well.
The prompt format works and it proves our concept of a instruct following LLM, but since we haven't changed eos_token and bos_token to our own, you may see unncessary information being generated by the model.
According to the previous issue, the model is capable of repeating itself. To solve this problem temporarily you have to keep temperature below 1. According to our tests somewhere between 0.5 to 0.7 is a sweet spot.

Our Team

Muhammadreza Haghiri (Website - Github - LinkedIn)
Mahi Mohrechi (Website - Github - LinkedIn)

Special Thanks

Mistral Team for providing the best open source base model ever.
Sina Rashidi, who translated Alpaca dataset to Persian.
Jupyto team for providing our infrastructure.

mav23
/

Maral-7B-alpha-1-GGUF