ูุตูŠุญ

ู†ู…ูˆุฐุฌ ู„ุบูˆูŠ ู…ุตู…ู… ู„ู„ุชุฑุฌู…ุฉ ุฅู„ู‰ ู„ุณุงู† ุนุฑุจูŠ ูุตูŠุญุŒ ู„ุฃู† ุงู„ุณุงุฆุฏ ุญุงู„ูŠุง ููŠ ุงู„ุชุฑุฌู…ุฉ ู‡ูŠ ุงู„ุนุฑุจูŠุฉ ุงู„ู…ุณุชุญุฏุซุฉ (ุงู„ุนุฑู†ุฌูŠุฉ)

ูˆู…ุง ู‡ูŠ ุงู„ุนุฑู†ุฌูŠุฉุŸ

ู„ุบุฉ ุธุงู‡ุฑู‡ุง ุงู„ุนุฑุจูŠุฉุŒ ูˆุจุงุทู†ู‡ุง ุงู„ุฃูุฑู†ุฌูŠุฉ. ูˆุฃู…ุซู„ุฉ ู‡ุฐุง ูƒุซูŠุฑุฉุŒ ูˆู…ู† ุฐู„ูƒ: ู†ู…ุท ุญูŠุงุฉ ุจุฏู„ู‹ุง ู…ู† ู…ุนูŠุดุฉุŒ ูˆุฃุฑุถูŠุฉ ู…ุดุชุฑูƒุฉ ุจุฏู„ู‹ุง ู…ู† ูƒู„ู…ุฉ ุณูˆุงุกุŒ ูˆุณู„ุงู… ุฏุงุฎู„ูŠ ุจุฏู„ู‹ุง ู…ู† ุทู…ุฃู†ูŠู†ุฉ ุฃูˆ ุณูƒูŠู†ุฉุŒ ูˆุณู„ุจูŠุงุช ูˆุฅูŠุฌุงุจูŠุงุช ุจุฏู„ู‹ุง ู…ู† ู…ุญุงุณู† ุงู„ุดูŠุก ูˆู…ุณุงูˆูŠู‡ุŒ ูˆูุถุงุฆู„ู‡ ูˆุฑุฐุงุฆู„ู‡.


Faseeh

A MTM (Machine Translation Model) designed to translate to True Classical Arabic

How to Get Started with the Model

Use the code below to get started with the model.

model_name = "Abdulmohsena/Faseeh"

tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="arb_Arab")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
generation_config = GenerationConfig.from_pretrained(model_name)


dummy = "And the Saudi Arabian Foreign Minister assured the visitors of the importance to seek the security."

encoded_ar = tokenizer(dummy, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, generation_config=generation_config)

tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

Model Details

  • Finetuned version of facebook's NLLB 200 Distilled 600M Parameters

Model Sources

Bias, Risks, and Limitations

  • The language pairs are outside of quran is mostly translated by Google Translate. Thus, the quality of translation is dependant on the quality of Google's Translation from Classical Arabic to English.
  • The Metrics used in this model is bertscore/e5score. It is not even close to perfect in terms of alignment, but it is the best available metric for semantic translation. Thus, until a better subsitute appears, this is the main evaluation metric.
  • Metrics used in general to evaluate the translation quality to Arabic are trained on Modern Standard Arabic, thus making them unaligned to the goals of the model.

Improvements

  • A much better approach to generate a language pair out of classical Arabic text is to use GPT4o (at the time of this writing, that is the only model capable of understanding complex Arabic sentences).
  • There should be evaluation metrics designed for the goal of this model. Currently, I have only created a binary classifier to classify a sentence if it is classical or not. It works as a score from 0 to 1, but it is not sufficient nor flexible, thus more work need to be done in evaluations.

Training Data

  • Arabic text outside of HuggingFace datasets are scraped from Shamela Library

Metrics

  • COMET: to pay more attention in representing the same meaning rather than focusing on individual words (Semantic Translation, not Syntactic Translation)
  • Fluency Score: A custom built metric to classify a metric if it is classical or not.
Downloads last month
854
Safetensors
Model size
615M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using Abdulmohsena/Faseeh 1