YorubaLlama / README.md
Jacaranda's picture
Update README.md
c4729bc verified
metadata
language:
  - en
  - yo
metrics:
  - perplexity
pipeline_tag: question-answering

Model Card for YorubaLlama

Use this notebook to inference. We recommend you inference on L4 or A100 GPU.

Model Details

YorubaLlama3 is a 8B Parameters language model that builds upon the foundation of meta-llama/Meta-Llama-3-8B.It has been specifically enhanced to excel in processing and generating text in Yoruba language. This model aims to improve natural language understanding and generation capabilities for Yoruba speaking users and researchers.

Model Description

Key features:

  • Improved performance on Yoruba language tasks
  • Maintains general language capabilities of the original Llama 3 model
  • Optimized for both understanding and generating Yoruba text

Training The training process for YorubaLlama involved two main stages:

1. LoRA-based Continual Pre-training:

We conducted continuous pre-training using publicly available Yoruba corpi, which we pre-processed using the Meta/Llama3 tokenizer. The primary focus was on causal language modeling,specifically training the model to predict the next Yoruba tokens based on preceding Yoruba tokens. Our continuous pre- training involved implementing the LoRA technique, where-in we froze the base model parameters of the foundation Meta/Llama3 model and introduced additional lightweight components(adpaters). adapters. These adapters were specifically trained to capture the intricacies, terminologies, and nuances of the Yoruba language. This approach facilitated a balance between leveraging the knowledge embedded in the pre-trained Meta/Llama3 model and optimizing it for Yoruba language, all without incurring the computational costs associated with retraining the entire Llama3 model.

2. LoRA-based Instruction Tuning:

Fine-tuned on a curated dataset of Yoruba instructions and responses Included task-specific data to improve performance on common language tasks Emphasized maintaining coherence and contextual understanding in Yoruba

Incorporated safety datasets to improve the model's ability to generate safe and ethical responses Included examples of harmful content and appropriate non-harmful alternatives Focused on reducing biases and improving the model's understanding of cultural sensitivities in theYoruba context The use of LoRA (Low-Rank Adaptation) for both continual pre-training and instruction tuning allowed for efficient adaptation of the base Llama 3 model while minimizing computational resources and preventing catastrophic forgetting of the original model's capabilities. Approximate dataset sizes:

Continual pre-training:["8.1 GB of text"] Instruction tuning:[66,280 instruction-response pairs"]

  • Developed by: Jacaranda Health
  • Model type: Llama
  • Language(s) (NLP): Yoruba and English
  • License: CC BY-NC-SA 4.0 DEED
  • Model Developers: Stanslaus Mwongela, Jay Patel, Sathy Rajasekharan, Lyvia Lusiji, Francesco Piccino, Mfoniso Ukwak, Ellen Sebastian

Uses

YorubaLlama is optimized for downstream tasks, notably those demanding instructional datasets in Yoruba, English, or both. Organizations can further fine-tune it for their specific domains. Potential areas include:

  • Question-answering within specific domains.
  • Assistant-driven chat capabilities: healthcare, agriculture, legal, education, tourism and hospitality, public services, financial sectors, communication, customer assistance, commerce, etcpublic services, financial sectors, communication, customer assistance, commerce, etc.

Sample Evaluation Prompts

Prompt YorubaLlama Response Meta-Llama-3-8B-Instruct Response
Kọ ìtàn kan nípa ẹhoro

(Translation: Write a story about a Rabbit)
Lọjọ kan, ọkan lẹwa ehoro wa ti a npè ni flopsy. O ngbe ninu igbo pẹlu awọn ọrẹ rẹ, rabbito ati giraffe-giraffe. Ọjọ kan, wọn pinnu lati lọ si irin-ajo nla kan. Wọn rin kakiri igbo, wọn n ṣawari gbogbo igun ati igi, titi wọn fi de adágún omi kan. Wọn dun pupọ lati wo oorun didan lori omi naa ki wọn si mu ara wọn ni isinmi. Lẹhinna, wọn tẹsiwaju irin-ajo wọn, wọn n gbadun ẹwa igbo naa titi wọn fi pada sile.

(Translation: One day, a lovely rabbit named Flopsy lived in the forest with her friends, Rabbito and the giraffes. One day, they decided to go on a grand adventure. They wandered through the forest, exploring every corner and tree, until they reached a serene pond. They were delighted to see the sun shimmering on the water and took some time to relax. Afterwards, they continued their journey, enjoying the beauty of the forest until they returned home.)
A Yoruba phrase! 😊"Kọ ìtàn kan nípa ẹhoro" translates to "Tell me one story about a horse". Is that correct
Ṣàlàyé ètò ẹ̀kọ́ ní Nàìjíríà

(Translation: Explain the Education System in Nigeria)
Ètò ẹ̀kọ́ ní nàìjíríà jẹ́ ìlànà àwọn ilé-ẹ̀kọ́ alákọ̀ọ́bẹ̀rẹ̀, alágbèéká àti gíga. Ilé-ẹ̀kọ́ alákọ̀ọ́bẹ̀rẹ̀ pèsè ètò ẹ̀kọ́ fún àwọn ọmọdé láti ọmọ ọdún mẹ́fà sí mọ́kànlá, nígbà tí ilé-ẹ̀kọ́ alágbèéká àti gíga pèsè ètò ẹ̀kọ́ fún àwọn akẹ́kọ̀ọ́ láti ọmọ ọdún mọ́kànlá sí méjìdínlógún. Àwọn ilé-ẹ̀kọ́ alákọ̀ọ́bẹ̀rẹ̀ àti alágbèéká ni ìjọba ṣe àtìlẹ́yìn fún, nígbà tí àwọn ilé-ẹ̀kọ́ gíga jẹ́ ti àwọn aládàní. Gbogbo àwọn ilé-ẹ̀kọ́ gbọdọ̀ tẹ̀lé àwọn ìlànà àti ìbéèrè ìwé-ẹ̀rí ti ìgbìmọ̀ ẹ̀kọ́ orílẹ̀-èdè (national education board) (neb)

(Translation: The education system in Nigeria consists of primary, secondary, and tertiary levels. Primary schools provide education for children from six to eleven years old, while secondary and tertiary institutions offer education for students from eleven to eighteen years old. Primary and secondary schools are supported by the government, while tertiary institutions are primarily private. All schools must adhere to the guidelines and certification requirements set by the National Education Board (NEB).)
A delightful topic! Education in Nigeria has been a subject of much discussion and concern. Here's a breakdown of the current state of education in Nigeria: Challenges: 1. Funding: Education is one of the most underfunded sectors in Nigeria, with inadequate budgetary allocation. 2. Infrastructure: Many schools lack basic facilities such as classrooms, toilets, and electricity, making it difficult for students to learn effectively. 3. Teacher Quality: The quality of teachers is a major concern, with many not having the necessary qualifications or training. 4. Enrollment: Despite efforts to increase enrollment, many children, especially girls, are still out of school due to poverty, cultural beliefs, and lack of access to education. 5. Curriculum: The curriculum is often outdated and not relevant to the needs of the 21st century, leading to a lack of skills and knowledge among graduates. Reforms: 1. Universal Basic Education (UBE) Program: Launched in 2005, this program aims to provide free and compulsory education for all children between ages 6-15. 2. Almajiri Education Program: This program targets out-of-school children, particularly those in the North, by providing education and skills training. 3. Private Sector Participation: The government has encouraged private sector participation in education to improve infrastructure and quality. 4. Digital Literacy: Efforts are being made to integrate digital literacy into the curriculum to prepare students for the digital age. 5. Skills Acquisition: There is a growing emphasis on vocational training and skills acquisition to equip students with practical skills for the job market. Opportunities: 1. Technology Integration: The use of technology can improve access to education, especially for remote and disadvantaged communities. 2. Private Sector Support: Private companies are investing in education through scholarships, mentorship programs, and infrastructure development. 3. Government Initiatives: The government has launched initiatives such
Kí ni ó ń fa ìrora orí?

(Translation: What causes headaches?)
Irora ori le jẹ abajade orisirisi ifosiwewe, bii aini isinmi to peye, ipo ara ti ko dara, tabi wahala. O ṣe pataki lati kan si alamọdaju ilera fun ayẹwo ati itọju to tọ.

(Translation: Headaches can be the result of various factors, such as lack of adequate rest, poor posture, or stress. It's important to consult a healthcare professional for proper examination and treatment.)
A Yoruba phrase! "Kí ni ó ń fa ìrora orí" translates to "Who is it that is making the head (or mind) to wander?" In Yoruba culture, "orí" refers to the head or mind, and "ìrora" means to wander or be distracted. So, this phrase is asking who or what is causing someone's thoughts to stray or their mind to wander.

Out-of-Scope Use

The use of the developed Large Language Model (LLM) capabilities is for research,social good and internal use purposes only. For commercial use and distribution, organisations/individuals are encouraged to contactJacaranda Health. To ensure the ethical and responsible use of YorubaLlama, we have outlined a set of guidelines. These guidelines categorize activities and practices into three main areas: prohibited actions, high-risk activities, and deceptive practices. By understanding and adhering to these directives, users can contribute to a safer and more trustworthy environment.

  1. Prohibited Actions:
  • Illegal Activities: Avoid promoting violence, child exploitation, human trafficking, and other crimes.
  • Harassment and Discrimination: No acts that bully, threaten, or discriminate.
  • Unauthorized Professions: No unlicensed professional activities.
  • Data Misuse: Handle personal data with proper consents.
  • Rights Violations: Respect third-party rights.
  • Malware Creation: Avoid creating harmful software.
  1. High-Risk Activities:
  • Dangerous Industries: No usage in military, nuclear, or espionage domains.
  • Weapons and Drugs: Avoid illegal arms or drug activities.
  • Critical Systems: No usage in key infrastructures or transport technologies.
  • Promotion of Harm: Avoid content advocating self-harm or violence.
  1. Deceptive Practices:
  • Misinformation: Refrain from creating/promoting fraudulent or misleading info.
  • Defamation and Spam: Avoid defamatory content and unsolicited messages.
  • Impersonation: No pretending to be someone without authorization.
  • Misrepresentation: No false claims about HauaLlama outputs.
  • Fake Online Engagement: No promotion of false online interactions.

Bias, Risks, and Limitations

HauaLlama is a cutting-edge technology brimming with possibilities, yet is not without inherent risks. The extensive testing conducted thus far has been predominantly in Yoruba and English, however leaving an expansive terrain of uncharted scenarios. Consequently, like its LLM counterparts, HauaLlama outcome predictability remains elusive, and there's the potential for it to occasionally generate responses that are either inaccurate, biased, or otherwise objectionable in nature when prompted by users. With this in mind, the responsible course of action dictates that, prior to deploying YorubaLlama in any applications, developers must embark on a diligent journey of safety testing and meticulous fine-tuning, customized to the unique demands of their specific use cases.

Contact-Us

For any questions, feedback, or commercial inquiries, please reach out at ai@jacarandahealth.org