Model Card for Model ID

Model Details

Fine-Tuned LLaMA 3 Model for Bodo Language

This repository contains a fine-tuned LLaMA 3 model that understands and processes the Bodo language. The fine-tuning was performed using datasets specifically curated for the Bodo language, including a dictionary of Bodo words and grammar rules. These datasets were created by the repository owner and can be accessed here: Bodo Language Dataset on Hugging Face.

Model Description

This model is a fine-tuned version of LLaMA 3 using the UnsLoT fine-tuning framework, a state-of-the-art transformer model. It has been specifically adapted to understand, process, and generate content in the Bodo language.

Key Details:

Developed by: Ayush Sisodiya
Dataset: Bodo Language Dataset
Language: Bodo
Fine-tuned from: LLaMA 3
License: Apache 2.0

Model Sources

Repository: This repository
Dataset: Bodo Language Dataset

Uses

Direct Use

This model can be used for:

Translating text to and from Bodo.
Understanding and generating grammatically correct Bodo sentences.
Supporting linguistic research on the Bodo language.

Downstream Use

This model can be integrated into applications such as:

Language learning tools for Bodo.
Chatbots or virtual assistants designed for Bodo speakers.
Documentation or media translation into the Bodo language.

Out-of-Scope Use

The model is not suitable for:

Applications requiring high accuracy in domains beyond the training data (e.g., scientific or technical Bodo content).
Generating biased, harmful, or inappropriate content.

Bias, Risks, and Limitations

While the model performs well in general Bodo language tasks, it has the following limitations:

Biases: The model inherits biases present in the training data.
Limited Scope: Performance may degrade for niche or highly technical Bodo vocabulary.
Language Nuances: Certain cultural or linguistic subtleties might not be perfectly captured.

Recommendations

Users should:

Evaluate the model on their specific use cases.
Avoid using the model for applications requiring complete linguistic precision.
Consider additional fine-tuning if the model is to be used in specialized domains.

How to Get Started

Here is an example of how to use the model in your Python code:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model_name = "AyushSisodiya/Bodo-LLaMA3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example input
input_text = "Bodo example sentence."
inputs = tokenizer(input_text, return_tensors="pt")

# Generate output
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was fine-tuned using the Bodo Language Dataset. This dataset includes:

A comprehensive Bodo word dictionary.
Detailed Bodo grammar rules.

Training Procedure

The model was fine-tuned using the UnsLoT framework with the following hyperparameters:

Evaluation

Testing Data and Metrics

The model was evaluated using a subset of the Bodo Language Dataset, with metrics such as:

Perplexity for language modeling.
BLEU Score for translation tasks.

Results

Perplexity: 12.4
BLEU Score: 35.6

Environmental Impact

Hardware: NVIDIA A100 GPUs
Training Time: ~12 hours
Carbon Emission Estimate: ~6.5 kg CO2eq

Citation

If you use this model in your work, please cite it as follows:

@misc{bodo_llama3,
  author = {Ayush Sisodiya},
  title = {Fine-Tuned LLaMA 3 Model for Bodo Language},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AyushSisodiya/BODOAI}
}

Contact

For questions or issues, please reach out via the repository's Issues tab or email Ayush Sisodiya.

Thank you for exploring the fine-tuned LLaMA 3 model for Bodo! Feel free to contribute or provide feedback.