LABahasa
/

llama-labahasa-11B

Model card Files Files and versions Community

llama-labahasa-11B / README.md

Munggok

init new model

2ab806d 2 days ago

|

3.72 kB

	---
	language:
	- id
	- en
	base_model:
	- meta-llama/Llama-3.2-11B-Vision-Instruct
	- openai/whisper-large
	tags:
	- multimodal
	- indonesian
	- english
	- vision
	- audio
	- text
	---

	# LaBahasa 11B

	## Model Information
	LaBahasa 11B is a multimodal LLM that combines text, audio, and image processing capabilities. Built upon OpenAI's Whisper and Meta's Llama 3.2 architectures, this model has been specifically optimized for Indonesian language understanding while maintaining English capability. The model was trained on 9 billion high quality bilingual dataset comprising Indonesian and English speech and text data.

	Model Architecture: LaBahasa 11B uses a feed-forward network to project audio embeddings from Whisper Large encoder to Llama's input embeddings, combined with image/text inputs to enable multimodal understanding.

	Model Developer: Bahasa AI and LintasArta

	## Intended Use
	This model is intended for various NLP tasks that require text/audio/image understanding and generating Indonesian language.

	## Usage

	### Installation
	```bash
	pip install --upgrade pip
	pip install --upgrade transformers
	```

	### Use with Transformers
	For audio input, LaBahasa 11B uses a special placeholder token `<\|audio\|>`, which then be replaced with the projected audio embedding.

	```python
	import transformers
	import torch
	import librosa, requests
	from PIL import Image

	model = transformers.AutoModel.from_pretrained('LABahasa/llama-labahasa-11B',
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map='cuda')
	processor = transformers.AutoProcessor.from_pretrained('LABahasa/llama-labahasa-11B',
	trust_remote_code=True)

	# Example with all modalities
	url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
	image = Image.open(requests.get(url, stream=True).raw)
	audio_path = "deskripsi.mp3"
	audio, * = librosa.load(audio_path, sr=22050)

	messages = [
	{
	'role': 'system',
	'content': 'You are a helpful AI assistant.'
	},
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "\n<\|audio\|>\n"},
	],
	}
	]

	input_text = processor.tokenizer.apply_chat_template(
	messages, add_generation_prompt=True, tokenize=False
	)

	inputs = processor(
	images=image,
	text=input_text,
	audio=audio,
	return_tensors="pt",
	sampling_rate=16000,
	).to(model.device)

	input_len = inputs.input_ids.shape[1]
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(processor.decode(outputs[0][input_len:]))
	```

	## Evaluation
	\| Metric \| Qwen2.5-14B \| llama-labahasa-11B \|
	\|-------------------\|-------------\|----------------------\|
	\| MMLU \| 66.3 \| 67.2 \|
	\| Multi-Mathematics \| 63.7 \| 64.5 \|
	\| MMMU \| 68.2 \| 68.2 \|
	\| id-MMLU \| 63.1 \| 72.2 \|

	## Training Details
	Training regime: BF16 mixed precision training

	Training Infrastructure: 8xH100 GPUs

	Training Time: 25 hours

	### Training Data
	LaBahasa 11B was trained on an extensive 9 billion high quality bilingual dataset comprising Indonesian and English speech and text data.

	### Training Procedure
	LaBahasa 11B was trained on customized training methodology modifications to enhance:
	* Image input processing capabilities through integration with Llama 3.2's vision features
	* Indonesian language understanding and generation