acul3 commited on
Commit
efddd14
1 Parent(s): 6238563

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ - en
5
+ base_model:
6
+ - meta-llama/Llama-3.2-11B-Vision-Instruct
7
+ - openai/whisper-large
8
+ tags:
9
+ - multimodal
10
+ - indonesian
11
+ - english
12
+ - vision
13
+ - audio
14
+ - text
15
+ ---
16
+
17
+ # LaBahasa 11B
18
+
19
+ ## Model Information
20
+ LaBahasa 11B is a multimodal LLM that combines text, audio, and image processing capabilities. Built upon OpenAI's Whisper and Meta's Llama 3.2 architectures, this model has been specifically optimized for Indonesian language understanding while maintaining English capability. The model was trained on 9 billion high quality bilingual dataset comprising Indonesian and English speech and text data.
21
+
22
+ **Model Architecture**: LaBahasa 11B uses a feed-forward network to project audio embeddings from Whisper Large encoder to Llama's input embeddings, combined with image/text inputs to enable multimodal understanding.
23
+
24
+ **Model Developer**: Bahasa AI and LintasArta
25
+
26
+ ## Intended Use
27
+ This model is intended for various NLP tasks that require text/audio/image understanding and generating Indonesian language.
28
+
29
+ ## Usage
30
+
31
+ ### Installation
32
+ ```bash
33
+ pip install --upgrade pip
34
+ pip install --upgrade transformers
35
+ ```
36
+
37
+ ### Use with Transformers
38
+ For audio input, LaBahasa 11B uses a special placeholder token `<|audio|>`, which then be replaced with the projected audio embedding.
39
+
40
+ ```python
41
+ import transformers
42
+ import torch
43
+ import librosa, requests
44
+ from PIL import Image
45
+
46
+ model = transformers.AutoModel.from_pretrained('LABahasa/llama-labahasa-11B',
47
+ trust_remote_code=True,
48
+ torch_dtype=torch.bfloat16,
49
+ device_map='cuda')
50
+ processor = transformers.AutoProcessor.from_pretrained('LABahasa/llama-labahasa-11B',
51
+ trust_remote_code=True)
52
+
53
+ # Example with all modalities
54
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
55
+ image = Image.open(requests.get(url, stream=True).raw)
56
+ audio_path = "deskripsi.mp3"
57
+ audio, * = librosa.load(audio_path, sr=22050)
58
+
59
+ messages = [
60
+ {
61
+ 'role': 'system',
62
+ 'content': 'You are a helpful AI assistant.'
63
+ },
64
+ {
65
+ "role": "user",
66
+ "content": [
67
+ {"type": "image"},
68
+ {"type": "text", "text": "\n<|audio|>\n"},
69
+ ],
70
+ }
71
+ ]
72
+
73
+ input_text = processor.tokenizer.apply_chat_template(
74
+ messages, add_generation_prompt=True, tokenize=False
75
+ )
76
+
77
+ inputs = processor(
78
+ images=image,
79
+ text=input_text,
80
+ audio=audio,
81
+ return_tensors="pt",
82
+ sampling_rate=16000,
83
+ ).to(model.device)
84
+
85
+ input_len = inputs.input_ids.shape[1]
86
+ outputs = model.generate(**inputs, max_new_tokens=100)
87
+ print(processor.decode(outputs[0][input_len:]))
88
+ ```
89
+
90
+ ## Evaluation
91
+ | Metric | Qwen2.5-14B | llama-labahasa-11B |
92
+ |-------------------|-------------|----------------------|
93
+ | MMLU | 66.3 | **67.2** |
94
+ | Multi-Mathematics | 63.7 | **64.5** |
95
+ | MMMU | 68.2 | **68.2** |
96
+ | id-MMLU | 63.1 | **72.2** |
97
+
98
+ ## Training Details
99
+ **Training regime**: BF16 mixed precision training
100
+
101
+ **Training Infrastructure**: 8xH100 GPUs
102
+
103
+ **Training Time**: 25 hours
104
+
105
+ ### Training Data
106
+ LaBahasa 11B was trained on an extensive 9 billion high quality bilingual dataset comprising Indonesian and English speech and text data.
107
+
108
+ ### Training Procedure
109
+ LaBahasa 11B was trained on customized training methodology modifications to enhance:
110
+ * Image input processing capabilities through integration with Llama 3.2's vision features
111
+ * Indonesian language understanding and generation