File size: 4,716 Bytes
f2dd38b
 
5d07e2d
 
 
 
 
 
f2dd38b
 
5d07e2d
f2dd38b
5d07e2d
f2dd38b
 
 
 
 
 
 
 
 
 
5d07e2d
 
 
 
 
f2dd38b
 
5d07e2d
 
f2dd38b
5d07e2d
 
 
 
f2dd38b
5d07e2d
f2dd38b
5d07e2d
f2dd38b
5d07e2d
 
 
 
 
 
f2dd38b
5d07e2d
 
f2dd38b
5d07e2d
 
 
 
 
 
 
 
f2dd38b
5d07e2d
 
 
f2dd38b
 
 
 
5d07e2d
f2dd38b
 
 
5d07e2d
f2dd38b
 
 
5d07e2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dd38b
 
5d07e2d
f2dd38b
 
5d07e2d
f2dd38b
5d07e2d
f2dd38b
 
5d07e2d
f2dd38b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
library_name: transformers
license: apache-2.0
datasets:
- RekaAI/VibeEval
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
pipeline_tag: image-text-to-text
---

# Model Card for hiiamsid/llama-3.2-vision-11B-ROCO

This is the finetuned version of meta-llama/Llama-3.2-11B-Vision-Instruct trained on MedIR/roco dataset using FSDP on 2 A100s.



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->


- **Developed by:** hiiamsid
- **Model type:** multimodal (Image/Text to Text)
- **Language(s) (NLP):** multilingual
- **License:** Apache License 2.0
- **Finetuned from model [optional]:** meta-llama/Llama-3.2-11B-Vision-Instruct


## How to Get Started with the Model
```

import requests
from PIL import Image
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

base_model = "hiiamsid/llama-3.2-vision-11B-ROCO"

processor = AutoProcessor.from_pretrained(base_model)

model = MllamaForConditionalGeneration.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

url = "https://lh7-rt.googleusercontent.com/docsz/AD_4nXcz-J3iR2bEGcCSLzay07Rqfj5tTakp2EMTTN0x6nKYGLS5yWl0unoSpj2S0-mrWpDtMqjl1fAgH6pVkKJekQEY_kwzL6QNOdf143Yt66znQ0EpfLvx6CLFOqw41oeOYmhPZ6Qrlb5AjEr4AenIOgBMTWTD?key=vhLUYntaS9QOx531XpJH3g"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe the tutorial feature image."}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=120)
print(processor.decode(output[0]))
```

## Training Details

### Training Data
MedIR/roco: https://huggingface.co/datasets/MedIR/roco (only 1000 samples where used for training)

### Training Procedure

-Trained using FSDP activating wraping policy, MixedPrecision Policy (on bfloat16), activationcheckpointing etc and saved using Type FULL_STATE_DICT

#### Training Hyperparameters

```
  @dataclass
  class train_config:
    model_name: str="meta-llama/Llama-3.2-11B-Vision-Instruct"
    batch_size_training: int=8
    batching_strategy: str="padding" #alternative is packing but vision model doesn't work with packing.
    context_length: int =4096
    gradient_accumulation_steps: int=1
    num_epochs: int=3
    lr: float=1e-5
    weight_decay: float=0.0
    gamma: float= 0.85 # multiplicatively decay the learning rate by gamma after each epoch
    seed: int=42
    use_fp16: bool=False
    mixed_precision: bool=True
    val_batch_size:int = 1
    use_peft: bool = False
    output_dir: str = "workspace/models"
    enable_fsdp: bool = True
    dist_checkpoint_root_folder: str="workspace/FSDP/model" # will be used if using FSDP
    dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
    save_optimizer: bool=False # will be used if using FSDP
    
  @dataclass
  class fsdp_config:
      mixed_precision: bool = True
      use_fp16: bool=False
      sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD # HYBRID_SHARD "Full Shard within a node DDP cross Nodes", SHARD_GRAD_OP "Shard only Gradients and Optimizer States", NO_SHARD "Similar to DDP".
      hsdp : bool =False # Require HYBRID_SHARD to be set. This flag can extend the HYBRID_SHARD by allowing sharding a model on customized number of GPUs (Sharding_group) and Replicas over Sharding_group.
      sharding_group_size: int=0 # requires hsdp to be set. This specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model.
      replica_group_size: int=0 #requires hsdp to be set. This specifies the replica group size, which is world_size/sharding_group_size.
      checkpoint_type: StateDictType = StateDictType.FULL_STATE_DICT  # alternatively FULL_STATE_DICT can be used. SHARDED_STATE_DICT saves one file with sharded weights per rank while FULL_STATE_DICT will collect all weights on rank 0 and save them in a single file.
      fsdp_activation_checkpointing: bool=True
      fsdp_cpu_offload: bool=False
      pure_bf16: bool = True
      optimizer: str= "AdamW"
```

### Model Architecture and Objective
This was just trained to see how much improvement can be seen when finetuned llama 3.2 vision.

### Compute Infrastructure
Trained on 2 A100 (80GB) from runpods.

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
https://github.com/meta-llama/llama-recipes 
[More Information Needed]