File size: 4,716 Bytes
f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b 5d07e2d f2dd38b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
library_name: transformers
license: apache-2.0
datasets:
- RekaAI/VibeEval
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
pipeline_tag: image-text-to-text
---
# Model Card for hiiamsid/llama-3.2-vision-11B-ROCO
This is the finetuned version of meta-llama/Llama-3.2-11B-Vision-Instruct trained on MedIR/roco dataset using FSDP on 2 A100s.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** hiiamsid
- **Model type:** multimodal (Image/Text to Text)
- **Language(s) (NLP):** multilingual
- **License:** Apache License 2.0
- **Finetuned from model [optional]:** meta-llama/Llama-3.2-11B-Vision-Instruct
## How to Get Started with the Model
```
import requests
from PIL import Image
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
base_model = "hiiamsid/llama-3.2-vision-11B-ROCO"
processor = AutoProcessor.from_pretrained(base_model)
model = MllamaForConditionalGeneration.from_pretrained(
base_model,
low_cpu_mem_usage=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
url = "https://lh7-rt.googleusercontent.com/docsz/AD_4nXcz-J3iR2bEGcCSLzay07Rqfj5tTakp2EMTTN0x6nKYGLS5yWl0unoSpj2S0-mrWpDtMqjl1fAgH6pVkKJekQEY_kwzL6QNOdf143Yt66znQ0EpfLvx6CLFOqw41oeOYmhPZ6Qrlb5AjEr4AenIOgBMTWTD?key=vhLUYntaS9QOx531XpJH3g"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe the tutorial feature image."}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=120)
print(processor.decode(output[0]))
```
## Training Details
### Training Data
MedIR/roco: https://huggingface.co/datasets/MedIR/roco (only 1000 samples where used for training)
### Training Procedure
-Trained using FSDP activating wraping policy, MixedPrecision Policy (on bfloat16), activationcheckpointing etc and saved using Type FULL_STATE_DICT
#### Training Hyperparameters
```
@dataclass
class train_config:
model_name: str="meta-llama/Llama-3.2-11B-Vision-Instruct"
batch_size_training: int=8
batching_strategy: str="padding" #alternative is packing but vision model doesn't work with packing.
context_length: int =4096
gradient_accumulation_steps: int=1
num_epochs: int=3
lr: float=1e-5
weight_decay: float=0.0
gamma: float= 0.85 # multiplicatively decay the learning rate by gamma after each epoch
seed: int=42
use_fp16: bool=False
mixed_precision: bool=True
val_batch_size:int = 1
use_peft: bool = False
output_dir: str = "workspace/models"
enable_fsdp: bool = True
dist_checkpoint_root_folder: str="workspace/FSDP/model" # will be used if using FSDP
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
save_optimizer: bool=False # will be used if using FSDP
@dataclass
class fsdp_config:
mixed_precision: bool = True
use_fp16: bool=False
sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD # HYBRID_SHARD "Full Shard within a node DDP cross Nodes", SHARD_GRAD_OP "Shard only Gradients and Optimizer States", NO_SHARD "Similar to DDP".
hsdp : bool =False # Require HYBRID_SHARD to be set. This flag can extend the HYBRID_SHARD by allowing sharding a model on customized number of GPUs (Sharding_group) and Replicas over Sharding_group.
sharding_group_size: int=0 # requires hsdp to be set. This specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model.
replica_group_size: int=0 #requires hsdp to be set. This specifies the replica group size, which is world_size/sharding_group_size.
checkpoint_type: StateDictType = StateDictType.FULL_STATE_DICT # alternatively FULL_STATE_DICT can be used. SHARDED_STATE_DICT saves one file with sharded weights per rank while FULL_STATE_DICT will collect all weights on rank 0 and save them in a single file.
fsdp_activation_checkpointing: bool=True
fsdp_cpu_offload: bool=False
pure_bf16: bool = True
optimizer: str= "AdamW"
```
### Model Architecture and Objective
This was just trained to see how much improvement can be seen when finetuned llama 3.2 vision.
### Compute Infrastructure
Trained on 2 A100 (80GB) from runpods.
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
https://github.com/meta-llama/llama-recipes
[More Information Needed] |