Tars97 commited on
Commit
885c7fd
1 Parent(s): 5ca79fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -3
README.md CHANGED
@@ -1,3 +1,132 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <div align="center">
5
+ <img src="assets/logo.png" alt="Lit-LLaMA" width="200"/>
6
+
7
+ # HawkLlama
8
+ [🤗**Huggingface**](https://huggingface.co/AIM-ZJU/HawkLlama_8b) | [🗂️**Github**](https://github.com/aim-uofa/VLModel) | [📖**Technical Report**](assets/technical_report.pdf)
9
+
10
+ Zhejiang University, China
11
+
12
+ </div>
13
+
14
+
15
+ This is the official implementation of HawkLlama, an open-source multimodal large language model designed for real-world vision and language understanding applications. Our model features the following highlights.
16
+
17
+ 1. HawkLlama-8B is constructed utilizing:
18
+ - [Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), the latest open-source large language model, trained on over 15 trillion tokens.
19
+ - [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384), an enhancement over CLIP employing sigmoid loss, which achieves superior performance in image recognition.
20
+ - An efficient vision-language connector, designed to capture high-resolution details without increasing the number of visual tokens, helps reduce the training overhead associated with high-resolution images.
21
+
22
+ 2. For model training, we utilize [Llava-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) dataset for pretraining and a mixed dataset specifically curated for instruction tuning, which contains both multimodal and language-only data for supervised fine-tuning.
23
+
24
+ 3. HawkLlama-8B is developed on [NeMo](https://github.com/NVIDIA/NeMo.git) framework, which facilitates 3D parallelism and offers scalability potential for future extension.
25
+
26
+ Our model is open-source and reproducable. Please check our [technical report](assets/technical_report.pdf) for more details.
27
+
28
+
29
+ <!-- ## News
30
+
31
+ [04/30] Llama3-LaMMly-8B is released, trained on a larger dataset, supporting higher resolution images, and also supporting Llama3 as the backbone. For LaMMly, we constructed a multimodal dataset containing 2.6M SFT sample, ensuring that LaMMly can achieve better generalization and improved image understanding. For more details, please refer to our [blog] and [technical report]. -->
32
+
33
+ ## Contents
34
+ - [Setup](#setup)
35
+ - [Model Weights](#model-weights)
36
+ - [Inference](#inference)
37
+ - [Evaluation](#evaluation)
38
+ - [Demo](#demo)
39
+
40
+
41
+ ## Setup
42
+
43
+ 1. Create envoirment and activate it.
44
+ ```Shell
45
+ conda create -n hawkllama python=3.10 -y
46
+ conda activate hawkllama
47
+ ```
48
+
49
+ 2. Clone and install this repo.
50
+ ```
51
+ git clone https://github.com/aim-uofa/VLModel.git
52
+ cd VLModel
53
+ pip install -e .
54
+ pip install -e third_party/VLMEvalKit
55
+ ```
56
+
57
+ ## Model Weights
58
+
59
+ Please refer to our [HuggingFace repository](https://huggingface.co/AIM-ZJU/HawkLlama_8b) to download the pretrained model weights.
60
+
61
+ ## Inference
62
+
63
+ We provide an example code for inference.
64
+
65
+ ```Python
66
+ import torch
67
+ from PIL import Image
68
+ from HawkLlama.model import LlavaNextProcessor, LlavaNextForConditionalGeneration
69
+ from HawkLlama.utils.conversation import conv_llava_llama_3, DEFAULT_IMAGE_TOKEN
70
+
71
+ processor = LlavaNextProcessor.from_pretrained("AIM-ZJU/HawkLlama_8b")
72
+
73
+ model = LlavaNextForConditionalGeneration.from_pretrained("AIM-ZJU/HawkLlama_8b", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
74
+ model.to("cuda:0")
75
+
76
+ image_file = "assets/coin.png"
77
+ image = Image.open(image_file).convert('RGB')
78
+
79
+ prompt = "what coin is that?"
80
+ prompt = DEFAULT_IMAGE_TOKEN + "\n" + prompt
81
+
82
+ conversation = conv_llava_llama_3.copy()
83
+ user_role_ind = 0
84
+ bot_role_ind = 1
85
+ conversation.append_message(conversation.roles[user_role_ind], prompt)
86
+ conversation.append_message(conversation.roles[bot_role_ind], "")
87
+ prompt = conversation.get_prompt()
88
+ inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
89
+ inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
90
+ output = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, max_new_tokens=2048, do_sample=False, use_cache=True)
91
+
92
+ print(processor.decode(output[0], skip_special_tokens=True))
93
+ ```
94
+
95
+ ## Evaluation
96
+
97
+ Evaluate is modified based on the VLMEval codebase.
98
+
99
+ ``` bash
100
+ # single gpu
101
+ python third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
102
+ # multi-gpus
103
+ torchrun --nproc-per-node=8 third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
104
+ ```
105
+
106
+ The results are shown below:
107
+
108
+ | Benchmark | Our MethodName | LLaVA-Llama3-v1.1 | LLaVA-Next |
109
+ |-----------------|----------------|-------------------|------------|
110
+ | MMMU val | **37.8** | 36.8 | 36.9 |
111
+ | SEEDBench img | **71.0** | 70.1 | 70.0 |
112
+ | MMBench-EN dev | **70.6** | 70.4 | 68.0 |
113
+ | MMBench-CN dev | **64.4** | 64.2 | 60.6 |
114
+ | CCBench | **33.9** | 31.6 | 24.7 |
115
+ | AI2D test | 65.6 | **70.0** | 67.1 |
116
+ | ScienceQA test | **76.1** | 72.9 | 70.4 |
117
+ | HallusionBench | 41.0 | **47.7** | 35.2 |
118
+ | MMStar | 43.0 | **45.1** | 38.1 |
119
+
120
+ ## Demo
121
+
122
+ Welcome to try our [demo](http://115.236.57.99:30020/)!
123
+
124
+
125
+
126
+ ## Acknowledgements
127
+
128
+ We express our appreciation to the following projects for their outstanding contributions in academia and code development: [LLaVA](https://github.com/haotian-liu/LLaVA), [NeMo](https://github.com/NVIDIA/NeMo), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and [xtuner](https://github.com/InternLM/xtuner).
129
+
130
+ ## License
131
+
132
+ HawkLlama is released under the [Apache 2.0](https://github.com/Lightning-AI/lightning-llama/blob/main/LICENSE) license.