File size: 4,862 Bytes
af8875a c095e03 af8875a c095e03 40a1b9b bf9a2b9 32e58f4 914a564 32e58f4 bee6755 9fa9b43 bee6755 32e58f4 03e227f 32e58f4 9759053 e87538b 4d156e4 605edd6 c240d7f 757f0c9 32e58f4 aba1ea0 93e0bf8 21c5820 32e58f4 03e227f 9759053 32e58f4 9759053 2d18430 32e58f4 81361c1 9759053 2d18430 32e58f4 9759053 9f9169a 3d262b5 9759053 8a66cc5 9888832 32e58f4 9fa9b43 32e58f4 d1cfd1f 32e58f4 40a1b9b 32e58f4 743894a 991ed02 2068c4e 743894a 2068c4e 743894a 6ac7bbc 743894a 42f234e 743894a 42f234e 743894a 42f234e fac7467 743894a 32e58f4 fb3f70e 32e58f4 bf9a2b9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
license: llama3
base_model: meta-llama/Meta-Llama-3-8B-Instruct
library_name: transformers
tags:
- AIGC
- LLaVA
datasets:
- OpenFace-CQUPT/FaceCaption-15M
metrics:
- accuracy
pipeline_tag: visual-question-answering
---
# Human-LLaVA-8B
## DEMO
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/TpN2t19Poe5YbHHP8uN7_.mp4"></video>
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/ur3sls4faPNlOMZ6sA_qK.png)
### Introduction
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
## Result
human-llava has a good performance in both general and special fields
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/OIsyRrbpdvgTCvzSj1WKy.png)
## News and Update π₯π₯π₯
* Sep.12, 2024. **π€[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!πππ**
* Sep.8, 2024. **π€[HumanLLaVA-llama-3-8B](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πππ**
## π€ Transformers
To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
``` python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForPreTraining
model_id = "OpenFace-CQUPT/Human_LLaVA"
cuda = 0
model = AutoModelForPreTraining.from_pretrained("OpenFace-CQUPT/Human_LLaVA", torch_dtype=torch.float16).to(cuda)
processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
text = "Please describe this picture"
prompt = "USER: <image>\n" + text + "\nASSISTANT:"
image_file = "./test1.jpg"
raw_image = Image.open(image_file)
# raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
predict = processor.decode(output[0][:], skip_special_tokens=True)
print(predict)
```
Our training code have been released publicly on github.[ddw2AIGROUP2CQUPT/Human-LLaVA-8B(github.com)](https://github.com/ddw2AIGROUP2CQUPT/Human-LLaVA-8B)
## Get the Dataset
#### Dataset Example
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/vRojQxm8IMybBV0X5CKbf.png)
#### Domain Alignment Stage
[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M)(self construct): is released!
#### Instruction Tuning Stage
**All public data sets have been filtered, and we will consider publishing all processed text in the future**
HumanCaptionHQ-300K(self construct): Coming Soon!
Face_hq(self construct):Coming Soon!
humanvg_high_reg(self construct):Coming Soon!
humanvg_high_rec(self construct):Coming Soon!
celeba_attribute(self construct): https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
ShareGPT4V:https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
LLaVA-Instruct_zh : https://huggingface.co/datasets/openbmb/llava_zh
verified_ref3rec: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
verified_ref3reg: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
verified_shikra: https://github.com/shikras/shikra
## Citation
```
Coming soon!!!
```
## contact
mailto: [S230201133@stu.cqupt.edu.cn](mailto:S230201133@stu.cqupt.edu.cn) or [dw_dai@163.com](mailto:dw_dai@163.com) |