Zhiminli commited on
Commit
0dae0c9
1 Parent(s): 3767fa1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: hunyuan-dit
3
+ license: other
4
+ license_name: tencent-hunyuan-community
5
+ license_link: https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/LICENSE.txt
6
+ language:
7
+ - en
8
+ - zh
9
+ ---
10
+
11
+ ## Hunyuan-Captioner
12
+ Hunyuan-Captioner meets the need of text-to-image techniques by maintaining a high degree of image-text consistency. It can generate high-quality image descriptions from a variety of angles, including object description, objects relationships, background information, image style, etc. Our code is based on [LLaVA](https://github.com/haotian-liu/LLaVA) implementation.
13
+
14
+ ### Instructions
15
+ a. Install dependencies
16
+
17
+ The dependencies and installation are basically the same as the [**base model**](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.1).
18
+
19
+ b. Data download
20
+ ```shell
21
+ cd HunyuanDiT
22
+ wget -O ./dataset/data_demo.zip https://dit.hunyuan.tencent.com/download/HunyuanDiT/data_demo.zip
23
+ unzip ./dataset/data_demo.zip -d ./dataset
24
+ mkdir ./dataset/porcelain/arrows ./dataset/porcelain/jsons
25
+ ```
26
+
27
+ c. Model download
28
+ ```shell
29
+ # Use the huggingface-cli tool to download the model.
30
+ huggingface-cli download Tencent-Hunyuan/HunyuanCaptioner --local-dir ./ckpts/captioner
31
+ ```
32
+
33
+
34
+ ### Inference
35
+
36
+ Current supported prompts:
37
+
38
+ | Target | Prompt |
39
+ | --- | --- |
40
+ | Caption in Chinese | 描述这张图片 |
41
+ | Caption in Chinese with tags | 根据提示词“{}”,描述这张图片 |
42
+ | Caption in English | Please describe the content of this image |
43
+ | | |
44
+
45
+
46
+ a. Single picture inference in Chinese
47
+
48
+ ```bash
49
+ python mllm/caption_demo.py --mode "caption_zh" --image_file "mllm/images/demo1.png" --model_path "./ckpts/captioner"
50
+ ```
51
+
52
+ b. Single picture inference with tag in Chinese
53
+
54
+ ```bash
55
+ python mllm/caption_demo.py --mode "insert_content" --content "宫保鸡丁" --image_file "mllm/images/demo2.png" --model_path "./ckpts/captioner"
56
+ ```
57
+
58
+ c. Single picture inference in English
59
+
60
+ ```bash
61
+ python mllm/caption_demo.py --mode "caption_en" --image_file "mllm/images/demo3.png" --model_path "./ckpts/captioner"
62
+ ```
63
+
64
+ d. Multiple pictures inference in Chinese
65
+
66
+ ```bash
67
+ ### Convert multiple pictures to csv file.
68
+ python mllm/make_csv.py --img_dir "mllm/images" --input_file "mllm/images/demo.csv"
69
+
70
+ ### Multiple pictures inference
71
+ python mllm/caption_demo.py --mode "caption_zh" --input_file "mllm/images/demo.csv" --output_file "mllm/images/demo_res.csv" --model_path "./ckpts/captioner"
72
+ ```
73
+
74
+ (Optional) To convert the output csv file to Arrow format, please refer to
75
+ [Data Preparation #3](https://github.com/Tencent/HunyuanDiT?tab=readme-ov-file#data-preparation) for detailed instructions.
76
+
77
+
78
+ ### Gradio
79
+ To launch a Gradio demo locally, please run the following commands one by one. For more detailed instructions, please refer to [LLaVA](https://github.com/haotian-liu/LLaVA).
80
+ ```bash
81
+ cd mllm
82
+ python -m llava.serve.controller --host 0.0.0.0 --port 10000
83
+
84
+ python -m llava.serve.gradio_web_server --controller http://0.0.0.0:10000 --model-list-mode reload --port 443
85
+
86
+ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://0.0.0.0:10000 --port 40000 --worker http://0.0.0.0:40000 --model-path "./ckpts/captioner" --model-name LlavaMistral
87
+ ```
88
+ Then the demo can be accessed through http://0.0.0.0:443. It should be noted that the 0.0.0.0 here needs to be X.X.X.X with your server IP.