quarterturn commited on
Commit
d8143df
1 Parent(s): 83af106

caption.py now support 4-bit; added 4-bit quant of molmo

Browse files
README.md CHANGED
@@ -11,7 +11,10 @@ Install:
11
  2. cd to "models" and clone Molmo-7B-D-0924:
12
  ```
13
  git lfs install
14
- git clone https://huggingface.co/allenai/Molmo-7B-D-0924 ```
 
 
 
15
  1. create a python3 venv or use conda to create an environment, eg:
16
  ``` conda create -n caption python=3.11 ```
17
  2. activate your environment, eg:
@@ -26,11 +29,11 @@ Install:
26
  4. click the button to download the caption zip file, the link is at the top of the page
27
 
28
  run the command-line version:
29
- ``` python3 caption.py ```
 
30
  1. make sure your images are in the "images" directory
31
  2. captions will be placed in the "images" directory
32
 
33
  Note:
34
- - The scripts are configured to load the model at bf16 precision, for max precision and lower memory utilization. This should fit in a single 24GB GPU.
35
- - You can edit the scripts to use a lower quant of the model, such as fp8, though accuracy may be lower.
36
  - If torch sees your first GPU supports flash attention and the others do not, it will assume all the cards do and it will throw an exception. A workaround is to use, for example, "CUDA_VISIBLE_DEVICES=0 python3 main.py (or caption.py)", to force torch to ignore the card supporting flash attention, so that it will use your other cards without it. Or, use it to exclude non-flash-attention-supporting GPUs.
 
11
  2. cd to "models" and clone Molmo-7B-D-0924:
12
  ```
13
  git lfs install
14
+ git clone https://huggingface.co/allenai/Molmo-7B-D-0924
15
+ ```
16
+ Since the 4-bit quant isn't that large, I have included it here. There's no need to clone it seperately. The full 32-bit version is big, so I leave it up to you to clone it if you want it.
17
+
18
  1. create a python3 venv or use conda to create an environment, eg:
19
  ``` conda create -n caption python=3.11 ```
20
  2. activate your environment, eg:
 
29
  4. click the button to download the caption zip file, the link is at the top of the page
30
 
31
  run the command-line version:
32
+ ``` python3 caption.py ``` (use molmo at bf16 for more accuracy; needs 24GB GPU)
33
+ ``` python3 caption.py -q ``` (use molmo at int4; should be fine with 12GB GPU)
34
  1. make sure your images are in the "images" directory
35
  2. captions will be placed in the "images" directory
36
 
37
  Note:
38
+ - main.py (gradio version does not yet support quant model)
 
39
  - If torch sees your first GPU supports flash attention and the others do not, it will assume all the cards do and it will throw an exception. A workaround is to use, for example, "CUDA_VISIBLE_DEVICES=0 python3 main.py (or caption.py)", to force torch to ignore the card supporting flash attention, so that it will use your other cards without it. Or, use it to exclude non-flash-attention-supporting GPUs.
caption.py CHANGED
@@ -1,9 +1,15 @@
1
  import os
 
2
  import torch
3
  from PIL import Image
4
  import requests
5
  from transformers import AutoProcessor, AutoModelForCausalLM, GenerationConfig, BitsAndBytesConfig
6
 
 
 
 
 
 
7
  if torch.cuda.is_available():
8
  device = torch.device("cuda")
9
  print("GPU is available. Using CUDA.")
@@ -11,7 +17,7 @@ else:
11
  device = torch.device("cpu")
12
  print("GPU is not available. Using CPU.")
13
 
14
- # load the processor from local path
15
  local_path = "./model/Molmo-7B-D-0924"
16
  processor = AutoProcessor.from_pretrained(
17
  local_path,
@@ -21,15 +27,25 @@ processor = AutoProcessor.from_pretrained(
21
  device_map='auto'
22
  )
23
 
24
- model = AutoModelForCausalLM.from_pretrained(
25
- local_path,
26
- trust_remote_code=True,
27
- torch_dtype='auto',
28
- device_map='auto',
29
- )
30
-
31
-
32
- model.to(dtype=torch.bfloat16)
 
 
 
 
 
 
 
 
 
 
33
 
34
  # directory containing the images
35
  image_directory = "./images"
 
1
  import os
2
+ import argparse
3
  import torch
4
  from PIL import Image
5
  import requests
6
  from transformers import AutoProcessor, AutoModelForCausalLM, GenerationConfig, BitsAndBytesConfig
7
 
8
+ # Parse command-line arguments
9
+ parser = argparse.ArgumentParser(description="Load and use a quantized model")
10
+ parser.add_argument("-q", "--use_quant", action="store_true", help="Use quantized model")
11
+ args = parser.parse_args()
12
+
13
  if torch.cuda.is_available():
14
  device = torch.device("cuda")
15
  print("GPU is available. Using CUDA.")
 
17
  device = torch.device("cpu")
18
  print("GPU is not available. Using CPU.")
19
 
20
+ # Load the processor
21
  local_path = "./model/Molmo-7B-D-0924"
22
  processor = AutoProcessor.from_pretrained(
23
  local_path,
 
27
  device_map='auto'
28
  )
29
 
30
+ # Load the model
31
+ if args.use_quant:
32
+ # Load the quantized model
33
+ quantized_local_path = "./model/molmo-7B-D-bnb-4bit"
34
+ model = AutoModelForCausalLM.from_pretrained(
35
+ quantized_local_path,
36
+ trust_remote_code=True,
37
+ torch_dtype='auto',
38
+ device_map='auto',
39
+ )
40
+ else:
41
+ # Load the non-quantized model
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ local_path,
44
+ trust_remote_code=True,
45
+ torch_dtype='auto',
46
+ device_map='auto',
47
+ )
48
+ model.to(dtype=torch.bfloat16)
49
 
50
  # directory containing the images
51
  image_directory = "./images"
images/00000-203229662.png ADDED

Git LFS Details

  • SHA256: e948bb833c60ca18596f2ddee627d8d85eef57360c8f2f85d6c06d44bdec7782
  • Pointer size: 131 Bytes
  • Size of remote file: 702 kB
images/00000-203229662.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ The image depicts a striking scene from the movie Black Swan. The ballerina, portrayed by Natalie Portman, is standing in a powerful, dramatic pose. Her body is elongated, with her arms outstretched to the sides, creating a sense of balance and tension. Her head is tilted back, gazing upwards, which adds to the intensity of the composition. The dancer's skin is pale, and her eyes are wide open, likely a deep blue or green, though the exact color is difficult to discern in this still. Her hair is dark, possibly black or dark brown, and appears to be styled in an elegant updo, though the exact style is not clear from this angle. The ballerina's body type is slender and athletic, reflecting her profession as a dancer. She is wearing a black leotard, which contrasts sharply with her pale skin and dark hair. The lighting in the scene is dramatic, with shadows playing across her face and body, emphasizing the intensity of the moment. This image captures the essence of the film's themes of obsession, dedication, and the psychological toll of pursuing perfection in dance.
images/00006-2234503665.png ADDED

Git LFS Details

  • SHA256: 9c6a6e63cdf23370ce5f20160e15a89d2b52fe92fd39fc289746e3418b7365d2
  • Pointer size: 131 Bytes
  • Size of remote file: 866 kB
images/00006-2234503665.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ The image depicts a striking scene from the movie Black Swan. The ballerina, portrayed by Natalie Portman, is standing in a powerful, dramatic pose. Her body is elongated, with her arms outstretched to the sides, creating a sense of balance and tension. Her head is tilted back, gazing upwards, which adds to the intensity of the composition. The dancer's skin is pale, and her eyes are wide open, likely a deep blue or green, though the exact color is difficult to discern in this still. Her hair is dark, possibly black or dark brown, and appears to be styled in an elegant updo, though the exact style is not clear from this angle. The ballerina's body type is slender and athletic, reflecting her profession as a dancer. She is wearing a black leotard, which contrasts sharply with her pale skin and dark hair. The lighting in the scene is dramatic, with shadows playing across her face and body, emphasizing the intensity of the moment. This image captures the essence of the film's themes of obsession, dedication, and the psychological toll of pursuing perfection in dance.
images/00030-4075734474.png ADDED

Git LFS Details

  • SHA256: 987d9f0ef256156f6182cefe1bfabbd304914af854a5e06c6e787fff795050aa
  • Pointer size: 131 Bytes
  • Size of remote file: 957 kB
images/00030-4075734474.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ The image depicts a striking scene from the movie Black Swan. The ballerina, portrayed by Natalie Portman, is standing in a powerful, dramatic pose. Her body is elongated, with her arms outstretched to the sides, creating a sense of balance and tension. Her head is tilted back, gazing upwards, which adds to the intensity of the composition. The dancer's skin is pale, and her eyes are wide open, likely a deep blue or green, though the exact color is difficult to discern in this still. Her hair is dark, possibly black or dark brown, and appears to be styled in an elegant updo, though the exact style is not clear from this angle. The ballerina's body type is slender and athletic, reflecting her profession as a dancer. She is wearing a black leotard, which contrasts sharply with her pale skin and dark hair. The lighting in the scene is dramatic, with shadows playing across her face and body, emphasizing the intensity of the moment. This image captures the essence of the film's themes of obsession, dedication, and the psychological toll of pursuing perfection in dance.
model/molmo-7B-D-bnb-4bit ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit 51097c4251a023d72485963c1ab69f3b6d6a1ec6