quarterturn
/

molmo-flux-captioner

Model card Files Files and versions Community

quarterturn commited on Oct 16

Commit

d8143df

•

1 Parent(s): 83af106

caption.py now support 4-bit; added 4-bit quant of molmo

Browse files

Files changed (9) hide show

README.md +7 -4
caption.py +26 -10
images/00000-203229662.png +3 -0
images/00000-203229662.txt +1 -0
images/00006-2234503665.png +3 -0
images/00006-2234503665.txt +1 -0
images/00030-4075734474.png +3 -0
images/00030-4075734474.txt +1 -0
model/molmo-7B-D-bnb-4bit +1 -0

README.md CHANGED Viewed

@@ -11,7 +11,10 @@ Install:
 2. cd to "models" and clone Molmo-7B-D-0924:
    ```
        git lfs install
-       git clone https://huggingface.co/allenai/Molmo-7B-D-0924 ```
 1. create a python3 venv or use conda to create an environment, eg:
    ``` conda create -n caption python=3.11 ```
 2. activate your environment, eg:
@@ -26,11 +29,11 @@ Install:
    4. click the button to download the caption zip file, the link is at the top of the page
    run the command-line version:
-   ``` python3 caption.py ```
    1. make sure your images are in the "images" directory
    2. captions will be placed in the "images" directory
 Note:
-- The scripts are configured to load the model at bf16 precision, for max precision and lower memory utilization. This should fit in a single 24GB GPU.
-- You can edit the scripts to use a lower quant of the model, such as fp8, though accuracy may be lower.
 - If torch sees your first GPU supports flash attention and the others do not, it will assume all the cards do and it will throw an exception. A workaround is to use, for example, "CUDA_VISIBLE_DEVICES=0 python3 main.py (or caption.py)", to force torch to ignore the card supporting flash attention, so that it will use your other cards without it. Or, use it to exclude non-flash-attention-supporting GPUs.

 2. cd to "models" and clone Molmo-7B-D-0924:
    ```
        git lfs install
+       git clone https://huggingface.co/allenai/Molmo-7B-D-0924
+```
+  Since the 4-bit quant isn't that large, I have included it here. There's no need to clone it seperately. The full 32-bit version is big, so I leave it up to you to clone it if you want it.
 1. create a python3 venv or use conda to create an environment, eg:
    ``` conda create -n caption python=3.11 ```
 2. activate your environment, eg:
    4. click the button to download the caption zip file, the link is at the top of the page
    run the command-line version:
+   ``` python3 caption.py ``` (use molmo at bf16 for more accuracy; needs 24GB GPU)
+   ``` python3 caption.py -q ``` (use molmo at int4; should be fine with 12GB GPU)
    1. make sure your images are in the "images" directory
    2. captions will be placed in the "images" directory
 Note:
+- main.py (gradio version does not yet support quant model)
 - If torch sees your first GPU supports flash attention and the others do not, it will assume all the cards do and it will throw an exception. A workaround is to use, for example, "CUDA_VISIBLE_DEVICES=0 python3 main.py (or caption.py)", to force torch to ignore the card supporting flash attention, so that it will use your other cards without it. Or, use it to exclude non-flash-attention-supporting GPUs.

caption.py CHANGED Viewed

@@ -1,9 +1,15 @@
 import os
 import torch
 from PIL import Image
 import requests
 from transformers import AutoProcessor, AutoModelForCausalLM, GenerationConfig, BitsAndBytesConfig
 if torch.cuda.is_available():
     device = torch.device("cuda")
     print("GPU is available. Using CUDA.")
@@ -11,7 +17,7 @@ else:
     device = torch.device("cpu")
     print("GPU is not available. Using CPU.")
-# load the processor from local path
 local_path = "./model/Molmo-7B-D-0924"
 processor = AutoProcessor.from_pretrained(
     local_path,
@@ -21,15 +27,25 @@ processor = AutoProcessor.from_pretrained(
     device_map='auto'
 )
-model = AutoModelForCausalLM.from_pretrained(
-    local_path,
-    trust_remote_code=True,
-    torch_dtype='auto',
-    device_map='auto',
-)
-model.to(dtype=torch.bfloat16)
 # directory containing the images
 image_directory = "./images"

 import os
+import argparse
 import torch
 from PIL import Image
 import requests
 from transformers import AutoProcessor, AutoModelForCausalLM, GenerationConfig, BitsAndBytesConfig
+# Parse command-line arguments
+parser = argparse.ArgumentParser(description="Load and use a quantized model")
+parser.add_argument("-q", "--use_quant", action="store_true", help="Use quantized model")
+args = parser.parse_args()
 if torch.cuda.is_available():
     device = torch.device("cuda")
     print("GPU is available. Using CUDA.")
     device = torch.device("cpu")
     print("GPU is not available. Using CPU.")
+# Load the processor
 local_path = "./model/Molmo-7B-D-0924"
 processor = AutoProcessor.from_pretrained(
     local_path,
     device_map='auto'
 )
+# Load the model
+if args.use_quant:
+    # Load the quantized model
+    quantized_local_path = "./model/molmo-7B-D-bnb-4bit"
+    model = AutoModelForCausalLM.from_pretrained(
+        quantized_local_path,
+        trust_remote_code=True,
+        torch_dtype='auto',
+        device_map='auto',
+    )
+else:
+    # Load the non-quantized model
+    model = AutoModelForCausalLM.from_pretrained(
+        local_path,
+        trust_remote_code=True,
+        torch_dtype='auto',
+        device_map='auto',
+    )
+    model.to(dtype=torch.bfloat16)
 # directory containing the images
 image_directory = "./images"

images/00000-203229662.png ADDED Viewed

Git LFS Details

SHA256: e948bb833c60ca18596f2ddee627d8d85eef57360c8f2f85d6c06d44bdec7782
Pointer size: 131 Bytes
Size of remote file: 702 kB

images/00000-203229662.txt ADDED Viewed

	@@ -0,0 +1 @@

+ The image depicts a striking scene from the movie Black Swan. The ballerina, portrayed by Natalie Portman, is standing in a powerful, dramatic pose. Her body is elongated, with her arms outstretched to the sides, creating a sense of balance and tension. Her head is tilted back, gazing upwards, which adds to the intensity of the composition. The dancer's skin is pale, and her eyes are wide open, likely a deep blue or green, though the exact color is difficult to discern in this still. Her hair is dark, possibly black or dark brown, and appears to be styled in an elegant updo, though the exact style is not clear from this angle. The ballerina's body type is slender and athletic, reflecting her profession as a dancer. She is wearing a black leotard, which contrasts sharply with her pale skin and dark hair. The lighting in the scene is dramatic, with shadows playing across her face and body, emphasizing the intensity of the moment. This image captures the essence of the film's themes of obsession, dedication, and the psychological toll of pursuing perfection in dance.

images/00006-2234503665.png ADDED Viewed

Git LFS Details

SHA256: 9c6a6e63cdf23370ce5f20160e15a89d2b52fe92fd39fc289746e3418b7365d2
Pointer size: 131 Bytes
Size of remote file: 866 kB

images/00006-2234503665.txt ADDED Viewed

	@@ -0,0 +1 @@

images/00030-4075734474.png ADDED Viewed

Git LFS Details

SHA256: 987d9f0ef256156f6182cefe1bfabbd304914af854a5e06c6e787fff795050aa
Pointer size: 131 Bytes
Size of remote file: 957 kB

images/00030-4075734474.txt ADDED Viewed

	@@ -0,0 +1 @@

model/molmo-7B-D-bnb-4bit ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit 51097c4251a023d72485963c1ab69f3b6d6a1ec6