VCL3D commited on
Commit
4accd6b
·
verified ·
1 Parent(s): 8cfe749

update readme

Browse files
Files changed (1) hide show
  1. README.md +52 -23
README.md CHANGED
@@ -1,8 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Welcome to the VOXReality Horizon Europe Project
2
 
3
- Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language Visual Spatial Question Answering model.
 
 
 
 
 
 
4
 
 
5
 
 
 
 
6
  ## 1. Requirements
7
  ---
8
  1. CUDA compatible GPU.
@@ -13,25 +64,3 @@ Below you'll find the necessary instructions in order to run our provided code.
13
  2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
14
  3. For Windows (tested on Windows 10 and 11).
15
  1. Make sure Docker is installed on your system.
16
-
17
-
18
- One you have docker up and running you can move to cloning the repository.
19
-
20
- ## 2. Cloning the repository
21
- ---
22
- 1. Start by cloning the repository by running the following command:
23
- `git clone https://gitlab.com/horizon-europe-voxreality/vision-and-language-models/rgb-language_vqa.git`
24
-
25
- ## 3. Building the docker image
26
- ---
27
- In order to build the docker image from scratch you should follow the instructions below:
28
-
29
- 1. To build the docker you need to have already downloaded the model(`https://huggingface.co/voxreality/rgb_language_vqa`) at `code\blip_model`.
30
- 2. Open a terminal in the same folder where the dockerfile resides and execute `docker build .`
31
-
32
- ## 4. Running the docker images
33
- ---
34
- 1. You can start the docker container by running the `LINUX_Start_Contrainer.sh` or the `WIN_GPU_Start_Container.bat` script respectively.
35
- 2. There should be a message in the terminal that will display whether the GPU is accessed by the container or not. If the GPU is available it will use it, else it will run on CPU. If for some reason you want to run it manually with CPU you can run 'sudo docker run -it -p 5041:5041 --network host --name voxreality_rgb-language_vqa voxreality/rgb-language:vqa'.
36
- 3. Once the container has started, you can move to the `https://localhost:5041/docs` or `https://YourExternalIP:5041/docs` address on your browser and start using the api.
37
- 4. To stop the container run the `LINUX_Stop_Container.sh` or the `WIN_GPU_Stop_Container.bat` script respectively. This script will create a `Dumps` folder and copy all the tests you have run e.g. uploaded photos and generated captions.
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - code_eval
7
+ library_name: transformers
8
+ pipeline_tag: image-to-text
9
+ tags:
10
+ - text-generation-inference
11
+ ---
12
+ <u><b>We are creating a spatial aware vision-language(VL) model.</b></u>
13
+
14
+ This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.
15
+
16
+ This is a sequence to sequence model for visual question-answering. The architecture is <u><b>BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)</b></u>
17
+
18
+ <details>
19
+ <summary>Requirements!</summary>
20
+ - 4GB GPU RAM.
21
+ - CUDA enabled docker
22
+ </details>
23
+
24
+ The way to download and run this:
25
+ ```python
26
+ from transformers import BlipProcessor, BlipForQuestionAnswering
27
+ import torch
28
+ from PIL import Image
29
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
30
+ # Specify the path to the directory where the model was saved
31
+ model_path = "voxeality/rgb-language_vqa"
32
+ # Load the model
33
+ model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
34
+ question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
35
+ image_path= 'path/to/file'
36
+ image = Image.open(image_path).convert("RGB")
37
+
38
+ # Load the processor used during training for consistent preprocessing
39
+ processor = BlipProcessor.from_pretrained(model_path)
40
+ # prepare inputs
41
+ encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
42
  # Welcome to the VOXReality Horizon Europe Project
43
 
44
+ out = model.generate(**encoding, max_new_tokens=200)
45
+ generated_text = processor.decode(out[0], skip_special_tokens=True)
46
+ print(generated_text)
47
+ ```
48
+ Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model.
49
+
50
+
51
 
52
+ The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.
53
 
54
+ <i>The output of this dialogue is either of that form:
55
+
56
+ Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2".
57
  ## 1. Requirements
58
  ---
59
  1. CUDA compatible GPU.
 
64
  2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
65
  3. For Windows (tested on Windows 10 and 11).
66
  1. Make sure Docker is installed on your system.