voxreality
/

rgb_language_vqa

@@ -1,8 +1,59 @@
 # Welcome to the VOXReality Horizon Europe Project
-Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language Visual Spatial Question Answering model.
 ## 1. Requirements
 ---
 1. CUDA compatible GPU.
@@ -13,25 +64,3 @@ Below you'll find the necessary instructions in order to run our provided code.
    2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
 3. For Windows (tested on Windows 10 and 11).
    1. Make sure Docker is installed on your system.
-One you have docker up and running you can move to cloning the repository.
-## 2. Cloning the repository
----
-1. Start by cloning the repository by running the following command:
-   `git clone https://gitlab.com/horizon-europe-voxreality/vision-and-language-models/rgb-language_vqa.git`
-## 3. Building the docker image
----
-In order to build the docker image from scratch you should follow the instructions below:
-1. To build the docker you need to have already downloaded the model(`https://huggingface.co/voxreality/rgb_language_vqa`) at `code\blip_model`.
-2. Open a terminal in the same folder where the dockerfile resides and execute `docker build .`
-## 4. Running the docker images
----
-1. You can start the docker container by running the `LINUX_Start_Contrainer.sh` or the `WIN_GPU_Start_Container.bat` script respectively.
-2. There should be a message in the terminal that will display whether the GPU is accessed by the container or not. If the GPU is available it will use it, else it will run on CPU. If for some reason you want to run it manually with CPU you can run 'sudo docker run -it -p 5041:5041 --network host --name voxreality_rgb-language_vqa voxreality/rgb-language:vqa'.
-3. Once the container has started, you can move to the `https://localhost:5041/docs` or `https://YourExternalIP:5041/docs` address on your browser and start using the api.
-4. To stop the container run the `LINUX_Stop_Container.sh` or the `WIN_GPU_Stop_Container.bat` script respectively. This script will create a `Dumps` folder and copy all the tests you have run e.g. uploaded photos and generated captions.

+---
+license: apache-2.0
+language:
+- en
+metrics:
+- code_eval
+library_name: transformers
+pipeline_tag: image-to-text
+tags:
+- text-generation-inference
+---
+<u><b>We are creating a spatial aware vision-language(VL) model.</b></u>
+This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.
+This is a sequence to sequence model for visual question-answering. The architecture is <u><b>BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)</b></u>
+<details>
+  <summary>Requirements!</summary>
+- 4GB GPU RAM.
+- CUDA enabled docker
+</details>
+The way to download and run this:
+```python
+from transformers import BlipProcessor, BlipForQuestionAnswering
+import torch
+from PIL import Image
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+# Specify the path to the directory where the model was saved
+model_path = "voxeality/rgb-language_vqa"
+# Load the model
+model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
+question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
+image_path= 'path/to/file'
+image = Image.open(image_path).convert("RGB")
+# Load the processor used during training for consistent preprocessing
+processor = BlipProcessor.from_pretrained(model_path)
+# prepare inputs
+encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
 # Welcome to the VOXReality Horizon Europe Project
+out = model.generate(**encoding, max_new_tokens=200)
+generated_text = processor.decode(out[0], skip_special_tokens=True)
+print(generated_text)
+```
+Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model.
+The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.
+<i>The output of this dialogue is either of that form:
+Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2".
 ## 1. Requirements
 ---
 1. CUDA compatible GPU.
    2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
 3. For Windows (tested on Windows 10 and 11).
    1. Make sure Docker is installed on your system.