alanzhuly commited on
Commit
ab88027
1 Parent(s): c9118f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -7
README.md CHANGED
@@ -17,14 +17,14 @@ Omni-Vision is a compact multimodal model that processes both visual and text in
17
  - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
18
  The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
19
 
20
- **Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes).
21
 
22
  ## Intended Use Cases
23
 
24
- 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
25
- 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
26
 
27
- ## ## Benchmarks
28
 
29
  | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
30
  |-------------------|----------------------|-----------|-------------|
@@ -36,22 +36,27 @@ The vision encoder first transforms input images into embeddings, which are then
36
  | ScienceQA (Test) | 64.5 | 59.0 | NA |
37
  | POPE | 89.4 | 84.1 | NA |
38
 
39
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztPlo5TgBAsFvZpGMy9H.png)
40
 
41
  ## How to use
42
 
43
- This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase.
44
-
45
  **Test in HuggingFace Space**
46
 
47
  **Run Locally**
48
 
49
  Install Nexa-SDK
50
 
 
 
 
 
51
  ```bash
52
  nexa run omnivision
53
  ```
54
 
 
 
 
55
 
56
  ## Training
57
 
 
17
  - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
18
  The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
19
 
20
+ **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
21
 
22
  ## Intended Use Cases
23
 
24
+ 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. E.g. "What kind of cat is this?"
25
+ 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. E.g. "Describe this image."
26
 
27
+ ## Benchmarks
28
 
29
  | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
30
  |-------------------|----------------------|-----------|-------------|
 
36
  | ScienceQA (Test) | 64.5 | 59.0 | NA |
37
  | POPE | 89.4 | 84.1 | NA |
38
 
39
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztP1o5TgBAsFVzpGMy9H.png" alt="Benchmark Radar Chart" width="500"/>
40
 
41
  ## How to use
42
 
 
 
43
  **Test in HuggingFace Space**
44
 
45
  **Run Locally**
46
 
47
  Install Nexa-SDK
48
 
49
+ Install Nexa-SDK:
50
+
51
+ > **Note**: To run our models locally, you'll need to install Nexa-SDK. It’s an on-device inference framework that enables efficient and flexible deployment of our models directly on your hardware. With support for text, image, audio, and multimodal processing, Nexa-SDK brings powerful AI capabilities to your local environment.
52
+
53
  ```bash
54
  nexa run omnivision
55
  ```
56
 
57
+ ## Technical Innovations for Edge Deployment
58
+ - 9x Token Reduction through Token Compression
59
+ - Minimal-Edit DPO for Enhanced Response Quality
60
 
61
  ## Training
62