miguelcarv commited on
Commit
2fcb1f0
1 Parent(s): 3df12a0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Φ Pheye - a family of efficient small vision-language models
2
+
3
+ - These models train a fraction of the number of parameters other models of similar sizes train
4
+ - Are more efficient in how they incorporate vision into language tasks (dense cross-attention > LLaVA style architecture)
5
+ - Are more efficient in how the process high resolution input images
6
+ - Use less data to train yet achieve competitive results (if you want to redo this recipe using more data results will be even better)
7
+
8
+ ## Benchmarks
9
+
10
+ | Model | Resolution | Trainable Params | Data | VQAv2 | Nocaps | TextVQA | TextCaps |
11
+ | ------------------------------------------------------------- | ---------- | ---------------- | ---- | ----- | ------ | ------- | -------- |
12
+ | MoE-LLaVA-1.6B×4 | 336 | 2.9B | 5.9M | 76.0 | - | 47.8 | - |
13
+ | MoE-LLaVA-2.7B×4 | 336 | 5.3B | 5.9M | 77.1 | - | 50.2 | - |
14
+ | moondream1 | 384 | 1.86B | 3.9M | 74.7 | - | 35.6 |
15
+ | moondream2 | 384 | 1.86B | - | 77.7 | 92.2 | 49.7 | 124.7 |
16
+ | [Pheye-x4 🤗](https://huggingface.co/miguelcarv/Pheye-x4-448) | 448 | 295M | 2.9M | 75.2 | 110.1 | 45.9 | 107.5 |
17
+ | [Pheye-x4 🤗](https://huggingface.co/miguelcarv/Pheye-x4-672) | 672 | 295M | 2.9M | 75.5 | 110.5 | 49.3 | 112.8 |
18
+ | [Pheye-x2 🤗](https://huggingface.co/miguelcarv/Pheye-x2-448) | 448 | 578M | 2.9M | 75.9 | 111.4 | 47.3 | 109.7 |
19
+ | [Pheye-x2 🤗](https://huggingface.co/miguelcarv/Pheye-x2-672) | 672 | 578M | 2.9M | 76.4 | 110.0 | 50.5 | 116.1 |
20
+
21
+ ## Examples
22
+
23
+ | Image | Example |
24
+ | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
25
+ | <img src="https://c5.staticflickr.com/6/5463/17191308944_ae0b20bb7e_o.jpg" width="500"/> | **How much do these popcorn packets weight?**<br>These popcorn packets weigh 90g each.<br><br>**What brand are these popcorn packets?**<br>These popcorn packets are from the brand "Magic Pop". |
26
+ | <img src="https://farm2.staticflickr.com/2708/5836100440_6e1117d36f_o.jpg" width="500"/> | **Can I pet that dog?**<br>No, you cannot pet the dog in the image. The dog is a black bear, which is a wild animal and should not be approached or interacted with by humans. It is important to maintain a safe distance from wild animals to avoid any potential harm to both the animal and the person. |
27
+ | <img src="https://farm4.staticflickr.com/3898/15152219465_4fb6d266ff_o.jpg" width="500"/> | **Why is the person wearing protective gear?**<br>The person in the image is wearing protective gear, such as a beekeeper's suit and a helmet, because they are engaged in beekeeping. Beekeeping involves handling bees and their hives, which can be potentially dangerous due to the risk of bee stings. The protective gear helps to minimize the risk of bee stings and ensures the beekeeper's safety while working with bees. |
28
+ | |
29
+
30
+ ## Usage
31
+
32
+ To generate a sample response from a prompt use `generate.py`.
33
+ Use a Python version >= 3.11. Start by cloning the repo and create a virtual environment with the necessary packages:
34
+
35
+ ```bash
36
+ git clone https://github.com/miguelscarv/pheye.git
37
+ python3 -m venv venv
38
+ source venv/bin/activate
39
+ pip3 install -r requirements.txt
40
+ ```
41
+
42
+ Then run `generate.py`:
43
+
44
+ ```bash
45
+ python3 generate.py --image_path images/dog_flower.jpg --prompt "What is the dog holding in it's mouth?" --device cuda
46
+ ```
47
+
48
+ ## Acknowledgments
49
+
50
+ This implementation was inspired by [OpenFlamingo](https://github.com/mlfoundations/open_flamingo)'s repository.