Text Generation
Transformers
Safetensors
GGUF
llava
remyx
Inference Endpoints
salma-remyx commited on
Commit
64b04d1
1 Parent(s): a4918f8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/iVKgqK6vTzCpCLVnWxmjA.png)
7
+
8
+ # Model Card for SpaceLLaVA
9
+
10
+ **SpaceLLaVA** uses LoRA to fine-tune [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main) on a dataset designed with [VQASynth](https://github.com/remyxai/VQASynth/tree/main) to enhance spatial reasoning as in [SpatialVLM](https://spatial-vlm.github.io/)
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models.
17
+ With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.
18
+
19
+
20
+ - **Developed by:** remyx.ai
21
+ - **Model type:** MultiModal Model, Vision Language Model, LLaVA
22
+ - **License:** Apache-2.0
23
+ - **Finetuned from model:** LLaVA
24
+
25
+ ### Model Sources
26
+
27
+ - **Repository:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
28
+ - **Paper:** [SpatialVLM](https://arxiv.org/abs/2401.12168)
29
+
30
+ ## Uses
31
+
32
+ Use this model to query spatial relationships between objects in a scene.
33
+
34
+ ## Citation
35
+
36
+ @article{chen2024spatialvlm,
37
+ title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
38
+ author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
39
+ journal = {arXiv preprint arXiv:2401.12168},
40
+ year = {2024},
41
+ url = {https://arxiv.org/abs/2401.12168},
42
+ }
43
+
44
+ @misc{liu2023llava,
45
+ title={Visual Instruction Tuning},
46
+ author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
47
+ publisher={NeurIPS},
48
+ year={2023},
49
+ }