Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ datasets:
|
|
10 |
|
11 |
![Text Meme](meme.jpg)
|
12 |
|
13 |
-
Is text really all you need? Probably not, but the least we can do is try. This repo contains a QLoRA fine-tune of Mistral-7B on the original Llava-150K-Instruct dataset; however, each image is encoded as a base64 representation. With enough data, can a LLM learn to "see" just from text? Early results say absolutely not, but I am committed to burning my GPU credits regardless of how bad the result.
|
14 |
|
15 |
I do believe in the future we will see a "simplification" of architectures designed to work for multiple modalities. LLaVA, for example, combines a vision encoder with a pre-trained LLM. Perhaps models of the future will have a joint-representation for both images and text, and not have to rely on splicing 2 models together. For example, perhaps [Token-Free Models](https://arxiv.org/html/2401.13660v1) could be trained on multi-modal byte representations of inputs. Of course, this would be extremely computationally expensive compared to modern vision models, but maybe 10-20 years down the line it's not that big of a deal?
|
16 |
|
|
|
10 |
|
11 |
![Text Meme](meme.jpg)
|
12 |
|
13 |
+
Is text really all you need? Probably not, but the least we can do is try. This repo contains a QLoRA fine-tune of [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) on the original [Llava-150K-Instruct](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) dataset; however, each image is encoded as a base64 representation. With enough data, can a LLM learn to "see" just from text? Early results say absolutely not, but I am committed to burning my GPU credits regardless of how bad the result.
|
14 |
|
15 |
I do believe in the future we will see a "simplification" of architectures designed to work for multiple modalities. LLaVA, for example, combines a vision encoder with a pre-trained LLM. Perhaps models of the future will have a joint-representation for both images and text, and not have to rely on splicing 2 models together. For example, perhaps [Token-Free Models](https://arxiv.org/html/2401.13660v1) could be trained on multi-modal byte representations of inputs. Of course, this would be extremely computationally expensive compared to modern vision models, but maybe 10-20 years down the line it's not that big of a deal?
|
16 |
|