TeLVE v1.0dep released. Due to the addressing problem during training, it is not recommended to use it because it is trained with a dataset of about half the size.
Browse files- README.md +80 -79
- models/TeLVE_v1.0dep.pth +3 -0
README.md
CHANGED
@@ -1,79 +1,80 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
language:
|
4 |
-
- en
|
5 |
-
- tr
|
6 |
-
tags:
|
7 |
-
- VLM
|
8 |
-
- image2text
|
9 |
-
- lm
|
10 |
-
---
|
11 |
-
# TeLVE: Turkish efficient Language Vision Engine 🧿
|
12 |
-
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
|
13 |
-
[![Models: v1.0](https://img.shields.io/badge/Models-v1.0-blue)](https://huggingface.co/outsu/TeLVE)
|
14 |
-
## First Turkish VLM ever!
|
15 |
-
|
16 |
-
TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing.
|
17 |
-
No module named 'imagine'
|
18 |
-
![TeLVE logo](<teLVE_logo.png>)
|
19 |
-
|
20 |
-
## Model Description
|
21 |
-
|
22 |
-
TeLVE combines:
|
23 |
-
- 🖼️ Vision Transformer (ViT-base-patch16-224)
|
24 |
-
- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
|
25 |
-
- 🔄 Cross-attention mechanism for vision-language fusion
|
26 |
-
|
27 |
-
### Version Logs
|
28 |
-
- **TeLVE v1.0**: Trained on Unsplash Lite dataset
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
-
|
42 |
-
-
|
43 |
-
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
-
|
55 |
-
-
|
56 |
-
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
Performance
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
| TeLVE v1.
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
}
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- tr
|
6 |
+
tags:
|
7 |
+
- VLM
|
8 |
+
- image2text
|
9 |
+
- lm
|
10 |
+
---
|
11 |
+
# TeLVE: Turkish efficient Language Vision Engine 🧿
|
12 |
+
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
|
13 |
+
[![Models: v1.0](https://img.shields.io/badge/Models-v1.0%2c%20v1.0dep-blue)](https://huggingface.co/outsu/TeLVE)
|
14 |
+
## First Turkish VLM ever!
|
15 |
+
|
16 |
+
TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing.
|
17 |
+
No module named 'imagine'
|
18 |
+
![TeLVE logo](<teLVE_logo.png>)
|
19 |
+
|
20 |
+
## Model Description
|
21 |
+
|
22 |
+
TeLVE combines:
|
23 |
+
- 🖼️ Vision Transformer (ViT-base-patch16-224)
|
24 |
+
- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
|
25 |
+
- 🔄 Cross-attention mechanism for vision-language fusion
|
26 |
+
|
27 |
+
### Version Logs
|
28 |
+
- **TeLVE v1.0**: Trained on Unsplash Lite dataset
|
29 |
+
- **TeLVE v1.0dep**: Dataset enhanced with selective images from Pexels images, the encoder problem with letter "ü" was fixed. *(Deprecated, performance was decreased because of dataset addressing problem. Not recommended to use.)*
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
The model can be used in two ways:
|
34 |
+
|
35 |
+
### Inference (imagine.py)
|
36 |
+
```python
|
37 |
+
# Generate captions for images
|
38 |
+
python imagine.py
|
39 |
+
```
|
40 |
+
This script:
|
41 |
+
- Loads a trained TeLVE model
|
42 |
+
- Takes images from `images` directory
|
43 |
+
- Generates Turkish captions for each image
|
44 |
+
- Outputs the results to console
|
45 |
+
|
46 |
+
### Training (main.py)
|
47 |
+
Users can train their own models with ViT and BERT encoders.
|
48 |
+
```python
|
49 |
+
# Train a new model
|
50 |
+
python main.py
|
51 |
+
```
|
52 |
+
|
53 |
+
This script:
|
54 |
+
- Loads and preprocesses image-caption pairs
|
55 |
+
- Initializes ViT and BERT encoders
|
56 |
+
- Trains the combined model
|
57 |
+
- Saves the model and tokenizer
|
58 |
+
|
59 |
+
|
60 |
+
## Performance
|
61 |
+
Performance scores will be evaluated.
|
62 |
+
<!--
|
63 |
+
| Model Version | Dataset | BLEU-4 | METEOR | CIDEr |
|
64 |
+
|--------------|---------|---------|---------|--------|
|
65 |
+
| TeLVE v1.0 | Unsplash | *TBD* | *TBD* | *TBD* |
|
66 |
+
| TeLVE v1.1 | Unsplash+Pexels | *TBD* | *TBD* | *TBD* |-->
|
67 |
+
|
68 |
+
## Citation
|
69 |
+
|
70 |
+
```bibtex
|
71 |
+
@software{telve2024,
|
72 |
+
author = {Öğüt Su Karagün},
|
73 |
+
title = {TeLVE: Turkish efficient Language Vision Engine},
|
74 |
+
year = {2024},
|
75 |
+
url = {https://huggingface.co/outsu/TeLVE}
|
76 |
+
}
|
77 |
+
```
|
78 |
+
|
79 |
+
## License
|
80 |
+
This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
|
models/TeLVE_v1.0dep.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5e74ea3f021a45ff9f888c841e8f07924b175fe2a50c73696daa7039be10df48
|
3 |
+
size 904212666
|