---
tags:
- generated_from_trainer
model-index:
- name: persian-clip
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# persian-clip

This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
It achieves the following results on the evaluation set:
- Loss: 0.7629

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 1.4072        | 0.12  | 100  | 2.1627          |
| 1.7146        | 0.25  | 200  | 1.6432          |
| 1.5058        | 0.37  | 300  | 1.4523          |
| 1.3836        | 0.49  | 400  | 1.4799          |
| 1.4946        | 0.62  | 500  | 1.3101          |
| 1.2544        | 0.74  | 600  | 1.2073          |
| 1.1984        | 0.86  | 700  | 1.1801          |
| 1.3243        | 0.99  | 800  | 1.1652          |
| 0.8373        | 1.11  | 900  | 1.0860          |
| 0.8625        | 1.23  | 1000 | 1.0731          |
| 0.791         | 1.36  | 1100 | 1.0427          |
| 0.8975        | 1.48  | 1200 | 1.0786          |
| 0.7767        | 1.6   | 1300 | 1.0248          |
| 0.9041        | 1.73  | 1400 | 1.0311          |
| 0.8474        | 1.85  | 1500 | 0.9649          |
| 0.7435        | 1.98  | 1600 | 0.9552          |
| 0.5126        | 2.1   | 1700 | 0.9909          |
| 0.4871        | 2.22  | 1800 | 0.9188          |
| 0.48          | 2.35  | 1900 | 0.9151          |
| 0.4715        | 2.47  | 2000 | 0.9056          |
| 0.408         | 2.59  | 2100 | 0.8885          |
| 0.4999        | 2.72  | 2200 | 0.8911          |
| 0.5169        | 2.84  | 2300 | 0.8727          |
| 0.3574        | 2.96  | 2400 | 0.8477          |
| 0.2749        | 3.09  | 2500 | 0.8666          |
| 0.2719        | 3.21  | 2600 | 0.8520          |
| 0.2779        | 3.33  | 2700 | 0.8379          |
| 0.3407        | 3.46  | 2800 | 0.8386          |
| 0.223         | 3.58  | 2900 | 0.8245          |
| 0.2649        | 3.7   | 3000 | 0.8149          |
| 0.2698        | 3.83  | 3100 | 0.7983          |
| 0.1863        | 3.95  | 3200 | 0.7959          |
| 0.1831        | 4.07  | 3300 | 0.7957          |
| 0.172         | 4.2   | 3400 | 0.7963          |
| 0.1457        | 4.32  | 3500 | 0.7879          |
| 0.1503        | 4.44  | 3600 | 0.7794          |
| 0.1783        | 4.57  | 3700 | 0.7788          |
| 0.166         | 4.69  | 3800 | 0.7753          |
| 0.1598        | 4.81  | 3900 | 0.7673          |
| 0.1618        | 4.94  | 4000 | 0.7629          |


### Framework versions

- Transformers 4.38.2
- Pytorch 2.1.2+cu121
- Datasets 2.10.1
- Tokenizers 0.15.0

### How to use?
```python
# Both models generate vectors with 768 dimensions.
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer, CLIPFeatureExtractor
# download pre-trained models
vision_encoder = CLIPVisionModel.from_pretrained('SeyedAli/Persian-CLIP')
preprocessor = CLIPFeatureExtractor.from_pretrained('SeyedAli/Persian-CLIP')
text_encoder = RobertaModel.from_pretrained('SeyedAli/Persian-CLIP')
tokenizer = AutoTokenizer.from_pretrained('SeyedAli/Persian-CLIP')
# define input image and input text
text = 'something'
image = PIL.Image.open('my_favorite_image.jpg')
# compute embeddings
text_embedding = text_encoder(**tokenizer(text,
                                          return_tensors='pt')).pooler_output
image_embedding = vision_encoder(**preprocessor(image, 
                                                return_tensors='pt')).pooler_output
```

### Demo:
The followings are just some use cases of Persian-CLIP on 25K Unsplash images

* use pip install -q git+https://github.com/sajjjadayobi/clipfa.git
```python
from clipfa import CLIPDemo
import torch
# Both models generate vectors with 768 dimensions.
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer, CLIPFeatureExtractor
# download pre-trained models
vision_encoder = CLIPVisionModel.from_pretrained('SeyedAli/Persian-CLIP')
preprocessor = CLIPFeatureExtractor.from_pretrained('SeyedAli/Persian-CLIP')
text_encoder = RobertaModel.from_pretrained('SeyedAli/Persian-CLIP')
tokenizer = AutoTokenizer.from_pretrained('SeyedAli/Persian-CLIP')

demo = CLIPDemo(vision_encoder, text_encoder, tokenizer)
demo.compute_text_embeddings(['متن 3' ,'متن 2' ,'متن 1'])
demo.compute_image_embeddings(['my_favorite_image.jpg'])
demo.zero_shot(image_path='my_favorite_image.jpg') 

```