fixed typos and added info pointed out in evaluation
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ tags:
|
|
7 |
|
8 |
## Model Details
|
9 |
|
10 |
-
This model is a finetuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with aim to improve zero-shot image classification, text-to-image and image-to-image retrieval
|
11 |
|
12 |
### Model Date
|
13 |
|
@@ -19,22 +19,22 @@ The base model uses a ViT-B/32 Transformer architecture as an image encoder and
|
|
19 |
|
20 |
### Model Version
|
21 |
|
22 |
-
We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd) for zero-shot classification for each of those.
|
23 |
|
24 |
### Training
|
25 |
|
26 |
To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
|
27 |
-
The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with
|
28 |
-
Full log of the training run
|
29 |
|
30 |
### Demo
|
31 |
|
32 |
-
|
33 |
|
34 |
|
35 |
### Documents
|
36 |
|
37 |
-
- [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU]()
|
38 |
|
39 |
|
40 |
### Use with Transformers
|
@@ -67,7 +67,12 @@ for l, p in zip(labels, probs[0]):
|
|
67 |
|
68 |
### Intended Use
|
69 |
|
70 |
-
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification.
|
|
|
|
|
|
|
|
|
|
|
71 |
|
72 |
#### Primary intended uses
|
73 |
|
@@ -79,7 +84,7 @@ We primarily imagine the model will be used by researchers to better understand
|
|
79 |
|
80 |
## Data
|
81 |
|
82 |
-
The model was trained on publicly available remote sensing image
|
83 |
|
84 |
|
85 |
## Performance and Limitations
|
|
|
7 |
|
8 |
## Model Details
|
9 |
|
10 |
+
This model is a finetuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with an aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specifically on remote sensing images.
|
11 |
|
12 |
### Model Date
|
13 |
|
|
|
19 |
|
20 |
### Model Version
|
21 |
|
22 |
+
We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd#evaluation-results) for performance metrics on zero-shot classification for each of those.
|
23 |
|
24 |
### Training
|
25 |
|
26 |
To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
|
27 |
+
The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with peak learning rate 1e-4 on 1 TPU-v3-8.
|
28 |
+
Full log of the training run can be found on [WandB](https://wandb.ai/wandb/hf-flax-clip-rsicd/runs/1ts243k3).
|
29 |
|
30 |
### Demo
|
31 |
|
32 |
+
Check out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo).
|
33 |
|
34 |
|
35 |
### Documents
|
36 |
|
37 |
+
- [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/Finetuning_CLIP_with_HF_and_jax.ipynb)
|
38 |
|
39 |
|
40 |
### Use with Transformers
|
|
|
67 |
|
68 |
### Intended Use
|
69 |
|
70 |
+
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification.
|
71 |
+
|
72 |
+
In addition, we can imagine applications in defense and law enforcement, climate change and global warming, and even some consumer applications. A partial list of applications can be found [here](https://github.com/arampacha/CLIP-rsicd#applications). In general we think such models can be useful as digital assistants for humans engaged in searching through large collections of images.
|
73 |
+
|
74 |
+
We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
|
75 |
+
|
76 |
|
77 |
#### Primary intended uses
|
78 |
|
|
|
84 |
|
85 |
## Data
|
86 |
|
87 |
+
The model was trained on publicly available remote sensing image captions datasets. Namely [RSICD](https://github.com/201528014227051/RSICD_optimal), [UCM](https://mega.nz/folder/wCpSzSoS#RXzIlrv--TDt3ENZdKN8JA) and [Sydney](https://mega.nz/folder/pG4yTYYA#4c4buNFLibryZnlujsrwEQ). More information on the datasets used can be found on [our project page](https://github.com/arampacha/CLIP-rsicd#dataset).
|
88 |
|
89 |
|
90 |
## Performance and Limitations
|