SentenceTransformer
このモデルは実験的なモデルです。
詳細はブログ記事を、関連するソースコードはリポジトリを参照してください。
テキスト埋め込みモデルはcl-nagoya/ruri-largeを利用し、画像エンコーダはQwen/Qwen2-VL-2B-InstructのViTをベースモデルとしています。
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("oshizo/japanese-clip-qwen2_vl-exp-0101", trust_remote_code=True)
import io
import requests
from PIL import Image
sentences = [
'モノクロの男性の肖像写真。軍服を着て石の階段に座っている。',
"庭で茶色の犬がこちらを向いて座っている。"
]
text_embeddings = model.encode(sentences)
text_embeddings.shape
# (2, 1024)
image_urls = [
'https://upload.wikimedia.org/wikipedia/commons/7/73/Shigenobu_Okuma_5.jpg',
'https://upload.wikimedia.org/wikipedia/commons/7/78/Akita_inu.jpeg'
]
images = [
Image.open(io.BytesIO(requests.get(image_urls[0]).content)).resize((150, 240)),
Image.open(io.BytesIO(requests.get(image_urls[1]).content)).resize((240, 150))
]
image_embeddings = model.encode(images)
image_embeddings.shape
# (2, 1024)
similarities = model.similarity(text_embeddings, image_embeddings)
similarities
# tensor([[0.2573, 0.0105],
# [0.0282, 0.2982]])
- Downloads last month
- 1,091
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.