brahmairesearch/brahmai-clip-v0.1

BRAHMAI-CLIP-v0.1

MODEL TYPE:

The base model employs a ViT-L/14 Transformer architecture for the image encoder and a masked self-attention Transformer for the text encoder. These encoders are trained with a contrastive loss to maximize the similarity between image and text pairs. The original implementation offered two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository contains the variant with the Vision Transformer.

DATE: June, 2024

CODE:

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

# Define the model and processor
model_id = "brahmairesearch/brahmai-clip-v0.1"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

# Load the image from URL
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

# Define the text descriptions
descriptions = ["a cat's photograph", "a dog's photograph"]

# Process the inputs
inputs = processor(text=descriptions, images=image, return_tensors="pt", padding=True)

# Get the outputs from the model
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image

# Calculate the label probabilities
probs = logits_per_image.softmax(dim=1)

# Print the results
print(probs)

Model Use

Intended Use

The model is designed as a research tool for academic and research communities. It aims to help researchers delve into zero-shot, arbitrary image classification and to explore interdisciplinary studies on the potential impacts of such models. The CLIP paper provides an example of these analyses by discussing potential downstream effects.

Primary Intended Users:

AI researchers.

We expect researchers to use this model to gain insights into the robustness, generalization, capabilities, biases, and constraints of computer vision models.

Out-of-Scope Use Cases

Deployed Use Cases: Any deployment of the model, whether commercial or not, is currently out of scope. Non-deployed uses, such as image search in a controlled environment, are also not advised unless there has been thorough in-domain testing with a specific, fixed class taxonomy. This caution is due to the variability in CLIP's performance with different class taxonomies, as highlighted in our safety assessment.
Surveillance and Facial Recognition: Use cases involving surveillance and facial recognition are always out of scope. The premature application of AI in these domains, given the current lack of testing norms and fairness checks, is potentially harmful.
Non-English Languages: The model has not been specifically trained or evaluated in languages other than English. Therefore, its use should be limited to English language applications.

Limitations

CLIP and our analysis of it have several limitations. The model currently struggles with tasks such as fine-grained classification and counting objects. Additionally, CLIP raises concerns regarding fairness and bias, which we discuss in the paper and briefly in the next section. An important limitation of our testing approach is the use of linear probes to evaluate CLIP's performance, as there is evidence suggesting that linear probes can underestimate model performance.

Bias and Fairness

The performance and specific biases of CLIP can vary significantly based on class design and the choices made for including or excluding categories. We assessed the risk of certain types of denigration by classifying images of people from the Fairface dataset into crime-related and non-human animal categories. Significant disparities were found concerning race and gender, and these disparities could shift based on the class construction. Details of these findings are captured in the Broader Impacts section of the paper.

We also evaluated CLIP's performance on gender, race, and age classification using the Fairface dataset. For gender classification, we found accuracy above 96% across all races, with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). For racial classification, CLIP averaged around 93% accuracy, and for age classification, it averaged around 63% accuracy. Our evaluations of gender, race, and age classification, as well as denigration harms, are intended to assess the model's performance across different demographics and to highlight potential risks, rather than to endorse or promote such tasks.