Model description

iGPT-fr ๐Ÿ‡ซ๐Ÿ‡ท is a GPT model for French pre-trained incremental language model developped by the Laboratoire de Linguistique Formelle (LLF). We adapted GPT-fr ๐Ÿ‡ซ๐Ÿ‡ท model to generate images conditionned by text inputs.

Intended uses & limitations

The model can be leveraged for image generation tasks. The model is currently under a developpment phase.

How to use

The model might be used through the ๐Ÿค— Transformers librairie. You will also need to install the Taming Transformers library for high-resolution image synthesis:

pip install git+https://github.com/CompVis/taming-transformers.git
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from huggingface_hub import hf_hub_download
from omegaconf import OmegaConf
from taming.models import vqgan
import torch
from PIL import Image
import numpy as np

# Load VQGAN model
vqgan_ckpt = hf_hub_download(repo_id="boris/vqgan_f16_16384", filename="model.ckpt", force_download=False)
vqgan_config = hf_hub_download(repo_id="boris/vqgan_f16_16384", filename="config.yaml", force_download=False)

config = OmegaConf.load(vqgan_config)
vqgan_model = vqgan.VQModel(**config.model.params)
vqgan_model.eval().requires_grad_(False)
vqgan_model.init_from_ckpt(vqgan_ckpt)

# Load pretrained model
model = GPT2LMHeadModel.from_pretrained("asi/igpt-fr-cased-base")
model.eval()
tokenizer = GPT2Tokenizer.from_pretrained("asi/igpt-fr-cased-base")

# Generate a sample of text
input_sentence = "Une carte de l'europe"
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')
input_ids = torch.cat((input_ids, torch.tensor([[50000]])), 1)  # Add image generation token

greedy_output = model.generate(
  input_ids.to(device), 
  max_length=256+input_ids.shape[1],
  do_sample=True, 
  top_p=0.92, 
  top_k=0)

def custom_to_pil(x):
  x = x.detach().cpu()
  x = torch.clamp(x, -1., 1.)
  x = (x + 1.)/2.
  x = x.permute(1,2,0).numpy()
  x = (255*x).astype(np.uint8)
  x = Image.fromarray(x)
  if not x.mode == "RGB":
    x = x.convert("RGB")
  return x

z_idx = greedy_output[0, input_ids.shape[1]:] - 50001
z_quant = vqgan_model.quantize.get_codebook_entry(z_idx, shape=(1, 16, 16, 256))
x_rec = vqgan_model.decode(z_quant).to('cpu')[0]
display(custom_to_pil(x_rec))

You may also filter results based on CLIP:

from tqdm import tqdm

def hallucinate(prompt, num_images=64):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    input_ids = torch.cat((input_ids, torch.tensor([[50000]])), 1).to(device)  # Add image generation token

    all_images = []
    for i in tqdm(range(num_images)):
        greedy_output = model.generate(
          input_ids.to(device), 
          max_length=256+input_ids.shape[1],
          do_sample=True,
          top_p=0.92, 
          top_k=0)

        z_idx = greedy_output[0, input_ids.shape[1]:] - 50001
        z_quant = vqgan_model.quantize.get_codebook_entry(z_idx, shape=(1, 16, 16, 256))
        x_rec = vqgan_model.decode(z_quant).to('cpu')[0]
        all_images.append(custom_to_pil(x_rec))
    return all_images

input_sentence = "Une carte de l'europe"
all_images = hallucinate(input_sentence)

from transformers import pipeline

opus_model = "Helsinki-NLP/opus-mt-fr-en"
opus_translator = pipeline("translation", model=opus_model)

opus_translator(input_sentence)

from transformers import CLIPProcessor, CLIPModel

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def clip_top_k(prompt, images, k=8):
  prompt_fr = opus_translator(input_sentence)[0]['translation_text']
  inputs = clip_processor(text=prompt_fr, images=images, return_tensors="pt", padding=True)
  outputs = clip_model(**inputs)
  logits = outputs.logits_per_text # this is the image-text similarity score
  scores = np.array(logits[0].detach()).argsort()[-k:][::-1]
  return [images[score] for score in scores]

filtered_images = clip_top_k(input_sentence, all_images)

for fi in filtered_images:
  display(fi)

Training data

We created a dedicated corpus to train our generative model. The training corpus consists in text-image pairs. We aggregated portions from existing corpora: Laion-5B and WIT. The final dataset includes 10,807,534 samples.

Training procedure

We pre-trained the model on the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We perform the training within a total of 140 hours of computation on Tesla V-100 hardware (TDP of 300W). The training was distributed on 8 compute nodes of 8 GPUs. We used data parallelization in order to divide each micro-batch on the computing units. We estimated the total emissions at 1161.22 kgCO2eq, using the Machine Learning Impact calculator presented in Lacoste et al., (2019).

Downloads last month
24
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.