alchemab/fabcon-medium · Issues Generating Paired Sequences

May 31, 2024

•

edited May 31, 2024

Hi,
I'm trying to generate paired sequences using the sampling parameters described in the paper, but the generation stops after the heavy chain is generated (i.e. there is an <|endoftext|>). How can I ensure the generation of full sequences?
Thanks,
Ollie

Generation code:

sample = model.generate(
             tokenizer("Ḣ", return_tensors='pt')['input_ids'][:, :-1],
             max_new_tokens=256,
             top_p=0.95,
             temperature = 1,
             do_sample=True, )

ideasbyjin

Jun 4, 2024

Thanks Ollie - apologies if there was any confusion from the paper (which was benchmarked using FAbCon-large). For this, you can either append the light chain token type after the heavy chain generations, or use a LogitsProcessor. Hope that helps!

ideasbyjin changed discussion status to closed Jun 4, 2024

ehottl

Aug 23, 2024

Here’s the code I used. Hope it helps!

from transformers import PreTrainedTokenizerFast, FalconForCausalLM
import torch

# Load tokenizer and model
tokenizer = PreTrainedTokenizerFast.from_pretrained("alchemab/fabcon-medium")
model = FalconForCausalLM.from_pretrained("alchemab/fabcon-medium")

# Generate heavy chain sequence
heavy_input = tokenizer("Ḣ", return_tensors='pt')['input_ids']
heavy_output = model.generate(
    heavy_input,
    max_new_tokens=256,
    do_sample=True,
    top_k=50,
    temperature=1.0,
    pad_token_id=tokenizer.eos_token_id
)

# Append light chain token and generate light chain sequence
light_input = torch.cat([heavy_output, tokenizer("Ḷ", return_tensors='pt')['input_ids']], dim=-1)
full_output = model.generate(
    light_input,
    max_new_tokens=256,
    do_sample=True,
    top_k=50,
    temperature=1.0,
    pad_token_id=tokenizer.eos_token_id
)

# Decode the generated sequence
decoded_seq = tokenizer.decode(full_output[0], skip_special_tokens=False)

print(decoded_seq)