Titletor

This model is a fine-tuned version of microsoft/phi-1_5 on zelalt/scientific-papers-3.5-withprompt dataset. It achieves the following results on the evaluation set:

  • Loss: 2.1587

Requirements

!pip install accelerate transformers einops datasets peft bitsandbytes

Test Dataset

If you prefer, you can use test dataset from zelalt/scientific-papers or zelalt/arxiv-papers or read your pdf as text with PyPDF2.PdfReader then give this text to LLM with adding "What is the title of this paper?" prompt.

from datasets import load_dataset

test_dataset = load_dataset("zelalt/scientific-papers", split='train')
test_dataset = test_dataset.rename_column('full_text', 'text')

def formatting(example):
    text = f"What is the title of this paper? {example['text'][:180]}\n\nAnswer: "
    return {'text': text}

formatted_dataset = test_dataset.map(formatting)

Sample Code


import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "zelalt/titletor-phi_1-5"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,trust_remote_code=True)
model = PeftModel.from_pretrained(model, peft_model_id)

#from dataset
inputs = tokenizer(f'''{formatted_dataset['text'][120]}''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]
print(text)
#as string
inputs = tokenizer(f'''What is the title of this paper? ...[your pdf as text]..\n\nAnswer: ''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Notes

  • After running it for the first time and loading the model and tokenizer, you can only run generating part to avoid RAM crash.

Output

Input:

What is the title of this paper? Bursting Dynamics of the 3D Euler Equations\nin Cylindrical Domains\nFrançois Golse ∗ †\nEcole Polytechnique, CMLS\n91128 Palaiseau Cedex, France\nAlex Mahalov ‡and Basil Nicolaenko §\n\nAnswer:

Output from LLM:

What is the title of this paper? Bursting Dynamics of the 3D Euler Equations
in Cylindrical Domains
François Golse ∗ †
Ecole Polytechnique, CMLS
91128 Palaiseau Cedex, France
Alex Mahalov ‡and Basil Nicolaenko §

Answer:  Bursting Dynamics of the 3D Euler Equations in Cylindrical Domains<|endoftext|>

Training and evaluation data

Train and validation dataset: zelalt/scientific-papers-3.5-withprompt

Training procedure

Training hyperparameters

  • total_train_batch_size: 8
  • lr_scheduler_type: cosine

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu118
  • Datasets 2.15.0
  • Tokenizers 0.15.0
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for zelalt/titletor-phi_1-5

Base model

microsoft/phi-1_5
Finetuned
(222)
this model

Dataset used to train zelalt/titletor-phi_1-5

Collection including zelalt/titletor-phi_1-5