Text Generation
Transformers
Safetensors
English
mistral
genomics
medical
conversational
text-generation-inference
Inference Endpoints
BioTATA-7B / README.md
kimou605's picture
Update README.md
60460d4 verified
metadata
license: apache-2.0
library_name: transformers
tags:
  - genomics
  - medical
datasets:
  - kimou605/TATA-NOTATA-FineMistral-nucleotide_transformer_downstream_tasks
  - InstaDeepAI/nucleotide_transformer_downstream_tasks
language:
  - en
pipeline_tag: text-generation
BIOTATA logo

⚠️ STATE OF THE ART ⚠️

Model Card for Model ID

BioTATA 7B V1 is a hybrid model merged between BioMistral 7B Dare and a 4bit QLORA adapter trained on TATA/NO TATA sequences from InstaDeepAI nucleotide_transformer_downstream_tasks dataset (promoters_all subset)

Model Details

Model Description

  • Developed by: Karim Akkari (kimou605)
  • Model type: FP32
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: BioMistral 7B Dare

Model Sources

How to Get Started with the Model

!pip install transformers
!pip install  accelerate
!pip install bitsandbytes
import os
import torch
import transformers
from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  BitsAndBytesConfig,
  pipeline
)
model_name='kimou605/BioTATA-7B'
model_config = transformers.AutoConfig.from_pretrained(
    model_name,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
    tokenizer=tokenizer,

)
messages = [{"role": "user", "content": "What is TATA"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=200, do_sample=True, temperature=0.01, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

This will inference the model on 4.8GB Vram

Bias, Risks, and Limitations

This model has been developped to show how can a medical LLM adapt itself to identify sequences as TATA/NO TATA The adapter has been trained on a 53.3k rows for only 1 epoch (due to hardware limitations)

THIS MODEL IS FOR RESEARCH PURPOSES DO NOT USE IN PRODUCTION

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Training Details

train1 train2

You can view training report here.

Training Data

kimou605/TATA-NOTATA-FineMistral-nucleotide_transformer_downstream_tasks

Training Procedure

Training Hyperparameters

  • Training regime: BF16 4bits

Speeds, Sizes, Times

7h/ epoch batch_per_gpu 32 GPU: NVIDIA A40 45GB Vram

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA A40
  • Hours used: 11H
  • Cloud Provider: vast.ai
  • Compute Region: Europe

Model Card Contact

Karim Akkari (kimou605)