File size: 5,596 Bytes

---
license: apache-2.0
library_name: transformers
tags:
- biology
datasets:
- kimou605/TATA-NOTATA-FineMistral-nucleotide_transformer_downstream_tasks
- InstaDeepAI/nucleotide_transformer_downstream_tasks
language:
- en
pipeline_tag: text-generation
---

<img src="BIOTATA.png" alt="BIOTATA logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

BioTATA 7B V1 is a hybrid model merged between BioMistral 7B Dare and a 4bit QLORA adapter trained on TATA/NO TATA sequences from [InstaDeepAI nucleotide_transformer_downstream_tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks) dataset (promoters_all subset)

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Karim Akkari (kimou605)
- **Model type:** FP32
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** BioMistral 7B Dare

### Model Sources 

<!-- Provide the basic links for the model. -->

- **Repository:** kimou605/BioTATA-7B
- **Demo:** [BioTATA 7B Space](https://huggingface.co/spaces/kimou605/BioTATA-7B)

## How to Get Started with the Model

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
```python
!pip install transformers
!pip install  accelerate
!pip install bitsandbytes
```

```python
import os
import torch
import transformers
from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  BitsAndBytesConfig,
  pipeline
)

```

```python
model_name='kimou605/BioTATA-7B'
model_config = transformers.AutoConfig.from_pretrained(
    model_name,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
```

```python
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True
```

```python
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
```

```python
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)
```

```python
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
    tokenizer=tokenizer,

)
```
```python
messages = [{"role": "user", "content": "What is TATA"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=200, do_sample=True, temperature=0.01, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```

This will inference the model on 4.8GB Vram
## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

This model has been developped to show how can a medical LLM adapt itself to identify sequences as TATA/NO TATA 
The adapter has been trained on a 53.3k rows for only 1 epoch (due to hardware limitations)

THIS MODEL IS FOR RESEARCH PURPOSES DO NOT USE IN PRODUCTION

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.



## Training Details

<img src="train1.png" alt="train1" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

<img src="train2.png" alt="train2" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
You can view training report [here](https://wandb.ai/esprit-innovision/Fine%20tuning%20mistral%207B%20instadeep/reports/BioTATA--Vmlldzo3ODIwNTU3).

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

kimou605/TATA-NOTATA-FineMistral-nucleotide_transformer_downstream_tasks

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Training Hyperparameters

- **Training regime:** BF16 4bits <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times 

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
7h/ epoch 
batch_per_gpu 32
GPU: NVIDIA A40 45GB Vram


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** NVIDIA A40
- **Hours used:** 11H
- **Cloud Provider:** vast.ai
- **Compute Region:** Europe



## Model Card Contact

Karim Akkari (kimou605)