File size: 5,596 Bytes
8fc8883 c30a7ff 8fc8883 410a862 8fc8883 00b0880 8fc8883 9d6918e 8fc8883 ce09783 817ea13 748df0d 817ea13 f100336 8fc8883 7158f1b 8fc8883 817ea13 f100336 8fc8883 410a862 8fc8883 2bfab49 748df0d 2bfab49 748df0d 2bfab49 748df0d 2bfab49 748df0d 2bfab49 748df0d 2bfab49 748df0d 2bfab49 748df0d 2bfab49 748df0d 8fc8883 2bfab49 8fc8883 817ea13 63c8969 8fc8883 410a862 8fc8883 78f7e59 296cc7f 78f7e59 7a9519a 8fc8883 817ea13 8fc8883 817ea13 8fc8883 7158f1b 8fc8883 817ea13 8fc8883 817ea13 8fc8883 410a862 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
license: apache-2.0
library_name: transformers
tags:
- biology
datasets:
- kimou605/TATA-NOTATA-FineMistral-nucleotide_transformer_downstream_tasks
- InstaDeepAI/nucleotide_transformer_downstream_tasks
language:
- en
pipeline_tag: text-generation
---
<img src="BIOTATA.png" alt="BIOTATA logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
BioTATA 7B V1 is a hybrid model merged between BioMistral 7B Dare and a 4bit QLORA adapter trained on TATA/NO TATA sequences from [InstaDeepAI nucleotide_transformer_downstream_tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks) dataset (promoters_all subset)
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Karim Akkari (kimou605)
- **Model type:** FP32
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** BioMistral 7B Dare
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** kimou605/BioTATA-7B
- **Demo:** [BioTATA 7B Space](https://huggingface.co/spaces/kimou605/BioTATA-7B)
## How to Get Started with the Model
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
```python
!pip install transformers
!pip install accelerate
!pip install bitsandbytes
```
```python
import os
import torch
import transformers
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
pipeline
)
```
```python
model_name='kimou605/BioTATA-7B'
model_config = transformers.AutoConfig.from_pretrained(
model_name,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
```
```python
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True
```
```python
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
```
```python
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
)
```
```python
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
tokenizer=tokenizer,
)
```
```python
messages = [{"role": "user", "content": "What is TATA"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=200, do_sample=True, temperature=0.01, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
This will inference the model on 4.8GB Vram
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
This model has been developped to show how can a medical LLM adapt itself to identify sequences as TATA/NO TATA
The adapter has been trained on a 53.3k rows for only 1 epoch (due to hardware limitations)
THIS MODEL IS FOR RESEARCH PURPOSES DO NOT USE IN PRODUCTION
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
## Training Details
<img src="train1.png" alt="train1" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
<img src="train2.png" alt="train2" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
You can view training report [here](https://wandb.ai/esprit-innovision/Fine%20tuning%20mistral%207B%20instadeep/reports/BioTATA--Vmlldzo3ODIwNTU3).
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
kimou605/TATA-NOTATA-FineMistral-nucleotide_transformer_downstream_tasks
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Training Hyperparameters
- **Training regime:** BF16 4bits <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
#### Speeds, Sizes, Times
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
7h/ epoch
batch_per_gpu 32
GPU: NVIDIA A40 45GB Vram
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** NVIDIA A40
- **Hours used:** 11H
- **Cloud Provider:** vast.ai
- **Compute Region:** Europe
## Model Card Contact
Karim Akkari (kimou605) |