T-lite-it-1.0_Q4_0 / README.md
MilyaShams's picture
hotfix
ef1ce4c verified
---
library_name: transformers
license: apache-2.0
language:
- ru
base_model:
- t-tech/T-lite-it-1.0
pipeline_tag: text-generation
---
# T-lite-it-1.0_Q4_0
<!-- Provide a quick summary of what the model is/does. -->
T-lite-it-1.0_Q4_0 is a quantized version of the **T-lite-it-1.0** model, originally based on the Qwen 2.5 7B architecture and fine-tuned for Russian-language tasks. This version is optimized for memory-constrained environments, making it suitable for fine-tuning and inference on GPUs with as little as **8GB VRAM**. The quantization was performed using **BitsAndBytes**, reducing the model to 4-bit precision.
## Model Description
<!-- Provide a longer summary of what this model is. -->
- **Language:** Russian
- **Base Model:** T-Lite-IT-1.0 (derived from Qwen 2.5 7B)
- **Quantization:** 4-bit precision using `BitsAndBytes`
- **Tasks:** Text generation, conversation, question answering, and chain-of-thought reasoning
- **Fine-Tuning Ready**: Ideal for further fine-tuning in low-resource environments.
- **VRAM Requirements**: Fine-tuning and inference possible with **8GB VRAM** or more.
## Usage
To load the model, ensure you have the required dependencies installed:
```bash
pip install transformers bitsandbytes
```
Then, load the model with the following code:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MilyaShams/T-lite-it-1.0_Q4_0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto"
)
```
## Fine-Tuning
The model is designed for fine-tuning with resource constraints. Use tools like Hugging Face's `Trainer` or `peft` (Parameter-Efficient Fine-Tuning) to adapt the model to specific tasks.
Example configuration for fine-tuning:
- Batch Size: Adjust to fit within 8GB VRAM (e.g., batch_size=2).
- Gradient Accumulation: Use to simulate larger batch sizes.
<!-- ## Inference
Perform inference with low latency:
```python
input_text = "Привет! Как я могу помочь?"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
``` -->
## Model Card Authors
[Milyausha Shamsutdinova](https://github.com/MilyaushaShamsutdinova)