|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
base_model: |
|
- t-tech/T-lite-it-1.0 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# T-lite-it-1.0_Q4_0 |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
T-lite-it-1.0_Q4_0 is a quantized version of the **T-lite-it-1.0** model, originally based on the Qwen 2.5 7B architecture and fine-tuned for Russian-language tasks. This version is optimized for memory-constrained environments, making it suitable for fine-tuning and inference on GPUs with as little as **8GB VRAM**. The quantization was performed using **BitsAndBytes**, reducing the model to 4-bit precision. |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Language:** Russian |
|
- **Base Model:** T-Lite-IT-1.0 (derived from Qwen 2.5 7B) |
|
- **Quantization:** 4-bit precision using `BitsAndBytes` |
|
- **Tasks:** Text generation, conversation, question answering, and chain-of-thought reasoning |
|
- **Fine-Tuning Ready**: Ideal for further fine-tuning in low-resource environments. |
|
- **VRAM Requirements**: Fine-tuning and inference possible with **8GB VRAM** or more. |
|
|
|
|
|
## Usage |
|
|
|
To load the model, ensure you have the required dependencies installed: |
|
```bash |
|
pip install transformers bitsandbytes |
|
``` |
|
|
|
Then, load the model with the following code: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "MilyaShams/T-lite-it-1.0_Q4_0" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
load_in_4bit=True, |
|
device_map="auto" |
|
) |
|
``` |
|
|
|
## Fine-Tuning |
|
|
|
The model is designed for fine-tuning with resource constraints. Use tools like Hugging Face's `Trainer` or `peft` (Parameter-Efficient Fine-Tuning) to adapt the model to specific tasks. |
|
|
|
Example configuration for fine-tuning: |
|
|
|
- Batch Size: Adjust to fit within 8GB VRAM (e.g., batch_size=2). |
|
- Gradient Accumulation: Use to simulate larger batch sizes. |
|
|
|
<!-- ## Inference |
|
|
|
Perform inference with low latency: |
|
|
|
```python |
|
input_text = "Привет! Как я могу помочь?" |
|
inputs = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
output = model.generate(**inputs, max_new_tokens=50) |
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
``` --> |
|
|
|
|
|
## Model Card Authors |
|
|
|
[Milyausha Shamsutdinova](https://github.com/MilyaushaShamsutdinova) |