MilyaShams
/

T-lite-it-1.0_Q4_0

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

T-lite-it-1.0_Q4_0 / README.md

MilyaShams's picture

hotfix

ef1ce4c verified 2 months ago

|

history blame contribute delete

2.37 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- ru
	base_model:
	- t-tech/T-lite-it-1.0
	pipeline_tag: text-generation
	---

	# T-lite-it-1.0_Q4_0

	<!-- Provide a quick summary of what the model is/does. -->

	T-lite-it-1.0_Q4_0 is a quantized version of the T-lite-it-1.0 model, originally based on the Qwen 2.5 7B architecture and fine-tuned for Russian-language tasks. This version is optimized for memory-constrained environments, making it suitable for fine-tuning and inference on GPUs with as little as 8GB VRAM. The quantization was performed using BitsAndBytes, reducing the model to 4-bit precision.

	## Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Language: Russian
	- Base Model: T-Lite-IT-1.0 (derived from Qwen 2.5 7B)
	- Quantization: 4-bit precision using `BitsAndBytes`
	- Tasks: Text generation, conversation, question answering, and chain-of-thought reasoning
	- Fine-Tuning Ready: Ideal for further fine-tuning in low-resource environments.
	- VRAM Requirements: Fine-tuning and inference possible with 8GB VRAM or more.


	## Usage

	To load the model, ensure you have the required dependencies installed:
	```bash
	pip install transformers bitsandbytes
	```

	Then, load the model with the following code:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "MilyaShams/T-lite-it-1.0_Q4_0"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	load_in_4bit=True,
	device_map="auto"
	)
	```

	## Fine-Tuning

	The model is designed for fine-tuning with resource constraints. Use tools like Hugging Face's `Trainer` or `peft` (Parameter-Efficient Fine-Tuning) to adapt the model to specific tasks.

	Example configuration for fine-tuning:

	- Batch Size: Adjust to fit within 8GB VRAM (e.g., batch_size=2).
	- Gradient Accumulation: Use to simulate larger batch sizes.

	<!-- ## Inference

	Perform inference with low latency:

	```python
	input_text = "Привет! Как я могу помочь?"
	inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
	output = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	``` -->


	## Model Card Authors

	[Milyausha Shamsutdinova](https://github.com/MilyaushaShamsutdinova)