NexaAIDev
/

Squid

on-device language model

Retrieval Augmented Generation

Model card Files Files and versions Community

Squid / README.md

zackli4ai's picture

wip

4ad5af2 6 months ago

|

3.53 kB

	---
	license: cc-by-nc-4.0
	base_model: Qwen/Qwen2-7B-Instruct
	model-index:
	- name: Dolphin
	results: []
	tags:
	- RAG
	- on-device language model
	- Retrieval Augmented Generation
	inference: false
	space: false
	spaces: false
	language:
	- en
	---
	# Dolphin: Long Context as a New Modality for on-device RAG

	<p align="center">
	- <a href="https://www.nexaai.com/models" target="_blank">Nexa Model Hub</a>
	- <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
	</p>

	<p align="center" width="100%">
	<a><img src="logo.png" alt="nexa-octopus" style="width: 30%; min-width: 300px; display: block; margin: auto;"></a>
	</p>

	## Overview
	Dolphin is a novel approach to accelerate language model inference by treating long context as a new modality, similar to image, audio, and video modalities in vision-language models. This innovative method incorporates a language encoder model to encode context information into embeddings, applying multimodal model concepts to enhance the efficiency of language model inference。 Below are model highlights:
	- 🧠 Context as a distinct modality
	- 🗜️ Language encoder for context compression
	- 🔗 Multimodal techniques applied to language processing
	- ⚡ Optimized for energy efficiency and on-device use
	- 📜 Specialized for long context understanding

	## Model Architecture
	Dolphin employs a decoder-decoder framework with two main components:
	1. A smaller decoder (0.5B parameters) for transforming information from extensive contexts
	2. A larger decoder (7B parameters) for comprehending and generating responses to current queries
	3. The architecture also includes a projector to align embeddings between the text encoder and the main decoder.

	![Model Architecture](modelstructure.jpg)

	## Running the Model

	```python
	from transformers import AutoTokenizer
	from configuration_dolphin import DolphinForCausalLM
	import time

	tokenizer = AutoTokenizer.from_pretrained('nexa-collaboration/dolphin_instruct_1M_0805', trust_remote_code=True)
	model = DolphinForCausalLM.from_pretrained('nexa-collaboration/dolphin_instruct_1M_0805', trust_remote_code=True)

	def inference(input_text):
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	input_text = "Take a selfie for me with front camera"
	nexa_query = f"Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: {input_text} \n\nResponse:"

	start_time = time.time()
	result = inference(nexa_query)
	print("Dolphin model result:\n", result)
	print("Latency:", time.time() - start_time, "s")
	```

	## Training Process
	Dolphin's training involves three stages:
	1. Restoration Training: Reconstructing original context from compressed embeddings
	2. Continual Training: Generating context continuations from partial compressed contexts
	3. Instruction Fine-tuning: Generating responses to queries given compressed contexts

	This multi-stage approach progressively enhances the model's ability to handle long contexts and generate appropriate responses.

	## Citation
	If you use Dolphin in your research, please cite our paper:

	```bibtex
	@article{dolphin2024,
	title={Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models},
	author={[Author Names]},
	journal={arXiv preprint arXiv:[paper_id]},
	year={2024}
	}
	```

	## Contact
	For questions or feedback, please [contact us](octopus@nexa4ai.com)