llama3.1-s-instruct-v0.2-GGUF / README.md

aashish1904

Upload README.md with huggingface_hub

c2b956e verified 3 months ago

6.51 kB


	---

	datasets:
	- homebrewltd/instruction-speech-whispervq-v2
	language:
	- en
	license: apache-2.0
	tags:
	- sound language model

	---

	![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)

	# QuantFactory/llama3.1-s-instruct-v0.2-GGUF
	This is quantized version of [homebrewltd/llama3.1-s-instruct-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-instruct-v0.2) created using llama.cpp

	# Original Model Card


	## Model Details

	We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.

	We expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files from [homebrewltd/llama3.1-s-base-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-base-v0.2) with nearly 1B tokens from [Instruction Speech WhisperVQ v2](https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v2) dataset.

	Model developers Homebrew Research.

	Input Text and sound.

	Output Text.

	Model Architecture Llama-3.

	Language(s): English.

	## Intended Use

	Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

	Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

	## How to Get Started with the Model

	Try this model using [Google Colab Notebook](https://colab.research.google.com/drive/18IiwN0AzBZaox5o0iidXqWD1xKq11XbZ?usp=sharing).

	First, we need to convert the audio file to sound tokens

	```python
	device = "cuda" if torch.cuda.is_available() else "cpu"
	if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
	hf_hub_download(
	repo_id="jan-hq/WhisperVQ",
	filename="whisper-vq-stoks-medium-en+pl-fixed.model",
	local_dir=".",
	)
	vq_model = RQBottleneckTransformer.load_model(
	"whisper-vq-stoks-medium-en+pl-fixed.model"
	).to(device)
	def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
	vq_model.ensure_whisper(device)

	wav, sr = torchaudio.load(audio_path)
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)
	with torch.no_grad():
	codes = vq_model.encode_audio(wav.to(device))
	codes = codes[0].cpu().tolist()

	result = ''.join(f'<\|sound_{num:04d}\|>' for num in codes)
	return f'<\|sound_start\|>{result}<\|sound_end\|>'

	def audio_to_sound_tokens_transcript(audio_path, target_bandwidth=1.5, device=device):
	vq_model.ensure_whisper(device)

	wav, sr = torchaudio.load(audio_path)
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)
	with torch.no_grad():
	codes = vq_model.encode_audio(wav.to(device))
	codes = codes[0].cpu().tolist()

	result = ''.join(f'<\|sound_{num:04d}\|>' for num in codes)
	return f'<\|reserved_special_token_69\|><\|sound_start\|>{result}<\|sound_end\|>'
	```

	Then, we can inference the model the same as any other LLM.

	```python
	def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	model_kwargs = {"device_map": "auto"}

	if use_4bit:
	model_kwargs["quantization_config"] = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	)
	elif use_8bit:
	model_kwargs["quantization_config"] = BitsAndBytesConfig(
	load_in_8bit=True,
	bnb_8bit_compute_dtype=torch.bfloat16,
	bnb_8bit_use_double_quant=True,
	)
	else:
	model_kwargs["torch_dtype"] = torch.bfloat16

	model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

	return pipeline("text-generation", model=model, tokenizer=tokenizer)

	def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
	generation_args = {
	"max_new_tokens": max_new_tokens,
	"return_full_text": False,
	"temperature": temperature,
	"do_sample": do_sample,
	}

	output = pipe(messages, **generation_args)
	return output[0]['generated_text']

	# Usage
	llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
	pipe = setup_pipeline(llm_path, use_8bit=True)
	```

	## Training process
	Training Metrics Image: Below is a snapshot of the training loss curve visualized.

	![training_](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/pQ8y9GoSvtv42MgkKRDt0.png)

	### Hardware

	GPU Configuration: Cluster of 8x NVIDIA H100-SXM-80GB.
	GPU Usage:
	- Continual Training: 6 hours.

	### Training Arguments

	We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.

	\| Parameter \| Continual Training \|
	\|----------------------------\|-------------------------\|
	\| Epoch \| 1 \|
	\| Global batch size \| 128 \|
	\| Learning Rate \| 0.5e-4 \|
	\| Learning Scheduler \| Cosine with warmup \|
	\| Optimizer \| Adam torch fused \|
	\| Warmup Ratio \| 0.01 \|
	\| Weight Decay \| 0.005 \|
	\| Max Sequence Length \| 512 \|


	## Examples

	1. Good example:

	<details>
	<summary>Click to toggle Example 1</summary>

	```

	```
	</details>

	<details>
	<summary>Click to toggle Example 2</summary>

	```

	```
	</details>


	2. Misunderstanding example:

	<details>
	<summary>Click to toggle Example 3</summary>

	```

	```
	</details>

	3. Off-tracked example:

	<details>
	<summary>Click to toggle Example 4</summary>

	```

	```
	</details>


	## Citation Information

	BibTeX:

	```
	@article{Llama3-S: Sound Instruction Language Model 2024,
	title={Llama3-S},
	author={Homebrew Research},
	year=2024,
	month=August},
	url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
	```

	## Acknowledgement

	- [WhisperSpeech](https://github.com/collabora/WhisperSpeech)

	- [Meta-Llama-3.1-8B-Instruct ](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)