leafspark
/

Meta-Llama-3.1-405B-Instruct-GGUF

Text Generation

Inference Endpoints

Model card Files Files and versions Community

Meta-Llama-3.1-405B-Instruct-GGUF / README.md

leafspark's picture

readme: revert change

5c35509 verified 3 months ago

|

history blame contribute delete

2.36 kB

	---
	license: llama3.1
	tags:
	- gguf
	- llama3
	pipeline_tag: text-generation
	datasets:
	- froggeric/imatrix
	language:
	- en
	library_name: ggml
	---

	# Meta-Llama-3.1-405B-Instruct-GGUF

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/o7DiWuILyzaPLh4Ne1JKr.png)

	Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF.

	Quantized with llama.cpp [b3449](https://github.com/ggerganov/llama.cpp/releases/tag/b3449)

	\| Quant \| Notes \|
	\|-------------\|--------------------------------------------\|
	\| BF16 \| Brain floating point, very high quality, smaller than F16 \|
	\| Q8_0 \| 8-bit quantization, high quality, larger size \|
	\| Q6_K \| 6-bit quantization, very good quality-to-size ratio \|
	\| Q5_K \| 5-bit quantization, good balance of quality and size \|
	\| Q5_0 \| Alternative 5-bit quantization, slightly different balance \|
	\| Q4_K_M \| 4-bit quantization, good for production use \|
	\| Q4_K_S \| 4-bit quantization, faster inference, efficient for scaling \|
	\| Q4_0 \| Basic 4-bit quantization, good for experimentation \|
	\| Q3_K_L \| 3-bit quantization, high-quality with more VRAM requirement \|
	\| Q3_K_M \| 3-bit quantization, good balance between speed and accuracy \|
	\| Q3_K_S \| 3-bit quantization, faster inference with minor quality loss \|
	\| Q2_K \| 2-bit quantization, suitable for general inference tasks \|
	\| IQ2_S \| Integer 2-bit quantization, optimized for small VRAM environments \|
	\| IQ2_XXS \| Integer 2-bit quantization, best for ultra-low memory footprint \|
	\| IQ1_M \| Integer 1-bit quantization, usable
	\| IQ1_S \| Integer 1-bit quantization, not recommended

	For higher quality quantizations (q4+), please refer to [nisten/meta-405b-instruct-cpu-optimized-gguf](https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf).

	Regarding the `smaug-bpe` tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the `llama-bpe` tokenizer:
	```
	./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf
	```

	## imatrix

	Generated from Q2_K quant.

	imatrix calibration data: `groups_merged.txt`