--- license: llama3.2 --- # FineLlama-3.2-3B-Instruct-ead-GGUF GGUF quantized versions of [Geraldine/FineLlama-3.2-3B-Instruct-ead](https://huggingface.co/Geraldine/FineLlama-3.2-3B-Instruct-ead) model, optimized for efficient inference using llama.cpp. ## Model Description - **Base Model**: FineLlama-3.2-3B-Instruct-ead - **Quantization**: Various GGUF formats - **Purpose**: EAD tag generation and archival metadata encoding - **Framework**: llama.cpp ## Available Variants The following quantized versions are available: - Q2_K variant (1.36 GB) - Q3_K_M variant (1.69 GB) - Q4_K_M variant (2.02 GB) - Q5_K_M variant (2.32 GB) - Q6_K variant (2.64 GB) - Q8_0 variant (3.42 GB) - FP16 variant (6.43 GB) ## Installation 1. Download the desired GGUF model variant 2. Install llama.cpp following the official instructions 3. Place the model file in your llama.cpp models directory ## Usage ```bash # Example using Q4_K_M quantization ./main -m models/FineLlama-3.2-3B-Instruct-ead-Q4_K_M.gguf -n 1024 --repeat_penalty 1.1 # Example using server mode ./server -m models/FineLlama-3.2-3B-Instruct-ead-Q4_K_M.gguf -c 4096 ``` ### Example using llama-cpp-python library ```python from llama_cpp import Llama query = "..." llm = Llama.from_pretrained( repo_id="Geraldine/FineLlama-3.2-3B-Instruct-ead-GGUF", filename="*Q8_0.gguf", n_ctx=1024, verbose=False ) output = llm.create_chat_completion(         messages = [             {"role": "system", "content": "You are an archivist expert in EAD format."},             {                 "role": "user",                 "content": query             }         ] ) print(output["choices"][0]["message"]["content"]) ``` ### Example using Ollama ```bash ollama run hf.co/Geraldine/FineLlama-3.2-3B-Instruct-ead-GGUF:Q4_K_M ``` ## Quantization Details - Q2_K: 2-bit quantization, optimized for efficiency - Q3_K_M: 3-bit quantization with medium precision - Q4_K_M: 4-bit quantization with medium precision - Q5_K_M: 5-bit quantization with medium precision - Q6_K: 6-bit quantization - Q8_0: 8-bit quantization, highest precision among quantized versions - FP16: Full 16-bit floating point, no quantization ## Performance Considerations - Lower bit quantizations (Q2_K, Q3_K_M) offer smaller file sizes but may have slightly reduced accuracy - Higher bit quantizations (Q6_K, Q8_0) provide better accuracy but require more storage and memory - FP16 provides full precision but requires significantly more resources