Quantized Llama 3 70B Instruct to Q40 format supported by Distributed Llama.
License
Before download this repository please accept Llama 3 Community License.
How to run
- Clone this repository.
- Clone Distributed Llama:
git clone https://github.com/b4rtaz/distributed-llama.git
- Build Distributed Llama:
make dllama
- Run Distributed Llama:
sudo nice -n -20 ./dllama inference --model /path/to/dllama_model_llama3-70b-instruct_q40.m --tokenizer /path/to/dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4
Chat Template
Please keep in mind this model expects the prompt to use the chat template of llama 3.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.