Doctor-Shotgun
/

TinyLlama-1.1B-32k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Doctor-Shotgun commited on Feb 1, 2024

Commit

9e0ae0f

·

verified ·

1 Parent(s): 6cc0a39

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -16,6 +16,10 @@ Created using [TinyLlama-1.1B](https://huggingface.co/TinyLlama/tinyLlama-interm
 Of note, the base checkpoint used was from commit "final model" fad4f1a5cd0563ac41349b8fec2e6e51156568a0 which was subsequently reverted, and not the current main branch 3T checkpoint of TinyLlama-1.1B.
 ### Wikitext (wikitext-2-raw-v1_train) Perplexity (64 rows) as evaluated via [exllamav2](https://github.com/turboderp/exllamav2):
 | Model                  | 2048       | 4096       | 8192       | 16384      | 32768      |

 Of note, the base checkpoint used was from commit "final model" fad4f1a5cd0563ac41349b8fec2e6e51156568a0 which was subsequently reverted, and not the current main branch 3T checkpoint of TinyLlama-1.1B.
+[EXL2 Quants by turboderp](https://huggingface.co/turboderp/TinyLlama-1B-32k-exl2)
+The quantized model fits alongside a 4.25bpw 70B model at 32k sequence length on a single A6000 and provides noticeable speed-up with speculative decoding.
 ### Wikitext (wikitext-2-raw-v1_train) Perplexity (64 rows) as evaluated via [exllamav2](https://github.com/turboderp/exllamav2):
 | Model                  | 2048       | 4096       | 8192       | 16384      | 32768      |