|
--- |
|
language: |
|
- en |
|
author: froggeric []() |
|
--- |
|
|
|
# Input files for generating the Importance Matrix |
|
|
|
|
|
## Which file to use for generating the importance matrix |
|
|
|
Not all importance matrices are equal. The best results are obtained when using a source file similar to the |
|
training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs), |
|
the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed; |
|
for example: |
|
``` |
|
cat technical.txt multilingual.txt wiki.txt >custom.matrix |
|
``` |
|
Note on **context size** when generating the matrix: in general, a small context size such as 512 is recommended, and community |
|
tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the |
|
source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger |
|
context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates |
|
to number of words, not characters. |
|
|
|
You will find below descriptions for the various input files provided, to help you choose the correct one. |
|
|
|
## Community provided files |
|
|
|
**groups_merged**\ |
|
_"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!) |
|
This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data. |
|
I get lower KL div than wikitext for the same length and the outputs seem qualitatively better."_ (kalomaze)\ |
|
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384 |
|
|
|
**group_10_merged**\ |
|
(superseeded by groups_merged)\ |
|
_"This is about ~50k pseudo-random tokens. |
|
I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b"_ (kalomaze)\ |
|
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233 |
|
|
|
**20k_random_data**\ |
|
(superseeded by groups_10_merged)\ |
|
https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190 |
|
|
|
**8k_random_data**\ |
|
(superseeded by 20k_random_data)\ |
|
https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829 |
|
|
|
**badwords**\ |
|
402 english words that can be considered dirty, naughty, obscene, or otherwise bad words. |
|
This could be useful to remove guard rails. |
|
Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master) |
|
|
|
**badwords_multilingual**\ |
|
2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages. |
|
This could be useful to remove guard rails. |
|
Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master) |
|
|
|
**ptb.train**\ |
|
Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing, |
|
punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded |
|
by WikiText which does not have these removals, features a larger vocabulary and full articles (better |
|
suited for models that can take advantage of long term dependencies). However, for importantce matrix training, |
|
PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText, |
|
without being WikiText; this can help against bias. |
|
|
|
**WikiText**\ |
|
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of |
|
verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and |
|
WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models |
|
that can take advantage of long term dependencies.\ |
|
https://huggingface.co/datasets/wikitext |
|
|
|
**WikiText_FR**\ |
|
70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" |
|
or "good articles".\ |
|
https://huggingface.co/datasets/asi/wikitext_fr |
|
|
|
**c4**\ |
|
The C4 dataset is a collection text sourced from the public Common Crawl web scrape. |
|
It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) |
|
in addition to extensive deduplication. C4 dataset was explicitly designed to be English only: |
|
any page that was not given a probability of at least 99% of being English by langdetect was discarded. |
|
|
|
**code** (exllamav2)\ |
|
Programming |
|
|
|
**multilingual** (exllamav2)\ |
|
English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew, |
|
Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish. |
|
|
|
**technical** (exllamav2)\ |
|
Technical writing. |
|
|
|
**tiny**\ |
|
Very short stories. Be mindful of the prevalence of _"Once upon a time"_ and _"<|endoftext|>"_. |
|
Extract from [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories) |
|
|
|
**wiki** (exllamav2)\ |
|
Small Wikipedia dump. Unclean, contains many unwanted tags. |
|
|
|
exllamav2 calibration data taken from:\ |
|
https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data |
|
|
|
## How to quantize using an imatrix, with llama.cpp |
|
|
|
1. Get one of the input files collected here, or elsewhere. |
|
2. Convert or download the model you want to quantise, in fp16 GGUF format. |
|
3. Generate an imatrix file specific to the model you want to quantise |
|
``` |
|
cd <llama.cpp directory> |
|
./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512 |
|
|
|
# -ngl : layers offloaded to gpu (recommended to use number of layers the model contains) |
|
# -t 12 : number of threads (should probably match no of cpu) |
|
# -c 512 : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model) |
|
# -b 200 : batch size (default=512) |
|
# --chunks 100 (recommended) |
|
# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16) |
|
``` |
|
4. Use the generated matrix file to quantise the model |
|
``` |
|
./quantize --matrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS> |
|
``` |
|
Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is |
|
better for higher quantisation. |