Imatrices / README.md
Joseph717171's picture
Update README.md
ea88c2c verified
metadata
language:
  - en
author: >-
  Joseph717171 & froggeric
  (https://huggingface.co/datasets/froggeric/imatrix/edit/main/README.md)

All credit for this wonderful Repo Card detailing and explaining the similarities and differences of computed imatrices and detailing and explaining the differences, similarities, and, highlighted significances of training datasets and their purported purposes for particular large language models, goes to froggeric.

Note: All uploaded imatrices to this repo are pre-computed, and are, therefore, ready to be used in llama.cpp's quantization process.

Note: Imatrices uploaded to this repo follow the following naming convention: model-name_training-dataset.imatrix (hyphens are purely used in this example to enhance readability...)

Instructions: Download the imatrix for your chosen LLM (Large Language Model), and quantize to your preferred QuantType. (Note the following example already assumes you converted your model to GGUF)

llama.cpp % ./quantize --imatrix path_to_imatrix path_to_model/ggml-model-f16.gguf model_name-QuantType.gguf QuantType

Note: If you need detailed steps to convert your Large Language Model to GGUF, please scroll to the bottom of this page and check out the section: How to convert Supported LLMs (Large Language Models) to GGUF format

Supplementary Learning: Training Datasets, Their Similarities and Differences, and How to Determine Which one will Be Right for Computing your Imatrix

Input files for generating the Importance Matrix

Which file to use for generating the importance matrix

Not all importance matrices are equal. The best results are obtained when using a source file similar to the training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs), the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed; for example:

cat technical.txt multilingual.txt wiki.txt >custom.matrix

Note on context size when generating the matrix: in general, a small context size such as 512 is recommended, and community tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates to number of words, not characters.

You will find below descriptions for the various input files provided, to help you choose the correct one.

Community provided files

groups_merged
"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!) This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data. I get lower KL div than wikitext for the same length and the outputs seem qualitatively better." (kalomaze)
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384

group_10_merged
(superseeded by groups_merged)
"This is about ~50k pseudo-random tokens. I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b" (kalomaze)
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233

20k_random_data
(superseeded by groups_10_merged)
https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190

8k_random_data
(superseeded by 20k_random_data)
https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829

badwords
402 english words that can be considered dirty, naughty, obscene, or otherwise bad words. This could be useful to remove guard rails. Compiled from Shutterstock github repo

badwords_multilingual
2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages. This could be useful to remove guard rails. Compiled from Shutterstock github repo

ptb.train
Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing, punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded by WikiText which does not have these removals, features a larger vocabulary and full articles (better suited for models that can take advantage of long term dependencies). However, for importantce matrix training, PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText, without being WikiText; this can help against bias.

WikiText
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
https://huggingface.co/datasets/wikitext

WikiText_FR
70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles" or "good articles".
https://huggingface.co/datasets/asi/wikitext_fr

c4
The C4 dataset is a collection text sourced from the public Common Crawl web scrape. It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) in addition to extensive deduplication. C4 dataset was explicitly designed to be English only: any page that was not given a probability of at least 99% of being English by langdetect was discarded.

code (exllamav2)
Programming

multilingual (exllamav2)
English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew, Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish.

technical (exllamav2)
Technical writing.

tiny
Very short stories. Be mindful of the prevalence of "Once upon a time" and "<|endoftext|>". Extract from TinyStories dataset

wiki (exllamav2)
Small Wikipedia dump. Unclean, contains many unwanted tags.

exllamav2 calibration data taken from:
https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data

How to Convert Supported LLMs (Large Language Models) to GGUF Format:

llama.cpp % python convert.py path_to_model --outtype f16

How to quantize using an imatrix, with llama.cpp

  1. Get one of the input files collected here, or elsewhere.
  2. Convert or download the model you want to quantise, in fp16 GGUF format.
  3. Generate an imatrix file specific to the model you want to quantise
cd <llama.cpp directory>
./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512

# -ngl    : layers offloaded to gpu (recommended to use number of layers the model contains)
# -t 12   : number of threads (should probably match no of cpu)
# -c 512  : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model)
# -b 200  : batch size (default=512)
# --chunks 100 (recommended)
# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16)
  1. Use the generated matrix file to quantise the model
./quantize --imatrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS>

Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is better for higher quantisation.