Joseph717171
/

Imatrices

English

Model card Files Files and versions Community

Joseph717171 commited on Mar 19, 2024

Commit

226a0fa

verified ·

1 Parent(s): 3bc0227

Create README.md

Browse files

Files changed (1) hide show

README.md +126 -0

README.md ADDED Viewed

	@@ -0,0 +1,126 @@

+---
+language:
+- en
+author: froggeric []()
+---
+# Input files for generating the Importance Matrix
+## Which file to use for generating the importance matrix
+Not all importance matrices are equal. The best results are obtained when using a source file similar to the
+training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs),
+the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed;
+for example:
+```
+cat technical.txt multilingual.txt wiki.txt >custom.matrix
+```
+Note on **context size** when generating the matrix: in general, a small context size such as 512 is recommended, and community
+tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the
+source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger
+context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates
+to number of words, not characters.
+You will find below descriptions for the various input files provided, to help you choose the correct one.
+## Community provided files
+**groups_merged**\
+_"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
+This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
+I get lower KL div than wikitext for the same length and the outputs seem qualitatively better."_ (kalomaze)\
+https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384
+**group_10_merged**\
+(superseeded by groups_merged)\
+_"This is about ~50k pseudo-random tokens.
+I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b"_ (kalomaze)\
+https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233
+**20k_random_data**\
+(superseeded by groups_10_merged)\
+https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190
+**8k_random_data**\
+(superseeded by 20k_random_data)\
+https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829
+**badwords**\
+402 english words that can be considered dirty, naughty, obscene, or otherwise bad words.
+This could be useful to remove guard rails.
+Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)
+**badwords_multilingual**\
+2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages.
+This could be useful to remove guard rails.
+Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)
+**ptb.train**\
+Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing,
+punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded
+by WikiText which does not have these removals, features a larger vocabulary and full articles (better
+suited for models that can take advantage of long term dependencies). However, for importantce matrix training,
+PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText,
+without being WikiText; this can help against bias.
+**WikiText**\
+The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of
+verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and
+WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models
+that can take advantage of long term dependencies.\
+https://huggingface.co/datasets/wikitext
+**WikiText_FR**\
+70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles"
+or "good articles".\
+https://huggingface.co/datasets/asi/wikitext_fr
+**c4**\
+The C4 dataset is a collection text sourced from the public Common Crawl web scrape.
+It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish)
+in addition to extensive deduplication. C4 dataset was explicitly designed to be English only:
+any page that was not given a probability of at least 99% of being English by langdetect was discarded.
+**code** (exllamav2)\
+Programming
+**multilingual** (exllamav2)\
+English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew,
+Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish.
+**technical** (exllamav2)\
+Technical writing.
+**tiny**\
+Very short stories. Be mindful of the prevalence of _"Once upon a time"_ and _"<|endoftext|>"_.
+Extract from [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories)
+**wiki** (exllamav2)\
+Small Wikipedia dump. Unclean, contains many unwanted tags.
+exllamav2 calibration data taken from:\
+https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data
+## How to quantize using an imatrix, with llama.cpp
+1. Get one of the input files collected here, or elsewhere.
+2. Convert or download the model you want to quantise, in fp16 GGUF format.
+3. Generate an imatrix file specific to the model you want to quantise
+```
+cd <llama.cpp directory>
+./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512
+# -ngl    : layers offloaded to gpu (recommended to use number of layers the model contains)
+# -t 12   : number of threads (should probably match no of cpu)
+# -c 512  : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model)
+# -b 200  : batch size (default=512)
+# --chunks 100 (recommended)
+# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16)
+```
+4. Use the generated matrix file to quantise the model
+```
+./quantize --matrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS>
+```
+Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is
+better for higher quantisation.