Reuse your `ggml-dbrx-instruct-16x12b-q8_0-imatrix.dat` file?

#1
by jukofyork - opened

If I want to quantize my own Q4_K_M can I just reuse your ggml-dbrx-instruct-16x12b-q8_0-imatrix.dat file instead of creating a new one?

Is there any benefit to rerunning on an fp16 instead of a q8 to generate the imatrix file myself?

@jukofyork Yes, you should be able to reuse the imatrix. I used Q8_0 to generate the imatrix because of resources constraints and given that Q8_0 should be very close to FP16 but when quantizing using the imatrix I always use the FP16 weights. It should however work to quantize from Q8_0 but you'll need to use the allow re-quantize option.

Thanks!

What data did you use and what command line?

I'm just reading about the imatrix stuff here: https://github.com/ggerganov/llama.cpp/pull/4861

Is "wikitext-2-raw" what people normally use for this too?

@jukofyork On this model card page I have all the info :-) and I yes wiki text appears to be quite popular for training the imatrix. There is a "grouped_merged.txt" as well that appears to sometimes give good results, I have used it in some of my previous quants.

@jukofyork On this model card page I have all the info :-) and I yes wiki text appears to be quite popular for training the imatrix. There is a "grouped_merged.txt" as well that appears to sometimes give good results, I have used it in some of my previous quants.

Sorry, I should have looked on the model card :/

So looking at this: https://github.com/ggerganov/llama.cpp/pull/4957

./imatrix -m input-fp16.gguf -f wiki.train.raw -o out-fp16.imatrix -c 512

512 context seems to be recommended and set -b to say 1024 (or higher if possible)?

I won't be able to offload all of it but it looks like I can use -ngl and -t to help too.

I think the earliest I can do it is Monday or else will completely kill my internet connection all weekend.

I use --chunks 200 with wiki.train and -c 512 is already the default so I don't specify that one. And yes, I also offload all layers to the GPU with -ngl 999 otherwise running imatrix CPU would require a monk's patience level. I recall reading from the imatrix thread on llama.cpp that 512 batch was most likely the best for most cases and going beyond that didn't produce conclusive better results.

There is a "grouped_merged.txt" as well that appears to sometimes give good results, I have used it in some of my previous quants.

https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384

groups_merged.txt
Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
I get lower KL div than wikitext for the same length and the outputs seem qualitatively better.

Next up, I will be looking into a way to preselect individual pieces of a larger dataset for higher KL div / higher PPL outlier sections so that the quantization is more robust to outliers (instead of throwing random data at it).

So perhaps for DBRX this might be a better choice?

Maybe, but without testing it we can't really know. I think I'll rent a box and run the full wiki.train (no chunk limit) and also the groups merged then compare PPL. This will take a while though, and it might be worth having a second run by someone else to compare the results.

Maybe, but without testing it we can't really know. I think I'll rent a box and run the full wiki.train (no chunk limit) and also the groups merged then compare PPL. This will take a while though, and it might be worth having a second run by someone else to compare the results.

I'm very interested in this now as up to now running coding models I've always had to run them at FP16 or Q8_0 if forced to, as the drop in performance is very dramatic for lower quants (eg: the models will start suggesting to change variable names to CamelCase when they already are or claim to make all sorts of changes but just end up writing out the code unchanged and so on) and even Q6_k run pretty bad... BUT: With the advent of all these recent large models, even with 96GB of VRAM that's not going to be possible and perhaps custom imatrix creation based on code might be the way to go.

Yes, and imatrix quants are not really tested because it requires a lot of compute. I'm running FP16 imatrix on both wiki and groups_merged right now, so we'll see. Also, I tested the IQ4_XS from phymbert (non-imatrix) vs mine that uses the imatrix and they underperform about the same. Ideally I want three PPL sets of results, one for each of wiki, groups_merged and non-imatrix.

So I just quantized my own Q4_K_M using the new ggml-dbrx-instruct-16x12b-f16_imatrix-wiki.dat file (everything else was using the HF safetensors, lllama.cpp pulled after yesterday's PR, etc) and it's working awful? Like really really awful compared to the Q4_0 I downloaded from phymbert yesterday:

https://huggingface.co/phymbert/dbrx-16x12b-instruct-q4_0-gguf

I'm retrying now without the imatrix file to see if it's any different, then will try and make my own Q4_0 to compare after.

I did notice in the PR that the Q8_0 and Q4_0gave the same output with deterministic settings, but the Q6_K was different, but until I can recreate a comparable Q4_0 I can't be sure what the cause is.

I was expecting to squeeze just a little better model from the Q4_K_M, but what I've created is worst than attempted frankenmerges of coding models: very simple examples to do with C++, Boost, Eclipse plugins, etc that I tested it with yesterday (that I've used on other models) it just totally failed at and acted almost as though some of the tensor blocks were missing or scrambled and some of it's knowledge was missing (like in a frankenmerge).

It seems even my fp16 model is garbage too, so I've got no idea what's happened (doubled checked and it was the instruct HF repo I cloned too).

Gonna try getting phymbert's fp16 GGUF.

@jukofyork All quants posted here are working for me, but only using the model's template otherwise they produce rubbish.

Hi @dranger003 ,

I've redone everything and it's all working now! I think I had somehow corrupted the FP16 GGUF yesterday and everything I made from that was also corrupt... :/

I also created an imatrix file from groups_merged.txt if it's any use: https://huggingface.co/jukofyork/dbrx-instruct-imatrix

It only used 91 chunks though - as that was the maximum it could get from the file, but since I wanted to use DBRX for coding I thought it might better to use the more diverse groups_merged.txt.

@jukofyork Thanks! Can you produce the PPL result for IQ4_XS? Let's compare with the results I posted on this model card, if we can.

@jukofyork Thanks! Can you produce the PPL result for IQ4_XS? Let's compare with the results I posted on this model card, if we can.

Sorry - I just deleted everything to free up space to start on command-r-plus model and only kept the imatrix and Q4_K_M :/

I did do some brief coding tests against your IQ4_XS, phymbert's Q4_0 and my Q4_K_M made with the groups_merged.txt and the Q4_K_M did definitely seem better.

The only result I have saved is this from OpenWebUI:


phymbert's Q4_0 My Q4_0 made using groups_merged.txt imatrix (I think)

USER

Can you create a simple C++ program using GSL for logistic regression? Define the logistic_cdf as a separate function and use it to create the train and test functions. Add a small test driver with hard-coded values in main to show it working.

ASSISTANT

Sure, here's a simple C++ program using the GNU Scientific Library (GSL) for logistic regression. This program defines a logistic CDF function, uses it to create train and test functions, and includes a small test driver with hard-coded values in main.

#include <gsl/gsl_math.h>
#include <gsl/gsl_cdf.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_multimin.h>
#include <iostream>

// Logistic CDF function
double logistic_cdf(double x) {
    return 1.0 / (1.0 + exp(-x));
}

// Train function using logistic regression
void train(const gsl_matrix *X, const gsl_vector *y, gsl_vector *beta) {
    const size_t n = X->size1;
    const size_t p = X->size2;

    // Initialize the design matrix and response vector
    gsl_matrix *X_design = gsl_matrix_alloc(n, p + 1);
    gsl_vector_view y_view = gsl_vector_view_array(y->data, n);

    // Add a column of ones to the design matrix for the intercept term
    for (size_t i = 0; i < n; ++i) {
        gsl_matrix_set(X_design, i, 0, 1.0);
        for (size_t j = 1; j < p + 1; ++j) {
            gsl_matrix_set(X_design, i, j, gsl_matrix_get(X, i, j - 1));
        }
    }

    // Initialize the objective function and its gradient
    gsl_multimin_function_fdf func;
    func.f = [](const gsl_vector *x, void *params) {
        const gsl_matrix *X_design = static_cast<const gsl_matrix *>(params);
        double f = 0.0;
        for (size_t i = 0; i < n; ++i) {
            double z = gsl_matrix_get(X_design, i, 0);
            for (size_t j = 1; j < p + 1; ++j) {
                z += gsl_vector_get(x, j - 1) * gsl_matrix_get(X_design, i, j);
            }
            f += y_view.vector.data[i] * log(logistic_cdf(z)) +
                 (1.0 - y_view.vector.data[i]) * log(1.0 - logistic_cdf(z));
        }
        return -f;
    };
    func.df = [](const gsl_vector *x, void *params, gsl_vector *g) {
        const gsl_matrix *X_design = static_cast<const gsl_matrix *>(params);
        gsl_vector_set_zero(g);
        for (size_t i = 0; i < n; ++i) {
            double z = gsl_matrix_get(X_design, i, 0);
            for (size_t j = 1; j < p + 1; ++j) {
                z += gsl_vector_get(x, j - 1) * gsl_matrix_get(X_design, i, j);
            }
            double p = logistic_cdf(z);
            for (size_t j = 1; j < p + 1; ++j) {
                gsl_vector_set(g, j - 1,
                               gsl_vector_get(g, j - 1) -
                                   (y_view.vector.data[i] - p) *
                                       gsl_matrix_get(X_design, i, j));
            }
        }
    };
    func.fdf = [](const gsl_vector *x, void *params, double *f, gsl_vector *g) {
        *f = func.f(x, params);
        func.df(x, params, g);
    };
    func.n = p;
    func.params = X_design;

    // Initialize the minimizer
    const gsl_multimin_fdfminimizer_type *T;
    gsl_multimin_fdfminimizer *s;
    T = gsl_multimin_fdfminimizer_nmsimplex;
    s = gsl_multimin_fdfminimizer_alloc(T, p);
    gsl_vector *x_init = gsl_vector_alloc(p);
    for (size_t j = 0; j < p; ++j) {
        gsl_vector_set(x_init, j, 0.0);
    }
    gsl_multimin_fdfminimizer_set(s, &func, x_init, 0.1, 1e-4);

    // Minimize the objective function
    int status;
    size_t iter = 0;
    do {
        iter++;
        status = gsl_multimin_fdfminimizer_iterate(s);
        if (status) {
            break;
        }
        status = gsl_multimin_test_gradient(s->gradient, 1e-3);
    } while (status == GSL_CONTINUE && iter < 1000);

    // Copy the solution to the output vector
    for (size_t j = 0; j < p; ++j) {
        gsl_vector_set(beta, j, gsl_vector_get(s->x, j));
    }

    // Free memory
    gsl_multimin_fdfminimizer_free(s);
    gsl_vector_free(x_init);
    gsl_matrix_free(X_design);
}

// Test function using logistic regression
void test(const gsl_matrix *X, const gsl_vector *beta, gsl_vector *y_pred) {
    const size_t n = X->size1;
    const size_t p = X->size2;

    // Initialize the design matrix and predicted response vector
    gsl_matrix *X_design = gsl_matrix_alloc(n, p + 1);
    gsl_vector_view y_pred_view = gsl_vector_view_array(y_pred->data, n);

    // Add a column of ones to the design matrix for the intercept term
    for (size_t i = 0; i < n; ++i) {
        gsl_matrix_set(X_design, i, 0, 1.0);
        for (size_t j = 1; j < p + 1; ++j) {
            gsl_matrix_set(X_design, i, j, gsl_matrix_get(X, i, j - 1));
        }
    }

    // Compute the predicted response
    for (size_t i = 0; i < n; ++i) {
        double z = gsl_matrix_get(X_design, i, 0);
        for (size_t j = 1; j < p + 1; ++j) {
            z += gsl_vector_get(beta, j - 1) * gsl_matrix_get(X_design, i, j);
        }
        gsl_vector_set(&y_pred_view.vector, i, logistic_cdf(z));
    }

    // Free memory
    gsl_matrix_free(X_design);
}

int main() {
    // Define the input data
    const size_t n = 4;
    const size_t p = 2;
    gsl_matrix *X = gsl_matrix_alloc(n, p);
    gsl_vector *y = gsl_vector_alloc(n);
    gsl_vector *beta = gsl_vector_alloc(p);
    gsl_vector *y_pred = gsl_vector_alloc(n);

    // Set the input data
    gsl_matrix_set(X, 0, 0, 1.0);
    gsl_matrix_set(X, 0, 1, 2.0);
    gsl_matrix_set(X, 1, 0, 1.0);
    gsl_matrix_set(X, 1, 1, 3.0);
    gsl_matrix_set(X, 2, 0, 1.0);
    gsl_matrix_set(X, 2, 1, 4.0);
    gsl_matrix_set(X, 3, 0, 1.0);
    gsl_matrix_set(X, 3, 1, 5.0);
    gsl_vector_set(y, 0, 0.0);
    gsl_vector_set(y, 1, 0.0);
    gsl_vector_set(y, 2, 1.0);
    gsl_vector_set(y, 3, 1.0);

    // Train the logistic regression model
    train(X, y, beta);

    // Test the logistic regression model
    test(X, beta, y_pred);

    // Print the results
    std::cout << "Beta: ";
    for (size_t j = 0; j < p; ++j) {
        std::cout << gsl_vector_get(beta, j) << " ";
    }
    std::cout << std::endl;
    std::cout << "Predicted response: ";
    for (size_t i = 0; i < n; ++i) {
        std::cout << gsl_vector_get(y_pred, i) << " ";
    }
    std::cout << std::endl;

    // Free memory
    gsl_matrix_free(X);
    gsl_vector_free(y);
    gsl_vector_free(beta);
    gsl_vector_free(y_pred);

    return 0;
}

This program defines a logistic CDF function, uses it to create train and test functions, and includes a small test driver with hard-coded values in main. The train function uses the Nelder-Mead simplex algorithm from GSL to minimize the negative log-likelihood of the logistic regression model. The test function uses the trained model to predict the response for new input data. The main function initializes the input data, trains the model, tests the model, and prints the results.


My Q4_K_M made using groups_merged.txt imatrix

USER

Can you create a simple C++ program using GSL for logistic regression? Define the logistic_cdf as a separate function and use it to create the train and test functions. Add a small test driver with hard-coded values in main to show it working.

ASSISTANT

Sure, here's a simple C++ program using the GNU Scientific Library (GSL) for logistic regression. This program defines a logistic CDF function, and uses it to create train and test functions for logistic regression. A small test driver with hard-coded values is included in the main function to show it working.

#include <gsl/gsl_math.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_multimin.h>
#include <gsl/gsl_cdf.h>
#include <iostream>

// Define the logistic CDF function
double logistic_cdf(double x) {
    return 1.0 / (1.0 + exp(-x));
}

// Define the logistic regression model
double logistic_regression_model(const gsl_vector *x, const gsl_vector *theta) {
    double y = gsl_vector_get(x, 0);
    double z = gsl_vector_get(theta, 0) + gsl_vector_get(theta, 1) * y;
    return logistic_cdf(z);
}

// Define the logistic regression cost function
double logistic_regression_cost(const gsl_vector *theta, void *data) {
    gsl_matrix *X = static_cast<gsl_matrix*>(data);
    size_t n = X->size1;
    double J = 0.0;

    for (size_t i = 0; i < n; ++i) {
        double h = logistic_regression_model(gsl_matrix_row(X, i).data, theta);
        double y = gsl_matrix_get(X, i, 2);
        J += -y * log(h) - (1 - y) * log(1 - h);
    }

    return J / n;
}

// Define the logistic regression gradient function
void logistic_regression_gradient(const gsl_vector *theta, void *data, gsl_vector *g) {
    gsl_matrix *X = static_cast<gsl_matrix*>(data);
    size_t n = X->size1;
    gsl_vector_set_zero(g);

    for (size_t i = 0; i < n; ++i) {
        double h = logistic_regression_model(gsl_matrix_row(X, i).data, theta);
        double y = gsl_matrix_get(X, i, 2);
        gsl_vector_set(g, 0, gsl_vector_get(g, 0) + (h - y) / n);
        gsl_vector_set(g, 1, gsl_vector_get(g, 1) + (h - y) * gsl_matrix_get(X, i, 1) / n);
    }
}

// Define the train function for logistic regression
gsl_vector *train_logistic_regression(gsl_matrix *X) {
    const gsl_multimin_fminimizer_type *T;
    gsl_multimin_fminimizer *s;
    gsl_vector *ss, *theta;
    double step_size = 0.1;
    double epsabs = 1e-4;
    size_t iter = 0;
    int status;
    size_t n = X->size1;

    T = gsl_multimin_fminimizer_nmsimplex;
    ss = gsl_multimin_initial_vertex(n, step_size);
    theta = gsl_vector_alloc(n);
    gsl_vector_set(theta, 0, 0.0);
    gsl_vector_set(theta, 1, 0.0);

    s = gsl_multimin_fminimizer_alloc(T, n);
    gsl_multimin_fminimizer_set(s, logistic_regression_cost, theta, ss, epsabs);

    do {
        iter++;
        status = gsl_multimin_fminimizer_iterate(s);

        if (status) {
            break;
        }

        status = gsl_multimin_test_size(s->size, epsabs);

    } while (status == GSL_CONTINUE && iter < 1000);

    gsl_vector_memcpy(theta, s->x);
    gsl_multimin_fminimizer_free(s);
    gsl_vector_free(ss);
    gsl_vector_free(theta);

    return theta;
}

// Define the test function for logistic regression
double test_logistic_regression(gsl_vector *theta, gsl_matrix *X) {
    size_t n = X->size1;
    double accuracy = 0.0;

    for (size_t i = 0; i < n; ++i) {
        double h = logistic_regression_model(gsl_matrix_row(X, i).data, theta);
        double y = gsl_matrix_get(X, i, 2);
        if ((h > 0.5 && y == 1) || (h <= 0.5 && y == 0)) {
            accuracy += 1.0;
        }
    }

    return accuracy / n;
}

int main() {
    // Define the training data
    gsl_matrix *X_train = gsl_matrix_alloc(4, 3);
    gsl_matrix_set(X_train, 0, 0, 1.0);
    gsl_matrix_set(X_train, 0, 1, 34.62365962451697);
    gsl_matrix_set(X_train, 0, 2, 0.0);
    gsl_matrix_set(X_train, 1, 0, 1.0);
    gsl_matrix_set(X_train, 1, 1, 37.66661804151512);
    gsl_matrix_set(X_train, 1, 2, 0.0);
    gsl_matrix_set(X_train, 2, 0, 1.0);
    gsl_matrix_set(X_train, 2, 1, 42.05936865676254);
    gsl_matrix_set(X_train, 2, 2, 0.0);
    gsl_matrix_set(X_train, 3, 0, 1.0);
    gsl_matrix_set(X_train, 3, 1, 46.16261400242737);
    gsl_matrix_set(X_train, 3, 2, 1.0);

    // Train the logistic regression model
    gsl_vector *theta = train_logistic_regression(X_train);

    // Define the test data
    gsl_matrix *X_test = gsl_matrix_alloc(2, 3);
    gsl_matrix_set(X_test, 0, 0, 1.0);
    gsl_matrix_set(X_test, 0, 1, 34.50846563095807);
    gsl_matrix_set(X_test, 0, 2, 0.0);
    gsl_matrix_set(X_test, 1, 0, 1.0);
    gsl_matrix_set(X_test, 1, 1, 47.33234614583333);
    gsl_matrix_set(X_test, 1, 2, 1.0);

    // Test the logistic regression model
    double accuracy = test_logistic_regression(theta, X_test);
    std::cout << "Accuracy: " << accuracy << std::endl;

    // Free memory
    gsl_matrix_free(X_train);
    gsl_matrix_free(X_test);
    gsl_vector_free(theta);

    return 0;
}

This program defines a logistic CDF function, and uses it to create train and test functions for logistic regression. The main function includes a small test driver with hard-coded values to show it working. The training data consists of four examples with two features and a binary label, and the test data consists of two examples with the same features and labels. The program trains a logistic regression model on the training data and tests it on the test data, printing the accuracy of the model.


These were both with temperature=0 (and rep-penalty=1), and the Q4_K_M clearly wrote in a better coding style and understood the problem slightly better IMO.

But the other tests I did to do with Eclipse and C++/Boost all showed better understanding of the task and better domain knowledge (eg: to make an Eclipse plug-in the Q4_K_M knew to make a MANIFEST.ini and a plugin.xml file).

If you want I can probably recreate the FP16 and do the tests tonight after I get rid of all the command-r-plus stuff?

Actually I think that is my Q4_0 I made with my imatrix file because I asked this question to phymbert's Q4_0 yesterday and it didn't know GSL stood for 'GNU Scientific Library' and used a different/random word for the 'G' (but did use and import the GSL library!).

This is pretty good IMO, as previously anything less than Q5_K_M has been terrible at coding and even Q5_K_M has been noticeably worse than Q8_0 or FP16. I don't think the perplexity measure is all that valid for use on coding models; where one tiny mistake can then send the model down a completely wrong avenue, whereas for creative writing or chatting; it probably makes no difference... I tried to use Mixtral at Q4_K_S on a 4090 card and it was really bad compared to Q8_0 (last December before the imatrix PR was added).

It would be very hard for me to subjectively test IQ4_XS though as Ollama don't allow it yet and I can only ask questions in OpenWebUI without any detailed system message or prompts I have for other tasks like refactoring or documentation, etc :/

I just tried to recreate this model again and I'm back to the broken-frankenmerge level of bad again lol?!

Just about to recreate the FP16 yet again and will see if it matches your PPL for wiki.test.raw as the PPL reported by imatrix looked suspiciously big compared to any other model on groups_merged.txt.

I'm also using llama.cpp branch "b2665" to be 100% sure.

It might be worth trying this again using the new tokenizer (and recreate the imatrix because of the different tokens getting activated, etc).

I'm just redoing dbrx-instructand already noticing a vastly lower perplexity during the imatrix computation compared to before.

Sign up or log in to comment