Spaces:
Running
on
A10G
imatrix support
continued from https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/78
Sadly no clue as to why I have no perms to push, whoami says the huggingface-cli login is valid.
The diffs are naturally still off due to uploading through huggingface_cli upload ggml-org/gguf-my-repo . . --repo-type space --revision refs/pr/80 instead after the denies. Looking forward to a solution regarding the auth, it's still in draft mode so all good until finding a way :)
Will we able to submit our own .txt file for Imatrix generation? That would be really cool. I hope this gets merged soon, it's a game changer.
Of course! The fallback file is there solely for less familiar users that would try to quantize without providing their own :)
@SixOpen - can you try creating a pull request like this: https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-advanced-usage - this way you should have the correct diff.
Otherwise maybe you can open a PR through the UI :/
All ready! Oddly, it still didn't push after the hf-cli login but remote set-url origin with username and token did! Glad I didn't have to clutter you with that many separate PRs and thanks for the patience 😆
Brilliant! Reviewing it now!
This generally looks good to me! Thanks for keeping it clean! Would really like it if @ggerganov can give it a review too!
- Looks like we are calling
make
twice: one time inDockerfile
ENTRYPOINT and one more time instart.sh
. Maybe it is better to just call it inDockerfile
like this:
ENTRYPOINT ["/bin/bash", "-c", "cd llama.cpp && LLAMA_CUDA=1 make -j quantize gguf-split imatrix && cd .. && /bin/sh start.sh"]
And simplify start.sh
to just:
python app.py
In
app.py
, is it necessary to compile again? If not, thengenerate_importance_matrix
can be simplifiedSince the
imatrix
computation can take a lot of time if the training data is too big, we can put a time limit for theimatrix
command - let's say 1 minute. If the process does not finish within this time limit, it gets killed and we use whateverimatrix.dat
has been generated last (theimatrix
tool periodically outputs the current result toimatrix.dat
, see the--output-frequency
CLI argument)
Great calls :) The time limit is definitely something we should have, will add that in a bit! Looks that while stashing start.sh remained on the version prior to entrypoint tweaks, and some LFS shenanigans might have affected the txt as well but I'll update the branch to take care of all of that 😄 along with the superfluous compile in app.py
Thanks @ggerganov for the review! and thanks @SixOpen for updating the PR.
Small comment - let's keep the build process in the start.sh
. This is because spaces sometimes build the Dockerfile
in a different environment and the final space separately.
If the build happens during start.sh
, then we make sure that the build is correct as per the hardware assigned to the space (this also makes it easy for people to duplicate this space).
Question: how are we ensuring the imatrix process goes on only for a minute?
EDIT: Nevermind saw the signal
code LGTM.
Agree to move the build inside start.sh
. Btw the 1 minute timeout was an example - I'm not sure what number would make sense, so feel free to experiment if it is too-short or too-long
Kalomaze's group merged is a very popular imatrix dataset: https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384
I suggest running that on a large model and see how long it takes, then add a few minutes in case people want to add something to it.
Of course! Good to know about spaces :) will update soon covering all above
Very much looking forward to this PR getting merged.
Lovely! Looks good to me! 🚀