Lewdiculous (AetherArchitectural)

Lewdiculous

AI & ML interests

[Personal Profile] General tech and LLM stuff! More information at: https://rentry.co/Lewdiculous | Mancer LLM Inference Service Referral: https://mancer.tech/?ref_code=82759764e0 | Backyard.ai Roleplay Referral: https://backyard.ai/ref/RDUtfMrITf3voh

Recent Activity

upvoted a collection 11 days ago
Social
updated a collection 11 days ago
Social
View all activity

Organizations

Blog-explorers's profile picture CyberHarem's profile picture LWDCLS Research's profile picture Social Post Explorers's profile picture AetherArchitectural's profile picture

Posts 2

view post
Post
40779
More context for your Pascal GPU or older!

Update: Now available in the official releases of KoboldCpp!
[releases] https://github.com/LostRuins/koboldcpp/releases/latest

These are great news for all the users with GTX 10XX, P40...

Flash Attention implementation for older NVIDIA GPUs without requiring Tensor Cores has come to llama.cpp in the last few days, and should be merged in the next version of KoboldCpp, you can already try it with another fork or by building it.

[Mentioned KCPP fork] https://github.com/Nexesenex/kobold.cpp/releases/latest

[PR] https://github.com/ggerganov/llama.cpp/pull/7188

You should expect less VRAM usage for the same context, allowing you to experience higher contexts with your current GPU.

There have also been reported final tokens/second speed improvements for inference, so that's also grand!

If you have tried it, I'd like to hear your experiences with --flashattention so far, especially for this implementation and for the large number of Pascal (GTX 10XX, P40...) cards.

Discussion linked bellow, with more links to relevant information:

https://huggingface.co/LWDCLS/LLM-Discussions/discussions/11

Cheers!
view post
Post
42481
Updated: Lumimaid and TheSpice-v0.8.3

I have uploaded version 2 (v2) files for the Llama-3-Lumimaid-8B-v0.1-OAS GGUF Imatrix quants.

[model] Lewdiculous/Llama-3-Lumimaid-8B-v0.1-OAS-GGUF-IQ-Imatrix

You can recognize the new files by their v2 prefix.

Imatrix data was generated from the FP16 and conversions directly from the BF16.
Hopefully avoiding any losses in the model conversion, as has been the recently discussed topic on Llama-3 and GGUF lately.

This is more disk and compute intensive so lets hope we get GPU inference support for BF16 models in llama.cpp.

If you are able to test them and noticed any issues compared to the original quants, let me know in the corresponding discussions.

---

Additionally, L3-TheSpice-8b-v0.8.3 GGUF Imatrix quants were also updated.

[model] Lewdiculous/L3-TheSpice-8b-v0.8.3-GGUF-IQ-Imatrix

datasets

None public yet