Not even a trillion

by distantquant - opened Jan 23

Discussion

distantquant

Jan 23

smh

distantquant

Jan 23

imo next you should make a 7b x100 next

distantquant

Jan 23

if you actually do, use UNA-TheBeagle-7b-v1 if you can

Tanvir1337

Jan 24

120b-x8 next please :huggingface:

Kquant03

Jan 24

•

edited Jan 24

imo next you should make a 7b x100 next

alright so I see this a lot in the community. Let me drop a TLDR

for experts it goes 2x, 4x, 8x, 16x, 32x, 64x, 128x, ...etc.

this is because of the sorting algorithm I believe. Expert Scaling must be by x2 (for "proper routing" anyways in a clown)

distantquant

Jan 27

imo next you should make a 7b x100 next

alright so I see this a lot in the community. Let me drop a TLDR

for experts it goes 2x, 4x, 8x, 16x, 32x, 64x, 128x, ...etc.

this is because of the sorting algorithm I believe. Expert Scaling must be by x2 (for "proper routing" anyways in a clown)

x128 :trol:

distantquant

Jan 29

or maybe 8x https://huggingface.co/alpindale/miqu-1-70b-fp16
peepolaugh

tudorao

Mar 3

.GGUF and less then 570GB and I can run on my server.

Kquant03

Mar 3

•

edited Mar 3

.GGUF and less then 570GB and I can run on my server.

with 512+GB you could merge your own 128x and then quantize it. You'd be the first if you made a 128x7B

I made Phalanx out of 410M models https://huggingface.co/Kquant03/Phalanx-512x460M-MoE

tudorao

Mar 4

Can you explain how you did it?
How it performs?
I try TheBloke/Falcon-180B-Chat-GGUF Q8 and performs better then full Llama 2 70B. I will try to merge Falcon 180B with Llama-2 70B to get best of the biggest LLMs available then quantize to fit in 512GB- 570GB RAM. My server has also two tesla p40 and i am upgrading for more gpus in the future but the main are the 2x 16core CPUs.
I am gonna download your model from the link. Now I am downloading the full falcon 180B will take some time until is finish. Then your model perhaps will take two days to download.

Kquant03

Mar 4

Can you explain how you did it?
How it performs?
I try TheBloke/Falcon-180B-Chat-GGUF Q8 and performs better then full Llama 2 70B. I will try to merge Falcon 180B with Llama-2 70B to get best of the biggest LLMs available then quantize to fit in 512GB- 570GB RAM. My server has also two tesla p40 and i am upgrading for more gpus in the future but the main are the 2x 16core CPUs.
I am gonna download your model from the link. Now I am downloading the full falcon 180B will take some time until is finish. Then your model perhaps will take two days to download.

I have no idea...I was able to merge it, but I ran out of swap space trying to load it into the transformers model loader 🥴

I assume considering a fly has around 38B parameters....a 410M parameter model might be a bit stupid and broken...

I used the mixtral branch of arceeai's mergekit https://github.com/arcee-ai/mergekit/tree/mixtral

Kquant03

Mar 4

•

edited Mar 4

Can you explain how you did it?
How it performs?
I try TheBloke/Falcon-180B-Chat-GGUF Q8 and performs better then full Llama 2 70B. I will try to merge Falcon 180B with Llama-2 70B to get best of the biggest LLMs available then quantize to fit in 512GB- 570GB RAM. My server has also two tesla p40 and i am upgrading for more gpus in the future but the main are the 2x 16core CPUs.
I am gonna download your model from the link. Now I am downloading the full falcon 180B will take some time until is finish. Then your model perhaps will take two days to download.

also I just re read this...you cannot merge models of different sizes I don't think...they have to have the same amount of layers. You might be able to try passthrough of llama-2 70B then passthrough that resulting merge again to get enough layers to merge falcon and llama-2.

good luck!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment