Oh man

#1
by BoscoTheDog - opened

Not again ;-)

Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?

If so, does it run with CPU, GPU, or both?

And does Transformers.js allow for the loading of lora extensions? I was toying with it because I was interested in how this experiment enabled that: https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/

Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?

That's a goal, but for now this repo will only "signal" to the browser to use the window.ai functionality, if present.

If so, does it run with CPU, GPU, or both?

It will run on GPU

And does Transformers.js allow for the loading of lora extensions?

Not currently - this is a limitation of ONNX (/ ONNX Runtime Web), so feel free to open feature requests there! :)

Would my script, which converts the MediaPipe format Gemini Nano to fp32 safetensors, be helpful? https://github.com/ethanc8/Gemini-Nano/blob/master/playground/converter.py

I haven't really tested it, since it takes more than 2 hours to finish dequantizing, and runs out of memory while it tries to save to safetensors. I'm trying various mitigations to get around this.

That is indeed very useful! If you can get a gemma model running with those weights, I can convert to ONNX and get it running with transformers.js!

@ethanc8 Cool!

I tried running the script, but got an error:

python3 convert_gemini.py weights.bin gemini_nano.safetensors fp16

model: tflite.Model.Model = tflite.Model.Model.GetRootAs(buf)

I changed that to model: tflite.Model = tflite.Model.GetRootAs(buf) and got a bit further:

return packer_type.unpack_from(memoryview_type(buf), head)[0]
struct.error: unpack_from requires a buffer of at least 1802465126 bytes for unpacking 4 bytes at offset 1802465122 (actual buffer size is 824)

Which means I have ridiculously little memory available I take it? :-D

@BoscoTheDog You need to enter the conda environment and use converter.py. Also, tflite.Model is a module, not a class (it's located in playground/tflite/Model.py), so we need to use tflite.Model.Model. Finally, the fact that your buffer size is 824 means that you opened an 824-byte file instead of the Gemini Nano weights. Check what's actually inside weights.bin.

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

Do we have much of any knowledge about what it'd take to restore multimodal support to this model? I assume that they're using a ViT-VQGAN for their image decoder (the other ways I know about to use transformers for image generation use dVAE, VQVAE, or VQGAN, and the only image gen research they cited in the architecture paragraph was OpenAI DALL-E using dVAE and Google Parti using ViT-VQGAN), and I'd hope that the input tokens and output tokens are from the same vocabulary, so the image encoder should also be a ViT-VQGAN. They mentioned that they used a Google USM for the speech encoder. It might be useful if we could get the model to generate image tokens. I'm also thinking of trying to restore the image output on Meta Chameleon, which should be much easier because they released the VQGAN, so I think they must've just fine-tuned the model to avoid generating images, after giving it the ability to generate images. Maybe the LoRA adapter which ships with Gemini Nano does something similar, so maybe running the model without the LoRA adapter might cause it to generate image tokens if you prompt it to. I'm really not sure though.

This comment has been hidden

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

I wish, but your Actions max out at 6 hours hard cap and won't run for any longer. That is sad, I would love a F32 PyTorch/Safetensors format Gemini Nano

@piotr25691 Thanks! Yeah, the newer scripts should be much faster, but it seems to be more likely to cause OOM.

Good news, I actually got it converted! Check it out here!

good job @QuietImpostor

i guess i will create FP32 pytorch_model.bin while you'll hold FP32 safetensors format

Very cool! I have some json files I'll create a PR for.

There may be this 256128x2048 tensor that is likely the image recognition tensor, how long does it take to make it FP32?

@piotr25691 That's the embedding layer :) (see here)

image.png

(256128x2048 = 524550144)

Thank you as I was worried what the heck was it as it took 20+ hours to convert from int4 to FP32 lol

I think it could be possible to add the Gemini architecture to transformers (.py) 😊

We can make it into llama.cpp by usage of PyTorch format and making of GGUFs, also would be nice to make a "gemini_surgery.py", which would be fundamentally the same as "llava_surgery_v2.py" but made for Gemini Nano image model extraction instead

Gemini Nano image model extraction

It looked like the Gemini Nano weights were just the LLM (transformer decoder), and not the image model. I think the image model is likely to be some kind of VQVAE (I'd make a guess and say it's a ViT-VQGAN because that's what they used in Parti). If anyone knows how VQVAE weights look and spots a VQVAE in the model, that'd be helpful, but otherwise we'd need to create a new training algorithm to train a VQVAE on the tokens outputted by Gemini Nano, or just finetune both the Gemini Nano LLM and a third-party VQVAE to work together, ignoring the image tokens designed for Google's VQVAE.

For speech, we'd need a Universal Speech Model, which Google claimed was 2B parameters. I don't think they ever released the USM weights, which would mean that we'd somehow have to train a USM that outputted exactly the same tokens, or finetune Gemini Nano on our new USM.

I think that finetuning Gemini Nano on new tokenizers for the same modalities will likely cause it to randomly switch between the two kinds of tokens, producing erratic results. So unless we can find Google's VQVAE and USM, or we wait for them to deploy those to Android and extract them from a rooted Pixel 8, it might not be useful to use Gemini Nano for non-text modalities (which unfortunately is really its main advantage given that it's quite bad at text-only tasks, at least according to official benchmark results).

How does it compare against Gemma 2B and the fine tuned Sappha 2B which fundamentally is the same thing?

Benchmark Gemma 2B Sappha 2B Gemini Nano
MMLU (5sh) 36.98 38.02 ???
HellaSwag (0sh) 49.22 51.70 ???
PIQA (1sh) 75.08 75.46 ???
TruthfulQA (0sh) 37.51 31.65 ???

We will need to benchmark it ourselves.

@QuietImpostor where did you get the weights from? Which version of Chrome Canary?

I think it's the same weights that were on your GitHub repository which provides the base quant

We will need to benchmark it ourselves.

And due to loss of precision from FP32 to int8 and back to FP32, benchmark scores will be slightly reduced

@QuietImpostor where did you get the weights from? Which version of Chrome Canary?

I followed the instructions from your GitHub, so version 128.0.6557.0.

@ethanc8

There's a link to an 'adapter' here, perhaps it's useful:
https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/

Direct links:

https://huggingface.co/wave-on-discord/gemini-nano

https://huggingface.co/wave-on-discord/gemini-nano-adapter

This is the code I've been using to get it to run on other browsers than Chrome.

@ethanc8

There's a link to an 'adapter' here, perhaps it's useful:
https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/

Direct links:

https://huggingface.co/wave-on-discord/gemini-nano

https://huggingface.co/wave-on-discord/gemini-nano-adapter

This is the code I've been using to get it to run on other browsers than Chrome.

This never worked with my Chromium browsers as it kept spitting some internal "assertion id >= 0 failed, -1 != 0", whatever that meant, but regular models like Gemma linked by Google worked

And there is no difference of the weights by @wave-on-discord and the ones provided by @ethanc8 , as the hashes match on both.

And there is no difference of the weights by @wave-on-discord and the ones provided by @ethanc8 , as the hashes match on both.

Nice! Then we know that we have GEMINI_XS version 2024.06.05.2205.

https://huggingface.co/wave-on-discord/gemini-nano-adapter

Maybe I'll go try reading those protobufs. They might include a description of the adapter's purpose.

@Xenova What is the standard way to store quantized weights in safetensors so that they can be converted to quantized ONNX for use in Transformers.js? The quantized weights are channel-wise quantized with the scaling factors in an fp32 array.

@Xenova What is the standard way to store quantized weights in safetensors so that they can be converted to quantized ONNX for use in Transformers.js? The quantized weights are channel-wise quantized with the scaling factors in an fp32 array.

What happened to us trying to force these weights to fp32 anyways, skipping the scaling factors? Is that used to reduce PPL introduced during quantization that Google performed?

@Xenova What is the standard way to store quantized weights in safetensors so that they can be converted to quantized ONNX for use in Transformers.js? The quantized weights are channel-wise quantized with the scaling factors in an fp32 array.

What happened to us trying to force these weights to fp32 anyways, skipping the scaling factors? Is that used to reduce PPL introduced during quantization that Google performed?

Skipping the scaling factors would mean that we are just coercing integers from -7 to +8 (I believe) into fp32. What we did is multiply by the scaling factors.

This still causes a loss of precision because the tensor data was never represented as FP32, which adds PPL, and makes FP32 have no point as it doesn't restore the loss. It may only be useful for converting to other formats like GGUF for llama.cpp

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive

When I ran it, it used ~84% of my 32GBs, so 16GBs is not even close to what's needed.

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive

When I ran it, it used ~84% of my 32GBs, so 16GBs is not even close to what's needed.

I managed to page through 90% of the model until it ran out, 20GB might be a good call.

Also, here's a PyTorch format: https://huggingface.co/piotr25691/gemini-nano-pytorch

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive

When I ran it, it used ~84% of my 32GBs, so 16GBs is not even close to what's needed.

I managed to page through 90% of the model until it ran out, 20GB might be a good call.

Also, here's a PyTorch format: https://huggingface.co/piotr25691/gemini-nano-pytorch

Neat! But yea, I think 20-24GBs should be just enough to get by.

Sign up or log in to comment