Oh man
Not again ;-)
Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?
If so, does it run with CPU, GPU, or both?
And does Transformers.js allow for the loading of lora extensions? I was toying with it because I was interested in how this experiment enabled that: https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/
Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?
That's a goal, but for now this repo will only "signal" to the browser to use the window.ai
functionality, if present.
If so, does it run with CPU, GPU, or both?
It will run on GPU
And does Transformers.js allow for the loading of lora extensions?
Not currently - this is a limitation of ONNX (/ ONNX Runtime Web), so feel free to open feature requests there! :)
Would my script, which converts the MediaPipe format Gemini Nano to fp32 safetensors, be helpful? https://github.com/ethanc8/Gemini-Nano/blob/master/playground/converter.py
I haven't really tested it, since it takes more than 2 hours to finish dequantizing, and runs out of memory while it tries to save to safetensors. I'm trying various mitigations to get around this.
That is indeed very useful! If you can get a gemma model running with those weights, I can convert to ONNX and get it running with transformers.js!
@ethanc8 Cool!
I tried running the script, but got an error:
python3 convert_gemini.py weights.bin gemini_nano.safetensors fp16
model: tflite.Model.Model = tflite.Model.Model.GetRootAs(buf)
I changed that to model: tflite.Model = tflite.Model.GetRootAs(buf)
and got a bit further:
return packer_type.unpack_from(memoryview_type(buf), head)[0]
struct.error: unpack_from requires a buffer of at least 1802465126 bytes for unpacking 4 bytes at offset 1802465122 (actual buffer size is 824)
Which means I have ridiculously little memory available I take it? :-D
@BoscoTheDog
You need to enter the conda environment and use converter.py
. Also, tflite.Model
is a module, not a class (it's located in playground/tflite/Model.py
), so we need to use tflite.Model.Model
. Finally, the fact that your buffer size is 824 means that you opened an 824-byte file instead of the Gemini Nano weights. Check what's actually inside weights.bin
.
I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.
Do we have much of any knowledge about what it'd take to restore multimodal support to this model? I assume that they're using a ViT-VQGAN for their image decoder (the other ways I know about to use transformers for image generation use dVAE, VQVAE, or VQGAN, and the only image gen research they cited in the architecture paragraph was OpenAI DALL-E using dVAE and Google Parti using ViT-VQGAN), and I'd hope that the input tokens and output tokens are from the same vocabulary, so the image encoder should also be a ViT-VQGAN. They mentioned that they used a Google USM for the speech encoder. It might be useful if we could get the model to generate image tokens. I'm also thinking of trying to restore the image output on Meta Chameleon, which should be much easier because they released the VQGAN, so I think they must've just fine-tuned the model to avoid generating images, after giving it the ability to generate images. Maybe the LoRA adapter which ships with Gemini Nano does something similar, so maybe running the model without the LoRA adapter might cause it to generate image tokens if you prompt it to. I'm really not sure though.
I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.
I wish, but your Actions max out at 6 hours hard cap and won't run for any longer. That is sad, I would love a F32 PyTorch/Safetensors format Gemini Nano
@piotr25691 Thanks! Yeah, the newer scripts should be much faster, but it seems to be more likely to cause OOM.
good job @QuietImpostor
i guess i will create FP32 pytorch_model.bin while you'll hold FP32 safetensors format
Very cool! I have some json files I'll create a PR for.
There may be this 256128x2048 tensor that is likely the image recognition tensor, how long does it take to make it FP32?
Thank you as I was worried what the heck was it as it took 20+ hours to convert from int4 to FP32 lol
I think it could be possible to add the Gemini architecture to transformers (.py) 😊
We can make it into llama.cpp by usage of PyTorch format and making of GGUFs, also would be nice to make a "gemini_surgery.py", which would be fundamentally the same as "llava_surgery_v2.py" but made for Gemini Nano image model extraction instead
Gemini Nano image model extraction
It looked like the Gemini Nano weights were just the LLM (transformer decoder), and not the image model. I think the image model is likely to be some kind of VQVAE (I'd make a guess and say it's a ViT-VQGAN because that's what they used in Parti). If anyone knows how VQVAE weights look and spots a VQVAE in the model, that'd be helpful, but otherwise we'd need to create a new training algorithm to train a VQVAE on the tokens outputted by Gemini Nano, or just finetune both the Gemini Nano LLM and a third-party VQVAE to work together, ignoring the image tokens designed for Google's VQVAE.
For speech, we'd need a Universal Speech Model, which Google claimed was 2B parameters. I don't think they ever released the USM weights, which would mean that we'd somehow have to train a USM that outputted exactly the same tokens, or finetune Gemini Nano on our new USM.
I think that finetuning Gemini Nano on new tokenizers for the same modalities will likely cause it to randomly switch between the two kinds of tokens, producing erratic results. So unless we can find Google's VQVAE and USM, or we wait for them to deploy those to Android and extract them from a rooted Pixel 8, it might not be useful to use Gemini Nano for non-text modalities (which unfortunately is really its main advantage given that it's quite bad at text-only tasks, at least according to official benchmark results).
We will need to benchmark it ourselves.
@QuietImpostor where did you get the weights from? Which version of Chrome Canary?
I think it's the same weights that were on your GitHub repository which provides the base quant
We will need to benchmark it ourselves.
And due to loss of precision from FP32 to int8 and back to FP32, benchmark scores will be slightly reduced
@QuietImpostor where did you get the weights from? Which version of Chrome Canary?
I followed the instructions from your GitHub, so version 128.0.6557.0
.
There's a link to an 'adapter' here, perhaps it's useful:
https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/
Direct links:
https://huggingface.co/wave-on-discord/gemini-nano
https://huggingface.co/wave-on-discord/gemini-nano-adapter
This is the code I've been using to get it to run on other browsers than Chrome.
There's a link to an 'adapter' here, perhaps it's useful:
https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/Direct links:
https://huggingface.co/wave-on-discord/gemini-nano
https://huggingface.co/wave-on-discord/gemini-nano-adapter
This is the code I've been using to get it to run on other browsers than Chrome.
This never worked with my Chromium browsers as it kept spitting some internal "assertion id >= 0 failed, -1 != 0", whatever that meant, but regular models like Gemma linked by Google worked
And there is no difference of the weights by @wave-on-discord and the ones provided by @ethanc8 , as the hashes match on both.
And there is no difference of the weights by @wave-on-discord and the ones provided by @ethanc8 , as the hashes match on both.
Nice! Then we know that we have GEMINI_XS version 2024.06.05.2205.
Maybe I'll go try reading those protobufs. They might include a description of the adapter's purpose.
@Xenova What is the standard way to store quantized weights in safetensors so that they can be converted to quantized ONNX for use in Transformers.js? The quantized weights are channel-wise quantized with the scaling factors in an fp32 array.
@Xenova What is the standard way to store quantized weights in safetensors so that they can be converted to quantized ONNX for use in Transformers.js? The quantized weights are channel-wise quantized with the scaling factors in an fp32 array.
What happened to us trying to force these weights to fp32 anyways, skipping the scaling factors? Is that used to reduce PPL introduced during quantization that Google performed?
@Xenova What is the standard way to store quantized weights in safetensors so that they can be converted to quantized ONNX for use in Transformers.js? The quantized weights are channel-wise quantized with the scaling factors in an fp32 array.
What happened to us trying to force these weights to fp32 anyways, skipping the scaling factors? Is that used to reduce PPL introduced during quantization that Google performed?
Skipping the scaling factors would mean that we are just coercing integers from -7 to +8 (I believe) into fp32. What we did is multiply by the scaling factors.
This still causes a loss of precision because the tensor data was never represented as FP32, which adds PPL, and makes FP32 have no point as it doesn't restore the loss. It may only be useful for converting to other formats like GGUF for llama.cpp
I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.
16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive
I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.
16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive
When I ran it, it used ~84% of my 32GBs, so 16GBs is not even close to what's needed.
I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.
16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive
When I ran it, it used ~84% of my 32GBs, so 16GBs is not even close to what's needed.
I managed to page through 90% of the model until it ran out, 20GB might be a good call.
Also, here's a PyTorch format: https://huggingface.co/piotr25691/gemini-nano-pytorch
I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.
16GB may be very dangerously close to the actual requirements, as everything becomes unresponsive
When I ran it, it used ~84% of my 32GBs, so 16GBs is not even close to what's needed.
I managed to page through 90% of the model until it ran out, 20GB might be a good call.
Also, here's a PyTorch format: https://huggingface.co/piotr25691/gemini-nano-pytorch
Neat! But yea, I think 20-24GBs should be just enough to get by.
I have a question.
I also looked at the analyzed content, but where is the lm_head weights of Gemini Nano?
I have a question.
I also looked at the analyzed content, but where is the lm_head weights of Gemini Nano?
they don't exist because of the fact that this is not a Transformers model and would have to be adapted.
what we did here, is to expand the quantized weight back to FP32 (which does not restore the loss in PPL metrics)
you will have to wait if you need to load this into Llama.cpp and alike.
After a crap ton of work with Claude 3.5 Sonnet, we finally managed to get a working Gemmafied Gemini Nano weights!
Check it out here: QuietImpostor/Gemini-Nano-Gemmafied
(It is still not the best, but it's better "working" than not!)
Edit: I am actively trying to fix converting to gguf, and hopefully also fixing the current issues alongside it.
@QuietImpostor Do you have the script you used to convert the model?
Yup, Ill put it in the repo if you'll allow me a second.
Edit: It's in there as gemmafy_gemini.py
Yup, Ill put it in the repo if you'll allow me a second.
Edit: It's in there as gemmafy_gemini.py
Won't this cause hallucinations, since the architecture doesn't match the weights?
@piotr25691 Oh, most definitely. Take a look at this response:<unused10><unused0>t exorbitanttt roused|tgtr<unused17>tt<unused91><unused17>t<unused0>trNttttt/t<unused17>tt 1<unused17><unused0><unused47>
Not really the most coherent responses. (I did use a Q4_0 gguf though, but it would likely be the same at the higher quants.)
We are definitely gonna have to finetune it or something to get any bit of coherence out of it.
(Edit: I'm sorry for such the late response.)
I would think the better way would to be to add support for the Gemini 1.0 architecture to Transformers and llama.cpp.
Also, I wonder when Google is going to deploy the vision and speech parts of Gemini Nano (I assume they're using a ViT-VQGAN and a USM model)
If you managed to create a gguf, could you perhaps share that on HuggingFace?
I'd like to try running it in the browser through Wllama (https://github.com/ngxson/wllama).
@BoscoTheDog That is actually what I did! I was too lazy to set up llama cpp. And it is already available as a FP16 GGUF in the repo!
@piotr25691 Oh, most definitely. Take a look at this response:
<unused10><unused0>t exorbitanttt roused|tgtr<unused17>tt<unused91><unused17>t<unused0>trNttttt/t<unused17>tt 1<unused17><unused0><unused47>
Not really the most coherent responses. (I did use a Q4_0 gguf though, but it would likely be the same at the higher quants.)
We are definitely gonna have to finetune it or something to get any bit of coherence out of it.
(Edit: I'm sorry for such the late response.)
Oh god the weird tokens, you're gonna have to retrain the model at this point, or add Gemini architecture support to llama.cpp
@QuietImpostor Ah cool. I can turn a Q16 into a Q4 without much loss, right? I'll try that.
@BoscoTheDog I already did try a Q4 quant. It produced that mess from earlier. Wouldn’t recommend it personally.
OK. I was just curious.
At the moment the Gemma/Gemini hybrid is basically stupid and shouldn't be considered as a usable model right now.
Either that, or I somehow messed up the weights which seems more likely. I’ll keep working on it when I get a chance.
Update: I made a V2 that should be better! But, I haven’t exactly tested it as of now. Once I get a chance, I will.
Cool! if you create any smaller GGUF quants, please share them :-)
Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)
Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)
I don't know if this will work, also ordinarily Gemma 2 should be exported to BF16 but Gemini Nano is a fake FP32 converted from essentially a Q4_0 model, this means GGUF quants don't make sense here.
Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)
I don't know if this will work, also ordinarily Gemma 2 should be exported to BF16 but Gemini Nano is a fake FP32 converted from essentially a Q4_0 model, this means GGUF quants don't make sense here.
There's a few things wrong with that. First, we're using Gemma 1's arch, not Gemma 2's as a base. Secondly, it's not fake. We did something called upcasting which can make it FP32.
Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)
I don't know if this will work, also ordinarily Gemma 2 should be exported to BF16 but Gemini Nano is a fake FP32 converted from essentially a Q4_0 model, this means GGUF quants don't make sense here.
There's a few things wrong with that. First, we're using Gemma 1's arch, not Gemma 2's as a base. Secondly, it's not fake. We did something called upcasting which can make it FP32.
Well why are you using the old architecture, and not the new one, they're probably similar enough. And about upcasting, it does not restore the lost precision, hence why I said "fake FP32", as it doesn't inherit the same level of precision that a FP32 would have.
Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)
I don't know if this will work, also ordinarily Gemma 2 should be exported to BF16 but Gemini Nano is a fake FP32 converted from essentially a Q4_0 model, this means GGUF quants don't make sense here.
There's a few things wrong with that. First, we're using Gemma 1's arch, not Gemma 2's as a base. Secondly, it's not fake. We did something called upcasting which can make it FP32.
Well why are you using the old architecture, and not the new one, they're probably similar enough. And about upcasting, it does not restore the lost precision, hence why I said "fake FP32", as it doesn't inherit the same level of precision that a FP32 would have.
Ah, I'm sorry for the misinterpretation. And I'm not really sure if Gemma 1 from Gemma 2's arch for our purpose is gonna be all too significant.
Update: V2 safetensors are broken, GGUFs don't work. At this point it'd definitely be easier to just support Gemini's architecture which is out of my paygrade.
Update: V2 safetensors are broken, GGUFs don't work. At this point it'd definitely be easier to just support Gemini's architecture which is out of my paygrade.
Yep there is no point trying to force it anyways. We'd rather have llama.cpp adapt for Gemini Nano's architecture
chromium / chromium / src / HEAD / . / components / optimization_guide / proto / features / model_prototyping.proto
// Copyright 2024 The Chromium Authors
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
syntax = "proto3";
package optimization_guide.proto;
import "components/optimization_guide/proto/features/common_quality_data.proto";
option optimize_for = LITE_RUNTIME;
option java_package = "org.chromium.components.optimization_guide.features.proto";
option java_outer_classname = "ModelPrototypingProto";
// DO NOT EDIT THIS FILE DIRECTLY!
//
// This file is generated in g3 and then synced to Chrome. Instead, please
// refer to http://go/chrome-intelligence-feature-protos (Google-internal link),
// and then changes will be synced with Chrome automatically.
message ModelPrototypingLoggingData {
ModelPrototypingRequest request = 1;
ModelPrototypingResponse response = 2;
}
// Next ID: 4
message ModelPrototypingRequest {
ModelingInputs modeling_inputs = 1;
// The series of prompts to send to the model(s). The calls are run in series
// and the responses can be used in future calls allowing piping the output of
// one query into the input of the next.
repeated PrototypingPrompt prototyping_prompts = 2;
// The responses from previous calls to the model. Can be used in future
// prompts. Syntax for accessing them is golang text/templates
// e.g., something like {{index .GetModelResponses 0}}.
repeated string model_responses = 3;
// Next ID: 6
// Defines a single prompt to be sent to the model.
message PrototypingPrompt {
// Prompt variables that can be used in the rest of the prompt. These are in
// addition to any prompt variables defined in the prompt template in the
// config for the model sequence. Prompt variables are helper functions that
// can be used in the prompt. For example, a prompt variable could be
// something like:
// {{ $funVar := "1" }}
// This would define a function that can be used in the prompt as
// {{$funVar}}. The value of the function is "1".
string prompt_variables = 1;
// The prompt is composed by inserting the following roles into the prompt
// template in the order they are defined.
// Role system is generally the instructions for the model to follow.
string system_instructions_template = 2;
// Role context is the information around the user interaction such as page
// state.
string context_area_template = 3;
// Role user is the information from the user such as a user input they
// typed.
string user_input_template = 4;
// Information about the model to use.
ModelInformation model_information = 5;
message ModelInformation {
ModelEnum model_enum = 1;
enum ModelEnum {
MODEL_UNSPECIFIED = 0;
// Returns the filled templates without running an LLM.
MODEL_RETURN_FILLED_TEMPLATES = 1;
// The compose s-dense model.
MODEL_COMPOSE = 2;
}
}
}
// All the information collected from the browser along with the user input
// (for features like Compose).
message BrowserCollectedInformation {
// The page context of the page the model is acting on.
PageContext page_context = 1;
// The inner text of the page the model is acting on (excluding x-origin
// frames)
string inner_text = 2;
// The offset of the focused element into the |inner_text|.
uint64 inner_text_offset = 3;
// Custom text that a prototyper can inject into prompts. If the browser
// collected information is not sufficient, an early stage prototype can
// build a string in Chrome/colab to be used in the prompt. This allows
// separation of prompt definition and call specific data.
repeated string custom_data = 4;
}
// Next ID: 3
// Data specific to the feature.
message ModelingInputs {
BrowserCollectedInformation browser_collected_information = 1;
string user_input = 2;
}
}
message ModelPrototypingResponse {
// The series of prompts sent to the model corresponding to the
// |prototyping_prompts| in the request.
repeated string model_prompts = 1;
// The responses from the model corresponding to |model_prompts|.
repeated string model_responses = 2;
}
Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.
Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.
You actually bought ChatGPT Plus just so o1 could fix it? Why o1 of all things?
Also read https://www.huggingface.co/QuietImpostor/Gemini-Nano-Safetensors-V2/discussions/1 for some minor issues.
Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.
You actually bought ChatGPT Plus just so o1 could fix it? Why o1 of all things?
Also read https://www.huggingface.co/QuietImpostor/Gemini-Nano-Safetensors-V2/discussions/1 for some minor issues.
I’ve had ChatGPT Plus for a while now. And o1-preview is extremely good at debugging in my experience. And I’ll take a look at the discussion.
@QuietImpostor Can you share the conversion code?
Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.
You actually bought ChatGPT Plus just so o1 could fix it? Why o1 of all things?
Also read https://www.huggingface.co/QuietImpostor/Gemini-Nano-Safetensors-V2/discussions/1 for some minor issues.
I’ve had ChatGPT Plus for a while now. And o1-preview is extremely good at debugging in my experience. And I’ll take a look at the discussion.
laughs in deep seek r1 lite preview :joy:
@QuietImpostor Can you share the conversion code?
Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.
Edit: Found it! updated convert.py
@QuietImpostor Can you share the conversion code?
Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.
Edit: Found it! updated convert.py
How's the RAM usage on this? Does it flatline your computer due to RAM in use?
@QuietImpostor Can you share the conversion code?
Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.
Edit: Found it! updated convert.py
How's the RAM usage on this? Does it flatline your computer due to RAM in use?
Depends on how much RAM you’ve got. I’d recommend 32GBs as I believe it took around ~27GBs when I ran it? You might be able to get away with it on Kaggle’s 30GBs if you wanted to reproduce it yourself.
@QuietImpostor Can you share the conversion code?
Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.
Edit: Found it! updated convert.py
How's the RAM usage on this? Does it flatline your computer due to RAM in use?
Depends on how much RAM you’ve got. I’d recommend 32GBs as I believe it took around ~27GBs when I ran it? You might be able to get away with it on Kaggle’s 30GBs if you wanted to reproduce it yourself.
is possible to optimize it for memory by doing only tensor at a time instead of as much as possible
@QuietImpostor Can you share the conversion code?
Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.
Edit: Found it! updated convert.py
How's the RAM usage on this? Does it flatline your computer due to RAM in use?
Depends on how much RAM you’ve got. I’d recommend 32GBs as I believe it took around ~27GBs when I ran it? You might be able to get away with it on Kaggle’s 30GBs if you wanted to reproduce it yourself.
is possible to optimize it for memory by doing only tensor at a time instead of as much as possible
Oh most definitely, I just went with what got it done quickest.
Just asked Claude 3.5 Haiku for this refactor.
I don't know if it works.
import sys
import torch
import safetensors.torch as st
import logging
import tflite.Model
import tflite.SubGraph
from tflite.TensorType import TensorType
# Set up logging
logger = logging.getLogger(__name__)
logging.basicConfig(
format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
level=logging.INFO
)
# Define scale and size mappings
name_of_tensor_type = {
0: "FLOAT32",
9: "INT8 ",
17: "INT4 ",
}
dtype_for_tensor_type = {
0: torch.float32,
9: torch.int8,
17: torch.uint8, # Because torch.int4 doesn't exist
}
size_for_tensor_type = {
0: 4,
9: 1,
17: 0.5,
}
def update_target_name(target_name: str) -> str:
"""Updates the target name to match the tensor name convention."""
def reverse_replace(theStr: str, a, b):
return theStr.replace(b, a)
target_name = reverse_replace(target_name, ".weight", ".w")
target_name = reverse_replace(target_name,
"model.layers.", "params.lm.transformer.x_layers_"
)
target_name = reverse_replace(target_name,
"mlp.gate_proj", "ff_layer.ffn_layer1_gate"
)
target_name = reverse_replace(target_name, "mlp.up_proj", "ff_layer.ffn_layer1")
target_name = reverse_replace(target_name, "mlp.down_proj", "ff_layer.ffn_layer2")
target_name = reverse_replace(target_name,
"post_layer_norm.weight", "post_layer_norm.scale"
)
target_name = reverse_replace(target_name,
"post_attention_layernorm", "post_layer_norm"
)
target_name = reverse_replace(target_name,
"pre_layer_norm.weight", "pre_layer_norm.scale"
)
target_name = reverse_replace(target_name, "input_layernorm", "pre_layer_norm")
target_name = reverse_replace(target_name, "self_attn.q_proj", "self_attention.q")
target_name = reverse_replace(target_name, "self_attn.k_proj", "self_attention.k")
target_name = reverse_replace(target_name, "self_attn.v_proj", "self_attention.v")
target_name = reverse_replace(target_name, "self_attn.o_proj", "self_attention.post")
target_name = reverse_replace(target_name,
"model.embed_tokens", "params.lm.softmax.logits_ffn"
)
target_name = reverse_replace(target_name, "final_ln.weight", "final_ln.scale")
target_name = reverse_replace(target_name, "model.norm", "params.lm.final_ln")
return target_name
def convert_quantized_int4_to_fp(quantized_data, scale_data, dims, dim_scale, dtype):
zero_point = 8
# Reshape quantized data to 1D tensor
quantized_data = quantized_data.view(-1)
# Extract low and high 4 bits
low_bits = (quantized_data & 0x0F).type(torch.int8)
high_bits = (quantized_data >> 4).type(torch.int8)
# Concatenate low and high bits
int4_values = torch.stack((low_bits, high_bits), dim=1).view(-1)
int4_values = int4_values - zero_point # Adjust zero point
# Apply scaling
scaled_data = int4_values.type(dtype) * scale_data
# Reshape to original dimensions
scaled_data = scaled_data.view(dims[0], dims[1])
return scaled_data
def convert_quantized_int8_to_fp(quantized_data, scale_data, dims, dim_scale, dtype):
zero_point = 0 # Assuming zero_point=0 for int8
# Reshape quantized data to 1D tensor
quantized_data = quantized_data.view(-1).type(torch.int8)
# Handle scale_data based on dim_scale
if dim_scale:
# Per-column scaling
scale_data = scale_data.repeat_interleave(2)
else:
# Per-row scaling
scale_data = scale_data.repeat_interleave(2)
# Convert scale_data to the same dtype
scale_data = scale_data.to(dtype=dtype)
# Apply scaling
scaled_data = (quantized_data - zero_point).type(dtype) * scale_data
# Reshape to original dimensions
scaled_data = scaled_data.view(dims[0], dims[1])
return scaled_data
def infer_tensor_shape(tensor_name: str, tensor_size: int) -> tuple:
"""Infer tensor shape based on name and size."""
if (".self_attention.q." in tensor_name
or ".self_attention.post." in tensor_name) and tensor_size == 4_194_304:
return (2048, 2048)
elif (".self_attention.k." in tensor_name
or ".self_attention.v." in tensor_name) and tensor_size == 524_288:
return (256, 2048)
elif (".ff_layer.ffn_layer1_gate." in tensor_name
or ".ff_layer.ffn_layer1." in tensor_name) and tensor_size == 25_165_824:
return (12_288, 2048)
elif ".ff_layer.ffn_layer2." in tensor_name and tensor_size == 25_165_824:
return (2048, 12_288)
elif "params.lm.softmax.logits_ffn.w" == tensor_name and tensor_size == 524_550_144:
return (256_128, 2048)
# LayerNorm weights are of shape {1, 1, 2048}
elif "layer_norm" in tensor_name and tensor_size == 2048:
return (1, 1, 2048)
else:
# Default to None if shape cannot be inferred
return None
def process_single_tensor(model, buf, tensor, scale_tensors, target_dtype, logger):
"""Process a single tensor with memory efficiency."""
tensor_name = tensor.Name().decode("utf-8")
tensor_type = tensor.Type()
# Get buffer metadata
buffer_meta = model.Buffers(tensor.Buffer())
# Infer tensor size and shape
tensor_buf_size = tensor.Shape(0)
tensor_size = tensor_buf_size // size_for_tensor_type[tensor_type]
tensor_dims = infer_tensor_shape(tensor_name, tensor_size)
# Update target name for conversion
target_name = update_target_name(tensor_name)
# Process based on tensor type
if tensor_type == TensorType.FLOAT32:
# Load FP32 tensor
tensor_data = torch.frombuffer(
buffer=buf,
dtype=torch.float32,
offset=buffer_meta.Offset(),
count=buffer_meta.Size() // 4
)
# Reshape if dimensions are known
if tensor_dims is not None:
tensor_data = tensor_data.reshape(tensor_dims)
# Convert dtype if needed
if target_dtype != torch.float32:
tensor_data = tensor_data.to(dtype=target_dtype)
return target_name, tensor_data
elif tensor_type in [TensorType.INT8, TensorType.INT4]:
# Determine scale tensor
scale_tensor_name = tensor_name + "_quantized_scale"
scale_buf_meta = model.Buffers(scale_tensors[scale_tensor_name].Buffer())
# Load quantized data
if tensor_type == TensorType.INT8:
quantized_buf = torch.frombuffer(
buffer=buf,
dtype=torch.int8,
offset=buffer_meta.Offset(),
count=buffer_meta.Size()
)
quantization_func = convert_quantized_int8_to_fp
else: # INT4
quantized_buf = torch.frombuffer(
buffer=buf,
dtype=torch.uint8,
offset=buffer_meta.Offset(),
count=buffer_meta.Size()
)
quantization_func = convert_quantized_int4_to_fp
# Load scale data
scale_buf = torch.frombuffer(
buffer=buf,
dtype=torch.float32,
offset=scale_buf_meta.Offset(),
count=scale_buf_meta.Size() // 4
)
# Special handling for 'logits_ffn.w_quantized_scale'
if 'logits_ffn.w_quantized_scale' in tensor_name:
if scale_buf.numel() % 2 != 0:
logger.error(f"Scale data size for {tensor_name} is not even. Cannot average.")
return None
scale_data = scale_buf.view(-1, 2).mean(dim=1) # Average every two scale factors
scale_data = scale_data.repeat_interleave(2)
else:
# General handling: per-row scaling, repeat each scale factor twice
scale_data = scale_buf.repeat_interleave(2)
# Dequantize tensor
tensor_data = quantization_func(
quantized_data=quantized_buf,
scale_data=scale_data,
dims=tensor_dims,
dim_scale=0,
dtype=target_dtype
)
return target_name, tensor_data
return None
def main():
# Check command-line arguments
if len(sys.argv) < 3:
print("Usage: python converter.py <path_to_tflite_model> <output_safetensors_file> [fp32|fp16|bf16]")
sys.exit(1)
tflite_model_path = sys.argv[1]
output_safetensors_path = sys.argv[2]
dtype_arg = sys.argv[3] if len(sys.argv) >= 4 else "fp32"
if dtype_arg == "fp32":
TARGET_DTYPE = torch.float32
elif dtype_arg == "fp16":
TARGET_DTYPE = torch.float16
elif dtype_arg == "bf16":
TARGET_DTYPE = torch.bfloat16
else:
print("Unsupported dtype. Choose from fp32, fp16, bf16.")
sys.exit(1)
logger.info(f"Starting conversion with TARGET_DTYPE={TARGET_DTYPE}")
# Read the TFLite model
with open(tflite_model_path, "rb") as input_file:
buf = bytearray(input_file.read())
model: tflite.Model.Model = tflite.Model.Model.GetRootAs(buf)
graph: tflite.SubGraph.SubGraph = model.Subgraphs(0)
# Identify and sort out scale tensor references
scale_tensors = {}
for i in range(graph.TensorsLength()):
tensor = graph.Tensors(i)
tensor_name = tensor.Name().decode("utf-8")
if tensor_name.endswith("_quantized_scale"):
scale_tensors[tensor_name] = tensor
# Dictionary to hold processed tensors
tensor_dict = {}
# Process each tensor individually
for i in range(graph.TensorsLength()):
# Get the tensor and its type
tensor = graph.Tensors(i)
tensor_type: TensorType = tensor.Type()
# Skip scale tensors
if tensor.Name().decode("utf-8").endswith("_quantized_scale"):
continue
# Process tensor individually
result = process_single_tensor(
model, buf, tensor, scale_tensors,
TARGET_DTYPE, logger
)
# Store processed tensor
if result:
target_name, tensor_data = result
tensor_dict[target_name] = tensor_data
# Log memory usage for tracking
logger.info(f"Processed: {target_name} - Shape: {tensor_data.shape}")
# Save to safetensors
logger.info(f"Saving to {output_safetensors_path}...")
st.save_file(tensor_dict, output_safetensors_path)
logger.info(f"Success! Saved to {output_safetensors_path}")
if __name__ == "__main__":
main()