Oh man

#1
by BoscoTheDog - opened

Not again ;-)

Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?

If so, does it run with CPU, GPU, or both?

And does Transformers.js allow for the loading of lora extensions? I was toying with it because I was interested in how this experiment enabled that: https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/

Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?

That's a goal, but for now this repo will only "signal" to the browser to use the window.ai functionality, if present.

If so, does it run with CPU, GPU, or both?

It will run on GPU

And does Transformers.js allow for the loading of lora extensions?

Not currently - this is a limitation of ONNX (/ ONNX Runtime Web), so feel free to open feature requests there! :)

Would my script, which converts the MediaPipe format Gemini Nano to fp32 safetensors, be helpful? https://github.com/ethanc8/Gemini-Nano/blob/master/playground/converter.py

I haven't really tested it, since it takes more than 2 hours to finish dequantizing, and runs out of memory while it tries to save to safetensors. I'm trying various mitigations to get around this.

That is indeed very useful! If you can get a gemma model running with those weights, I can convert to ONNX and get it running with transformers.js!

@ethanc8 Cool!

I tried running the script, but got an error:

python3 convert_gemini.py weights.bin gemini_nano.safetensors fp16

model: tflite.Model.Model = tflite.Model.Model.GetRootAs(buf)

I changed that to model: tflite.Model = tflite.Model.GetRootAs(buf) and got a bit further:

return packer_type.unpack_from(memoryview_type(buf), head)[0]
struct.error: unpack_from requires a buffer of at least 1802465126 bytes for unpacking 4 bytes at offset 1802465122 (actual buffer size is 824)

Which means I have ridiculously little memory available I take it? :-D

@BoscoTheDog You need to enter the conda environment and use converter.py. Also, tflite.Model is a module, not a class (it's located in playground/tflite/Model.py), so we need to use tflite.Model.Model. Finally, the fact that your buffer size is 824 means that you opened an 824-byte file instead of the Gemini Nano weights. Check what's actually inside weights.bin.

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

Do we have much of any knowledge about what it'd take to restore multimodal support to this model? I assume that they're using a ViT-VQGAN for their image decoder (the other ways I know about to use transformers for image generation use dVAE, VQVAE, or VQGAN, and the only image gen research they cited in the architecture paragraph was OpenAI DALL-E using dVAE and Google Parti using ViT-VQGAN), and I'd hope that the input tokens and output tokens are from the same vocabulary, so the image encoder should also be a ViT-VQGAN. They mentioned that they used a Google USM for the speech encoder. It might be useful if we could get the model to generate image tokens. I'm also thinking of trying to restore the image output on Meta Chameleon, which should be much easier because they released the VQGAN, so I think they must've just fine-tuned the model to avoid generating images, after giving it the ability to generate images. Maybe the LoRA adapter which ships with Gemini Nano does something similar, so maybe running the model without the LoRA adapter might cause it to generate image tokens if you prompt it to. I'm really not sure though.

@QuietImpostor Can you share the conversion code?

Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.

You actually bought ChatGPT Plus just so o1 could fix it? Why o1 of all things?

Also read https://www.huggingface.co/QuietImpostor/Gemini-Nano-Safetensors-V2/discussions/1 for some minor issues.

I’ve had ChatGPT Plus for a while now. And o1-preview is extremely good at debugging in my experience. And I’ll take a look at the discussion.

laughs in deep seek r1 lite preview :joy:

@QuietImpostor Can you share the conversion code?

Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.

Edit: Found it! updated convert.py

@QuietImpostor Can you share the conversion code?

Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.

Edit: Found it! updated convert.py

How's the RAM usage on this? Does it flatline your computer due to RAM in use?

@QuietImpostor Can you share the conversion code?

Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.

Edit: Found it! updated convert.py

How's the RAM usage on this? Does it flatline your computer due to RAM in use?

Depends on how much RAM you’ve got. I’d recommend 32GBs as I believe it took around ~27GBs when I ran it? You might be able to get away with it on Kaggle’s 30GBs if you wanted to reproduce it yourself.

@QuietImpostor Can you share the conversion code?

Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.

Edit: Found it! updated convert.py

How's the RAM usage on this? Does it flatline your computer due to RAM in use?

Depends on how much RAM you’ve got. I’d recommend 32GBs as I believe it took around ~27GBs when I ran it? You might be able to get away with it on Kaggle’s 30GBs if you wanted to reproduce it yourself.

is possible to optimize it for memory by doing only tensor at a time instead of as much as possible

@QuietImpostor Can you share the conversion code?

Oh yes! I totally forgot. Give me a minute and it'll be in the updated repo.

Edit: Found it! updated convert.py

How's the RAM usage on this? Does it flatline your computer due to RAM in use?

Depends on how much RAM you’ve got. I’d recommend 32GBs as I believe it took around ~27GBs when I ran it? You might be able to get away with it on Kaggle’s 30GBs if you wanted to reproduce it yourself.

is possible to optimize it for memory by doing only tensor at a time instead of as much as possible

Oh most definitely, I just went with what got it done quickest.

Just asked Claude 3.5 Haiku for this refactor.

I don't know if it works.

import sys
import torch
import safetensors.torch as st
import logging
import tflite.Model
import tflite.SubGraph
from tflite.TensorType import TensorType

# Set up logging
logger = logging.getLogger(__name__)
logging.basicConfig(
    format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    level=logging.INFO
)

# Define scale and size mappings
name_of_tensor_type = {
    0: "FLOAT32",
    9: "INT8   ",
    17: "INT4   ",
}

dtype_for_tensor_type = {
    0: torch.float32,
    9: torch.int8,
    17: torch.uint8,  # Because torch.int4 doesn't exist
}

size_for_tensor_type = {
    0: 4,
    9: 1,
    17: 0.5,
}

def update_target_name(target_name: str) -> str:
    """Updates the target name to match the tensor name convention."""
    def reverse_replace(theStr: str, a, b):
        return theStr.replace(b, a)
    
    target_name = reverse_replace(target_name, ".weight", ".w")
    target_name = reverse_replace(target_name, 
        "model.layers.", "params.lm.transformer.x_layers_"
    )

    target_name = reverse_replace(target_name, 
        "mlp.gate_proj", "ff_layer.ffn_layer1_gate"
    )
    target_name = reverse_replace(target_name, "mlp.up_proj", "ff_layer.ffn_layer1")
    target_name = reverse_replace(target_name, "mlp.down_proj", "ff_layer.ffn_layer2")

    target_name = reverse_replace(target_name,
        "post_layer_norm.weight", "post_layer_norm.scale"
    )
    target_name = reverse_replace(target_name,
        "post_attention_layernorm", "post_layer_norm"
    )
    
    target_name = reverse_replace(target_name, 
        "pre_layer_norm.weight", "pre_layer_norm.scale"
    )
    target_name = reverse_replace(target_name, "input_layernorm", "pre_layer_norm")
    
    target_name = reverse_replace(target_name, "self_attn.q_proj", "self_attention.q")
    target_name = reverse_replace(target_name, "self_attn.k_proj", "self_attention.k")
    target_name = reverse_replace(target_name, "self_attn.v_proj", "self_attention.v")
    target_name = reverse_replace(target_name, "self_attn.o_proj", "self_attention.post")
    target_name = reverse_replace(target_name, 
        "model.embed_tokens", "params.lm.softmax.logits_ffn"
    )
    target_name = reverse_replace(target_name, "final_ln.weight", "final_ln.scale")
    target_name = reverse_replace(target_name, "model.norm", "params.lm.final_ln")
    
    return target_name

def convert_quantized_int4_to_fp(quantized_data, scale_data, dims, dim_scale, dtype):
    zero_point = 8

    # Reshape quantized data to 1D tensor
    quantized_data = quantized_data.view(-1)

    # Extract low and high 4 bits
    low_bits = (quantized_data & 0x0F).type(torch.int8)
    high_bits = (quantized_data >> 4).type(torch.int8)

    # Concatenate low and high bits
    int4_values = torch.stack((low_bits, high_bits), dim=1).view(-1)
    int4_values = int4_values - zero_point  # Adjust zero point

    # Apply scaling
    scaled_data = int4_values.type(dtype) * scale_data

    # Reshape to original dimensions
    scaled_data = scaled_data.view(dims[0], dims[1])

    return scaled_data

def convert_quantized_int8_to_fp(quantized_data, scale_data, dims, dim_scale, dtype):
    zero_point = 0  # Assuming zero_point=0 for int8

    # Reshape quantized data to 1D tensor
    quantized_data = quantized_data.view(-1).type(torch.int8)
    
    # Handle scale_data based on dim_scale
    if dim_scale:
        # Per-column scaling
        scale_data = scale_data.repeat_interleave(2)
    else:
        # Per-row scaling
        scale_data = scale_data.repeat_interleave(2)
    
    # Convert scale_data to the same dtype
    scale_data = scale_data.to(dtype=dtype)

    # Apply scaling
    scaled_data = (quantized_data - zero_point).type(dtype) * scale_data

    # Reshape to original dimensions
    scaled_data = scaled_data.view(dims[0], dims[1])

    return scaled_data

def infer_tensor_shape(tensor_name: str, tensor_size: int) -> tuple:
    """Infer tensor shape based on name and size."""
    if (".self_attention.q." in tensor_name
        or ".self_attention.post." in tensor_name) and tensor_size == 4_194_304:
        return (2048, 2048)
    elif (".self_attention.k." in tensor_name
          or ".self_attention.v." in tensor_name) and tensor_size == 524_288:
        return (256, 2048)
    elif (".ff_layer.ffn_layer1_gate." in tensor_name
          or ".ff_layer.ffn_layer1." in tensor_name) and tensor_size == 25_165_824:
        return (12_288, 2048)
    elif ".ff_layer.ffn_layer2." in tensor_name and tensor_size == 25_165_824:
        return (2048, 12_288)
    elif "params.lm.softmax.logits_ffn.w" == tensor_name and tensor_size == 524_550_144:
        return (256_128, 2048)
    # LayerNorm weights are of shape {1, 1, 2048}
    elif "layer_norm" in tensor_name and tensor_size == 2048:
        return (1, 1, 2048)
    else:
        # Default to None if shape cannot be inferred
        return None

def process_single_tensor(model, buf, tensor, scale_tensors, target_dtype, logger):
    """Process a single tensor with memory efficiency."""
    tensor_name = tensor.Name().decode("utf-8")
    tensor_type = tensor.Type()

    # Get buffer metadata
    buffer_meta = model.Buffers(tensor.Buffer())
    
    # Infer tensor size and shape
    tensor_buf_size = tensor.Shape(0)
    tensor_size = tensor_buf_size // size_for_tensor_type[tensor_type]
    tensor_dims = infer_tensor_shape(tensor_name, tensor_size)

    # Update target name for conversion
    target_name = update_target_name(tensor_name)

    # Process based on tensor type
    if tensor_type == TensorType.FLOAT32:
        # Load FP32 tensor
        tensor_data = torch.frombuffer(
            buffer=buf, 
            dtype=torch.float32, 
            offset=buffer_meta.Offset(),
            count=buffer_meta.Size() // 4
        )
        
        # Reshape if dimensions are known
        if tensor_dims is not None:
            tensor_data = tensor_data.reshape(tensor_dims)

        # Convert dtype if needed
        if target_dtype != torch.float32:
            tensor_data = tensor_data.to(dtype=target_dtype)
        
        return target_name, tensor_data

    elif tensor_type in [TensorType.INT8, TensorType.INT4]:
        # Determine scale tensor
        scale_tensor_name = tensor_name + "_quantized_scale"
        scale_buf_meta = model.Buffers(scale_tensors[scale_tensor_name].Buffer())
        
        # Load quantized data
        if tensor_type == TensorType.INT8:
            quantized_buf = torch.frombuffer(
                buffer=buf, 
                dtype=torch.int8, 
                offset=buffer_meta.Offset(),
                count=buffer_meta.Size()
            )
            quantization_func = convert_quantized_int8_to_fp
        else:  # INT4
            quantized_buf = torch.frombuffer(
                buffer=buf, 
                dtype=torch.uint8, 
                offset=buffer_meta.Offset(),
                count=buffer_meta.Size()
            )
            quantization_func = convert_quantized_int4_to_fp
        
        # Load scale data
        scale_buf = torch.frombuffer(
            buffer=buf,
            dtype=torch.float32,
            offset=scale_buf_meta.Offset(),
            count=scale_buf_meta.Size() // 4
        )
        
        # Special handling for 'logits_ffn.w_quantized_scale'
        if 'logits_ffn.w_quantized_scale' in tensor_name:
            if scale_buf.numel() % 2 != 0:
                logger.error(f"Scale data size for {tensor_name} is not even. Cannot average.")
                return None

            scale_data = scale_buf.view(-1, 2).mean(dim=1)  # Average every two scale factors
            scale_data = scale_data.repeat_interleave(2)
        else:
            # General handling: per-row scaling, repeat each scale factor twice
            scale_data = scale_buf.repeat_interleave(2)
        
        # Dequantize tensor
        tensor_data = quantization_func(
            quantized_data=quantized_buf,
            scale_data=scale_data,
            dims=tensor_dims,
            dim_scale=0,
            dtype=target_dtype
        )
        
        return target_name, tensor_data

    return None

def main():
    # Check command-line arguments
    if len(sys.argv) < 3:
        print("Usage: python converter.py <path_to_tflite_model> <output_safetensors_file> [fp32|fp16|bf16]")
        sys.exit(1)

    tflite_model_path = sys.argv[1]
    output_safetensors_path = sys.argv[2]
    dtype_arg = sys.argv[3] if len(sys.argv) >= 4 else "fp32"

    if dtype_arg == "fp32":
        TARGET_DTYPE = torch.float32
    elif dtype_arg == "fp16":
        TARGET_DTYPE = torch.float16
    elif dtype_arg == "bf16":
        TARGET_DTYPE = torch.bfloat16
    else:
        print("Unsupported dtype. Choose from fp32, fp16, bf16.")
        sys.exit(1)

    logger.info(f"Starting conversion with TARGET_DTYPE={TARGET_DTYPE}")

    # Read the TFLite model
    with open(tflite_model_path, "rb") as input_file:
        buf = bytearray(input_file.read())

    model: tflite.Model.Model = tflite.Model.Model.GetRootAs(buf)
    graph: tflite.SubGraph.SubGraph = model.Subgraphs(0)

    # Identify and sort out scale tensor references
    scale_tensors = {}
    for i in range(graph.TensorsLength()):
        tensor = graph.Tensors(i)
        tensor_name = tensor.Name().decode("utf-8")
        if tensor_name.endswith("_quantized_scale"):
            scale_tensors[tensor_name] = tensor

    # Dictionary to hold processed tensors
    tensor_dict = {}

    # Process each tensor individually
    for i in range(graph.TensorsLength()):
        # Get the tensor and its type
        tensor = graph.Tensors(i)
        tensor_type: TensorType = tensor.Type()

        # Skip scale tensors
        if tensor.Name().decode("utf-8").endswith("_quantized_scale"):
            continue

        # Process tensor individually
        result = process_single_tensor(
            model, buf, tensor, scale_tensors, 
            TARGET_DTYPE, logger
        )

        # Store processed tensor
        if result:
            target_name, tensor_data = result
            tensor_dict[target_name] = tensor_data
            
            # Log memory usage for tracking
            logger.info(f"Processed: {target_name} - Shape: {tensor_data.shape}")

    # Save to safetensors
    logger.info(f"Saving to {output_safetensors_path}...")
    st.save_file(tensor_dict, output_safetensors_path)
    logger.info(f"Success! Saved to {output_safetensors_path}")

if __name__ == "__main__":
    main()

Sign up or log in to comment