Oh man

#1
by BoscoTheDog - opened

Not again ;-)

Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?

If so, does it run with CPU, GPU, or both?

And does Transformers.js allow for the loading of lora extensions? I was toying with it because I was interested in how this experiment enabled that: https://www.reddit.com/r/LocalLLaMA/comments/1dsfpb4/gemini_nano_running_locally_in_brave_using/

Owner

Does this mean Gemini nano can run without MediaPipe, through Transformers.js only?

That's a goal, but for now this repo will only "signal" to the browser to use the window.ai functionality, if present.

If so, does it run with CPU, GPU, or both?

It will run on GPU

And does Transformers.js allow for the loading of lora extensions?

Not currently - this is a limitation of ONNX (/ ONNX Runtime Web), so feel free to open feature requests there! :)

Would my script, which converts the MediaPipe format Gemini Nano to fp32 safetensors, be helpful? https://github.com/ethanc8/Gemini-Nano/blob/master/playground/converter.py

I haven't really tested it, since it takes more than 2 hours to finish dequantizing, and runs out of memory while it tries to save to safetensors. I'm trying various mitigations to get around this.

Owner

That is indeed very useful! If you can get a gemma model running with those weights, I can convert to ONNX and get it running with transformers.js!

@ethanc8 Cool!

I tried running the script, but got an error:

python3 convert_gemini.py weights.bin gemini_nano.safetensors fp16

model: tflite.Model.Model = tflite.Model.Model.GetRootAs(buf)

I changed that to model: tflite.Model = tflite.Model.GetRootAs(buf) and got a bit further:

return packer_type.unpack_from(memoryview_type(buf), head)[0]
struct.error: unpack_from requires a buffer of at least 1802465126 bytes for unpacking 4 bytes at offset 1802465122 (actual buffer size is 824)

Which means I have ridiculously little memory available I take it? :-D

@BoscoTheDog You need to enter the conda environment and use converter.py. Also, tflite.Model is a module, not a class (it's located in playground/tflite/Model.py), so we need to use tflite.Model.Model. Finally, the fact that your buffer size is 824 means that you opened an 824-byte file instead of the Gemini Nano weights. Check what's actually inside weights.bin.

I am now running the dequantization at https://github.com/ethanc8/Gemini-Nano/actions. I kept running out of memory on my host machine, but hopefully GitHub Actions' 16GB RAM should allow the dequantization to finish successfully.

Do we have much of any knowledge about what it'd take to restore multimodal support to this model? I assume that they're using a ViT-VQGAN for their image decoder (the other ways I know about to use transformers for image generation use dVAE, VQVAE, or VQGAN, and the only image gen research they cited in the architecture paragraph was OpenAI DALL-E using dVAE and Google Parti using ViT-VQGAN), and I'd hope that the input tokens and output tokens are from the same vocabulary, so the image encoder should also be a ViT-VQGAN. They mentioned that they used a Google USM for the speech encoder. It might be useful if we could get the model to generate image tokens. I'm also thinking of trying to restore the image output on Meta Chameleon, which should be much easier because they released the VQGAN, so I think they must've just fine-tuned the model to avoid generating images, after giving it the ability to generate images. Maybe the LoRA adapter which ships with Gemini Nano does something similar, so maybe running the model without the LoRA adapter might cause it to generate image tokens if you prompt it to. I'm really not sure though.

Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)

I don't know if this will work, also ordinarily Gemma 2 should be exported to BF16 but Gemini Nano is a fake FP32 converted from essentially a Q4_0 model, this means GGUF quants don't make sense here.

There's a few things wrong with that. First, we're using Gemma 1's arch, not Gemma 2's as a base. Secondly, it's not fake. We did something called upcasting which can make it FP32.

Well why are you using the old architecture, and not the new one, they're probably similar enough. And about upcasting, it does not restore the lost precision, hence why I said "fake FP32", as it doesn't inherit the same level of precision that a FP32 would have.

Whenever I get a chance, I will! (I’ll edit this message once I have one up and running.)

I don't know if this will work, also ordinarily Gemma 2 should be exported to BF16 but Gemini Nano is a fake FP32 converted from essentially a Q4_0 model, this means GGUF quants don't make sense here.

There's a few things wrong with that. First, we're using Gemma 1's arch, not Gemma 2's as a base. Secondly, it's not fake. We did something called upcasting which can make it FP32.

Well why are you using the old architecture, and not the new one, they're probably similar enough. And about upcasting, it does not restore the lost precision, hence why I said "fake FP32", as it doesn't inherit the same level of precision that a FP32 would have.

Ah, I'm sorry for the misinterpretation. And I'm not really sure if Gemma 1 from Gemma 2's arch for our purpose is gonna be all too significant.

Update: V2 safetensors are broken, GGUFs don't work. At this point it'd definitely be easier to just support Gemini's architecture which is out of my paygrade.

Update: V2 safetensors are broken, GGUFs don't work. At this point it'd definitely be easier to just support Gemini's architecture which is out of my paygrade.

Yep there is no point trying to force it anyways. We'd rather have llama.cpp adapt for Gemini Nano's architecture

chromium / chromium / src / HEAD / . / components / optimization_guide / proto / features / model_prototyping.proto

// Copyright 2024 The Chromium Authors
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
syntax = "proto3";
package optimization_guide.proto;
import "components/optimization_guide/proto/features/common_quality_data.proto";
option optimize_for = LITE_RUNTIME;
option java_package = "org.chromium.components.optimization_guide.features.proto";
option java_outer_classname = "ModelPrototypingProto";
// DO NOT EDIT THIS FILE DIRECTLY!
//
// This file is generated in g3 and then synced to Chrome. Instead, please
// refer to http://go/chrome-intelligence-feature-protos (Google-internal link),
// and then changes will be synced with Chrome automatically.
message ModelPrototypingLoggingData {
ModelPrototypingRequest request = 1;
ModelPrototypingResponse response = 2;
}
// Next ID: 4
message ModelPrototypingRequest {
ModelingInputs modeling_inputs = 1;
// The series of prompts to send to the model(s). The calls are run in series
// and the responses can be used in future calls allowing piping the output of
// one query into the input of the next.
repeated PrototypingPrompt prototyping_prompts = 2;
// The responses from previous calls to the model. Can be used in future
// prompts. Syntax for accessing them is golang text/templates
// e.g., something like {{index .GetModelResponses 0}}.
repeated string model_responses = 3;
// Next ID: 6
// Defines a single prompt to be sent to the model.
message PrototypingPrompt {
// Prompt variables that can be used in the rest of the prompt. These are in
// addition to any prompt variables defined in the prompt template in the
// config for the model sequence. Prompt variables are helper functions that
// can be used in the prompt. For example, a prompt variable could be
// something like:
// {{ $funVar := "1" }}
// This would define a function that can be used in the prompt as
// {{$funVar}}. The value of the function is "1".
string prompt_variables = 1;
// The prompt is composed by inserting the following roles into the prompt
// template in the order they are defined.
// Role system is generally the instructions for the model to follow.
string system_instructions_template = 2;
// Role context is the information around the user interaction such as page
// state.
string context_area_template = 3;
// Role user is the information from the user such as a user input they
// typed.
string user_input_template = 4;
// Information about the model to use.
ModelInformation model_information = 5;
message ModelInformation {
ModelEnum model_enum = 1;
enum ModelEnum {
MODEL_UNSPECIFIED = 0;
// Returns the filled templates without running an LLM.
MODEL_RETURN_FILLED_TEMPLATES = 1;
// The compose s-dense model.
MODEL_COMPOSE = 2;
}
}
}
// All the information collected from the browser along with the user input
// (for features like Compose).
message BrowserCollectedInformation {
// The page context of the page the model is acting on.
PageContext page_context = 1;
// The inner text of the page the model is acting on (excluding x-origin
// frames)
string inner_text = 2;
// The offset of the focused element into the |inner_text|.
uint64 inner_text_offset = 3;
// Custom text that a prototyper can inject into prompts. If the browser
// collected information is not sufficient, an early stage prototype can
// build a string in Chrome/colab to be used in the prompt. This allows
// separation of prompt definition and call specific data.
repeated string custom_data = 4;
}
// Next ID: 3
// Data specific to the feature.
message ModelingInputs {
BrowserCollectedInformation browser_collected_information = 1;
string user_input = 2;
}
}
message ModelPrototypingResponse {
// The series of prompts sent to the model corresponding to the
// |prototyping_prompts| in the request.
repeated string model_prompts = 1;
// The responses from the model corresponding to |model_prompts|.
repeated string model_responses = 2;
}

https://chromium.googlesource.com/chromium/src/+/HEAD/components/optimization_guide/proto/features/model_prototyping.proto

Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.

Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.

You actually bought ChatGPT Plus just so o1 could fix it? Why o1 of all things?

Also read https://www.huggingface.co/QuietImpostor/Gemini-Nano-Safetensors-V2/discussions/1 for some minor issues.

Reviving this thread to say that I’ve actually made some rather significant progress! Turns out the conversion code was bugged and making all tensors 1D where they weren’t needed. This time, o1-preview made significant optimizations to the int# to FP and it now completes in at most a minute (minus saving the weights individually which was to save memory). I will be sharing this code as soon as I get the opportunity. But for now, take the repo.

You actually bought ChatGPT Plus just so o1 could fix it? Why o1 of all things?

Also read https://www.huggingface.co/QuietImpostor/Gemini-Nano-Safetensors-V2/discussions/1 for some minor issues.

I’ve had ChatGPT Plus for a while now. And o1-preview is extremely good at debugging in my experience. And I’ll take a look at the discussion.

Sign up or log in to comment