-
Notifications
You must be signed in to change notification settings - Fork 255
Add Gemma3 12B Support #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Really? That's very curious considering their distill LoRA doesn't have weights for that.... but yeah it's true I initially took it from the distill model under the impression those weights weren't distilled. |
Yeah seems so. For example, in LTX 2.0 dev t2v tests 2 and 3 above, I only swapped the connectors between the distilled and dev, and the 1st frame of the videos already show noticeable differences. Another test. 0_00003.mp4 |
Yeah I can confirm, I've uploaded the dev version and renamed the distilled version now, thanks for the heads up. |
|
Thank you for this PR @jarz76, if I may add a suggestion, if you want to make it work with PR 399 you have to change this code to PS: A way to download the tokenizer.model file without having to fill a form to google is to download this one: |
|
@jarz76 thanks for your great pr, but i failed to run. can you tell us which gguf are you using |
|
@BigStationW you are right, thanks. but iq2 is not supported |
|
I've added my own PR for this which works (at least for me). It requires the mmproj model and the serialized sentencepiece tokenizer, which are embedded in the gemma-3 safetensor distributed with LTX-2 (they are not present in all the pre-existing gemma-3-12b-it GGUFs on hugginface). https://huggingface.co/smhf72/gemma-3-12b-it-extras-comfy I extracted them and pushed them to a new hf repo. You can just put them in the I did experiment with embedding the It was not quite as complicated as this PR seems to make it, fortunately. |
|
@scottmudge cool , but is mmproj.gguf necessary? this pr doesn't need that |
|
Yes the original gemma3 safetensor provided at LTX-2 release has the mmproj tensors included. It is required for visual reasoning (I2V, prompt enhancement based on the input image, etc). It is technically not needed for T2V, but it is useful to have regardless. |
If you're using it on top of #399, try to merge it again or edit loader.py manually. So far it tested work with Find replace it with As for vision, I don't think it's really necessary. Without vision, it works fine, and as far as I know, ComfyUI also doesn't currently implement the prompt enhancer. The T2V and I2V examples above are using it without an mmproj. Vision is used in their own custom node for the prompt enhancer:: https://github.com/Lightricks/ComfyUI-LTXVideo |
@kijai does the comfy native implemention use the vision part of gemma or no? |
|
You're missing the gguf_sd_loader return tuple update in the commits ? (return still has only 2 values in the tuple). |
|
Both PR (402+399) merged files : |
|
Yes it expects #399 to be merged first, which adds the metadata to the return tuple (needed for the LTX-2 base GGUF models). Should have mentioned that in the PR. Normally wouldn't make a PR dependent on another PR, but given #399 has been posted all over the place for people to merge in to use LTX-2 transformer GGUFs, I assumed it was going to be merged eventually. |
I've copied the #399 nodes.py and replaced the loader with the above but when using GGUF Dual clip loader I'm getting an error: File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 303, in _async_map_node_over_list |
|
Based on the logs, you're using these nodes to load it: https://github.com/calcuis/gguf Which don't have Gemma-3 support. You probably shouldn't even have both installed, since you're going to get confused as to which node goes with which repo. Use the Dual CLIP GGUF loader from this repo's nodes, like in Kijai's example. Load the GGUF in the first slot, and then one of these: https://huggingface.co/Kijai/LTXV2_comfy/tree/main/text_encoders in the second (depending on if you're using the distill model or the normal dev model) |
Don't do it manually, just use git i tested and it working fine git clone https://github.com/city96/ComfyUI-GGUF.git
cd ComfyUI-GGUF
# Fetch PR #399 and create a branch for it
git fetch origin pull/399/head:pr-399
git checkout pr-399
# Fetch PR #402 and apply it on top
git fetch origin pull/402/head:pr-402
# Apply PR #402 on top of PR #399
git merge pr-402 |
Thanks, it went well until the last part where git wanted to know the email and user name, looks like it would have made changes to the repo and not just a local merge, canceled the console screen which after which I tried again and there's now something active in the background (not a dev lol): _D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>git fetch origin pull/402/head:pr-402 D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>git merge pr-402 D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>_ |
Try to set ur email first git config user.email "you@example.com"
git config user.name "Your Name"
git commit --no-edit
But yeah, if both pr get merged, don't forget to switch back to the official repo. I'm also not a dev either, i just cant wait to try ltx2 on my potato pc lol, and gguf is the way. |
city96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this implementation. I added some small nitpick comments, but once those are resolved we should be good to go ahead with merging this.
Not sure if you tested, but it might make sense to do a small test or even just a single still image to check differences between the original safetensors and the BF16 gguf text encoder with the new tokenizer logic to make sure it doesn't behave weirdly for some of the classic cases (newlines, numbers, etc). I think that's what was causing weirdness for me with gemma2, though your spm tokenizer spec looks a lot better than my half-assed one for my attempt lol
978399d to
243a525
Compare
i got this—— |
Update the PR and the nodes |
|
Download the relevant test workflow: |
same result |
|
Does anyone find LTX2 not that follow our instructions, not good as wan2.2. Wan can do almost anything we want, while LTX2 not. Maybe Gemma3 is a reason ,but i don't think it is the main reason. I have tried three different gemma3, results totally different ,but all not as good as wan |
While LTX2 T2V is very good, I feel this happens with I2V. It’s not that this Gemma3 GGUF implementation doesn’t work, because previously I was using FP8 TE and BF16 checkpoints, and it seems this is just how LTX2 performs. Anyway, A few days ago on Reddit, the LTX CEO did an AMA, and he answered a question confirming that there are some issues with I2V and portrait/vertical video. He also said they will update LTX2 periodically, and that LTX2.1 may be released soon. So let’s see how it goes. |
|
LTX2.1? maybe we have to wait for that version. 2.0 is not a perfect tool for filming , just a toy. T2V is almost useless unless for some video materials like alpha. what we need is a powful I2V, which can do whatever we want . |
|
I think we can also make GGUF versions of these two files, it would be incredibly useful for low RAM/VRAM users https://huggingface.co/Kijai/LTXV2_comfy/tree/main/text_encoders |
city96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code changes look good to me so I'll go ahead and merge this to main. Thanks again for working on this.
Okay, here’s a new test of the tokenizer logic. What do you think?
Looks like it's not exactly 1:1 but those results are very close and it probably isn't the tokenizer since that tends to be more obvious.
(We could possibly add some tokenizer tests against the reference at some point to check for correctness but yeah, definitely out of scope for this PR and probably overkill)
|
Can you add support for enhanced prompt. I found it can get better result with enhanced prompt. And it is really very special,only useful in gemma3, other LLM not that good. PS, I2V need the mmproj file 😄 |
A feature of having enhanced prompt is not really the purpose of this repo. Prompt enhancer nodes and models exist and it's better to keep that seperate. As for ComfyUI native requiring mmproj file, I am going to test and report with and without the mmproj file. |
mmproj is specific to .gguf, and native ComfyUI doesn’t support .gguf, which is why city96 created this repo. So Comfy obviously doesn’t require an mmproj file. and It's easy to tell if comfy really use the gemma 3 vision / see the image when having image input (i2v). As a casual user, just run the I2V ComfyUI native workflow first using the .safetensors file, which already has vision built into it (unlike .gguf, which needs a separate mmproj.gguf file), then save the output. After that, swap the model loader and text encoder to GGUF. If it produces similar output, then that confirms that Comfy doesn’t implement or use the vision capabilities Gemma-3 has. And that mean vision isn’t an essential component in LTX2, since when you run it using a GGUF text encoder, it isn’t using mmproj/vision at all. Another way to tell is that in the native Comfy workflow, the CLIP output from the TE loader is connected directly to the normal CLIP Text Encode node. It’s just a standard text-encode node, not a special one like TextEncodeQwenImageEdit, which has an image input. Even in ComfyUI-LTX2 custom node, I believe it’s only used to provide context for the prompt-enhancer node, so Gemma-3 sees the image and tries to enhance the prompt based on the image context. Unlike Qwen-Image-Edit, which requires the vision to make image edit, that's why mmproj is a must for qwen-image-edit gguf. |
|
Native comfyui only use text encoder to be a clip node, but LLM has evolved so much, string result is really needed.So that we can do what ever we want using LLM ,without using a custom 'LLM loader' to load the LLM again |




Trying to add support for Gemma 3 12B GGUF, it can be used with the
DualClipLoader (GGUF)node.CLIP 1 is a Gemma 3 GGUF, and CLIP 2 uses embedding connectors from: https://huggingface.co/Kijai/LTXV2_comfy/tree/main/text_encoders
Note: The connectors from @kijai seem to come from the distilled models, and when testing it shows different results compared to connectors extracted from the dev models.
I’ve uploaded the connectors-dev here: https://huggingface.co/jayn7/LTXV2/tree/main if anyone want try it. Kijai has updated the repo and now provides both dev and distilled connectors.This approach uses the Gemma 3tokenizer.model(4.5MB) file directly, instead of attempting to recreate tokenizer from metadata. It loads and searches fortokenizer.modelorgemma3-tokenizer.modelinsideComfyUI/models/text_encodersfolder.The tokenizer can be found here: https://huggingface.co/google/gemma-3-12b-it/tree/main
Edit: We no longer need
tokenizer.model, the approach is the same as the others, it will attempt to recreate the tokenizer from metadata. #402 (comment)GGUF quants tested so far, but as long as they contain the required metadata, any release should work fine:
https://huggingface.co/unsloth/gemma-3-12b-it-GGUF - IQ4_XS
https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF - Q8 & BF16
https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf
Some example results (workflow embedded)
LTX 2.0 DEV T2V
BF16.mp4
Q8.mp4
Q8_distill.conncetor.mp4
LTX 2.0 Distilled T2V
bf16.mp4
Q8.mp4
LTX 2.0 DEV I2V
BF16.mp4
IQ4_XS.mp4