Skip to content

Conversation

@jarz76
Copy link
Contributor

@jarz76 jarz76 commented Jan 9, 2026

Trying to add support for Gemma 3 12B GGUF, it can be used with the DualClipLoader (GGUF) node.

CLIP 1 is a Gemma 3 GGUF, and CLIP 2 uses embedding connectors from: https://huggingface.co/Kijai/LTXV2_comfy/tree/main/text_encoders

Note: The connectors from @kijai seem to come from the distilled models, and when testing it shows different results compared to connectors extracted from the dev models. I’ve uploaded the connectors-dev here: https://huggingface.co/jayn7/LTXV2/tree/main if anyone want try it. Kijai has updated the repo and now provides both dev and distilled connectors.

chrome_iVsVdBcPKv

This approach uses the Gemma 3 tokenizer.model (4.5MB) file directly, instead of attempting to recreate tokenizer from metadata. It loads and searches for tokenizer.model or gemma3-tokenizer.model inside ComfyUI/models/text_encoders folder.
The tokenizer can be found here: https://huggingface.co/google/gemma-3-12b-it/tree/main

Edit: We no longer need tokenizer.model, the approach is the same as the others, it will attempt to recreate the tokenizer from metadata. #402 (comment)

GGUF quants tested so far, but as long as they contain the required metadata, any release should work fine:
https://huggingface.co/unsloth/gemma-3-12b-it-GGUF - IQ4_XS
https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF - Q8 & BF16
https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf

Some example results (workflow embedded)

LTX 2.0 DEV T2V

# Description Result (video)
1 ComfyUI LTXV Audio Text Encoder Node (gemma_3_12B_it_bf16.safetensors)
BF16.mp4
2 DualClipLoader GGUF Node (Gemma3 GGUF Q8 + embedding connectors extracted from dev models)
Q8.mp4
3 DualClipLoader GGUF Node (Gemma3 GGUF Q8 + embedding connectors by kijai)
Q8_distill.conncetor.mp4

LTX 2.0 Distilled T2V

# Description Result (video)
1 ComfyUI LTXV Audio Text Encoder Node (gemma_3_12B_it_bf16.safetensors)
bf16.mp4
2 DualClipLoader GGUF Node (Gemma3 GGUF Q8 + embedding connectors)
Q8.mp4

LTX 2.0 DEV I2V

# Description Result (video)
1 ComfyUI LTXV Audio Text Encoder Node (gemma_3_12B_it_bf16.safetensors)
BF16.mp4
2 DualClipLoader GGUF Node (Gemma3 GGUF IQ4_XS + embedding connectors)
IQ4_XS.mp4

@kijai
Copy link

kijai commented Jan 9, 2026

Note: The connectors from @kijai seem to come from the distilled models, and when testing it shows different results compared to connectors extracted from the dev models. For now, I’ve uploaded the connectors-dev here: https://huggingface.co/jayn7/LTXV2/tree/main if anyone want try it.

Really? That's very curious considering their distill LoRA doesn't have weights for that.... but yeah it's true I initially took it from the distill model under the impression those weights weren't distilled.

@jarz76
Copy link
Contributor Author

jarz76 commented Jan 9, 2026

Note: The connectors from @kijai seem to come from the distilled models, and when testing it shows different results compared to connectors extracted from the dev models. For now, I’ve uploaded the connectors-dev here: https://huggingface.co/jayn7/LTXV2/tree/main if anyone want try it.

Really? That's very curious considering their distill LoRA doesn't have weights for that.... but yeah it's true I initially took it from the distill model under the impression those weights weren't distilled.

Yeah seems so. For example, in LTX 2.0 dev t2v tests 2 and 3 above, I only swapped the connectors between the distilled and dev, and the 1st frame of the videos already show noticeable differences.

Another test.

0_00003.mp4

@kijai
Copy link

kijai commented Jan 9, 2026

Note: The connectors from @kijai seem to come from the distilled models, and when testing it shows different results compared to connectors extracted from the dev models. For now, I’ve uploaded the connectors-dev here: https://huggingface.co/jayn7/LTXV2/tree/main if anyone want try it.

Really? That's very curious considering their distill LoRA doesn't have weights for that.... but yeah it's true I initially took it from the distill model under the impression those weights weren't distilled.

Yeah seems so. For example, in LTX 2.0 dev t2v tests 2 and 3 above, I only swapped the connectors between the distilled and dev, and the 1st frame of the videos already show noticeable differences.

Another test.

0_00003.mp4

Yeah I can confirm, I've uploaded the dev version and renamed the distilled version now, thanks for the heads up.

@BigStationW
Copy link

BigStationW commented Jan 9, 2026

Thank you for this PR @jarz76, if I may add a suggestion, if you want to make it work with PR 399 you have to change this code

sd, arch = gguf_sd_loader(path, return_arch=True, is_text_model=True)

to

sd, arch, metadata = gguf_sd_loader(path, return_arch=True, is_text_model=True)

PS: A way to download the tokenizer.model file without having to fill a form to google is to download this one:
https://huggingface.co/unsloth/gemma-3-4b-it/blob/main/tokenizer.model

@zwukong
Copy link

zwukong commented Jan 10, 2026

@jarz76 thanks for your great pr, but i failed to run. can you tell us which gguf are you using

  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 266, in load_clip
    return (self.load_patcher(clip_paths, clip_type, self.load_data(clip_paths)),)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 220, in load_data
    sd = gguf_clip_loader(p)
         ^^^^^^^^^^^^^^^^^^^
  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\loader.py", line 460, in gguf_clip_loader
    sd, arch = gguf_sd_loader(path, return_arch=True, is_text_model=True)
    ^^^^^^^^
ValueError: too many values to unpack (expected 2)

@zwukong
Copy link

zwukong commented Jan 10, 2026

@BigStationW you are right, thanks. but iq2 is not supported

@scottmudge
Copy link

scottmudge commented Jan 10, 2026

I've added my own PR for this which works (at least for me). It requires the mmproj model and the serialized sentencepiece tokenizer, which are embedded in the gemma-3 safetensor distributed with LTX-2 (they are not present in all the pre-existing gemma-3-12b-it GGUFs on hugginface).

https://huggingface.co/smhf72/gemma-3-12b-it-extras-comfy

I extracted them and pushed them to a new hf repo. You can just put them in the models/clip folder, and they should automatically load alongside the main GGUF.

I did experiment with embedding the spiece and mmproj models directly into the GGUF, but it seemed like too much work to regenerate all of them, and having them separate allows people to use abliterated models if they want.

#404

It was not quite as complicated as this PR seems to make it, fortunately.

@zwukong
Copy link

zwukong commented Jan 10, 2026

@scottmudge cool , but is mmproj.gguf necessary? this pr doesn't need that

@scottmudge
Copy link

Yes the original gemma3 safetensor provided at LTX-2 release has the mmproj tensors included. It is required for visual reasoning (I2V, prompt enhancement based on the input image, etc). It is technically not needed for T2V, but it is useful to have regardless.

@jarz76
Copy link
Contributor Author

jarz76 commented Jan 10, 2026

@jarz76 thanks for your great pr, but i failed to run. can you tell us which gguf are you using

  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 266, in load_clip
    return (self.load_patcher(clip_paths, clip_type, self.load_data(clip_paths)),)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 220, in load_data
    sd = gguf_clip_loader(p)
         ^^^^^^^^^^^^^^^^^^^
  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\loader.py", line 460, in gguf_clip_loader
    sd, arch = gguf_sd_loader(path, return_arch=True, is_text_model=True)
    ^^^^^^^^
ValueError: too many values to unpack (expected 2)

If you're using it on top of #399, try to merge it again or edit loader.py manually. So far it tested work with
https://huggingface.co/unsloth/gemma-3-12b-it-GGUF
https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF
https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf

Find

sd, arch = gguf_sd_loader(path, return_arch=True, is_text_model=True)

replace it with

sd, arch, metadata = gguf_sd_loader(path, return_arch=True, is_text_model=True)

As for vision, I don't think it's really necessary. Without vision, it works fine, and as far as I know, ComfyUI also doesn't currently implement the prompt enhancer. The T2V and I2V examples above are using it without an mmproj.

Vision is used in their own custom node for the prompt enhancer:: https://github.com/Lightricks/ComfyUI-LTXVideo

@MeiYi-dev
Copy link

@jarz76 thanks for your great pr, but i failed to run. can you tell us which gguf are you using

  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 266, in load_clip
    return (self.load_patcher(clip_paths, clip_type, self.load_data(clip_paths)),)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\nodes.py", line 220, in load_data
    sd = gguf_clip_loader(p)
         ^^^^^^^^^^^^^^^^^^^
  File "M:\ComfyUI\312_cu128\ComfyUI\custom_nodes\ComfyUI-GGUF\loader.py", line 460, in gguf_clip_loader
    sd, arch = gguf_sd_loader(path, return_arch=True, is_text_model=True)
    ^^^^^^^^
ValueError: too many values to unpack (expected 2)

If you're using it on top of #399, try to merge it again or edit loader.py manually. So far it tested work with https://huggingface.co/unsloth/gemma-3-12b-it-GGUF https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf

Find

sd, arch = gguf_sd_loader(path, return_arch=True, is_text_model=True)

replace it with

sd, arch, metadata = gguf_sd_loader(path, return_arch=True, is_text_model=True)

As for vision, I don't think it's really necessary. Without vision, it works fine, and as far as I know, ComfyUI also doesn't currently implement the prompt enhancer. The T2V and I2V examples above are using it without an mmproj.

Vision is used in their own custom node for the prompt enhancer:: https://github.com/Lightricks/ComfyUI-LTXVideo

@kijai does the comfy native implemention use the vision part of gemma or no?

@builder-main
Copy link

builder-main commented Jan 10, 2026

You're missing the gguf_sd_loader return tuple update in the commits ? (return still has only 2 values in the tuple).
[Edit] : because the fork does not include the aforementioned commit indeed.

@builder-main
Copy link

Both PR (402+399) merged files :
loader.py

@scottmudge
Copy link

Yes it expects #399 to be merged first, which adds the metadata to the return tuple (needed for the LTX-2 base GGUF models). Should have mentioned that in the PR.

Normally wouldn't make a PR dependent on another PR, but given #399 has been posted all over the place for people to merge in to use LTX-2 transformer GGUFs, I assumed it was going to be merged eventually.

@novice-101
Copy link

Both PR (402+399) merged files : loader.py

I've copied the #399 nodes.py and replaced the loader with the above but when using GGUF Dual clip loader I'm getting an error:
(using gemma-3-12b-it-heretic-x.i1-Q3_K_S.gguf)

File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 303, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 291, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 583, in load_clip
return (self.load_patcher(clip_paths, get_clip_type(type), self.load_data(clip_paths)), get_device(device))
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 546, in load_data
sd = load_gguf_clip(p)
^^^^^^^^^^^^^^^^^
File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 430, in load_gguf_clip
sd, arch = load_gguf_sd(path, return_arch=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 343, in load_gguf_sd
raise ValueError(f"Unknown architecture: {arch_str!r}")
ValueError: Unknown architecture: 'gemma3'

@scottmudge
Copy link

Based on the logs, you're using these nodes to load it:

https://github.com/calcuis/gguf

Which don't have Gemma-3 support. You probably shouldn't even have both installed, since you're going to get confused as to which node goes with which repo.

Use the Dual CLIP GGUF loader from this repo's nodes, like in Kijai's example. Load the GGUF in the first slot, and then one of these:

https://huggingface.co/Kijai/LTXV2_comfy/tree/main/text_encoders

in the second (depending on if you're using the distill model or the normal dev model)

@muljanis45
Copy link

Both PR (402+399) merged files : loader.py

I've copied the #399 nodes.py and replaced the loader with the above but when using GGUF Dual clip loader I'm getting an error: (using gemma-3-12b-it-heretic-x.i1-Q3_K_S.gguf)

File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 303, in _async_map_node_over_list await process_inputs(input_dict, i) File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 291, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 583, in load_clip return (self.load_patcher(clip_paths, get_clip_type(type), self.load_data(clip_paths)), get_device(device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 546, in load_data sd = load_gguf_clip(p) ^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 430, in load_gguf_clip sd, arch = load_gguf_sd(path, return_arch=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 343, in load_gguf_sd raise ValueError(f"Unknown architecture: {arch_str!r}") ValueError: Unknown architecture: 'gemma3'

Don't do it manually, just use git i tested and it working fine

git clone https://github.com/city96/ComfyUI-GGUF.git
cd ComfyUI-GGUF

# Fetch PR #399 and create a branch for it
git fetch origin pull/399/head:pr-399
git checkout pr-399

# Fetch PR #402 and apply it on top
git fetch origin pull/402/head:pr-402

# Apply PR #402 on top of PR #399
git merge pr-402

@novice-101
Copy link

Both PR (402+399) merged files : loader.py

I've copied the #399 nodes.py and replaced the loader with the above but when using GGUF Dual clip loader I'm getting an error: (using gemma-3-12b-it-heretic-x.i1-Q3_K_S.gguf)
File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 303, in _async_map_node_over_list await process_inputs(input_dict, i) File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 291, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 583, in load_clip return (self.load_patcher(clip_paths, get_clip_type(type), self.load_data(clip_paths)), get_device(device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 546, in load_data sd = load_gguf_clip(p) ^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 430, in load_gguf_clip sd, arch = load_gguf_sd(path, return_arch=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 343, in load_gguf_sd raise ValueError(f"Unknown architecture: {arch_str!r}") ValueError: Unknown architecture: 'gemma3'

Don't do it manually, just use git i tested and it working fine

git clone https://github.com/city96/ComfyUI-GGUF.git
cd ComfyUI-GGUF

# Fetch PR #399 and create a branch for it
git fetch origin pull/399/head:pr-399
git checkout pr-399

# Fetch PR #402 and apply it on top
git fetch origin pull/402/head:pr-402

# Apply PR #402 on top of PR #399
git merge pr-402

Thanks, it went well until the last part where git wanted to know the email and user name, looks like it would have made changes to the repo and not just a local merge, canceled the console screen which after which I tried again and there's now something active in the background (not a dev lol):

_D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>git fetch origin pull/402/head:pr-402

D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>git merge pr-402
fatal: You have not concluded your merge (MERGE_HEAD exists).
Please, commit your changes before you merge.

D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>_

@muljanis45
Copy link

muljanis45 commented Jan 10, 2026

Both PR (402+399) merged files : loader.py

I've copied the #399 nodes.py and replaced the loader with the above but when using GGUF Dual clip loader I'm getting an error: (using gemma-3-12b-it-heretic-x.i1-Q3_K_S.gguf)
File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 303, in _async_map_node_over_list await process_inputs(input_dict, i) File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 291, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 583, in load_clip return (self.load_patcher(clip_paths, get_clip_type(type), self.load_data(clip_paths)), get_device(device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 546, in load_data sd = load_gguf_clip(p) ^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 430, in load_gguf_clip sd, arch = load_gguf_sd(path, return_arch=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\gguf\pig.py", line 343, in load_gguf_sd raise ValueError(f"Unknown architecture: {arch_str!r}") ValueError: Unknown architecture: 'gemma3'

Don't do it manually, just use git i tested and it working fine

git clone https://github.com/city96/ComfyUI-GGUF.git
cd ComfyUI-GGUF

# Fetch PR #399 and create a branch for it
git fetch origin pull/399/head:pr-399
git checkout pr-399

# Fetch PR #402 and apply it on top
git fetch origin pull/402/head:pr-402

# Apply PR #402 on top of PR #399
git merge pr-402

Thanks, it went well until the last part where git wanted to know the email and user name, looks like it would have made changes to the repo and not just a local merge, canceled the console screen which after which I tried again and there's now something active in the background (not a dev lol):

_D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>git fetch origin pull/402/head:pr-402

D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>git merge pr-402 fatal: You have not concluded your merge (MERGE_HEAD exists). Please, commit your changes before you merge.

D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF>_

Try to set ur email first

git config user.email "you@example.com"
git config user.name "Your Name"

git commit --no-edit

Or for an instant solution, you can just use this one, I've already merged 399 & 402 here.
https://github.com/muljanis45/ComfyUI-GGUF

But yeah, if both pr get merged, don't forget to switch back to the official repo. I'm also not a dev either, i just cant wait to try ltx2 on my potato pc lol, and gguf is the way.

@novice-101
Copy link

novice-101 commented Jan 10, 2026

Based on the logs, you're using these nodes to load it:

https://github.com/calcuis/gguf

Yes interestingly I thought I needed to keep the "workflow-encrypt" custom node since I was using it to load the audio VAE, was wondering why it still referenced KJ (Kijai) on the node name, after deleting the calcuis nodes (encrypt) the original workflow is now displaying the KJ-nodes like it should.

image

vs before;
image

@jarz76
Copy link
Contributor Author

jarz76 commented Jan 11, 2026

chrome_m2DaXTnQJ0

just tried adding support to recreate the tokenizer directly from the GGUF metadata instead of the previous approach using the extra tokenizer.model file, and it seems to work fine

Copy link
Owner

@city96 city96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this implementation. I added some small nitpick comments, but once those are resolved we should be good to go ahead with merging this.

Not sure if you tested, but it might make sense to do a small test or even just a single still image to check differences between the original safetensors and the BF16 gguf text encoder with the new tokenizer logic to make sure it doesn't behave weirdly for some of the classic cases (newlines, numbers, etc). I think that's what was causing weirdness for me with gemma2, though your spm tokenizer spec looks a lot better than my half-assed one for my attempt lol

@jarz76
Copy link
Contributor Author

jarz76 commented Jan 12, 2026

Thank you for this implementation. I added some small nitpick comments, but once those are resolved we should be good to go ahead with merging this.

Not sure if you tested, but it might make sense to do a small test or even just a single still image to check differences between the original safetensors and the BF16 gguf text encoder with the new tokenizer logic to make sure it doesn't behave weirdly for some of the classic cases (newlines, numbers, etc). I think that's what was causing weirdness for me with gemma2, though your spm tokenizer spec looks a lot better than my half-assed one for my attempt lol

Okay, here’s a new test of the tokenizer logic. What do you think?

Prompt:

A teenage boy stands in front of a birthday cake with lit candles, wearing a blue striped shirt and a paper party hat. Colorful balloons and streamers decorate the living room behind him. His family surrounds him, their faces glowing with excitement in the warm candlelight. He takes a deep breath and begins counting down with enthusiasm: "5... 4... 3... 2... 1!" His eyes light up with each number.

The camera slowly zooms in on his face as he counts. When he reaches "1!" everyone cheers and he leans forward to blow out the candles. Confetti falls from above as smoke wisps from the extinguished flames. The room fills with laughter and applause.
# Description Result (video)
1 ComfyUI LTXV Audio Text Encoder Node (gemma_3_12B_it_bf16.safetensors)
bf16_safetensor.mp4
2 DualClipLoader GGUF Node (Gemma3 GGUF BF16)
bf16_gguf.mp4
3 DualClipLoader GGUF Node (Gemma3 GGUF IQ4_XS)
iq4_xs_gguf.mp4

Prompt with timestamp:

A humorous cinematic scene in a cozy living room with warm lighting and subtle futuristic design.

0–1.5s
An elderly man wearing a 1960s tweed suit looks around, confused.

Man:
“What year is it?”

1.5–3s
An elderly woman wearing a modern futuristic suit answers calmly.

Woman:
“2050.”

3–4.5s
The man snaps upright, shocked and irritated.

Man:
“What the hell? No way.”

4.5–6.5s
The woman activates her suit. A sleek Iron-Man-style nanotech mask smoothly deploys and covers her face, mechanical panels forming in a clean, elegant motion.

Woman (through the mask, calm):
“Technology moved on.”

6.5–8.5s
The man leans forward, squinting at her masked face in disbelief.

Man:
“…This is wrong.”

8.5–10s
The camera slowly pulls back as the woman stands silently, mask glowing softly.

Expressive facial animation before mask deployment, clear lip sync, precise mechanical motion, natural body language, comedic timing through pauses, cinematic realism, shallow depth of field, smooth camera movement, warm interior lighting, light humor.
# Description Result (video)
1 ComfyUI LTXV Audio Text Encoder Node (gemma_3_12B_it_bf16.safetensors)
BF16_Safetensors.mp4
2 DualClipLoader GGUF Node (Gemma3 GGUF BF16)
BF16_GGUF.mp4
3 DualClipLoader GGUF Node (Gemma3 GGUF IQ4_XS)
IQ4_XS.mp4

For image, I’m not sure if there are specific good settings for text to image using LTX-2 , this is the result when I simply set length = 1

Prompt:

A cozy art gallery room with warm wooden floors and soft overhead lighting. 5 paintings hang on the white wall in a neat horizontal row, each in an elegant gold frame. The paintings depict different landscapes: a mountain scene, a seascape, a forest path, a sunset over fields, and a snowy valley. A velvet rope barrier runs along the floor in front of the wall.

The camera slowly pans from left to right, revealing each painting in detail. Shadows from the frames create subtle depth on the wall. A small brass plaque sits beneath each artwork. The room is quiet and peaceful, with the gentle hum of climate control in the background. Natural light filters in from a window off to the side, casting a warm glow across the 5 framed pieces on display.
ComfyUI_temp_fhdpf_00002_

@Sostay
Copy link

Sostay commented Jan 12, 2026

Both PR (402+399) merged files : loader.py

i got this——
The size of tensor a (3520) must match the size of tensor b (466816) at non-singleton dimension 2

LTX2 TI2V(1).json

@MeiYi-dev
Copy link

Both PR (402+399) merged files : loader.py

i got this—— The size of tensor a (3520) must match the size of tensor b (466816) at non-singleton dimension 2
LTX2 TI2V(1).json

Update the PR and the nodes

@niceqwer55555 niceqwer55555 mentioned this pull request Jan 12, 2026
@juntaosun
Copy link

Download the relevant test workflow:
#400 (comment)

@Sostay
Copy link

Sostay commented Jan 12, 2026

Both PR (402+399) merged files : loader.py

i got this—— The size of tensor a (3520) must match the size of tensor b (466816) at non-singleton dimension 2
LTX2 TI2V(1).json

Update the PR and the nodes

same result

@zwukong
Copy link

zwukong commented Jan 12, 2026

Does anyone find LTX2 not that follow our instructions, not good as wan2.2. Wan can do almost anything we want, while LTX2 not. Maybe Gemma3 is a reason ,but i don't think it is the main reason. I have tried three different gemma3, results totally different ,but all not as good as wan

@muljanis45
Copy link

Does anyone find LTX2 not that follow our instructions, not good as wan2.2. Wan can do almost anything we want, while LTX2 not. Maybe Gemma3 is a reason ,but i don't think it is the main reason. I have tried three different gemma3, results totally different ,but all not as good as wan

While LTX2 T2V is very good, I feel this happens with I2V. It’s not that this Gemma3 GGUF implementation doesn’t work, because previously I was using FP8 TE and BF16 checkpoints, and it seems this is just how LTX2 performs.

Anyway, A few days ago on Reddit, the LTX CEO did an AMA, and he answered a question confirming that there are some issues with I2V and portrait/vertical video. He also said they will update LTX2 periodically, and that LTX2.1 may be released soon. So let’s see how it goes.

@zwukong
Copy link

zwukong commented Jan 12, 2026

LTX2.1? maybe we have to wait for that version. 2.0 is not a perfect tool for filming , just a toy. T2V is almost useless unless for some video materials like alpha. what we need is a powful I2V, which can do whatever we want .

@MeiYi-dev
Copy link

MeiYi-dev commented Jan 12, 2026

I think we can also make GGUF versions of these two files, it would be incredibly useful for low RAM/VRAM users

https://huggingface.co/Kijai/LTXV2_comfy/tree/main/text_encoders

Copy link
Owner

@city96 city96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code changes look good to me so I'll go ahead and merge this to main. Thanks again for working on this.

Okay, here’s a new test of the tokenizer logic. What do you think?

Looks like it's not exactly 1:1 but those results are very close and it probably isn't the tokenizer since that tends to be more obvious.

(We could possibly add some tokenizer tests against the reference at some point to check for correctness but yeah, definitely out of scope for this PR and probably overkill)

@city96 city96 merged commit 9ecc3c4 into city96:main Jan 12, 2026
@zwukong
Copy link

zwukong commented Jan 12, 2026

Can you add support for enhanced prompt. I found it can get better result with enhanced prompt. And it is really very special,only useful in gemma3, other LLM not that good. PS, I2V need the mmproj file 😄

  def _enhance(
        self,
        messages: list[dict[str, str]],
        image: torch.Tensor | None = None,
        max_new_tokens: int = 512,
        seed: int = 42,
    ) -> str:
        if self.processor is None:
            self._init_image_processor()
        text = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        model_inputs = self.processor(
            text=text,
            images=image,
            return_tensors="pt",
        ).to(self.model.device)
        pad_token_id = self.processor.tokenizer.pad_token_id if self.processor.tokenizer.pad_token_id is not None else 0
        model_inputs = _pad_inputs_for_attention_alignment(model_inputs, pad_token_id=pad_token_id)

        with torch.inference_mode(), torch.random.fork_rng(devices=[self.model.device]):
            torch.manual_seed(seed)
            outputs = self.model.generate(
                **model_inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
            )
            generated_ids = outputs[0][len(model_inputs.input_ids[0]) :]
            enhanced_prompt = self.processor.tokenizer.decode(generated_ids, skip_special_tokens=True)

        return enhanced_prompt

    def enhance_t2v(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        system_prompt: str | None = None,
        seed: int = 42,
    ) -> str:
        """Enhance a text prompt for T2V generation."""

        system_prompt = system_prompt or self.default_gemma_t2v_system_prompt

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"user prompt: {prompt}"},
        ]

        return self._enhance(messages, max_new_tokens=max_new_tokens, seed=seed)

    def enhance_i2v(
        self,
        prompt: str,
        image: torch.Tensor,
        max_new_tokens: int = 512,
        system_prompt: str | None = None,
        seed: int = 42,
    ) -> str:
        """Enhance a text prompt for I2V generation using a reference image."""
        system_prompt = system_prompt or self.default_gemma_i2v_system_prompt
        messages = [
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": f"User Raw Input Prompt: {prompt}."},
                ],
            },
        ]
        return self._enhance(messages, image=image, max_new_tokens=max_new_tokens, seed=seed)

    @functools.cached_property
    def default_gemma_i2v_system_prompt(self) -> str:
        return _load_system_prompt("gemma_i2v_system_prompt.txt")

    @functools.cached_property
    def default_gemma_t2v_system_prompt(self) -> str:
        return _load_system_prompt("gemma_t2v_system_prompt.txt")

    def forward(self, text: str, padding_side: str = "left") -> tuple[torch.Tensor, torch.Tensor]:
        raise NotImplementedError("This method is not implemented for the base class")


@MeiYi-dev
Copy link

Can you add support for enhanced prompt. I found it can get better result with enhanced prompt. And it is really very special,only useful in gemma3, other LLM not that good. PS, I2V need the mmproj file 😄

  def _enhance(
        self,
        messages: list[dict[str, str]],
        image: torch.Tensor | None = None,
        max_new_tokens: int = 512,
        seed: int = 42,
    ) -> str:
        if self.processor is None:
            self._init_image_processor()
        text = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        model_inputs = self.processor(
            text=text,
            images=image,
            return_tensors="pt",
        ).to(self.model.device)
        pad_token_id = self.processor.tokenizer.pad_token_id if self.processor.tokenizer.pad_token_id is not None else 0
        model_inputs = _pad_inputs_for_attention_alignment(model_inputs, pad_token_id=pad_token_id)

        with torch.inference_mode(), torch.random.fork_rng(devices=[self.model.device]):
            torch.manual_seed(seed)
            outputs = self.model.generate(
                **model_inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
            )
            generated_ids = outputs[0][len(model_inputs.input_ids[0]) :]
            enhanced_prompt = self.processor.tokenizer.decode(generated_ids, skip_special_tokens=True)

        return enhanced_prompt

    def enhance_t2v(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        system_prompt: str | None = None,
        seed: int = 42,
    ) -> str:
        """Enhance a text prompt for T2V generation."""

        system_prompt = system_prompt or self.default_gemma_t2v_system_prompt

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"user prompt: {prompt}"},
        ]

        return self._enhance(messages, max_new_tokens=max_new_tokens, seed=seed)

    def enhance_i2v(
        self,
        prompt: str,
        image: torch.Tensor,
        max_new_tokens: int = 512,
        system_prompt: str | None = None,
        seed: int = 42,
    ) -> str:
        """Enhance a text prompt for I2V generation using a reference image."""
        system_prompt = system_prompt or self.default_gemma_i2v_system_prompt
        messages = [
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": f"User Raw Input Prompt: {prompt}."},
                ],
            },
        ]
        return self._enhance(messages, image=image, max_new_tokens=max_new_tokens, seed=seed)

    @functools.cached_property
    def default_gemma_i2v_system_prompt(self) -> str:
        return _load_system_prompt("gemma_i2v_system_prompt.txt")

    @functools.cached_property
    def default_gemma_t2v_system_prompt(self) -> str:
        return _load_system_prompt("gemma_t2v_system_prompt.txt")

    def forward(self, text: str, padding_side: str = "left") -> tuple[torch.Tensor, torch.Tensor]:
        raise NotImplementedError("This method is not implemented for the base class")

A feature of having enhanced prompt is not really the purpose of this repo. Prompt enhancer nodes and models exist and it's better to keep that seperate. As for ComfyUI native requiring mmproj file, I am going to test and report with and without the mmproj file.

@muljanis45
Copy link

muljanis45 commented Jan 12, 2026

As for ComfyUI native requiring mmproj file, I am going to test and report with and without the mmproj file.

mmproj is specific to .gguf, and native ComfyUI doesn’t support .gguf, which is why city96 created this repo. So Comfy obviously doesn’t require an mmproj file.

and It's easy to tell if comfy really use the gemma 3 vision / see the image when having image input (i2v).

As a casual user, just run the I2V ComfyUI native workflow first using the .safetensors file, which already has vision built into it (unlike .gguf, which needs a separate mmproj.gguf file), then save the output. After that, swap the model loader and text encoder to GGUF.

If it produces similar output, then that confirms that Comfy doesn’t implement or use the vision capabilities Gemma-3 has. And that mean vision isn’t an essential component in LTX2, since when you run it using a GGUF text encoder, it isn’t using mmproj/vision at all.

Another way to tell is that in the native Comfy workflow, the CLIP output from the TE loader is connected directly to the normal CLIP Text Encode node. It’s just a standard text-encode node, not a special one like TextEncodeQwenImageEdit, which has an image input.

Even in ComfyUI-LTX2 custom node, I believe it’s only used to provide context for the prompt-enhancer node, so Gemma-3 sees the image and tries to enhance the prompt based on the image context.

Unlike Qwen-Image-Edit, which requires the vision to make image edit, that's why mmproj is a must for qwen-image-edit gguf.

@zwukong
Copy link

zwukong commented Jan 13, 2026

Native comfyui only use text encoder to be a clip node, but LLM has evolved so much, string result is really needed.So that we can do what ever we want using LLM ,without using a custom 'LLM loader' to load the LLM again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.