Skip to content

Conversation

@vantagewithai
Copy link
Contributor

@vantagewithai vantagewithai commented Jan 7, 2026

The Diffusion Loader in ComfyUI can read the model config directly from the safetensors metadata. I’ve added the same support, so newer models like LTX2 are handled correctly, since ComfyUI loads the model configuration for LTX2 from the safetensors file header metadata.

@kijai
Copy link

kijai commented Jan 9, 2026

Thanks, can confirm this works, example GGUF that includes the metadata and runs with this PR:

https://huggingface.co/Kijai/LTXV2_comfy/blob/main/diffusion_models/ltx-2-19b-distilled_Q4_K_M.gguf

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 9, 2026

Yes i have them all working. Dev and distilled versions.
https://huggingface.co/vantagewithai/LTX-2-GGUF/tree/main

@MeiYi-dev MeiYi-dev mentioned this pull request Jan 9, 2026
@Heliumrich
Copy link

Thanks, can confirm this works, example GGUF that includes the metadata and runs with this PR:

https://huggingface.co/Kijai/LTXV2_comfy/blob/main/diffusion_models/ltx-2-19b-distilled_Q4_K_M.gguf

Hi, can we efficiently use a dev GGUF with the BF16 LoRA to make it distilled ?
I mean in terms of memory usage, or is it better to download a distilled GGUF directly

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 9, 2026

You can use the distilled model directly, or run a LoRA with the dev version. In practice, 20 full steps with CFG 4 give better results than the distilled setup alone. If memory is constraint then use distilled version. I tested dev version with LoRA on RTX 3060 12GB it ran without OOM.

Using the dev version gives you more flexibility — you can run the full 20 steps at a higher CFG (without having to download the complete model again) or run 8 steps with LoRA. This model is very fast (at least 5× faster than Wan), so even 20-step runs execute quickly.
The distilled LoRA is especially useful for upscaling, where higher CFG or more steps aren’t necessary. Overall, the best approach is to use the dev version together with the distilled LoRA.

@Heliumrich
Copy link

Heliumrich commented Jan 9, 2026

You can use the distilled model directly, or run a LoRA with the dev version. In practice, 20 full steps with CFG 4 give better results than the distilled setup alone. If memory is constraint then use distilled version. I tested dev version with LoRA on RTX 3060 12GB it ran without OOM.

Using the dev version gives you more flexibility — you can run the full 20 steps at a higher CFG (without having to download the complete model again) or run 8 steps with LoRA. This model is very fast (at least 5× faster than Wan), so even 20-step runs execute quickly. The distilled LoRA is especially useful for upscaling, where higher CFG or more steps aren’t necessary. Overall, the best approach is to use the dev version together with the distilled LoRA.

How is distilled model better for low VRAM ? I mean it should just scale linearly, less steps make it faster regardless of vram or offloading
With 16GB VRAM (and 32GB RAM), I will probably use Q6 with a tiny bit of offloading

I feel like Q5_K_M or similar will degrade quality too much
And big resolution and/or long video will take even more vram, so only Q4 would have enough headroom for 720p 5sec or something

So I think LoRA just to "learn to prompt LTX2" and mess with it, but later use only dev

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 9, 2026

How is distilled model better for low VRAM ? I mean it should just scale linearly, less steps make it faster regardless of vram or offloading
With 16GB VRAM (and 32GB RAM), I will probably use Q6 with a tiny bit of offloading

Using distilled version directly, will not need to load extra LoRA, that is only minimal save, if very tight on VRAM and RAM and using distilled mode only 8 steps. So then no need to load another 7GB+ as LoRA.

With 16GB VRAM and 32GB RAM, you can even run full dev BF16 version, i was able to run it on 12GB, output was better and because of offloading added time was not that much, so quality won over little time increase. and i was able to generate 20 seconds@25fps without OOM.

@zwukong
Copy link

zwukong commented Jan 9, 2026

@vantagewithai
Great work , thanks so much. And it will be greater if we can have gemma-3 gguf or 4bit like this https://huggingface.co/unsloth/gemma-3-12b-it-qat-bnb-4bit/tree/main

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 9, 2026

@vantagewithai Great work , thanks so much. And it will be greater if we can have gemma-3 gguf or 4bit like this https://huggingface.co/unsloth/gemma-3-12b-it-qat-bnb-4bit/tree/main

You can use the GGUF DualCLIP Loader node from this repo to load Gemma-3 GGUF models from Unsloth or Google. This node supports loading both GGUF and safetensors.

For 4-bit quantized models, you can load them directly using ComfyUI’s built-in Dual CLIP Loader node.

@theOliviaRossi
Copy link

@vantagewithai Great work , thanks so much. And it will be greater if we can have gemma-3 gguf or 4bit like this https://huggingface.co/unsloth/gemma-3-12b-it-qat-bnb-4bit/tree/main

here you go: https://huggingface.co/mradermacher/gemma-3-12b-it-heretic-x-i1-GGUF/tree/main

@zwukong
Copy link

zwukong commented Jan 9, 2026

Really?🤔

ComfyUI’s built-in Dual CLIP Loader node can load a folder ?

gguf https://huggingface.co/unsloth/gemma-3-12b-it-GGUF/tree/main got this error:

Unexpected text model architecture type in GGUF file: 'gemma3'

@rmcc3
Copy link

rmcc3 commented Jan 9, 2026

This PR keeps giving me 'VAE' object has no attribute 'latent_frequency_bins', even when using "working" workflows.

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 9, 2026

This PR keeps giving me 'VAE' object has no attribute 'latent_frequency_bins', even when using "working" workflows.

This happens because the audio VAE can’t be loaded using ComfyUI’s internal VAE Loader. You need to either use Kijai’s VAE Loader (loads from vae folder), or use ComfyUI’s LTXV Audio VAE Loader and copy the audio VAE into the checkpoints folder (Only applicable for ComfyUI’s LTXV Audio VAE Loader).

@zwukong
Copy link

zwukong commented Jan 9, 2026

@vantagewithai thanks for your reply ,but i have tried again.none of your method works, int4 folder can't be loaded by Dual CLIP Loader , gguf gemma3 architecture error by GGUF DualCLIP Loader

@vantagewithai
Copy link
Contributor Author

Really?🤔

ComfyUI’s built-in Dual CLIP Loader node can load a folder ?

gguf https://huggingface.co/unsloth/gemma-3-12b-it-GGUF/tree/main got this error:

Unexpected text model architecture type in GGUF file: 'gemma3'

Nope — the Dual CLIP Loader can’t load folders.

You’re right about the Gemma GGUF though — that one can be loaded using the GGUF Dual CLIP Loader node. I’ll take a look and see how it can be supported.

@Bradley-Liu
Copy link

@vantagewithai I use kj's loadvaekj to load the audio vae but the code gives me "'VAE' object has no attribute 'latent_frequency_bins'“ error. could you tell me how i can fix this?

@LIQUIDMIND111
Copy link

Tried the GGUF q4 from kijai, and got this AMAZING RESULT, LMDAO!!!!!

LTX_2.0_i2v_00102_.mp4

@YarvixPA
Copy link
Contributor

YarvixPA commented Jan 9, 2026

You should replace the "nodes.py" and "loader.py" of the custom node with this one of the PR.

This is an example output (Q3_K_S dev model + distilled lora at 1080p) of my GGUF quant available at https://huggingface.co/QuantStack/LTX-2-GGUF

LTX-2.20GGUFs.20coming.mp4

@LIQUIDMIND111
Copy link

You should replace the "nodes.py" and "loader.py" of the custom node with this one of the PR.

This is an example output (Q3_K_S dev model + distilled lora at 1080p) of my GGUF quant available at https://huggingface.co/QuantStack/LTX-2-GGUF

LTX-2.20GGUFs.20coming.mp4

i got the files downloaded, now, if i am using the NATIVE comfy workflows with kijai nodes, where do i find them on comfy....?? i cant see them on the search bar.....

@YarvixPA
Copy link
Contributor

YarvixPA commented Jan 9, 2026

You should replace the "nodes.py" and "loader.py" of the custom node with this one of the PR.

This is an example output (Q3_K_S dev model + distilled lora at 1080p) of my GGUF quant available at https://huggingface.co/QuantStack/LTX-2-GGUF

LTX-2.20GGUFs.20coming.mp4

i got the files downloaded, now, if i am using the NATIVE comfy workflows with kijai nodes, where do i find them on comfy....?? i cant see them on the search bar.....

You should go to custom nodes (folder) > ComfyUI-GGUF (folder) and replace that files there

@LIQUIDMIND111
Copy link

You should replace the "nodes.py" and "loader.py" of the custom node with this one of the PR.

This is an example output (Q3_K_S dev model + distilled lora at 1080p) of my GGUF quant available at https://huggingface.co/QuantStack/LTX-2-GGUF

LTX-2.20GGUFs.20coming.mp4

i got the files downloaded, now, if i am using the NATIVE comfy workflows with kijai nodes, where do i find them on comfy....?? i cant see them on the search bar.....

You should go to custom nodes (folder) > ComfyUI-GGUF (folder) and replace that files there

ahh got it, done! let me try now a render, thanks!

@Bradley-Liu
Copy link

@YarvixPA do you think using dev+lora has a better result than using distilled alone?

@vantagewithai
Copy link
Contributor Author

@YarvixPA do you think using dev+lora has a better result than using distilled alone?

I tried it both ways and the results were very similar — I didn’t notice any major degradation or improvement. Since this model is quite fast, I prefer using the dev version with 20 steps and CFG 4 (without distillation) for production, and dev + distilled LoRA for prototyping.

@LostnD
Copy link

LostnD commented Jan 9, 2026

workflow? please any one I downloaded the Quantstack ggufs! now which workflow to use? kijai's or ComfyUI ltx official's?

@LostnD
Copy link

LostnD commented Jan 9, 2026

I'm getting this error
image
I tried gemma 4bit safetensors file bot of them from one folder and also gemma fp8 _e4m3fn

@YarvixPA
Copy link
Contributor

YarvixPA commented Jan 9, 2026

@YarvixPA do you think using dev+lora has a better result than using distilled alone?

No, I'm also going to upload the GGUF quants for the distilled version once I'm back. Since 'Dev' is the base, you can just apply the LoRA to it. However, Dev will always give you better quality as it's meant for higher step counts.

@guiteubeuh
Copy link

Tried the GGUF q4 from kijai, and got this AMAZING RESULT, LMDAO!!!!!
LTX_2.0_i2v_00102_.mp4

I have the same issue , I downloaded the 2 files, did you fix it ?

@Heliumrich
Copy link

Heliumrich commented Jan 10, 2026

can anybody here Post as an Attachement those Two files? - loader.py - and - nodes.py - so we can download and use them

https://raw.githubusercontent.com/city96/ComfyUI-GGUF/5f715d6fda151d21f621d9ec801975d938332305/loader.py
https://raw.githubusercontent.com/city96/ComfyUI-GGUF/f083506720f2f049631ed6b6e937440f5579f6c7/nodes.py

Right-click, save target as..., and replace the same files in GGUF custom_nodes folder
If, for some reason, this PR get updated more, these links won't reflect the newer changes

@kijai
Copy link

kijai commented Jan 10, 2026

Hmm, don't unsloth run some test to find which blocks are more "important" and do some logic specific for each model ? That's what they do for LLMs at least. They auto-correct in an iterative way

Okay on some of them, below Q5 it seems there's some mixed weights yes.

@Heliumrich
Copy link

Heliumrich commented Jan 10, 2026

I mean your quants are already pretty great and really similar, nothing would beat Nunchaku either way...
(please Kijai fork nunchaku and add support, the project is basically dead 😭 )

SVDQuant formats (W4A16/W4A4/W8A8/W4A8KV4) are so much better (and faster) than GGUF
DeepCompressor and Nunchaku projects are so slow to add new models :/

@vantagewithai
Copy link
Contributor Author

Hmm, don't unsloth run some test to find which blocks are more "important" and do some logic specific for each model ?
That's what they do for LLMs at least. They auto-correct in an iterative way

Usually, the most important blocks in diffusion models are the first few blocks, which refine the initial latent, and the last blocks, which produce the final latent output.

But you’re right — in the case of Qwen Image Layered, I also had to quantize two middle blocks to get the best results. There’s no mathematical formula for this; you really have to test and see what works best.

@kijai
Copy link

kijai commented Jan 10, 2026

I mean your quants are already pretty great and really similar, nothing would beat Nunchaku either way... (please Kijai fork nunchaku and add support, the project is basically dead 😭 )

SVDQuant formats (W4A16/W4A4/W8A8/W4A8KV4) are so much better (and faster) than GGUF

I didn't do anything special with them, just city96's script. The mixed models should technically perform better, I wish they were marked as such though, because not all of them are and for precisions such as Q6 or Q8 there's no difference.

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 10, 2026

I mean your quants are already pretty great and really similar, nothing would beat Nunchaku either way... (please Kijai fork nunchaku and add support, the project is basically dead 😭 )

SVDQuant formats (W4A16/W4A4/W8A8/W4A8KV4) are so much better (and faster) than GGUF

Nunchaku is quite good — they use SVDQuant, a 4-bit quantization scheme that significantly improves speed. GGUF, on the other hand, follows a different architecture.

They were planning to add support for Wan and video models. I haven’t followed up recently, but at the time it seemed they were still limited to image models. I might be mistaken though — the last time I checked the Nunchaku project was at least a month ago.

Let's Hope Kijai takes it over. :)

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 10, 2026

I mean your quants are already pretty great and really similar, nothing would beat Nunchaku either way... (please Kijai fork nunchaku and add support, the project is basically dead 😭 )
SVDQuant formats (W4A16/W4A4/W8A8/W4A8KV4) are so much better (and faster) than GGUF

I didn't do anything special with them, just city96's script. The mixed models should technically perform better, I wish they were marked as such though, because not all of them are and for precisions such as Q6 or Q8 there's no difference.

I modified llama.cpp and added support for higher quantization on 6 blocks. For lower quant versions.

if (arch == LLM_ARCH_LTXV){
if (
(name.find("transformer_blocks.0.") != std::string::npos) ||
(name.find("transformer_blocks.1.") != std::string::npos) ||
(name.find("transformer_blocks.2.") != std::string::npos) ||
// (name.find("transformer_blocks.29.") != std::string::npos) ||
// (name.find("transformer_blocks.30.") != std::string::npos) ||
(name.find("transformer_blocks.45.") != std::string::npos) ||
(name.find("transformer_blocks.46.") != std::string::npos) ||
(name.find("transformer_blocks.47.") != std::string::npos) // this should be dynamic
) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0 || ftype == LLAMA_FTYPE_MOSTLY_Q4_1) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_0 || ftype == LLAMA_FTYPE_MOSTLY_Q5_1) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_S) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
new_type = GGML_TYPE_Q6_K;
}
}
}

@YarvixPA
Copy link
Contributor

YarvixPA commented Jan 10, 2026

Same thing on QuantStack GGUF. This is something already have been implemented on quants of Qwen Image

+    // LTX-2: first/last block high precision for lower quants
+    if (arch == LLM_ARCH_LTXV) {
+        if (
+            (name.find("transformer_blocks.0.") != std::string::npos) ||
+            (name.find("transformer_blocks.47.") != std::string::npos) // 48 blocks total (0-47)
+        ) {
+            if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q4_0 ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q4_1 ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S ||
+                ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
+                new_type = GGML_TYPE_Q5_K;  // Minimum Q5_K for low quants
+            }
+            else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
+                new_type = GGML_TYPE_Q6_K;
+            }
+        }
+    }

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 10, 2026

Same thing on QuantStack GGUF. This is something already have been implemented on quants of Qwen Image

for Qwen Image Layered, i found best results using this setup.

static bool qwen_image_needs_protection(enum llama_ftype ftype) {
switch (ftype) {
case LLAMA_FTYPE_MOSTLY_Q2_K:
case LLAMA_FTYPE_MOSTLY_Q3_K_S:
case LLAMA_FTYPE_MOSTLY_Q3_K_M:
case LLAMA_FTYPE_MOSTLY_Q4_K_M:
case LLAMA_FTYPE_MOSTLY_Q4_K_S:
case LLAMA_FTYPE_MOSTLY_Q5_K_S:
case LLAMA_FTYPE_MOSTLY_Q5_0:
case LLAMA_FTYPE_MOSTLY_Q4_0:
case LLAMA_FTYPE_MOSTLY_Q4_1:
return true;
default:
return false;
}
}

static bool qwen_image_force_q5(const std::string & name, int block_id) {
// Attention Q/K projections
if (name.find(".attn.") != std::string::npos) {
if (name.find("q_proj") != std::string::npos ||
name.find("k_proj") != std::string::npos ||
name.find("to_q") != std::string::npos ||
name.find("to_k") != std::string::npos) {
return true;
}
}

// Early MLP + modulation layers
if (block_id >= 0 && block_id <= 5) {
    if (name.find(".img_mlp.") != std::string::npos ||
        name.find(".txt_mlp.") != std::string::npos ||
        name.find(".img_mod.") != std::string::npos ||
        name.find(".txt_mod.") != std::string::npos) {
        return true;
    }
}

return false;

}

if (arch == LLM_ARCH_QWEN_IMAGE){
if (
(name.find("transformer_blocks.0.") != std::string::npos) ||
(name.find("transformer_blocks.1.") != std::string::npos) ||
(name.find("transformer_blocks.2.") != std::string::npos) ||
(name.find("transformer_blocks.29.") != std::string::npos) ||
(name.find("transformer_blocks.30.") != std::string::npos) ||
(name.find("transformer_blocks.57.") != std::string::npos) ||
(name.find("transformer_blocks.58.") != std::string::npos) ||
(name.find("transformer_blocks.59.") != std::string::npos) // this should be dynamic
) {
if (qwen_image_needs_protection(ftype)) {
const int block_id = get_block_id(name);
if (qwen_image_force_q5(name, block_id)) {
new_type = GGML_TYPE_Q4_K;
}
}
}
}

For LTX-2 i used first 3 and last 3 blocks.

@Arjun-Haridasan
Copy link

Arjun-Haridasan commented Jan 10, 2026

@kijai @vantagewithai i don't understand a shit about what you guys are talking about (though ik a little bit of what is happening here).. can you guys help me setup a working workflow to user LTX-2 in comfy UI for a 12 gb V Ram device?

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 10, 2026

@kijai @vantagewithai i don't understand a shit about what you guys are talking about.. can you guys help me setup a working workflow to user LTX-2 in comfy UI for a 12 gb V Ram device?

Try this. Supports both T2V/I2V and safetensors/gguf in one workflow. I have tested it on 12GB VRAM and 48GB System RAM setup.

Since the PR hasn’t been merged yet, this workflow is using my own custom node based on ComfyUI-GGUF to load GGUF models. If you already have the merged PR changes on your side, you can safely replace it with the ComfyUI-GGUF node.

https://github.com/vantagewithai/Vantage-Nodes

To run I2V mode in workflow you would need to do lots of offloading by running ComfyUI with these params. T2V mode works fine without needing to reserve VRAM.

python main.py --lowvram --reserve-vram 10

https://huggingface.co/vantagewithai/LTX-2-Split/resolve/main/Vantage-LTX2-Advanced-Workflow-GGUF-Support.json?download=true

workflow

@Arjun-Haridasan
Copy link

Arjun-Haridasan commented Jan 10, 2026

I was able to run I2V Wan 14b model with 12 gb vram and 32 gb ram(though it takes 45 mins to generate a 1080p 5 sec video).. i need something like this that works. But after updating none of the workflows that i have are working everything has an some or the other error... could you help me by telling how to setup comfyui so that errors can be avoided after updating???
Wan 2 2 I2V 14b GGUF + Lora

@FlowDownTheRiver
Copy link

Great job! I will try this one as my video gens with gguf had those pixelated effect and broken audio. Hey @vantagewithai your youtube channel is also a really good one. Thanks for the PR!

@vantagewithai
Copy link
Contributor Author

Great job! I will try this one as my video gens with gguf had those pixelated effect and broken audio. Hey @vantagewithai your youtube channel is also a really good one. Thanks for the PR!

You are most welcome! :)

@LostnD
Copy link

LostnD commented Jan 10, 2026

@vantagewithai please dude do something about the gemma I have 6GB vram! make the gemma gguf work with ltx 2, exactly on gemma part it kind of like stucks and doesn't pass from there in comfyUI!

@zwukong
Copy link

zwukong commented Jan 10, 2026

@vantagewithai Qwen Image Layered gguf is needed. There is no quality Q3 yet (unsloth and quantstack both bad,unsoth worse), thanks for your codes. can you provide a gguf link

@shimmyshimmer
Copy link

shimmyshimmer commented Jan 11, 2026

@vantagewithai Qwen Image Layered gguf is needed. There is no quality Q3 yet (unsloth and quantstack both bad,unsoth worse), thanks for your codes. can you provide a gguf link

You might've used an old version of the Qwen Image layered by Unsloth. We just updated like a day ago for dynamic quantization. Try it out and see if you still get bad performance: https://huggingface.co/unsloth/Qwen-Image-Layered-GGUF/tree/main

We're always trying to improve our formula. And we run an analysis/search to find quant configs and we are continuing to evolve methodology.

@zwukong
Copy link

zwukong commented Jan 11, 2026

Ok i will try. The most useful way to check the gguf quality is Q3 i think,if Q3 is fine, others will be very good

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 11, 2026

@vantagewithai Qwen Image Layered gguf is needed. There is no quality Q3 yet (unsloth and quantstack both bad,unsoth worse), thanks for your codes. can you provide a gguf link

You might've used an old version of the Qwen Image layered by Unsloth. We just updated like a day ago for dynamic quantization. Try it out and see if you still get bad performance: https://huggingface.co/unsloth/Qwen-Image-Layered-GGUF/tree/main

We're always trying to improve our formula. And we run an analysis/search to find quant configs and we are continuing to evolve methodology.

@shimmyshimmer No, I didn’t use one from Unsloth. I always quantize the models myself. What I shared was simply the method that gave me the best results, especially in terms of layer-splitting accuracy.

@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 11, 2026

Ok i will try. The most useful way to check the gguf quality is Q3 i think,if Q3 is fine, others will be very good

@zwukong @shimmyshimmer @YarvixPA @kijai Unsloth puts a lot of effort into quantizing models and keeps improving them. Also, a shoutout to QuantStack, and offcouse Kiaji for his fp8 versions — they all do great work for the community by providing high-quality quantized models for everyone. That said, for Qwen Image Layered, I’d recommend sticking with Q4_K_M or higher — even the FP8 version doesn’t perform as well as the BF16 weights.

@nizamani
Copy link

I keep on getting this error when trying gguf gemma 3
image
I tried UnSloth gemma 3 gguf as well as this one https://huggingface.co/mradermacher/gemma-3-12b-it-heretic-x-i1-GGUF/tree/main

@JosephMillsAtWork
Copy link

JosephMillsAtWork commented Jan 11, 2026

@vantagewithai thanks for this. After applying the patch and altering #398 I was able to get this "running" on my laptop with 6gb of vram(some audio sync issues but hey it's 6gb of vram on a laptop) . I also tested on my 8gb and 16gb card. Again thanks for you time and effort that you put into this.

@city96 I know we all get busy with life and everything just a friendly bump for the merge.

@city96
Copy link
Owner

city96 commented Jan 11, 2026

Sorry, yeah, I've had a lot of stuff to deal with and barely have a working PC to test on, so I'm really behind on new models and issues.

Anyway, I checked out this PR. It does seem to break loading quantized text encoders as-is since it changes the number of elements that gguf_sd_loader returns. I'll push a fix to this branch that just changes it around a bit to return a dict instead, hopefully that's a better solution long term since we can add more stuff to it (and we no longer need to have the return_arch arg either). It does mean a breaking change either way for any node pack that tries to call gguf_sd_loader directly this one time, though not sure if that's really that common to do.

For the sake of speed, I'll merge it with those changes. If anything breaks, I'll be around for at least a few days so I can try and fix stuff faster. I'll also try to look at gemma3.

This should be more future proof in case we need to return other attributes in the future. Possible breaking change for anyone using `gguf_sd_loader` directly either way, though.
@city96 city96 merged commit 58625e1 into city96:main Jan 11, 2026
@vantagewithai
Copy link
Contributor Author

vantagewithai commented Jan 11, 2026

@city96 Thanks a lot.

Since you’ve merged the PR and mentioned that it breaks a few things, I think this can be handled in a non-breaking way. We can add a helper function that simply checks whether config key is present in the metadata. If it is, it returns the required metadata key or full metadata. We do this in the just this node's definition, so all other functions remain same.

This way, the old implementation remains untouched, nothing breaks, and the new metadata support works seamlessly alongside it.

I also made a small local change to convert.py to add support for carrying metadata into the generated GGUF files. I think it would be better if this were parameter-driven, so could you please consider adding this functionality to convert.py?

Or, if you’re currently busy, i could add a parameter like --add-metadata, and based on that flag, call the required functions when generating the GGUF. I can then submit a separate PR for these changes.

Apart from these changes, I’ve also added support for several new, previously unrecognized models in my local copy of convert.py. I can submit a separate PR for those changes as well, if you’d like.

I added the following helper function:

def load_state_dict_with_metadata(path):
    # Load state dict and extract safetensors metadata
    if any(path.endswith(x) for x in [".ckpt", ".pt", ".bin", ".pth"]):
        state_dict = torch.load(path, map_location="cpu", weights_only=True)
        metadata = {}  # Legacy formats do not contain metadata
        for subkey in ["model", "module"]:
            if subkey in state_dict:
                state_dict = state_dict[subkey]
                break
    else:
        # Parse safetensors header for metadata
        import struct, json
        with open(path, "rb") as f:
            length = struct.unpack("<Q", f.read(8))[0]
            header = json.loads(f.read(length))
            metadata = header.get("__metadata__", {})
        
        state_dict = load_file(path)
        logging.info(f"Extracted {len(metadata)} metadata keys from safetensors")

    state_dict = strip_prefix(state_dict)
    return state_dict, metadata

Then, inside convert_file, I made the following changes:

state_dict, safetensors_metadata = load_state_dict_with_metadata(path)

# After writer creation
add_metadata_with_type(writer, safetensors_metadata)
logging.info(f"Copied {len(safetensors_metadata)} metadata keys to GGUF")

@city96
Copy link
Owner

city96 commented Jan 11, 2026

@vantagewithai

Since you’ve merged the PR and mentioned that it breaks a few things, I think this can be handled in a non-breaking way.

I think long term the current approach makes the most sense, since we might need to add other returned info to sd loader eventually, so just ripping the bandaid off and changing it once like this is likely the least painful long term. I checked a few custom nodes that I could think of, but it shouldn't break ComfyUI-MultiGPU and the other node packs I think just have a copy of the loader code instead of relying on this code directly.

I also made a small local change to convert.py to add support for carrying metadata into the generated GGUF files. I think it would be better if this were parameter-driven, so could you please consider adding this functionality to convert.py?

Yeah, I think that makes a lot of sense to have, though the convert code at the moment is a bit all over the place since half the updates are on a different branch. It'll have to be merged to master first, plus I guess a lot of the new model architectures are likely to be missing.

As a bonus, we keep the actual metadata from any model that does have it, though I guess we might want to wrap a try-catch per metadata line on the off-chance one has something weird in it that might break it. i.e. flux schnell just straight up has a base64 encoded jpeg thumbnail in the metadata. Not sure how well that gets handled.

image

@FlowDownTheRiver
Copy link

FlowDownTheRiver commented Jan 11, 2026

@city96 Thanks for merging with the fix. After I tried @vantagewithai 's implementation which was working great for the LTX models ,I realized that it was breaking the gguf loading for the qwen clips. Now it supports that too. However there is still a minor request I want you to push to the main repo if possible. on this PR #402 it is adding Gemma3 12b support, and at the same topic this file #402 (comment) was shared on top of @vantagewithai 's implementation which was basically supporting ltx models and Gemma model at the same time. So now that you have pushed this to the main repo can you also include the Gemma support on the clip loaders? especially on the dual clip loader. So we can have all fixed and supported.

Edit : I have seen your comments lately on that PR, so you know about the subject. Thanks for the great work to this date...

city96 added a commit that referenced this pull request Jan 12, 2026
For #407 since old comfy versions don't support passing metadata (added in #399 )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.