Support system role in hf data processing #2970

ChingTsai · 2026-01-20T08:35:19Z

Description

Supporting system role in hf data processing
Add test case for Qwen in sft_data_processing_test.py
Refactoring sft_data_processing_test.py to support testing multiple tokenizer
Fix incorrect left shift in targets_segmentation in shift_and_refine
- see more details here

FIXES: b/471740714

Tests

Unit test

MAXTEXT_PKG_DIR=<xxx>/maxtext/src/MaxText pytest tests/sft_data_processing_test.py

E2E sft

python3 -m MaxText.sft.sft_trainer \
   src/MaxText/configs/sft.yml \
    run_name=$(date +%Y-%m-%d-%H-%M-%S) \
    base_output_directory=<xxx> \
    model_name=qwen3-4b \
    load_parameters_path=<xxx>/0/items \
    tokenizer_path=Qwen/Qwen3-4B \
    steps=53 \
    profiler=xplane \
    hf_path=arrow \
    dataset_type=hf \
    train_split=train \
    hf_train_files=<xxx>.arrow \
    hf_eval_files=<xxx>.arrow \
    per_device_batch_size=16 \
    max_target_length=1024 \
    learning_rate=5e-6 \
    warmup_steps_fraction=0.05 \
   data_shuffle_seed=42 \
    gradient_clipping_threshold=1 \
    weight_dtype=bfloat16

logs

unpatched gpaste
patched gpaste

Step 1 loss drop from 6.100826 to 5.839514

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2026-01-20T09:00:59Z

🤖 Hi @ChingTsai, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request adds support for the "system" role in Hugging Face input processing, which is a valuable enhancement for chat models. The implementation is well-structured, particularly the logic within apply_chat_template. The accompanying tests are thorough and have been significantly improved by parameterization, allowing for easier testing with different tokenizers.

🔍 General Feedback

The refactoring in sft_data_processing_test.py to use parameterized_class is a great improvement for test maintainability and extensibility.
The addition of _get_pad_id in _hf_data_processing.py is a good example of code deduplication.
The fix in shift_and_refine to also shift targets_segmentation is a subtle but important correction.

Overall, this is a high-quality contribution. I have one minor suggestion for improving test consistency.

tests/sft_data_processing_test.py

codecov · 2026-01-20T09:18:44Z

Codecov Report

❌ Patch coverage is 85.71429% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/input_pipeline/_hf_data_processing.py	66.66%	2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

ChingTsai · 2026-01-20T09:24:43Z

src/MaxText/input_pipeline/_input_pipeline_utils.py

 def shift_and_refine(x, ignored_ids, axis=1):
  """Shift inputs, set segmentation to 0 when target element is in ignored_ids if provided"""
  x["targets"] = shift_left(x["targets"], ignored_ids[0], axis=axis)
+  x["targets_segmentation"] = shift_left(x["targets_segmentation"], 0, axis=axis)


For Llama 2, since token_id 0 is unknown_id, the segmentation for pad_ids (added during packing) is updated correctly in lines 699-700. However, for Qwen3, token_id 0 is !, therefore, we need to apply an explicit left shift to the segmentation here.

- Add unit tests in hf_data_prcessing for Qwen3 - Refactor sft_data_processing_test - Fix incorrect left shift in targets_segmentation

ChingTsai added the gemini-review label Jan 20, 2026

github-actions bot reviewed Jan 20, 2026

View reviewed changes

tests/sft_data_processing_test.py Show resolved Hide resolved

ChingTsai commented Jan 20, 2026

View reviewed changes

- Support system prompt for hf data processing

b14f116

- Add unit tests in hf_data_prcessing for Qwen3 - Refactor sft_data_processing_test - Fix incorrect left shift in targets_segmentation

ChingTsai force-pushed the support-system-role-in-hf-input branch from db379a3 to b14f116 Compare January 20, 2026 10:08

ChingTsai marked this pull request as ready for review January 20, 2026 10:44

ChingTsai requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners January 20, 2026 10:44

ChingTsai changed the title ~~Support system role in hf input~~ Support system role in hf data processing Jan 20, 2026

ChingTsai self-assigned this Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support system role in hf data processing #2970

Support system role in hf data processing #2970

ChingTsai commented Jan 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 20, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

codecov bot commented Jan 20, 2026 •

edited

Loading

Uh oh!

ChingTsai Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support system role in hf data processing #2970

Are you sure you want to change the base?

Support system role in hf data processing #2970

Conversation

ChingTsai commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Unit test

E2E sft

logs

Checklist

Uh oh!

github-actions bot commented Jan 20, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

Uh oh!

codecov bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChingTsai Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChingTsai commented Jan 20, 2026 •

edited

Loading

codecov bot commented Jan 20, 2026 •

edited

Loading

ChingTsai Jan 20, 2026 •

edited

Loading