Skip to content

Conversation

@ChingTsai
Copy link
Collaborator

@ChingTsai ChingTsai commented Jan 20, 2026

Description

  • Supporting system role in hf data processing
  • Add test case for Qwen in sft_data_processing_test.py
  • Refactoring sft_data_processing_test.py to support testing multiple tokenizer
  • Fix incorrect left shift in targets_segmentation in shift_and_refine
    • see more details here

FIXES: b/471740714

Tests

Unit test

MAXTEXT_PKG_DIR=<xxx>/maxtext/src/MaxText pytest tests/sft_data_processing_test.py

E2E sft

python3 -m MaxText.sft.sft_trainer \
   src/MaxText/configs/sft.yml \
    run_name=$(date +%Y-%m-%d-%H-%M-%S) \
    base_output_directory=<xxx> \
    model_name=qwen3-4b \
    load_parameters_path=<xxx>/0/items \
    tokenizer_path=Qwen/Qwen3-4B \
    steps=53 \
    profiler=xplane \
    hf_path=arrow \
    dataset_type=hf \
    train_split=train \
    hf_train_files=<xxx>.arrow \
    hf_eval_files=<xxx>.arrow \
    per_device_batch_size=16 \
    max_target_length=1024 \
    learning_rate=5e-6 \
    warmup_steps_fraction=0.05 \
   data_shuffle_seed=42 \
    gradient_clipping_threshold=1 \
    weight_dtype=bfloat16

logs

unpatched gpaste
patched gpaste

Step 1 loss drop from 6.100826 to 5.839514

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@github-actions
Copy link

🤖 Hi @ChingTsai, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This pull request adds support for the "system" role in Hugging Face input processing, which is a valuable enhancement for chat models. The implementation is well-structured, particularly the logic within apply_chat_template. The accompanying tests are thorough and have been significantly improved by parameterization, allowing for easier testing with different tokenizers.

🔍 General Feedback

  • The refactoring in sft_data_processing_test.py to use parameterized_class is a great improvement for test maintainability and extensibility.
  • The addition of _get_pad_id in _hf_data_processing.py is a good example of code deduplication.
  • The fix in shift_and_refine to also shift targets_segmentation is a subtle but important correction.

Overall, this is a high-quality contribution. I have one minor suggestion for improving test consistency.

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/MaxText/input_pipeline/_hf_data_processing.py 66.66% 2 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

def shift_and_refine(x, ignored_ids, axis=1):
"""Shift inputs, set segmentation to 0 when target element is in ignored_ids if provided"""
x["targets"] = shift_left(x["targets"], ignored_ids[0], axis=axis)
x["targets_segmentation"] = shift_left(x["targets_segmentation"], 0, axis=axis)
Copy link
Collaborator Author

@ChingTsai ChingTsai Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Llama 2, since token_id 0 is unknown_id, the segmentation for pad_ids (added during packing) is updated correctly in lines 699-700. However, for Qwen3, token_id 0 is !, therefore, we need to apply an explicit left shift to the segmentation here.

- Add unit tests in hf_data_prcessing for Qwen3
  - Refactor sft_data_processing_test
  - Fix incorrect left shift in targets_segmentation
@ChingTsai ChingTsai force-pushed the support-system-role-in-hf-input branch from db379a3 to b14f116 Compare January 20, 2026 10:08
@ChingTsai ChingTsai marked this pull request as ready for review January 20, 2026 10:44
@ChingTsai ChingTsai changed the title Support system role in hf input Support system role in hf data processing Jan 20, 2026
@ChingTsai ChingTsai self-assigned this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant