-
Notifications
You must be signed in to change notification settings - Fork 453
Support system role in hf data processing #2970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤖 Hi @ChingTsai, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📋 Review Summary
This pull request adds support for the "system" role in Hugging Face input processing, which is a valuable enhancement for chat models. The implementation is well-structured, particularly the logic within apply_chat_template. The accompanying tests are thorough and have been significantly improved by parameterization, allowing for easier testing with different tokenizers.
🔍 General Feedback
- The refactoring in
sft_data_processing_test.pyto useparameterized_classis a great improvement for test maintainability and extensibility. - The addition of
_get_pad_idin_hf_data_processing.pyis a good example of code deduplication. - The fix in
shift_and_refineto also shifttargets_segmentationis a subtle but important correction.
Overall, this is a high-quality contribution. I have one minor suggestion for improving test consistency.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| def shift_and_refine(x, ignored_ids, axis=1): | ||
| """Shift inputs, set segmentation to 0 when target element is in ignored_ids if provided""" | ||
| x["targets"] = shift_left(x["targets"], ignored_ids[0], axis=axis) | ||
| x["targets_segmentation"] = shift_left(x["targets_segmentation"], 0, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Llama 2, since token_id 0 is unknown_id, the segmentation for pad_ids (added during packing) is updated correctly in lines 699-700. However, for Qwen3, token_id 0 is !, therefore, we need to apply an explicit left shift to the segmentation here.
- Add unit tests in hf_data_prcessing for Qwen3 - Refactor sft_data_processing_test - Fix incorrect left shift in targets_segmentation
db379a3 to
b14f116
Compare
Description
systemrole in hf data processingsft_data_processing_test.pysft_data_processing_test.pyto support testing multiple tokenizershift_and_refineFIXES: b/471740714
Tests
Unit test
E2E sft
logs
unpatched gpaste
patched gpaste
Step 1 loss drop from
6.100826to5.839514Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.