-
Notifications
You must be signed in to change notification settings - Fork 12
Add IFEval #414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add IFEval #414
Conversation
- 112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD) - Self-contained module with no external repo dependencies - Partial credit scoring (fraction of constraints satisfied) - Automatic <think> tag stripping for reasoning models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
| self._description_pattern = "Every {N}st of your response must be in Japanese." | ||
| if N % 10 == 2: | ||
| self._description_pattern = "Every {N}nd of your response must be in Japanese." | ||
| elif N % 10 == 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeError when N is None in NthWordJapaneseChecker
High Severity
The build_description method uses N % 10 directly on lines 947-951, but N can be None (the default parameter value). Although self._japanese_position is correctly assigned a random value when N is None (lines 942-944), the subsequent modulo operations use N instead of self._japanese_position, causing a TypeError: unsupported operand type(s) for %: 'NoneType' and 'int' when the method is called without providing N.
| lines = value.split('\n') | ||
| for line in lines: | ||
| if not line.strip(): | ||
| lines.remove(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying list while iterating causes skipped elements
Medium Severity
In IndentStairsChecker.check_following, the code iterates over lines with a for loop while calling lines.remove(line) inside the loop. This is a classic Python antipattern - when an element is removed, subsequent elements shift down but the iterator advances, causing the next element to be skipped. This can lead to empty lines not being properly filtered out, causing incorrect validation results.
| for sentence in sentences: | ||
| if len(sentence.strip()) != char_count: | ||
| return False | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing word uniqueness check in CharacterCountUniqueWordsChecker
Medium Severity
The check_following method only verifies that sentences have equal character counts, but the instruction requires "using all different words." The word uniqueness constraint is completely missing from the implementation. Responses that reuse the same words across sentences will incorrectly pass validation as long as the character counts match.
| for bullet in bullets[1:]: | ||
| if "-" not in bullet: | ||
| return False | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubBulletPointsChecker passes when no bullets exist
Medium Severity
When the response contains no * characters, value.split('*') returns a single-element list, making bullets[1:] empty. The for loop never executes and the method returns True, incorrectly validating responses that have no bullet points at all despite the instruction requiring them.
dphuang2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this out of rewards? I think thats an abandoned module.
also how long does the eval take to run, can we move it under tests? If its too long, we can throw it under eval_protocol/benchmarks/ifeval
dphuang2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh also I don't see an evaluation_test decorator anywhere. Can we make this an actual eval-protocol eval?
…in the docstring of the test_ifeval_benchmark function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
| else: | ||
| self._n_end = n_end | ||
| self._description_pattern = ( | ||
| "Copy the span of words that lies between (and including) index {n_start} and {n_end}, the indices are character indices!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description says character indices but uses word indices
High Severity
RepeatSpanChecker has a mismatch between its description and implementation. The instruction description explicitly states "the indices are character indices!" but check_following validates using word indices via split()[self._n_start:self._n_end]. A model following the description literally would extract characters, but the checker expects words, causing correct responses to fail validation.
Additional Locations (1)
…olloutProcessor for improved ground truth extraction and updating pyproject.toml to include new benchmark data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
| count_bullets += 1 | ||
| elif sentences: | ||
| sentences = instructions_util.split_into_sentences(line.strip()) | ||
| count_sentences += len(sentences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable reuse causes blank line failures
Low Severity
In SomeBulletPointsChecker.check_following, the sentences variable is initially a boolean (True) but gets overwritten with a list from split_into_sentences() at line 1483. If a blank line appears in the text section, split_into_sentences("") returns an empty list, making sentences falsy. The next non-blank text line then falls through to else: return False, incorrectly failing valid responses that contain blank lines between sentences.
| if not n_end: | ||
| self._n_end = random.randint(self._n_start + 1, len(self._prompt_to_repeat.split()) - 1) | ||
| else: | ||
| self._n_end = n_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Falsy check ignores valid index zero values
Medium Severity
In RepeatSpanChecker.build_description, the conditions if not n_start: and if not n_end: at lines 2150 and 2154 treat 0 as falsy. If a caller provides n_start=0 (a valid word index meaning "start from the first word"), the code incorrectly ignores it and generates a random value instead. The checks should use if n_start is None: to distinguish between "not provided" and "provided as zero".
| else: | ||
| self._prompt_to_repeat = prompt_to_repeat | ||
| if not n_start: | ||
| self._n_start = random.randint(0, len(self._prompt_to_repeat.split()) - 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short prompts cause ValueError in random range
Low Severity
In RepeatSpanChecker.build_description, when n_start is not provided and prompt_to_repeat contains only one word, len(self._prompt_to_repeat.split()) - 2 evaluates to -1. This causes random.randint(0, -1) to raise ValueError because the upper bound cannot be less than the lower bound. The code lacks validation that the prompt has sufficient words.
Note
Adds a self-contained IFEval benchmark module with IFBench OOD constraint support and public reward API.
eval_protocol/benchmarks/ifeval/package with README and__init__exposingifeval_partial_credit_rewardifbench_instructions.py(extensive checkers),ifbench_registry.py(instruction IDs → classes),ifbench_util.py(NLTK setup, tokenization helpers)data/ifbench_test_sample.jsonlWritten by Cursor Bugbot for commit 05ceb80. This will update automatically on new commits. Configure here.