Skip to content

Conversation

@SandyYuan
Copy link
Collaborator

@SandyYuan SandyYuan commented Jan 16, 2026

  • 112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD)
  • Self-contained module with no external repo dependencies
  • Partial credit scoring (fraction of constraints satisfied)
  • Automatic tag stripping for reasoning models

Note

Adds a self-contained IFEval benchmark module with IFBench OOD constraint support and public reward API.

  • New eval_protocol/benchmarks/ifeval/ package with README and __init__ exposing ifeval_partial_credit_reward
  • Adds IFBench implementations: ifbench_instructions.py (extensive checkers), ifbench_registry.py (instruction IDs → classes), ifbench_util.py (NLTK setup, tokenization helpers)
  • Includes sample dataset data/ifbench_test_sample.jsonl
  • NLTK resources auto-downloaded and local data path configured

Written by Cursor Bugbot for commit 05ceb80. This will update automatically on new commits. Configure here.

- 112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD)
- Self-contained module with no external repo dependencies
- Partial credit scoring (fraction of constraints satisfied)
- Automatic <think> tag stripping for reasoning models
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

self._description_pattern = "Every {N}st of your response must be in Japanese."
if N % 10 == 2:
self._description_pattern = "Every {N}nd of your response must be in Japanese."
elif N % 10 == 3:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError when N is None in NthWordJapaneseChecker

High Severity

The build_description method uses N % 10 directly on lines 947-951, but N can be None (the default parameter value). Although self._japanese_position is correctly assigned a random value when N is None (lines 942-944), the subsequent modulo operations use N instead of self._japanese_position, causing a TypeError: unsupported operand type(s) for %: 'NoneType' and 'int' when the method is called without providing N.

Fix in Cursor Fix in Web

lines = value.split('\n')
for line in lines:
if not line.strip():
lines.remove(line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying list while iterating causes skipped elements

Medium Severity

In IndentStairsChecker.check_following, the code iterates over lines with a for loop while calling lines.remove(line) inside the loop. This is a classic Python antipattern - when an element is removed, subsequent elements shift down but the iterator advances, causing the next element to be skipped. This can lead to empty lines not being properly filtered out, causing incorrect validation results.

Fix in Cursor Fix in Web

for sentence in sentences:
if len(sentence.strip()) != char_count:
return False
return True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing word uniqueness check in CharacterCountUniqueWordsChecker

Medium Severity

The check_following method only verifies that sentences have equal character counts, but the instruction requires "using all different words." The word uniqueness constraint is completely missing from the implementation. Responses that reuse the same words across sentences will incorrectly pass validation as long as the character counts match.

Fix in Cursor Fix in Web

for bullet in bullets[1:]:
if "-" not in bullet:
return False
return True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubBulletPointsChecker passes when no bullets exist

Medium Severity

When the response contains no * characters, value.split('*') returns a single-element list, making bullets[1:] empty. The for loop never executes and the method returns True, incorrectly validating responses that have no bullet points at all despite the instruction requiring them.

Fix in Cursor Fix in Web

Copy link
Collaborator

@dphuang2 dphuang2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this out of rewards? I think thats an abandoned module.

also how long does the eval take to run, can we move it under tests? If its too long, we can throw it under eval_protocol/benchmarks/ifeval

Copy link
Collaborator

@dphuang2 dphuang2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh also I don't see an evaluation_test decorator anywhere. Can we make this an actual eval-protocol eval?

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

else:
self._n_end = n_end
self._description_pattern = (
"Copy the span of words that lies between (and including) index {n_start} and {n_end}, the indices are character indices!")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description says character indices but uses word indices

High Severity

RepeatSpanChecker has a mismatch between its description and implementation. The instruction description explicitly states "the indices are character indices!" but check_following validates using word indices via split()[self._n_start:self._n_end]. A model following the description literally would extract characters, but the checker expects words, causing correct responses to fail validation.

Additional Locations (1)

Fix in Cursor Fix in Web

…olloutProcessor for improved ground truth extraction and updating pyproject.toml to include new benchmark data files.
@SandyYuan SandyYuan requested a review from dphuang2 January 16, 2026 19:02
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

count_bullets += 1
elif sentences:
sentences = instructions_util.split_into_sentences(line.strip())
count_sentences += len(sentences)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable reuse causes blank line failures

Low Severity

In SomeBulletPointsChecker.check_following, the sentences variable is initially a boolean (True) but gets overwritten with a list from split_into_sentences() at line 1483. If a blank line appears in the text section, split_into_sentences("") returns an empty list, making sentences falsy. The next non-blank text line then falls through to else: return False, incorrectly failing valid responses that contain blank lines between sentences.

Fix in Cursor Fix in Web

if not n_end:
self._n_end = random.randint(self._n_start + 1, len(self._prompt_to_repeat.split()) - 1)
else:
self._n_end = n_end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falsy check ignores valid index zero values

Medium Severity

In RepeatSpanChecker.build_description, the conditions if not n_start: and if not n_end: at lines 2150 and 2154 treat 0 as falsy. If a caller provides n_start=0 (a valid word index meaning "start from the first word"), the code incorrectly ignores it and generates a random value instead. The checks should use if n_start is None: to distinguish between "not provided" and "provided as zero".

Fix in Cursor Fix in Web

else:
self._prompt_to_repeat = prompt_to_repeat
if not n_start:
self._n_start = random.randint(0, len(self._prompt_to_repeat.split()) - 2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short prompts cause ValueError in random range

Low Severity

In RepeatSpanChecker.build_description, when n_start is not provided and prompt_to_repeat contains only one word, len(self._prompt_to_repeat.split()) - 2 evaluates to -1. This causes random.randint(0, -1) to raise ValueError because the upper bound cannot be less than the lower bound. The code lacks validation that the prompt has sufficient words.

Fix in Cursor Fix in Web

@SandyYuan SandyYuan changed the title Add IFEval reward function Add IFEval Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants