Add IFEval #414

SandyYuan · 2026-01-16T18:04:55Z

112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD)
Self-contained module with no external repo dependencies
Partial credit scoring (fraction of constraints satisfied)
Automatic tag stripping for reasoning models

Note

Adds a self-contained IFEval benchmark module with IFBench OOD constraint support and public reward API.

New eval_protocol/benchmarks/ifeval/ package with README and __init__ exposing ifeval_partial_credit_reward
Adds IFBench implementations: ifbench_instructions.py (extensive checkers), ifbench_registry.py (instruction IDs → classes), ifbench_util.py (NLTK setup, tokenization helpers)
Includes sample dataset data/ifbench_test_sample.jsonl
NLTK resources auto-downloaded and local data path configured

^{Written by Cursor Bugbot for commit 05ceb80. This will update automatically on new commits. Configure here.}

- 112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD) - Self-contained module with no external repo dependencies - Partial credit scoring (fraction of constraints satisfied) - Automatic <think> tag stripping for reasoning models

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

cursor · 2026-01-16T18:14:42Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+			self._description_pattern = "Every {N}st of your response must be in Japanese."
+		if N % 10 == 2:
+			self._description_pattern = "Every {N}nd of your response must be in Japanese."
+		elif N % 10 == 3:


TypeError when N is None in NthWordJapaneseChecker

High Severity

The build_description method uses N % 10 directly on lines 947-951, but N can be None (the default parameter value). Although self._japanese_position is correctly assigned a random value when N is None (lines 942-944), the subsequent modulo operations use N instead of self._japanese_position, causing a TypeError: unsupported operand type(s) for %: 'NoneType' and 'int' when the method is called without providing N.

cursor · 2026-01-16T18:14:42Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+		lines = value.split('\n')
+		for line in lines:
+			if not line.strip():
+				lines.remove(line)


Modifying list while iterating causes skipped elements

Medium Severity

In IndentStairsChecker.check_following, the code iterates over lines with a for loop while calling lines.remove(line) inside the loop. This is a classic Python antipattern - when an element is removed, subsequent elements shift down but the iterator advances, causing the next element to be skipped. This can lead to empty lines not being properly filtered out, causing incorrect validation results.

cursor · 2026-01-16T18:14:42Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+		for sentence in sentences:
+			if len(sentence.strip()) != char_count:
+				return False
+		return True


Missing word uniqueness check in CharacterCountUniqueWordsChecker

Medium Severity

The check_following method only verifies that sentences have equal character counts, but the instruction requires "using all different words." The word uniqueness constraint is completely missing from the implementation. Responses that reuse the same words across sentences will incorrectly pass validation as long as the character counts match.

cursor · 2026-01-16T18:14:42Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+		for bullet in bullets[1:]:
+			if "-" not in bullet:
+				return False
+		return True


SubBulletPointsChecker passes when no bullets exist

Medium Severity

When the response contains no * characters, value.split('*') returns a single-element list, making bullets[1:] empty. The for loop never executes and the method returns True, incorrectly validating responses that have no bullet points at all despite the instruction requiring them.

dphuang2

can we move this out of rewards? I think thats an abandoned module.

also how long does the eval take to run, can we move it under tests? If its too long, we can throw it under eval_protocol/benchmarks/ifeval

dphuang2

oh also I don't see an evaluation_test decorator anywhere. Can we make this an actual eval-protocol eval?

…in the docstring of the test_ifeval_benchmark function.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-01-16T18:45:45Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+		else:
+			self._n_end = n_end
+		self._description_pattern = (
+			"Copy the span of words that lies between (and including) index {n_start} and {n_end}, the indices are character indices!")


Description says character indices but uses word indices

High Severity

RepeatSpanChecker has a mismatch between its description and implementation. The instruction description explicitly states "the indices are character indices!" but check_following validates using word indices via split()[self._n_start:self._n_end]. A model following the description literally would extract characters, but the checker expects words, causing correct responses to fail validation.

Additional Locations (1)

eval_protocol/benchmarks/ifeval/ifbench_instructions.py#L2167-L2170

…olloutProcessor for improved ground truth extraction and updating pyproject.toml to include new benchmark data files.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

cursor · 2026-01-16T19:12:16Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+				count_bullets += 1
+			elif sentences:
+				sentences = instructions_util.split_into_sentences(line.strip())
+				count_sentences += len(sentences)


Variable reuse causes blank line failures

Low Severity

In SomeBulletPointsChecker.check_following, the sentences variable is initially a boolean (True) but gets overwritten with a list from split_into_sentences() at line 1483. If a blank line appears in the text section, split_into_sentences("") returns an empty list, making sentences falsy. The next non-blank text line then falls through to else: return False, incorrectly failing valid responses that contain blank lines between sentences.

cursor · 2026-01-16T19:12:16Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+		if not n_end:
+			self._n_end = random.randint(self._n_start + 1, len(self._prompt_to_repeat.split()) - 1)
+		else:
+			self._n_end = n_end


Falsy check ignores valid index zero values

Medium Severity

In RepeatSpanChecker.build_description, the conditions if not n_start: and if not n_end: at lines 2150 and 2154 treat 0 as falsy. If a caller provides n_start=0 (a valid word index meaning "start from the first word"), the code incorrectly ignores it and generates a random value instead. The checks should use if n_start is None: to distinguish between "not provided" and "provided as zero".

cursor · 2026-01-16T19:12:16Z

eval_protocol/benchmarks/ifeval/ifbench_instructions.py

+		else:
+			self._prompt_to_repeat = prompt_to_repeat
+		if not n_start:
+			self._n_start = random.randint(0, len(self._prompt_to_repeat.split()) - 2)


Short prompts cause ValueError in random range

Low Severity

In RepeatSpanChecker.build_description, when n_start is not provided and prompt_to_repeat contains only one word, len(self._prompt_to_repeat.split()) - 2 evaluates to -1. This causes random.randint(0, -1) to raise ValueError because the upper bound cannot be less than the lower bound. The code lacks validation that the prompt has sufficient words.

SandyYuan added 3 commits January 16, 2026 06:50

added readme text

fe95eb5

ifbench updates

8d465bc

cursor bot reviewed Jan 16, 2026

View reviewed changes

dphuang2 reviewed Jan 16, 2026

View reviewed changes

SandyYuan added 2 commits January 16, 2026 18:42

refactor to benchmarks and adding test decorator

6d27f8b

Fix formatting in test_ifeval.py by removing unnecessary blank lines …

0b19932

…in the docstring of the test_ifeval_benchmark function.

cursor bot reviewed Jan 16, 2026

View reviewed changes

SandyYuan added 2 commits January 16, 2026 18:57

Enhance IFEval benchmark processing by introducing IFEvalGroundTruthR…

210efa4

…olloutProcessor for improved ground truth extraction and updating pyproject.toml to include new benchmark data files.

fixing bugs

05ceb80

SandyYuan requested a review from dphuang2 January 16, 2026 19:02

cursor bot reviewed Jan 16, 2026

View reviewed changes

SandyYuan changed the title ~~Add IFEval reward function~~ Add IFEval Jan 16, 2026

Add IFEval #414

Are you sure you want to change the base?

Add IFEval #414

Uh oh!

Conversation

SandyYuan commented Jan 16, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

TypeError when N is None in NthWordJapaneseChecker

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Modifying list while iterating causes skipped elements

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Missing word uniqueness check in CharacterCountUniqueWordsChecker

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

SubBulletPointsChecker passes when no bullets exist

Uh oh!

dphuang2 left a comment

Choose a reason for hiding this comment

Uh oh!

dphuang2 left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Description says character indices but uses word indices

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Variable reuse causes blank line failures

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Falsy check ignores valid index zero values

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Short prompts cause ValueError in random range

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SandyYuan commented Jan 16, 2026 •

edited by cursor bot

Loading