A coding benchmark for evaluating LLM agents on multi-phase programming tasks. Tests three critical capabilities:
- Hidden requirement discovery — inferring undisclosed constraints from structured feedback
- Long-context retention — maintaining state and hypotheses across many iterations
- Iterative refinement — systematically improving solutions based on violation signals
┌─────────────────────────────────────────────────────────────────────────────┐
│ FluxCodeBench Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────────┐ ┌───────────┐ ┌─────────────┐ │
│ │ Agent │─────▶│ solution.py │─────▶│ Runner │─────▶│ Evaluator │ │
│ └─────────┘ └─────────────┘ └───────────┘ └─────────────┘ │
│ ▲ │ │ │
│ │ │ │ │
│ │ ┌─────────────┐ │ ▼ │
│ └───────────│feedback.json│◀───────────┘ ┌─────────────┐ │
│ └─────────────┘ │ Test Cases │ │
│ │ │ (hidden) │ │
│ ▼ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Violations │ │
│ │ + Coverage │ │
│ └─────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ Phase 0 Phase 1 Phase 2 ... Phase N │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ Rule │ │ Rule │ │ Rule │ │ Rule │ │
│ │ A │ + │ A │ + │ A │ + ... + │ A │ │
│ └──────┘ │ B │ │ B │ │ B │ │
│ └──────┘ │ C │ │ ... │ │
│ └──────┘ │ Z │ │
│ └──────┘ │
│ ◀──────────────── Rules accumulate across phases ─────────────────────▶ │
└─────────────────────────────────────────────────────────────────────────────┘
The agent receives only minimal initial information (input/output types, basic problem description). The actual correctness constraints are not fully disclosed — the agent must infer them from structured feedback on failed attempts.
Key properties:
- Agent starts with incomplete specification
- Hidden constraints are revealed indirectly through violation feedback
- Each phase introduces new undisclosed requirements
- Success requires systematic exploration, not just code generation
pip install -e .flux-code-bench list --tasks-dir tasksflux-code-bench validate --task tasks/task_00_filter_numbersflux-code-bench run --task tasks/task_00_filter_numbers --workspace ./workspace --singleflux-code-bench run --task tasks/task_00_filter_numbers --workspace ./workspaceIn interactive mode, the runner watches for changes to workspace/solution.py and evaluates each update.
When running a task, the runner creates a workspace directory with:
| File | Description |
|---|---|
problem.md |
Problem description (agent-visible) |
task.json |
Task metadata and limits |
phase.json |
Current phase info and rules |
solution.py |
Agent writes solution here |
feedback.json |
Evaluation feedback after each attempt |
report.json |
Final metrics report |
Each evaluation returns structured JSON feedback:
{
"phase_id": 1,
"attempt_id": 5,
"status": "partially_valid",
"status_reason": "Fails checks: no_mutation",
"violations": [
{"rule_id": "no_mutation", "scope": "direct", "count": 2}
],
"summary": {
"rules_total": 3,
"rules_passed": 2,
"rules_failed": 1,
"coverage": 0.85
},
"delta": {
"coverage_change": 0.15,
"new_failures": [],
"fixed_failures": ["correct_output"]
}
}Each task is a directory with:
tasks/task_00_filter_numbers/
├── task.yaml # Task metadata, phases, rules
├── problem.md # Agent-visible problem description
├── evaluator.py # Evaluation logic (check_* methods)
└── tests.py # Test cases (not agent-visible)
- Create a new directory under
tasks/ - Define
task.yamlwith phases and rules - Write
problem.md(what the agent sees) - Implement
evaluator.pywithcheck_{rule_id}methods - Create
tests.pywithTEST_CASESlist - Validate with
flux-code-bench validate --task tasks/your_task
id: "task_00_filter_numbers"
name: "Filter Numbers"
difficulty: "easy"
interface:
function_name: "filter_numbers"
signature: "def filter_numbers(numbers: list[int]) -> list[int]"
allowed_imports: []
execution:
timeout_seconds: 10
phases:
- id: 0
description: "Basic filtering"
rules:
- id: "correct_output"
description: "Output matches expected"
scopes: ["basic"]
- id: 1
description: "Handle edge cases"
rules:
- id: "correct_output"
description: "Output matches expected"
scopes: ["basic", "zeros", "negatives"]
- id: "no_mutation"
description: "Input must not be modified"
scopes: ["direct"]
limits:
max_attempts_per_phase: 5
max_total_attempts: 15| Tier | Phases | Description |
|---|---|---|
| Easy | 3–5 | Basic transformations, simple rules |
| Medium | 6–15 | Moderate complexity, multiple interacting rules |
| Hard | 16–30 | Complex algorithms, many edge cases |
| Expert | 31–50 | Deep challenges, extensive hidden states |
MIT