FluxCodeBench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
python benchmark machine-learning code-generation evaluation-framework ai-agents llm llm-evaluation agent-evaluation coding-benchmark
-
Updated
Jan 18, 2026 - Python