Skip to content

This repo accompanies the paper "The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management" presented at the Fourth Deep Learning for Code (DL4Code) workshop at NeurIPS 2025 and Tobias Lindenbauer's Master's Thesis.

License

Notifications You must be signed in to change notification settings

Recursive-Safeguarding/the-complexity-trap

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

The Complexity Trap: Multi-Model Evaluation

Python 3.12+ uv WandB LiteLLM arXiv

Fork of JetBrains Research's "The Complexity Trap" with multi-model evaluation infrastructure

Original README: README_upstream.md


What's New

This fork extends the original paper's experiments with:

  • Multi-provider LLM support via LiteLLM: GLM-4.7, Kimi-K2, MiniMax-M2.1, DeepSeek, AWS Bedrock (Qwen3), OpenRouter, Anthropic, OpenAI
  • WandB sweep orchestration with parallel agent support for systematic hyperparameter search
  • Query CLI (scripts/query.py) for quick results analysis and paper comparison
  • Dashboards: Streamlit web UI and Rich-based TUI for experiment monitoring
  • VPS deployment scripts for long-running sweeps on remote servers
  • Decoupled evaluation architecture to prevent Docker hangs from blocking sweeps
  • 902 evaluated instances across 4 context management strategies

Evaluation Results

902 instances evaluated on SWE-bench Verified with GLM-4.7

Strategy Comparison

Strategy Solve Rate vs Raw Cost/Instance Cost Savings
raw 64.0% $1.00 baseline
observation_masking 62.0% -2.0% $0.61 -39%
llm_summary 54.2% -9.8% $0.50 -50%
hybrid 56.0% -8.0% $0.32 -68%

Key Finding

Context management hurts GLM-4.7 performance, opposite of the paper's findings with qwen3-coder-480b. The raw baseline achieves the highest solve rate (64.0%), while observation masking drops performance by 2.0% (vs paper's +1.4% gain) and LLM-Summary drops it by 9.8% (vs paper's +0.4%). Cost savings remain substantial at 39-68%.

This suggests context management strategies may be model-dependent rather than universally beneficial.

Paper Comparison

Metric Paper (qwen3-coder-480b) Ours (GLM-4.7)
Raw solve rate 53.4% 64.0%
Best strategy hybrid (+1.6%) raw (baseline)
Cost reduction 50-58% 39-68%
Dataset size 500 instances 902 instances

Quick Start

# Clone and setup
git clone https://github.com/youkad/the-complexity-trap.git
cd the-complexity-trap
uv venv .venv --python 3.12 --seed
source .venv/bin/activate
uv sync --extra dev

# Configure API keys
cp .env.example .env
# Edit .env with your API keys (ZHIPUAI_API_KEY, MOONSHOT_API_KEY, etc.)

# Run a quick test (5 instances)
python scripts/run_sweep.py \
  --model glm-4.7 \
  --strategy raw \
  --instances-slice :5

Query Results

# Quick status summary
python scripts/query.py summary
# Best: glm-4.7 raw @ 64.0% ($1.00) — 4 strategies, 902 instances

# Leaderboard (markdown for presentations)
python scripts/query.py --format markdown leaderboard

# Paper comparison
python scripts/query.py --model glm-4.7 paper-comparison

# Strategy breakdown
python scripts/query.py --model glm-4.7 compare-strategies

# Exit status analysis
python scripts/query.py failures

Available Models

Preset Provider Description Context
glm-4.7 Z.AI GLM-4.7 (agentic coding) 131K
glm-4.6 Z.AI GLM-4.6 (355B MoE) 200K
kimi-k2 Moonshot Kimi K2 (1T MoE, 32B active) 262K
minimax-m2.1 MiniMax M2.1 (enhanced multilingual) 205K
deepseek-chat DeepSeek DeepSeek V3 128K
bedrock-qwen3-coder-480b AWS Bedrock Qwen3 Coder 480B 131K
bedrock-qwen3-32b AWS Bedrock Qwen3 32B 32K
gpt-4o OpenAI GPT-4o 128K
claude-sonnet-4.5 Anthropic Claude Sonnet 4.5 200K

List all presets: python scripts/run_model.py --list


Strategies

Strategy Config Description
raw default_no_demo_raw.yaml No context management (baseline)
observation_masking default_no_demo_N=1_M=10.yaml Keep last M=10 observations
llm_summary default_no_demo_checkpoint_same_model_openhands_N=21_M=10.yaml Summarize every N=21 turns
hybrid default_no_demo_checkpoint_same_model_openhands_N=21_M=10_masking_M=10.yaml Both strategies combined

WandB Sweeps

# Create and run a sweep
wandb sweep sweeps/smart_search.yaml
wandb agent <SWEEP_ID>

# Or use the convenience script
./scripts/vps_sweep.sh sweeps/smart_search.yaml --start

Sweep configs in sweeps/:

  • smart_search.yaml - Bayesian search across 6 models x 4 strategies
  • bedrock_repro.yaml - Paper reproduction on AWS Bedrock
  • quick_test.yaml - Sanity check (small slice)

Dashboards

# Web dashboard (Streamlit)
streamlit run scripts/dashboard.py

# Terminal dashboard (Rich) - for SSH/VPS
DASHBOARD_PROJECT=the-complexity-trap python scripts/dashboard_tui.py

Documentation

For original SWE-agent documentation and paper methodology, see:


Attribution

This is a fork of "The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management" by Lindenbauer et al. (JetBrains Research).

@misc{lindenbauer2025complexitytrapsimpleobservation,
      title={The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management},
      author={Tobias Lindenbauer and Igor Slinko and Ludwig Felder and Egor Bogomolov and Yaroslav Zharov},
      year={2025},
      eprint={2508.21433},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2508.21433},
}

Paper: arXiv:2508.21433 Dataset: HuggingFace


License

MIT (same as upstream). See LICENSE.

About

This repo accompanies the paper "The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management" presented at the Fourth Deep Learning for Code (DL4Code) workshop at NeurIPS 2025 and Tobias Lindenbauer's Master's Thesis.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.9%
  • TeX 19.0%
  • Shell 3.0%
  • JavaScript 2.4%
  • CSS 1.0%
  • HTML 0.2%
  • Other 0.5%