GitHub - CoDS-GCS/MAFBench: Unified benchmark for evaluating architectural design choices in LLM-based single-agent and multi-agent frameworks across orchestration, memory, planning, specialization, and coordination.

MASBench is a unified benchmark suite for systematically analyzing architectural design choices in LLM-based agent frameworks — spanning orchestration, memory, planning, specialization, and multi-agent coordination under controlled execution. The suite isolates framework-level effects from model capabilities and task complexity, enabling controlled evaluation of single-agent and multi-agent architectural design choices.

Resources: Website | Paper (coming soon)

Why MASBench?

Existing benchmarks primarily test isolated agent capabilities (reasoning, tool use, memory) without addressing how framework architecture governs performance and scalability. MASBench fills this gap by:

Providing controlled evaluation of architectural design decisions under fixed models and tasks
Isolating framework-level effects from model capabilities and task complexity
Enabling systematic comparison across orchestration patterns, memory architectures, and coordination mechanisms
Supporting reproducible analysis of scalability and resource utilization characteristics

How MASBench Works

Unified execution pipeline: Standardized interfaces normalize execution across diverse frameworks
Standardized configuration & logging: Consistent measurement and artifact collection
Controlled architectural isolation: Framework behavior evaluated independently of model and task variations
Cost-aware backend routing: Abstracted LLM backends support efficient, framework-agnostic evaluation

Architectural Taxonomy

MASBench organizes frameworks along three primary paradigms:

Graph-based orchestration: Workflows modeled as directed graphs with nodes representing computational steps and edges defining control flow
Role-based agent systems: Agents structured around specialized roles with coordination mechanisms routing tasks based on role assignments
Environment/simulation-mediated systems: Agents situated within shared environments where interaction occurs through state and action interfaces

The suite evaluates key architectural dimensions:

Orchestration & control flow: How frameworks structure task execution and manage dependencies
Memory architecture: Long-term retention, learning, and forgetting mechanisms
Planning interfaces: Multi-step reasoning under framework constraints
Specialization mechanisms: Role assignment, task routing, and capability distribution
Communication topology & coordination: Information flow patterns, coordination mechanisms, and topology-induced interaction patterns in multi-agent settings

Benchmark Modules

Single-Agent Evaluation

Memory — Long-term retention, learning, forgetting
Planning — Multi-step reasoning under interface constraints
Specialization — Role assignment and capability distribution
Framework Overhead — Orchestration and execution efficiency
Tool Use — Architectural integration patterns

Multi-Agent Evaluation

Coordination & Topology — Communication patterns and coordination outcomes

Reproducibility & Artifacts

MASBench enforces reproducibility through:

Fixed Python version (3.12.3) and pinned dependencies (requirements.lock)
Unified execution pipeline with standardized configuration and logging
Backend abstraction supporting cost-aware, framework-agnostic evaluation
Experimental results preserved in results/ for transparency

Analysis and interpretation are documented within individual experiment directories and associated publications.

Citation

If you use MASBench in academic work, please cite:

@article{orogat2026mafbench,
  title={Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis},
  author={Orogat, Abdelghny and Rostam, Ana and Mansour, Essam},
  journal={arXiv preprint arXiv:submit/7225627},
  year={2026}
}

Contact

Abdelghny Orogat — Concordia University
Email: Abdelghny.Orogat@concordia.ca

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
benchmarks		benchmarks
data/MATH		data/MATH
llms		llms
multi_agent		multi_agent
results		results
single_agent		single_agent
supplementary_materials		supplementary_materials
.gitignore		.gitignore
MAFBENCH_UNIFICATION_CONTRIBUTIONS.md		MAFBENCH_UNIFICATION_CONTRIBUTIONS.md
README.md		README.md
requirements.lock		requirements.lock
requirements.txt		requirements.txt
server.log		server.log
slogan.png		slogan.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why MASBench?

How MASBench Works

Architectural Taxonomy

Benchmark Modules

Single-Agent Evaluation

Multi-Agent Evaluation

Reproducibility & Artifacts

Citation

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Languages

CoDS-GCS/MAFBench

Folders and files

Latest commit

History

Repository files navigation

Why MASBench?

How MASBench Works

Architectural Taxonomy

Benchmark Modules

Single-Agent Evaluation

Multi-Agent Evaluation

Reproducibility & Artifacts

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages