Skip to content

Unified benchmark for evaluating architectural design choices in LLM-based single-agent and multi-agent frameworks across orchestration, memory, planning, specialization, and coordination.

Notifications You must be signed in to change notification settings

CoDS-GCS/MAFBench

Repository files navigation

Python 3.12.3 License: MIT Research

MASBench

MASBench is a unified benchmark suite for systematically analyzing architectural design choices in LLM-based agent frameworks — spanning orchestration, memory, planning, specialization, and multi-agent coordination under controlled execution. The suite isolates framework-level effects from model capabilities and task complexity, enabling controlled evaluation of single-agent and multi-agent architectural design choices.

Resources: Website | Paper (coming soon)

Why MASBench?

Existing benchmarks primarily test isolated agent capabilities (reasoning, tool use, memory) without addressing how framework architecture governs performance and scalability. MASBench fills this gap by:

  • Providing controlled evaluation of architectural design decisions under fixed models and tasks
  • Isolating framework-level effects from model capabilities and task complexity
  • Enabling systematic comparison across orchestration patterns, memory architectures, and coordination mechanisms
  • Supporting reproducible analysis of scalability and resource utilization characteristics

How MASBench Works

  • Unified execution pipeline: Standardized interfaces normalize execution across diverse frameworks
  • Standardized configuration & logging: Consistent measurement and artifact collection
  • Controlled architectural isolation: Framework behavior evaluated independently of model and task variations
  • Cost-aware backend routing: Abstracted LLM backends support efficient, framework-agnostic evaluation

Architectural Taxonomy

MASBench organizes frameworks along three primary paradigms:

  • Graph-based orchestration: Workflows modeled as directed graphs with nodes representing computational steps and edges defining control flow
  • Role-based agent systems: Agents structured around specialized roles with coordination mechanisms routing tasks based on role assignments
  • Environment/simulation-mediated systems: Agents situated within shared environments where interaction occurs through state and action interfaces

The suite evaluates key architectural dimensions:

  • Orchestration & control flow: How frameworks structure task execution and manage dependencies
  • Memory architecture: Long-term retention, learning, and forgetting mechanisms
  • Planning interfaces: Multi-step reasoning under framework constraints
  • Specialization mechanisms: Role assignment, task routing, and capability distribution
  • Communication topology & coordination: Information flow patterns, coordination mechanisms, and topology-induced interaction patterns in multi-agent settings

Benchmark Modules

Single-Agent Evaluation

  • Memory — Long-term retention, learning, forgetting
  • Planning — Multi-step reasoning under interface constraints
  • Specialization — Role assignment and capability distribution
  • Framework Overhead — Orchestration and execution efficiency
  • Tool Use — Architectural integration patterns

Multi-Agent Evaluation

Reproducibility & Artifacts

MASBench enforces reproducibility through:

  • Fixed Python version (3.12.3) and pinned dependencies (requirements.lock)
  • Unified execution pipeline with standardized configuration and logging
  • Backend abstraction supporting cost-aware, framework-agnostic evaluation
  • Experimental results preserved in results/ for transparency

Analysis and interpretation are documented within individual experiment directories and associated publications.

Citation

If you use MASBench in academic work, please cite:

@article{orogat2026mafbench,
  title={Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis},
  author={Orogat, Abdelghny and Rostam, Ana and Mansour, Essam},
  journal={arXiv preprint arXiv:submit/7225627},
  year={2026}
}

Contact

Abdelghny Orogat — Concordia University
Email: Abdelghny.Orogat@concordia.ca

About

Unified benchmark for evaluating architectural design choices in LLM-based single-agent and multi-agent frameworks across orchestration, memory, planning, specialization, and coordination.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages