Skip to content

Evaluation of LLM Oracles for Robotic Task Decomposition

Notifications You must be signed in to change notification settings

farhadcuber/task-decomp

Repository files navigation

Introduction

A Multi-Robot Task Planner with LLM Oracles

This repository have the codes for our experiments analyzing and evaluating LLM Oracles for the purpose of multi-robot task planning.

Prerequisites

  • Python 3.9+ with a virtual environment (recommended).
  • An OpenRouter API key (set in the environment as OPENROUTER_API_KEY).
  • Tested on Ubuntu 24.04.3 LTS.

Example environment setup:

  • Create and activate a virtual environment.
  • Install dependencies listed in requirements.txt file.

pip install -r requirements.txt

  • Export your OpenRouter key in the shell before running any scripts.

Data Layout

  • Task definitions are stored in tasks.csv. This file contains 30 tasks and their decompositions.
  • The list of robots and their skills is stored in PROMPT.txt.
  • Model lists (for batch runs) can be kept in models.txt.

How to Use the Python Scripts

Purpose: Run the main benchmark pipeline across tasks and models.

What it does:

  • Loads tasks from tasks.csv.
  • Uses prompts from PROMPT.txt.
  • Queries models (via OpenRouter) and stores outputs in model-specific CSV files.

How to store outputs:

  • Place results in a dedicated folder (e.g., outputs/, outputs_malicious/, or a custom run folder such as T1-claude/).
  • Each model produces a file like llm_<model>.csv in the chosen output folder.

How to generate outputs:

  • Run the script with your configured environment and OpenRouter key.
  • Verify that the output folder contains new llm_*.csv files after the run completes.

Purpose: Compute similarity metrics between model outputs and ground-truth decompositions.

What it does:

  • Reads the model output CSVs (e.g., from outputs/).
  • Computes similarity scores and writes a summary CSV such as similarity.csv.

How to store outputs:

  • Place similarity summaries in the same folder as the model outputs (e.g., outputs/) or in a dedicated evaluation folder.

How to generate outputs:

  • Run after you have model output CSVs.
  • Confirm that similarity.csv (or similarly named file) appears in the output folder.

Purpose: Compute pairwise similarity between model responses or task outputs.

What it does:

  • Reads model output CSVs for each task.
  • Produces per-task pairwise similarity tables (TSV/CSV).

How to store outputs:

How to generate outputs:

  • Run after model outputs are available.
  • Verify that per-task files (e.g., task_1_pairwise_similarity.tsv) are created.

Purpose: Generate accuracy plots from similarity summaries.

How to store outputs:

  • Store figures in figures/ or a dedicated figures/ subfolder per run.

How to generate outputs:

  • Run after similarity summaries exist.
  • Check for new image files in figures/.

Purpose: Generate latency plots from benchmark results.

How to store outputs:

How to generate outputs:

  • Run after benchmark outputs exist with timing data.
  • Check for new image files in figures/.

Purpose: Generate reputation or quality plots for model outputs.

How to store outputs:

How to generate outputs:

  • Run after similarity summaries exist.
  • Check for new image files in figures/.

Purpose: Create per-task pairwise similarity heatmaps.

How to store outputs:

  • Store heatmap images in figures/ or a per-run figures folder.

How to generate outputs:

  • Run after pairwise similarity TSV/CSV files exist.
  • Confirm heatmap images are generated.

SkillChain-RTC benchmark

The task catalog is in tasks.csv and contains 30 tasks with their decompositions. The list of robots and their skills is also available in PROMPT.txt.

About

Evaluation of LLM Oracles for Robotic Task Decomposition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages