starbench/
├── starbench_data/ # default data download directory
├── starbench_tasks/ # default task metadata download directory
├── starbench/ # Python evaluation package
├── scripts/ # Setup + data helper scripts
│ ├── install.sh
│ ├── install_virtualhome.sh
│ └── download.sh
├── virtualhome/ # VirtualHome dependency
├── starbench_example.py # Minimal example
├── pyproject.toml # Python build + dependencies
└── README.md
- Python (3.10+ recommended)
- (Optional) Docker — only required if you want to run VirtualHome through the provided installation script.
git clone https://github.com/ut-amrl/STARBench.git
cd STARBench
git submodule update --init --recursiveOption A: quick installation for everything
bash scripts/install.shThis script would also install VirtualHome through docker. If you prefer installing VirtualHome using other method, see more details on their webpage.
Option B: install STARBench only (no VirtualHome)
bash scripts/install_starbench.shbash scripts/download_data.shThis script would download data to starbench_dta and starbench_tasks by default.
To start the simulation in docker:
cd virtualhome
podman run --name virtualhome_container \
--mount type=bind,source="$(pwd)"/unity_vol,target=/unity_vol/ \
--mount type=bind,source="$(pwd)"/unity_output,target=/Output/ \
-p 8080:8080 -it virtualhomeIf you saw error: Error: rootlessport listen tcp 0.0.0.0:8080: bind: address already in use, run:
lsof -i :8080You'll see output like:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
your_app 12345 user ... TCP ... 0 LISTEN 0.0.0.0:8080
Kill this process by running:
kill -9 <PID>
and restart the container.
Checkout example.py script for details.
Plug in your algorithm (replace BaseRobot):example.py uses a BaseRobot(actions=...) placeholder. Replace it with your own robot implementation.
Your agent is expected to call the following primitive actions (provided in starbench.action_utils):
- navigate_then_observe
- detect
- pick
- open
During evaluation, STARBench traces actions via task_context(..., sink=robot.on_action, stop_after={"pick"}), and the episode terminates after pick is attempted.
To start evaluation, in another terminal:
python example.py \
--agent-name <NAME_OF_ALGORITHM> \
--benchmark-dir <PATH_TO_BENCHMARK_DIR such as starbench_tasks> \
--data-dir <PATH_TO_DATA_DIR such as starbench_data> \
--task-file <PATH_TO_TASK_SUMMARY_CSV such as starbench_tasks/tasks_summary.csv> \
--output-dir <PATH_TO_OUTPUT_DIR> \
--port 18080@misc{chen2025searchingspacetimeunified,
title={Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval},
author={Taijing Chen and Sateesh Kumar and Junhong Xu and George Pavlakos and J oydeep Biswas and Roberto Martín-Martín},
year={2025},
eprint={2511.14004},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.14004},
}
