RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

1University of Massachusetts Amherst, 2Princeton University, 3Stanford University, 4Carnegie Mellon University

Abstract

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 200 mutated tasks with graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policy models, pre-trained VLAs, and oracle state-based planners in both single-task fine-tuning and multi-task learning settings. Our results show that while pre-trained VLAs show some preliminary performance on the seed task under single-task fine-tuning settings, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring non-trivial reasoning, strategy adaptation, and robustness to deceptive or constrained environments.

Task Generation

Our automated task generation pipeline consists of two main stages. Left: Seed task generation and verification at the specification (verbal) level. Right: A verified seed task is expanded into diverse creative problem-solving task variants through iterative mutation and verification, and is then instantiated with 3D scene configurations and executable evaluation metrics in the simulator to produce the final benchmark tasks.

Task Generation Pipeline

Task Gallery

Align Blocks

Retrieve Cube

Gap Retrieve

Pinch Card

Roll Up Ball

Dominos

Stand Pages

Hold Cup

Collect Screw

Place Tall Box

Ball Into Bottle

Cover With Lid

Stack Cubes

Stand Bulb

Ball Onto Tower

Cylinder Through Hole

Stack Bowls

Place Book

Raise Platform

Align Chopsticks

Retrieve Roll

Move Cubes

Balance Board

Retrieve Cube

Mutation Examples

Seed Task

Mutation 1 (Add irrelevant objects)

Mutation 2 (Fix two blocks)

Seed Task

Mutation 1 (Add irrelevant objects)

Mutation 2 (Add an obstacle)

Seed Task

Mutation 1 (Add irrelevant objects)

Mutation 2 (Replace pusher)

Seed Task

Mutation 1 (Replace coaster)

Mutation 2 (Fix coaster)