Robotics & Dexterous Manipulation · IROS 2026 Submission

Catch It All!: Generalizable Dynamic Catching
with a Dexterous Hand

Anonymous Authors

§01 Abstract

Catching arbitrary objects requires robots to adapt in real-time to changing visual observations, unseen object properties, and diverse contact dynamics. This paper presents Catch It All!, an autonomous catching framework for humanoid robots that solves this problem without motion capture or teleoperation.

Catch It All! develops a generalizable, egocentric control pipeline at 50 Hz alongside a safety-aware sim-to-real curriculum allowing real-time catching of novel objects. The policy is trained entirely in simulation using safety-aware curricula that ensure safe behavior near torque limits and avoid self-collision. The resulting policy is deployed zero-shot on a Unitree G1 humanoid robot without any fine-tuning.

We evaluate across 11 novel objects spanning wide variation in size, shape, toss ranges, mass, and restitution, achieving strong catch success rates in both simulation and real-world experiments.

40.9% Real-world catch rate
11 Novel test objects
50 Hz Control frequency
2.75 m Max toss distance
<20 ms Perception latency

§02 Contributions

§03 Egocentric Perspective

All perception in our system is purely egocentric — the robot sees the world only through an overhead ZED2i stereo camera mounted 0.3 m above the neck at a −45° angle. There is no external motion capture, no third-person tracking, and no privileged state at test time. The video below is captured directly from this onboard camera and represents the robot's actual visual input during catching trials.

👁
Robot's point of view

This footage is recorded from the robot's own egocentric camera — exactly what the policy sees. SAM2 runs on this stream at up to 50 Hz to produce 2D bounding boxes, which are projected into 3D via stereo depth. No external cameras or motion capture systems are used at any point during deployment.

Egocentric camera feed — onboard ZED2i · robot's actual visual input

§04 Results — Successes

Each clip below shows a successful catch from our real-world evaluation on a Unitree G1 humanoid equipped with a custom 20-DoF, five-fingered House of Dextra hand. Objects were thrown by a human from 1.0–2.75 m at randomly selected distances. The same zero-shot deployed DAgger student policy is used for all catches — no object-specific tuning. Despite limited information on object geometry, the policy consistently achieves 4–5 finger grasps.

Spiky Ball
Pink Star
Orange Block
Green Ellipsoid
Purple Polygon
Yellow Sandbag
Pineapple
Panda
Octopus
Object Palm Contact ↑ Catch Success ↑
Pink Star 75% 50%
Orange Block 50% 50%
Green Ellipsoid 50% 50%
Purple Polygon 75% 50%
Yellow Sandbag 100%50%
Strawberry 25% 25%
Pineapple 75% 50%
Panda 50% 25%
Turtle 25% 0%
Octopus 50% 50%
Spiky Ball 50% 50%
Average 56.8% 40.9%

4 trials per object · 44 total tosses · 1 long (>2.0 m), 2 medium (1.45–2.0 m), 1 short (1.0–1.45 m)

§05 Results — Failures

We include representative failure cases for transparency. The majority of failures arose from two primary sources. First, SAM2 lost track of the projectile during flight and failed to reacquire it, resulting in degraded object state estimates — particularly pronounced for objects with ambiguous visual features (multi-colored, dark-colored, or texturally complex objects such as the toy turtle; the yellow bear resulted in no successful detection and was excluded from the main evaluation).

Second, rapid arm extension induced compensatory whole-body displacement, causing the torso and camera frame to shift forward and rebound. This introduced systematic error in camera-relative position estimates, exceeding the 5 cm localization threshold. This is a known limitation of operating without a lower-body stabilization controller; future work should incorporate a standing balance controller to decouple arm dynamics from base motion. A third, less prevalent failure mode was system latency, occasionally resulting in near-miss catches.

Failure cases — real-world evaluation set

§06 Results — Adaptation

We highlight the policy's reactivity after imperfect initial contact. In one representative qualitative example, the Pink Star projectile hits the palm of the robot hand, rebounds to the torso, and the robot then catches it on the rebound off the body. This reactive behavior emerges from the closed-loop 50 Hz egocentric control pipeline — not from any explicit rebound recovery reward — and demonstrates that the learned policy generalizes well beyond clean first-contact catches.

The egocentric student policy operates from a 50-step history buffer of noisy 3D object positions (±5 cm noise; ±7.5 cm near the hand), robot proprioception for the 7-DoF arm and 20-DoF hand, and 3D bounding box dimensions. Perception latency is kept below 20 ms by running SAM2 Hiera Tiny on parallel threads on a single workstation (Ryzen 5 5600X + RTX 4090).

Adaptation — mid-flight correction & rebound recovery