Catch It All! — Generalizable Dynamic Catching with a Dexterous Hand

§01 Abstract

Catching arbitrary objects requires robots to adapt in real-time to changing visual observations, unseen object properties, and diverse contact dynamics. This paper presents Catch It All!, an autonomous catching framework for humanoid robots that solves this problem without motion capture or teleoperation.

Catch It All! develops a generalizable, egocentric control pipeline at 50 Hz alongside a safety-aware sim-to-real curriculum allowing real-time catching of novel objects. The policy is trained entirely in simulation using safety-aware curricula that ensure safe behavior near torque limits and avoid self-collision. The resulting policy is deployed zero-shot on a Unitree G1 humanoid robot without any fine-tuning.

We evaluate across 11 novel objects spanning wide variation in size, shape, toss ranges, mass, and restitution, achieving strong catch success rates in both simulation and real-world experiments.

40.9% Real-world catch rate

11 Novel test objects

50 Hz Control frequency

2.75 m Max toss distance

<20 ms Perception latency

§02 Contributions

C1
Generalizing across objects, tosses, and physical properties We design rewards and training curricula that yield policies capable of catching objects unseen during training, spanning diverse shapes, sizes, masses, and restitution coefficients. Toss conditions are similarly varied: trajectories range from 1.0–2.75 m, include angled and lateral tosses (including behind-the-robot side catches), paired with randomized time-to-impact from 0.4–5.0 s and object angular velocities up to ±25 rad/s for small objects. This breadth of training conditions enables our policy to handle the full diversity of real-world throwing scenarios without any object- or toss-specific tuning, achieving non-trivial catching performance on 10 out of 11 unseen real-world objects.
C2
Safe policy training via progressive curricula We introduce a joint reward-penalty and torque clipping curriculum that balances exploration early in training with strict safety enforcement as the policy matures, yielding catching behaviors that respect hardware constraints — including wrist torque limits as low as 5 N·m — without sacrificing performance.
C3
Zero-shot sim-to-real transfer without fine-tuning By distilling a privileged teacher policy trained on ground-truth state into an egocentric student operating from noisy camera observations, combined with extensive domain randomization, our policy transfers directly to a real Unitree G1 humanoid. Sim and real performance trends are closely matched across object categories, validating simulation as a reliable proxy for real-world evaluation. A human-teleoperated baseline achieved only 6.8% success across the same 44 tosses, underscoring both the task difficulty and the strength of our autonomous approach.

§03 Egocentric Perspective

All perception in our system is purely egocentric — the robot sees the world only through an overhead ZED2i stereo camera mounted 0.3 m above the neck at a −45° angle. There is no external motion capture, no third-person tracking, and no privileged state at test time. The video below is captured directly from this onboard camera and represents the robot's actual visual input during catching trials.

Egocentric camera feed — onboard ZED2i · robot's actual visual input

§04 Results — Successes

Each clip below shows a successful catch from our real-world evaluation on a Unitree G1 humanoid equipped with a custom 20-DoF, five-fingered House of Dextra hand. Objects were thrown by a human from 1.0–2.75 m at randomly selected distances. The same zero-shot deployed DAgger student policy is used for all catches — no object-specific tuning. Despite limited information on object geometry, the policy consistently achieves 4–5 finger grasps.

Spiky Ball

Pink Star

Orange Block

Green Ellipsoid

Purple Polygon

Yellow Sandbag

Pineapple

Panda

Octopus

Object	Palm Contact ↑	Catch Success ↑
Pink Star	75%	50%
Orange Block	50%	50%
Green Ellipsoid	50%	50%
Purple Polygon	75%	50%
Yellow Sandbag	100%	50%
Strawberry	25%	25%
Pineapple	75%	50%
Panda	50%	25%
Turtle	25%	0%
Octopus	50%	50%
Spiky Ball	50%	50%
Average	56.8%	40.9%

4 trials per object · 44 total tosses · 1 long (>2.0 m), 2 medium (1.45–2.0 m), 1 short (1.0–1.45 m)

§05 Results — Failures

We include representative failure cases for transparency. The majority of failures arose from two primary sources. First, SAM2 lost track of the projectile during flight and failed to reacquire it, resulting in degraded object state estimates — particularly pronounced for objects with ambiguous visual features (multi-colored, dark-colored, or texturally complex objects such as the toy turtle; the yellow bear resulted in no successful detection and was excluded from the main evaluation).

Second, rapid arm extension induced compensatory whole-body displacement, causing the torso and camera frame to shift forward and rebound. This introduced systematic error in camera-relative position estimates, exceeding the 5 cm localization threshold. This is a known limitation of operating without a lower-body stabilization controller; future work should incorporate a standing balance controller to decouple arm dynamics from base motion. A third, less prevalent failure mode was system latency, occasionally resulting in near-miss catches.

Failure cases — real-world evaluation set

§06 Results — Adaptation

We highlight the policy's reactivity after imperfect initial contact. In one representative qualitative example, the Pink Star projectile hits the palm of the robot hand, rebounds to the torso, and the robot then catches it on the rebound off the body. This reactive behavior emerges from the closed-loop 50 Hz egocentric control pipeline — not from any explicit rebound recovery reward — and demonstrates that the learned policy generalizes well beyond clean first-contact catches.

The egocentric student policy operates from a 50-step history buffer of noisy 3D object positions (±5 cm noise; ±7.5 cm near the hand), robot proprioception for the 7-DoF arm and 20-DoF hand, and 3D bounding box dimensions. Perception latency is kept below 20 ms by running SAM2 Hiera Tiny on parallel threads on a single workstation (Ryzen 5 5600X + RTX 4090).

Adaptation — mid-flight correction & rebound recovery