KITE

Generalizing manipulation policies across robot embodiments remains difficult because standard policies entangle task reasoning with embodiment-specific motor control. We study zero-shot cross-embodiment manipulation, where a policy trained on source embodiments must be deployed on a structurally different target embodiment without additional task demonstrations. We introduce Kinematic Interaction Transfer across Embodiments (KITE), which decouples manipulation into embodiment-agnostic task reasoning and embodiment-specific motor control, connected through a learned latent representation of interaction intent based on contact patterns. Task reasoning is performed by a shared policy that predicts latent intents from source demonstrations, while motor control is performed by an intent-conditioned action decoder learned from each embodiment's kinematic model. With KITE, adaptation to a new embodiment requires only training a new action decoder using its kinematic model, without recollecting demonstration data. We evaluate KITE on three manipulation tasks spanning transfer between parallel grippers, dexterous hands, and composite embodiments. KITE consistently achieves zero-shot transfer to structurally different target embodiments, outperforming state-of-the-art baselines in transfer success and task-embodiment scope.

KITE decouples task reasoning and motor control through a learned, contact-based representation of interaction intent:

Embodiment-specific motor control. A generalist action decoder turns a latent intent into motor commands for a specific robot. It is trained solely from the robot's kinematic model, with no task demonstrations.
Embodiment-agnostic task reasoning. A shared policy learns what interaction should happen rather than how a particular body should move. It is trained from ordinary task demonstrations, converted into the latent-intent space.

At inference, KITE operates as a standard visuomotor policy. Deploying to a new robot requires only training a new action decoder from its URDF and plugging it in.

We deploy KITE on a WujiHand mounted on a Tianji-Marvin arm in the real world. Every policy is trained in matched simulation and run on hardware with no task demonstrations on the target embodiment and no real-world fine-tuning. KITE succeeds in 7–8 of 10 trials on all three tasks, confirming zero-shot transfer under real perception and physics. For keyboard pressing and bottle pumping, the source is a human hand, showing that KITE extends to human-to-robot transfer.

Cube Picking

Robotiq → Wuji reaches 7/10 successes on physical hardware.

Source: Robotiq

Target: Wuji

Keyboard Pressing

Human Hand → Wuji reaches 8/10 successes on physical hardware.

Source: Human Hand

Target: Wuji

Bottle Pumping

Human Hand → Wuji reaches 7/10 successes on physical hardware.

Source: Human Hand

Target: Wuji

We evaluate zero-shot cross-embodiment transfer across three tasks and five embodiments in MuJoCo: a parallel gripper, three dexterous hands, and a composite of two robots. With no task demonstrations on the target embodiments, KITE achieves the highest success rate in every transfer setting.

Cube Picking

A Robotiq source demonstration transfers to three dexterous hands, testing simple-gripper to dexterous-hand zero-shot manipulation.

Source: Robotiq

Target: Barrett

Target: Allegro

Target: Wuji

Keyboard Pressing

Wuji demonstrations press a target key sequence and transfer to dexterous and composite target embodiments through the shared interaction-intent interface.

Source: Wuji

Target: Barrett

Target: Allegro

Target: Robotiq+Stick

Bottle Pumping

The source policy transfers a multi-stage interaction that must both lift the bottle and depress the pump, requiring task-relevant contacts to change over time.

Source: Wuji

Target: Barrett

Target: Allegro

Target: Robotiq+Stick

We verify the necessity of the latent intent interface, confirm that kinematics-only supervision is sufficient for the action decoder, and characterize the decoder's robustness to initialization perturbations.

Latent intent interface. Removing the latent intent and letting the policy output raw contact sets drops average success rate by 12–24 points across tasks, with the gap widening on more contact-rich tasks.

Effect of adding target-task demonstrations

Kinematics-only supervision. Adding up to 200 target-task demonstrations to the action decoder hardly improves success rate, confirming that the kinematics-only supervision is sufficient.

Initialization	Succ. (%)
Base	100
5 cm, 30°	100
20 cm, 60°	81
Flipped 180°	62
30 cm shift	37

Initialization robustness. The action decoder handles moderate perturbations well and only degrades when initialization departs far from the local contact neighborhood.

The action decoder in KITE is able to discover diverse execution strategies for the same task on a specific robot, engaging different hand regions to achieve the desired contact in each case.

Wuji · Strategy 1

Wuji · Strategy 2

Wuji · Strategy 3

We compare KITE against SPIDER, a retargeting baseline that relies on hand-specified source-to-target part correspondences. On bottle pumping, SPIDER breaks down as the predefined correspondence misaligns with the task-relevant contact region, while KITE adapts contacts to each target embodiment.

KITE · Allegro

SPIDER · Allegro

KITE · Barrett

SPIDER · Barrett

We gratefully acknowledge use of the research computing resources of the Empire AI Consortium, Inc., with support from Empire State Development of the State of New York, the Simons Foundation, and the Secunda Family Foundation. This work was supported in part by the Amazon Research Awards and an NVIDIA Academic Grant. We thank Samuel Jin, Yunhao Cao, Jialiang Zhang, Chuanruo Ning, Xingyi He, Adhitya Polavaram, Zhenyu Wei, Qi Wu, Cory Fan, and Pranav Thakkar for constructive discussions and feedback. We thank Calvin Qiu, Adhitya Polavaram, Yunhao Cao, and Cory Fan for their generous help on real-world and simulation robot infrastructure.

@misc{wang2026kitedecouplingkinematicsinteraction,
      title={KITE: Decoupling Kinematics and Interaction for Zero-Shot Cross-Embodiment Manipulation},
      author={Qianxu Wang and Kuan Fang},
      year={2026},
      eprint={2606.22113},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.22113},
}