Cube Picking
Robotiq → Wuji reaches 7/10 successes on physical hardware.
We consider the problem of zero-shot cross-embodiment transfer for dexterous manipulation, where task demonstrations are available only on a source embodiment. To address this, KITE decouples manipulation into embodiment-agnostic task reasoning (green) and embodiment-specific motor control (blue), connected by a learned latent representation of interaction intent.
Generalizing manipulation policies across robot embodiments remains difficult because standard policies entangle task reasoning with embodiment-specific motor control. We study zero-shot cross-embodiment manipulation, where a policy trained on source embodiments must be deployed on a structurally different target embodiment without additional task demonstrations. We introduce Kinematic Interaction Transfer across Embodiments (KITE), which decouples manipulation into embodiment-agnostic task reasoning and embodiment-specific motor control, connected through a learned latent representation of interaction intent based on contact patterns. Task reasoning is performed by a shared policy that predicts latent intents from source demonstrations, while motor control is performed by an intent-conditioned action decoder learned from each embodiment's kinematic model. With KITE, adaptation to a new embodiment requires only training a new action decoder using its kinematic model, without recollecting demonstration data. We evaluate KITE on three manipulation tasks spanning transfer between parallel grippers, dexterous hands, and composite embodiments. KITE consistently achieves zero-shot transfer to structurally different target embodiments, outperforming state-of-the-art baselines in transfer success and task-embodiment scope.
KITE decouples task reasoning and motor control through a learned, contact-based representation of interaction intent:
At inference, KITE operates as a standard visuomotor policy. Deploying to a new robot requires only training a new action decoder from its URDF and plugging it in.
We deploy KITE on a WujiHand mounted on a Tianji-Marvin arm in the real world. Every policy is trained in matched simulation and run on hardware with no task demonstrations on the target embodiment and no real-world fine-tuning. KITE succeeds in 7–8 of 10 trials on all three tasks, confirming zero-shot transfer under real perception and physics. For keyboard pressing and bottle pumping, the source is a human hand, showing that KITE extends to human-to-robot transfer.
Robotiq → Wuji reaches 7/10 successes on physical hardware.
Human Hand → Wuji reaches 8/10 successes on physical hardware.
Human Hand → Wuji reaches 7/10 successes on physical hardware.
We evaluate zero-shot cross-embodiment transfer across three tasks and five embodiments in MuJoCo: a parallel gripper, three dexterous hands, and a composite of two robots. With no task demonstrations on the target embodiments, KITE achieves the highest success rate in every transfer setting.
A Robotiq source demonstration transfers to three dexterous hands, testing simple-gripper to dexterous-hand zero-shot manipulation.
Wuji demonstrations press a target key sequence and transfer to dexterous and composite target embodiments through the shared interaction-intent interface.
The source policy transfers a multi-stage interaction that must both lift the bottle and depress the pump, requiring task-relevant contacts to change over time.
We verify the necessity of the latent intent interface, confirm that kinematics-only supervision is sufficient for the action decoder, and characterize the decoder's robustness to initialization perturbations.
Latent intent interface. Removing the latent intent and letting the policy output raw contact sets drops average success rate by 12–24 points across tasks, with the gap widening on more contact-rich tasks.
Kinematics-only supervision. Adding up to 200 target-task demonstrations to the action decoder hardly improves success rate, confirming that the kinematics-only supervision is sufficient.
| Initialization | Succ. (%) |
|---|---|
| Base | 100 |
| 5 cm, 30° | 100 |
| 20 cm, 60° | 81 |
| Flipped 180° | 62 |
| 30 cm shift | 37 |
Initialization robustness. The action decoder handles moderate perturbations well and only degrades when initialization departs far from the local contact neighborhood.
The action decoder in KITE is able to discover diverse execution strategies for the same task on a specific robot, engaging different hand regions to achieve the desired contact in each case.
We compare KITE against SPIDER, a retargeting baseline that relies on hand-specified source-to-target part correspondences. On bottle pumping, SPIDER breaks down as the predefined correspondence misaligns with the task-relevant contact region, while KITE adapts contacts to each target embodiment.
We gratefully acknowledge use of the research computing resources of the Empire AI Consortium, Inc., with support from Empire State Development of the State of New York, the Simons Foundation, and the Secunda Family Foundation. This work was supported in part by the Amazon Research Awards and an NVIDIA Academic Grant. We thank Samuel Jin, Yunhao Cao, Jialiang Zhang, Chuanruo Ning, Xingyi He, Adhitya Polavaram, Zhenyu Wei, Qi Wu, Cory Fan, and Pranav Thakkar for constructive discussions and feedback. We thank Calvin Qiu, Adhitya Polavaram, Yunhao Cao, and Cory Fan for their generous help on real-world and simulation robot infrastructure.
@misc{wang2026kitedecouplingkinematicsinteraction,
title={KITE: Decoupling Kinematics and Interaction for Zero-Shot Cross-Embodiment Manipulation},
author={Qianxu Wang and Kuan Fang},
year={2026},
eprint={2606.22113},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.22113},
}