Follow / arXiv • TenStep

Filters

Search Domain Date Importance Sort Favorites only

arXiv Items

15 item(s)

2026-05-05 medium Score 1.7

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Authors: Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, Zongqing Lu

一句话总结

提出无需生成未来帧的潜空间 world-action model，以提升 VLA 控制策略的未来感知能力。

中文翻译

Being-H0.7 在感知与动作之间加入可学习潜变量查询，并用训练期未来观测分支对齐部署期当前观测分支，使策略获得面向控制的未来结构推理能力。推理时不做视频 rollout，因此兼顾世界模型收益和直接 VLA 策略效率。该方法在多个仿真和真实任务上验证了对机器人控制的实用价值。

English Abstract

Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.

Physical/Embodied Intelligence LLM/VLM

Abs

PDF

中英对照

2026-05-05 medium Score 1.6

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

Authors: Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, YuAn Wang, Long Chen, Hangjun Ye, Xiaoshuai Hao, Wenbo Ding

一句话总结

提出交错文本子目标与视觉关键帧的推理轨迹，用于长程机器人操作策略。

中文翻译

论文提出 IVLR，让多模态 Transformer 在测试时先生成覆盖全任务的文本-视觉交错推理轨迹，再让闭环动作解码器基于轨迹、指令和当前观测输出动作。方法通过演示分段和视觉语言模型标注构造伪监督，在 LIBERO 和 SimplerEnv-WidowX 上显著提升长程操作成功率。它直接面向 VLA 操作策略中的显式语义-几何规划问题。

English Abstract

Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method{} reaches 95.5\% average success on LIBERO, including 92.4\% on LIBERO-Long, and 59.4\% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.

Physical/Embodied Intelligence LLM/VLM

Abs

PDF

中英对照

2026-05-05 medium Score 1.6

Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances

Authors: Xianbo Cai, Hideyuki Ichiwara, Hyogo Hiruma, Masaki Yoshikawa, Hiroshi Ito, Tetsuya Ogata

一句话总结

提出面向实时移动操作的双目多阶段空间注意力预测策略，以提升尺度变化和视觉扰动下的闭环动作生成鲁棒性。

中文翻译

论文针对移动机械臂在开放环境中因相机视角变化导致目标视觉尺度变化的问题，提出从双目图像中提取任务相关空间注意点，并与机器人状态结合的层级循环预测模型。该方法直接用于闭环动作预测，在刚体放置、关节物体操作和柔性物体交互等真实移动操作任务上评估。实验显示其在随机初始位置和视觉扰动下相比模仿学习与VLA基线有更高成功率，符合机器人操作视觉运动策略方向。

English Abstract

Robots operating in open, unstructured real-world environments must rely on onboard visual perception while autonomously moving across different locations. Continuous changes in onboard camera viewpoints cause significant visual scale variations in target objects, affecting vision-based motion generation. In this work, we present a stereo multistage spatial attention-based deep predictive learning method for real-time mobile manipulation. The proposed methods extracts task-relevant spatial attention points from stereo images and integrates them with robot states through a hierarchical recurrent architecture for closed-loop action prediction. We evaluate the system on four real-world mobile manipulation tasks using a mobile manipulator, including rigid placement, articulated object manipulation, and deformable object interaction. Experiments under randomized initial positions and visual disturbance conditions demonstrate improved robustness and task success rates compared to representative imitation learning and vision-language-action baselines under identical control settings. The results indicate that structured stereo spatial attention combined with predictive temporal modeling provides an effective solution within the evaluated mobile manipulation scenarios.

Physical/Embodied Intelligence

Abs

PDF

中英对照

2026-05-05 low Score 1.52

Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

Authors: Abay Bektursun

一句话总结

论文探索冻结文本预训练Transformer权重跨模态复用，并在机器人操作基准上用少量可训练接口取得性能提升。

中文翻译

论文研究将仅用文本token预训练的冻结Gemma权重，通过薄训练接口迁移到非文本任务。其报告在OGBench机器人操作任务上超过已发表GCIQL，并分析冻结文本模型内部头在跨模态任务中的可复用机制。虽然机器人实验范围较窄，但它直接涉及大模型权重作为机器人操作策略底座的跨模态算法问题。

English Abstract

Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: $+4.33$pt over published GCIQL at $n=3$ with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity ($76.2 \pm 0.8$, $n=3$) at $0.43\times$ DT's trainable count, with the frozen substrate compressing to a 5L slice ($+1.66$pt over the 6L baseline at $n=3$). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 ($n=2$); a 6.36M-parameter from-scratch trained transformer at matched capacity ($1/\sqrt{d_k}$ scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an $8.7\times$ advantage. Architecture-alone falsifications: a frozen random transformer with correct $1/\sqrt{d_k}$ scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across $n=3$ where pretrained reaches 60%). A dual-measurement protocol -- text-activation probing on 95 English sentences plus task-ablation on a non-language target -- names individual heads independently identifiable on both protocols: head L26.28 scores $3.7\times$ the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation ($\Delta$ L30 $= +0.221$); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.

Physical/Embodied Intelligence LLM/VLM

Abs

PDF

中英对照

2026-05-05 low Score 1.5

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

Authors: Hanxin Zhang, Mingshuo Xu, Abdulqader Dhafer, Shigang Yue, Hongbiao Dong, Zhou Daniel Hao

一句话总结

用干预式视觉归因诊断 VLA 策略在分布外泛化中的因果错配。

中文翻译

论文把 VLA 的视觉-动作归因形式化为干预估计问题，提出 ISS 衡量视觉区域对动作预测的因果影响，并用 NMR 量化策略对无关视觉因素的依赖。实验显示这些指标能预测操作任务中的泛化表现，并比常规解释方法更忠实。该工作虽偏解释性，但直接服务于具身 VLA 策略的可靠性分析。

English Abstract

Vision-Language-Action (VLA) policies often fail under distribution shift, suggesting that decisions may depend on spurious visual correlations rather than task-relevant causes. We formulate visual-action attribution as an interventional estimation problem. Accordingly, we introduce the Interventional Significance Score (ISS), an interventional masking procedure for estimating the causal influence of visual regions on action predictions, and the Nuisance Mass Ratio (NMR), a scalar measure of attribution to task-irrelevant features. We analyze the statistical properties of ISS and show that it admits unbiased estimation, and we characterize conditions under which action prediction error provides a valid proxy for causal influence. Experiments across diverse manipulation tasks indicate that NMR predicts generalization behavior and that ISS yields more faithful explanations than existing interpretability methods. These results suggest that interventional attribution provides a simple diagnostic approach for identifying causal misalignment in embodied policies.

Physical/Embodied Intelligence LLM/VLM

Abs

PDF

中英对照

2026-05-05 low Score 1.5

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Authors: Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo

一句话总结

提出机器人车队部署中持续收集经验并在线后训练通用 VLA 操作策略的 RL 框架。

中文翻译

LWD 从预训练 VLA 策略出发，把真实机器人部署、人工干预、共享经验、策略改进和再部署闭环连接起来。方法结合分布式隐式价值学习和面向 flow-based VLA 动作生成器的策略提取，在 16 台双臂机器人和多项真实长程操作任务上提升成功率。它是面向真实操作策略持续改进的算法框架。

English Abstract

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

Physical/Embodied Intelligence LLM/VLM

Abs

PDF

中英对照

2026-05-05 low Score 1.5

MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

Authors: Xianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata

一句话总结

在 ACT 上加入多阶段空间注意力与未来对齐损失，提升低延迟精细操作稳定性。

中文翻译

MSACT 面向真实双臂精细操作中的低延迟控制和视觉定位漂移问题，在 ACT 框架中引入任务相关 2D 注意点作为局部空间模态。方法通过自监督未来注意序列对齐抑制漂移，无需关键点标注，并在 ALOHA 仿真与真实任务中评估成功率、延迟和鲁棒性。它直接属于视觉运动操作策略改进。

English Abstract

Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.

Physical/Embodied Intelligence

Abs

PDF

中英对照

2026-05-05 low Score 1.4

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

Authors: Sen Cui, Jingheng Ma

一句话总结

从哈密顿动力学角度提出可控、物理一致的机器人世界模型建模思路。

中文翻译

论文提出 Hamiltonian World Models，将观测编码到结构化潜在相空间，并通过含控制、耗散和残差项的动力学演化来生成可用于规划的预测轨迹。重点讨论如何让世界模型的未来预测更物理可靠、可被动作控制且适合长程决策。虽然更偏观点和框架化提案，但主题直接指向机器人控制中的世界模型。

English Abstract

World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video-generative models that emphasize visual future synthesis, 3D scene-centric models that emphasize spatial reconstruction, and JEPA-like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action-controllable, and long-horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emph{Hamiltonian World Models} as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian-inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long-horizon stability, while also noting practical challenges in real-world robotic scenes involving friction, contact, non-conservative forces, and deformable objects.

Physical/Embodied Intelligence

Abs

PDF

中英对照

2026-05-05 low Score 1.36

Recovering Hidden Reward in Diffusion-Based Policies

Authors: Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie, Guodong Zhang, Qicheng He, Deyi Ji, Yue Ding, Hongtao Lu

一句话总结

EnergyFlow将扩散动作策略与逆强化学习统一，用保守能量场从去噪场中恢复专家软Q函数梯度和奖励信号。

中文翻译

论文提出EnergyFlow，将生成式动作建模和逆强化学习结合，把扩散策略的去噪场参数化为标量能量函数的梯度。理论上证明在最大熵最优性下，去噪得分可恢复专家软Q函数梯度，并分析奖励可识别性与误差传播。实验在多个机器人操作任务上取得强模仿性能，并能为后续强化学习提供有效奖励，直接服务于扩散式操作策略算法。

English Abstract

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at this https URL .

Physical/Embodied Intelligence

Abs

PDF

中英对照

2026-05-05 low Score 1.1

Affordance Agent Harness: Verification-Gated Skill Orchestration

Authors: Haojian Huang, Jiahao Shi, Yinchuan Li, Yingcong Chen

一句话总结

提出带验证门控的技能编排框架，用于提升开放场景中的可供性定位可靠性与推理成本效率。

中文翻译

论文面向开放世界场景中的可供性 grounding，解决小目标、遮挡、反光和视觉歧义导致的交互区域识别困难。方法将检测、分割、交互想象等异构技能放入闭环运行时，通过经验记忆、路由器和可供性验证器按实例难度自适应调用并门控提交。实验显示该框架在多个可供性基准上取得更好的准确率-成本折中，和机器人交互前的视觉可供性识别与技能选择直接相关。

English Abstract

Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to reuse experience from recurring objects. These failures expose a systems problem: test-time grounding must acquire the right evidence, decide whether that evidence is reliable enough to commit, and do so under bounded inference cost without access to labels. We propose Affordance Agent Harness, a closed-loop runtime that unifies heterogeneous skills with an evidence store and cost control, retrieves episodic memories to provide priors for recurring categories, and employs a Router to adaptively select and parameterize skills. An affordance-specific Verifier then gates commitments using self-consistency, cross-scale stability, and evidence sufficiency, triggering targeted retries before a final judge fuses accumulated evidence and trajectories into the prediction. Experiments on multiple affordance benchmarks and difficulty-controlled subsets show a stronger accuracy-cost Pareto frontier than fixed-pipeline baselines, improving grounding quality while reducing average skill calls and latency. Project page: this https URL .

Physical/Embodied Intelligence Agent

Abs

PDF

中英对照

2026-05-05 low Score 0.92

Predictive Spatio-Temporal Scene Graphs for Semi-Static Scenes

Authors: Miguel Saavedra-Ruiz, Charlie Gauthier, Kumaraditya Gupta, Shima Shahfar, Kirsty Ellis, Steven Parkison, Liam Paull

一句话总结

提出可预测半静态环境变化的时空场景图，让机器人能基于历史观测推断未来环境状态。

中文翻译

论文面向机器人在重复观测环境中的时空语义推理问题，将贝叶斯滤波器Perpetua*嵌入3D场景图，形成PredictiveGraphs来建模物体和关系随时间的规律性变化。方法在仿真和真实动态导航任务中验证，可预测未来环境状态并应对分布偏移。虽然不是直接的操控策略算法，但它是用于机器人决策和控制的世界/环境表示，和具身智能中的世界模型方向相关。

English Abstract

We have seen tremendous recent progress in our ability to build "spatio-semantic" representations that enable robots to perform complex reasoning across geometry and semantics. However, the vast majority of these methods lack any ability to perform reasoning across time. This is a desirable property in situations where a robot repeatedly observes an environment where instances may change in between observations, but in a structured way. Consider as an example a home environment where the location of a mug typically moves from the cupboard to a countertop to the sink and then back to the cupboard on a daily basis. We should be able to learn this cyclic behavior and use it to predict the state of the mug in the future. In this work, we propose a method that is able to perform this type of tempo-spatio-semantic reasoning. Underpinning the method is a filter, Perpetua$^*$, that performs Bayesian reasoning on the states of the environment that are observed over time. This filter is integrated within a 3D scene graph structure that we call PredictiveGraphs, where nodes represent objects and edges function as Perpetua$^*$ filters encoding spatio-semantic relationships. We validate the method in both simulation and real-world dynamic navigation tasks, where our real world experiments consist of an environment that is undergoing semi-static changes at a bi-hourly frequency over a period of three weeks. In both settings, we demonstrate that our method outperforms baselines in predicting future environment states, even in the presence of distributional shifts.

Physical/Embodied Intelligence

Abs

PDF

中英对照

2026-05-05 low Score 0.9

Task-Conditioned Uncertainty Costmaps for Legged Locomotion

Authors: Kartikeya Singh, Christo Aluckal, Romeo Orsolino, Karthik Dantu

一句话总结

提出面向腿足机器人的任务条件不确定性代价地图，用于更可靠的越野路径规划。

中文翻译

论文针对腿足机器人在非结构化地形上的可行 foothold 预测与路径选择问题，建模由地形观测和运动指令条件化的认知不确定性。该不确定性可识别训练分布外区域，并被整合进统一的代价地图生成框架以支持不确定性感知规划。仿真和真实实验显示，该方法提升OOD检测效果，并降低可行性误差、改善规划可靠性。

English Abstract

Legged robots maintain dynamic feasibility through multicontact interactions with terrain. Learned foothold prediction can provide feasibility-aware costs for motion planning and path selection, but accurately predicting future contacts from perceptual inputs such as height scans remains challenging on highly unstructured terrain, even with a repetitive gait cycle. In this work, we show that modeling epistemic uncertainty in predicted footholds, conditioned on terrain observations and commanded motion, distinguishes in-distribution from out-of-distribution operating regimes in simulation and real-world settings. This allows a single learned model, trained on limited data distributions, to express uncertainty caused by missing training coverage. We use this learned uncertainty to detect OOD regions and incorporate them into a unified costmap-generation framework for uncertainty-aware path planning. Using these uncertainty-aware costmaps, we evaluate feasibility error across in-distribution and OOD terrains in simulation and real-world settings. The results show improved OOD detection, up to a 37% reduction in simulation feasibility error, and more reliable planning behavior than geometry-only baselines.

Physical/Embodied Intelligence

Abs

PDF

中英对照

2026-05-04 high Heat 2 Score 2.55

E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation

Authors: Kaiyan Zhao, Borong Zhang, Yiming Wang, Xingyu Liu, Xuetao Li, Yuyang Chen, Xiaoguang Niu

一句话总结

用经验感知采样提升 Decision Transformer 在机器人长程操作中的样本效率与探索质量。

中文翻译

论文提出 E^2DT，将 Decision Transformer 与 k-DPP 经验采样结合，在机器人操作强化学习中同时考虑轨迹质量与多样性。方法利用 DT 的潜表示衡量窗口间差异，再结合回报、预测不确定性和阶段覆盖构造质量-多样性联合核，从而优先采样更有信息量的经验。实验表明该方法在仿真和真实机器人长程操作任务中优于已有方法，适合归入机器人 manipulation policy 学习方向。

English Abstract

In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DT's performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E$^2$DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E$^2$DT to be both efficient, by prioritizing sampling quality, such as high-return, high-uncertainty, and underrepresented trajectories, and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DT's internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage based on inverse frequency. These two dimensions are integrated into a novel quality-diversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E$^2$DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.

uncategorized

Abs

PDF

中英对照

2026-05-04 high Heat 2 Score 2.33

Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

Authors: Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh, Kevin Yang, Zhutian Yang, Qinxi Yu, Xiaolong Wang, Phillip Isola, Ge Yang

一句话总结

Lucid-XR通过XR交互、物理仿真和语言可控视频生成构建机器人操作数据引擎，用合成数据训练可零样本迁移的视觉策略。

中文翻译

论文提出Lucid-XR，一个面向真实机器人操作训练的多模态生成式数据引擎，结合头显端物理仿真、人到机器人姿态重定向和物理引导的视频生成。系统可通过自然语言规格扩增数据，并用纯合成数据训练机器人视觉策略。实验展示策略可零样本迁移到未见、杂乱和低光环境，并覆盖灵巧操作、软材料和接触丰富任务，直接契合机器人视觉策略数据与训练方向。

English Abstract

We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking multi-modal data to train real-world robotic systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with human-to-robot pose retargeting. Data collected is further amplified by a physics-guided video generation pipeline steerable via natural language specifications. We demonstrate zero-shot transfer of robot visual policies to unseen, cluttered, and badly lit evaluation environments, after training entirely on Lucid-XR's synthetic data. We include examples across dexterous manipulation tasks that involve soft materials, loosely bound particles, and rigid body contact. Project website: https://lucidxr.github.io

uncategorized

Abs

PDF

中英对照

2026-05-04 low Heat 2 Score 1.21

World Model for Robot Learning: A Comprehensive Survey

Authors: Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, Jianfei Yang

一句话总结

系统综述机器人学习中的世界模型范式、作用分工与评测脉络。

中文翻译

这篇综述从机器人学习视角系统梳理世界模型，覆盖其在策略学习、规划、仿真、评估和数据生成中的作用，并总结从 imagination-based 视频生成到可控、结构化与基础模型化 world model 的演进。文章还连接导航与自动驾驶场景，归纳代表性数据集、基准和评测协议。对你当前关注的机器人 policy、world model 与具身智能知识整理很有参考价值。

English Abstract

World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large-scale video generation. However, the literature remains fragmented across architectures, functional roles, and embodied application domains. To address this gap, we present a comprehensive review of world models from a robot-learning perspective. We examine how world models are coupled with robot policies, how they serve as learned simulators for reinforcement learning and evaluation, and how robotic video world models have progressed from imagination-based generation to controllable, structured, and foundation-scale formulations. We further connect these ideas to navigation and autonomous driving, and summarize representative datasets, benchmarks, and evaluation protocols. Overall, this survey systematically reviews the rapidly growing literature on world models for robot learning, clarifies key paradigms and applications, and highlights major challenges and future directions for predictive modeling in embodied agents. To facilitate continued access to newly emerging works, benchmarks, and resources, we will maintain and regularly update the accompanying GitHub repository alongside this survey.

uncategorized

Abs

PDF

中英对照

Back