GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning GigaBrain-0.5M*:通过世界模型自我进化的VLA大模型

Abstract摘要

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose GigaBrain-0.5M*, a VLA model trained via world model-based reinforcement learning. Built upon GigaBrain-0.5, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. GigaBrain-0.5M* further integrates world model-based reinforcement learning via RAMP (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that RAMP achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30% on challenging tasks including Laundry Folding, Box Packing, and Espresso Preparation. Critically, GigaBrain-0.5M* exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure. 直接从当前观测预测多步动作块的视觉-语言-动作(VLA)模型,由于场景理解受限且未来预测能力较弱,存在先天局限。相比之下,在大量视频语料上预训练的视频世界模型具备更强的时空推理与未来预测能力,是提升 VLA 学习的先验模型。因此,我们提出 GigaBrain-0.5M*:通过世界模型自我进化的VLA大模型。该模型基于 GigaBrain-0.5(在超过 10,000 小时机器人操作数据上预训练,其中间迭代的模型版本目前在国际 RoboChallenge 榜单排名第一)。GigaBrain-0.5M* 进一步通过 RAMP(Reinforcement leArning via world Model-conditioned Policy)引入基于世界模型的强化学习,实现稳健的跨任务适应。实验表明,RAMP 相比 RECAP 基线在 Laundry Folding、Box Packing、Espresso Preparation 等挑战任务上带来约 30% 的提升。更关键的是,GigaBrain-0.5M* 具备可靠的长时程执行能力,能够稳定完成复杂操作任务。

Pre-train Data Distribution预训练数据分布

GigaBrain-0.5 is pretrained on 10,931 hours of diverse visual experience, with 61% (6,653 hours) synthesized by our world model GigaWorld to enable scalable coverage of novel textures, viewpoints, and object configurations. The remaining 39% (4,278 hours) comes from real-robot data, collected from our proprietary fleet and public benchmarks. Together, this forms a balanced curriculum—imagination at scale, grounded in sensor truth. GigaBrain-0.5 在 10,931 小时的多样化视觉经验上进行预训练,其中 61%(6,653 小时)由我们的世界模型 GigaWorld 合成,用于可扩展地覆盖新纹理、视角与物体位置;其余 39%(4,278 小时)来自真实机器人数据,采集自自有机器人集群与公开基准。二者共同构成了均衡的训练数据:大规模“想象”与传感器真实数据相结合。

dataset samples

Model Architecture模型架构

GigaBrain-0.5M* is a world model-conditioned VLA trained via world model-based reinforcement learning. Pretrained on multimodal, robot manipulation, and web video data, it enables self-improvement through human-in-the-loop (HIL) rollout that generates diverse training data for continual training. GigaBrain-0.5M* 是通过世界模型自我进化的VLA大模型。它在多模态、机器人操作与视频数据上预训练,并通过人类在环(HIL)rollout 持续生成多样化训练数据,实现持续训练与自我改进。

VLA architecture

Reinforcement Learning via World Model-Conditioned Policy基于世界模型条件策略的强化学习

The RAMP (Reinforcement leArning via world Model-conditioned Policy) framework follows an iterative four-stage training paradigm: RAMP(Reinforcement leArning via world Model-conditioned Policy)框架采用迭代式的四阶段训练范式:

  1. The world model is pretrained on large-scale robot manipulation data to forecast future states and associated value. 在大规模机器人操作数据上预训练世界模型,用于预测未来状态及其对应价值。
  2. The policy is fine-tuned by conditioning action selection on the world model’s predicted futures and value estimates. 以世界模型预测的未来与价值估计作为条件,对策略进行微调以指导动作选择。
  3. The conditioned policy is deployed in physical environments to collect rollout trajectories under human-in-the-loop intervention. 将条件策略部署到真实环境,在人类在环干预下采集 rollout 轨迹。
  4. Both the world model and policy are jointly refined using the curated rollout dataset. 使用筛选后的 rollout 数据集联合优化世界模型与策略。

This iterative training paradigm enables continual learning and self-improvement. 该迭代式训练范式支持持续学习与自我提升。

RAMP is inspired by RECAP in $\pi^*_{0.6}$, as both approaches condition the VLA model on additional information. However, RECAP uses only sparse advantages (0 or 1) as input, providing limited information gain. In contrast, RAMP leverages future states predicted by a well-pretrained world model, yielding substantially richer conditioning signals. We further provide a theoretical analysis showing that RECAP is a special case of RAMP. RAMP 的设计受到 $\pi^*_{0.6}$ 中 RECAP 的启发,两者都通过附加信息对 VLA 模型进行条件化。但 RECAP 仅使用稀疏优势(0 或 1)作为输入,信息增益有限;相比之下,RAMP 利用预训练充分的世界模型所预测的未来状态,提供更丰富的条件信号。我们进一步给出理论分析,说明 RECAP 是 RAMP 的一种特例。

Benchmarking Pre-training

GigaBrain-0.5M* PerformanceGigaBrain-0.5M* 性能

GigaBrain-0.5M* achieves near-perfect success on Box Packing, Espresso Preparation, and Laundry Folding, consistently completing tasks successfully across consecutive runs. GigaBrain-0.5M* 在 Box PackingEspresso PreparationLaundry Folding 上达到接近满成功率,并且能够在连续多次运行中持续稳定成功完成任务。


GigaBrain-0.5 PerformanceGigaBrain-0.5 性能

To evaluate performance on physical robots, we collect task-specific demonstrations on the target robot platform and post-train the model for each task. We evaluate on eight internal tasks and observe consistent improvements over prior policies (GigaBrain-0, $\pi_0$, $\pi_{0.5}$). 为评估真实机器人上的性能,我们在目标机器人平台上采集任务特定的示范数据,并针对每个任务进行后训练。我们在 8 个内部任务上进行评测,观察到相比先前策略(GigaBrain-0、$\pi_0$、$\pi_{0.5}$)的持续一致提升。

Internal task evaluation results

On the public RoboChallenge benchmark, an intermediate model (GigaBrain-0.1) ranks first on the leaderboard as of February 9, 2026, achieving an average success rate of 51.67% (9% higher than $\pi_{0.5}$ at 42.67%). 在公开的 RoboChallenge 基准上,截至 2026 年 2 月 9 日,我们的中间模型(GigaBrain-0.1)位列榜单第一,平均成功率达到 51.67%(比 $\pi_{0.5}$ 的 42.67% 高 9 个百分点)。

RoboChallenge leaderboard snapshot

Value Prediction Performance价值预测性能

Our analysis highlights three takeaways: 我们的分析总结出三点结论:

  • VLM-based: highest latency (0.32 s/frame on an A800 GPU), dominated by the SigLIP visual encoder. 基于 VLM: 延迟最高(A800 上 0.32 s/帧),主要由 SigLIP 视觉编码器耗时主导。
  • WM-based (value only): fastest inference (0.11 s) but lower accuracy (MAE = 0.0838, Kendall = 0.7288). 基于世界模型(仅价值): 推理最快(0.11 s),但精度较低(MAE = 0.0838,Kendall = 0.7288)。
  • WM-based (state+value): best accuracy (Kendall = = 0.8018, MAE = 0.0621) with competitive speed (0.25 s). 基于世界模型(状态+价值): 精度最佳(Kendall = 0.8018,MAE = 0.0621),同时速度具有竞争力(0.25 s)。

Qualitative value prediction visualizations are shown below. 下方展示价值预测的定性可视化结果。

Drag on the curve to scrub the video. 在曲线上拖动即可定位视频进度。
Loading prediction data… 正在加载预测数据…
VLM Instruction accuracy in benchmark tasks

World Model Conditioning for Policy Learning用于策略学习的世界模型条件

Incorporating the world model yields substantial performance gains across tasks, with improvements consistently observed throughout training from 5,000 to 20,000 steps. The benefit is particularly pronounced in the multi-task setting, where the success-rate gap widens progressively and reaches about 30% on tasks such as Box Packing at 20,000 steps. This suggests that world model conditioning facilitates knowledge transfer across tasks while preserving strong single-task performance. 引入世界模型后,多任务与单任务均获得显著性能提升,并且在 5,000 到 20,000 步的训练过程中持续可见。收益在多任务设置下尤为明显:成功率差距随训练逐步拉大,在 20,000 步时例如 Box Packing 等任务可达到约 30%。这表明世界模型条件化有助于跨任务知识迁移,同时保持强单任务性能。

VLM Instruction accuracy in benchmark tasks

Comparison with RL Baselines与 RL 基线对比

We benchmark RAMP against state-of-the-art RL baselines: 我们将 RAMP 与当前先进的 RL 基线进行对比:

  • GigaBrain-0.5 + AWR: online fine-tuning with weighted imitation learning using policy rollouts. GigaBrain-0.5 + AWR:基于策略 rollout 的加权模仿学习进行在线微调。
  • GigaBrain-0.5 + RECAP: advantage-conditioned offline RL baseline without future-state prediction. GigaBrain-0.5 + RECAP:优势条件化的离线 RL 基线,不使用未来状态预测。
  • GigaBrain-0.5 + RAMP (GigaBrain-0.5M*): conditions the policy on predicted value and future-state latents for long-horizon tasks. GigaBrain-0.5 + RAMPGigaBrain-0.5M*):以预测价值与未来状态隐变量作为条件,提升长时程任务表现。

RAMP achieves near-perfect success on Box Packing, Espresso Preparation, and Laundry Folding, outperforming all baselines; gains on Box Packing and Espresso Preparation exceed RECAP by about 30 percentage points. GigaBrain-0.5M* also transfers reliably to real-world deployment, as shown in the first videos on this page. RAMP 在 Box PackingEspresso PreparationLaundry Folding 上接近满成功率,整体优于所有基线;其中 Box PackingEspresso Preparation 相比 RECAP 的提升约 30 个百分点。GigaBrain-0.5M* 也能可靠迁移到真实环境部署,页面中的第一个视频展示了其效果。

VLM Instruction accuracy in benchmark tasks