

Lumo-1 is a large-scale VLA model that connects vision, language, and physical action. It can generalize to new objects, environments, and instructions - even those involving abstract or indirect descriptions - and can be efficiently adapted to new tasks, including those requiring extended reasoning or precise manipulation.
These capabilities are enabled by first inheriting strong multi-modal reasoning from existing vision-language models (VLMs), and then progressively extending this competence toward grounded reasoning about the physical world and real-world action execution. The training procedure follows a structured three-stage process:
We introduce a spatial action tokenizer that represents motion sequences in a compact form, generalizing across different robot embodiments and spatial contexts while being more efficient than FAST or binning-based tokenization methods. This representation maintains the spatial meaning of actions while reducing irrelevant variations introduced during data collection.

(a) Robot trajectories are decomposed into the shortest subsequence of states (waypoints) within an acceptable reconstruction error budget. (b) The motion token library is constructed by clustering delta actions from a large-scale, diverse dataset, with rotation and translation processed independently. During training, at each timestep, one of the top-3 closest tokens are randomly selected from the motion token library to approximate the next waypoint, the selected token then serves as the reference for determining the subsequent token. (c) shows the probability densities of delta actions derived from a diverse robot trajectory dataset, projected onto 2D planes.
Rather than relying on trajectory memorization, we foster a structured reasoning process that enables purposeful and context-aware action generation. The model engages in multiple forms of embodied textual reasoning:
The model further performs visual reasoning for perception-grounded inference and motion estimation. Finally, action trajectory generation is formulated as waypoint prediction over the action horizon, aligning 2D visual understanding with continuous downstream control in physical space.
We apply Reinforcement Learning (RL) to further refine embodied reasoning and enhance the coordination between high-level reasoning and low-level motor actions. The optimization is guided by multiple reward signals, including visual reward, reasoning–action consistency reward, action execution reward, and reasoning format reward, collectively encouraging coherent and physically grounded decision-making.
We evaluate Lumo-1 on the generalizable pick-and-place task under four settings consistent with GR3: (1) Basic, (2) Unseen Environments, (3) Unseen Instructions, and (4) Unseen Objects. Across all settings, Lumo-1 consistently outperforms baseline models, with particularly strong gains on challenging cases involving novel instructions and previously unseen objects.

Lumo-1 outperforms its backbone model (Qwen2.5-VL-7B-Instruct) on 6 out of 7 benchmarks and surpasses specialized embodied models RoboBrain-7B-2.0 and Robix-7B on most tasks following Stage1 pre-training. These results highlight Lumo-1's strong capabilities in object localization, spatial reasoning, and fine-grained visual understanding. Moreover, these competencies remain largely intact after Stage2 co-training on diverse robot trajectories, demonstrating that incorporating action learning does not compromise the model's core multimodal perception and reasoning abilities.

As shown in the table below, the RL-trained model consistently achieves higher reward values compared to the Stage3 model across nearly all evaluation metrics and reasoning modes. In particular, under the full reasoning mode, the RL model shows notable improvements in locating key areas such as bounding box, waypoint, and action rewards.

We also introduce the Net Superiority Rate (NSR) metric, defined as the difference between the number of instances where the RL-trained model outperforms the Stage3 model and the number of instances where the Stage3 model outperforms the RL-trained model, normalized by the total number of comparable instances. The table below shows that NSR values are consistently positive, indicating overall superiority of the RL-trained model, with the largest gains observed in waypoint and action rewards. These results demonstrate that the RL training phase effectively enhances model performance, particularly in trajectory planning and action execution.

We adopt the Data-Constrained Scaling Law (Muennighoff et al., 2023), which models the effective contributions of data and parameters using an exponential decay formulation, where the value of a data token diminishes by roughly \(\left( 1 - e^{-R_D / R_D^*} \right)\) per repetition. Under the assumption of a fixed model size, the scaling law can be further simplified as:
$$L(D) = \frac{B}{D'^\beta} + E;$$
$$D' = U_D + U_D R_D^* \left(1 - e^{-R_D / R_D^*}\right)$$
where E is the asymptotic lower bound of the loss, B controls the initial loss magnitude, and β is the exponent. \(U_D\) denotes the amount of unique data, \(R_D\) is the number of data repetitions, and \(R_D^*\) is a learned decay constant that the diminishing marginal utility of repeated data.
Our key observations include: (1) Scaling Law Validity: loss predictions from the Data-Constrained Scaling Law closely match observed values, confirming its applicability in data-constrained robotic learning. (2) Data Diversity Necessity: policies trained without augmentations perform poorly under real-world variations, highlighting the importance of diverse training data. Training with broad diversity, including prompt and image augmentations, improves resilience to validation perturbations and reduces loss on fully out-of-domain data (e.g., involving backgrounds, scenes, objects).