RotVLA

Abstract

We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

Method

RotVLA combines a pretrained VLM backbone, a latent action model, and a unified action expert. The key idea is to represent actions using continuous rotations, preserving geometric structure and continuity while enabling learning across diverse embodiments and data sources.

Stage I · Continuous Rotational Latent Action

RotVLA instead represents each latent action as an element of the rotation group SO(n). This preserves continuity (using SoftVQ rather than hard quantization) while enabling meaningful action composition through matrix multiplication, mirroring how real-world motions compose.

To prevent the LAM from degenerating into trivial frame reconstruction, we introduce a triplet learning objective. Given three consecutive frames I_t, I_t+1, I_t+2, the model extracts two single-step latent actions and composes them via matrix multiplication to predict the two-step transition. This compositional supervision forces the encoder to capture true motion dynamics rather than copying frame appearance, and generalizes zero-shot to unseen datasets and embodiments.

Stage II · RotVLA Pretraining

RotVLA pairs a pretrained InternVL3.5-1B VLM backbone with a flow-matching Diffusion Transformer (DiT) action expert. Given consecutive frames and a language instruction, the RotVLA is pretrained with predicting the latent action. The model is pretrained on over 1,700 hours of cross-embodiment robot datasets (Open X-Embodiment, AGIBOT, RoboMIND, RoboCOIN) and egocentric human videos (Ego4D).

Stage III · RotVLA Finetuning

For downstream manipulation, the flow-matching head is extended into a unified action expert that jointly denoises latent actions and robot actions within a single diffusion process. A structured attention mechanism ensures that latent action tokens attend only to vision-language tokens, while robot action tokens attend to both, allowing latent actions to serve as a latent planner that conditions embodiment-specific control.

Figure. Illustration of existing LAMs (a) and RotVLA (b).

Results

98.2%

LIBERO Results

89.6%

RoboTwin2.0 Results (Clean)

88.5%

RoboTwin2.0 Results (Randomized)

1.7B

Model Parameters

Simulation benchmark and qualitative results.

Table. Evaluation on simulation benchmarks (LIBERO and RoboTwin2.0).

Real-World Experiments

Task 1: Pick up the block and place it on the plate (1x speed).

Task 2: Put the block into the drawer and close the drawer (1x speed).

Task 3: Put the yellow cup into the red cup, then place the green cup into the yellow cup (1x speed).

Latent Action Analysis

We analyze the learned latent actions from three complementary perspectives: reconstruction and compositional consistency, cross-domain generalizability, and representation expressiveness.

Reconstruction and compositional consistency. We visualize the LAM's reconstruction quality and compositional behavior across three datasets: AGIBOT, RoboMIND-UR, and BC-Z. Given three consecutive frames I_t, I_t+1, I_t+2, the model produces single-step reconstructions Î_t+1 and Î_t+2, as well as a composed prediction Î^comp_t+2 obtained by applying the composed latent action directly from I_t. The composed prediction closely matches I_t+2, demonstrating that the SO(n) latent space supports meaningful action composition via matrix multiplication.

Cross-domain generalization. A key desideratum for latent action representations is that motion semantics should transfer across embodiments and datasets. We extract latent actions from one domain and directly apply them to reconstruct frames in other domains, including seen datasets (RoboCOIN, RoboMIND) and the unseen LIBERO benchmark not used during training. Latent actions such as move right, move left, and stay still consistently produce plausible reconstructions across all domains.

Representation expressiveness (LARY benchmark). We evaluate the quality of latent action representations via linear probing on the LARY benchmark, which tests both regression of low-level robot actions and classification of high-level semantic actions. RotVLA substantially outperforms prior LAMs across all four robot datasets, achieving the lowest regression MSE and highest classification accuracy. Even RotVLA*, whose LAM pretraining excludes overlapping datasets (AGIBOT, RoboCOIN, Ego4D), significantly surpasses competing methods, confirming that the continuous rotational representation captures richer dynamic structure than discrete latent formulations.

BibTeX

@article{rotvla2026,
  title     = {RotVLA: Rotational Latent Action for Vision-Language-Action Model},
  author    = {Qiwei Li and Xicheng Gong and Xinghang Li and Peiyan Li and Quanyun Zhou and Hangjun Ye and Jiahuan Zhou and Yadong Mu},
  journal   = {arXiv preprint arXiv:2605.13403},
  year      = {2026}
}