Xiaomi announces new Vision-Language-Action robot training model

If you didn’t realise that Xiaomi is also involved in the robotics space, you haven’t been paying attention. The company announced its CyberOne humanoid robot back in 2022. It’s been fairly quiet on that front until now.

The Chinese company has just unveiled a new Vision-Language-Action (VLA) model used for training robots that it reckons is “optimized for high performance and fast and smooth real-time execution.” The really cool thing is that it is making said model open-source.

Xiaomi: how to train a robot

The VLA model has an actual title, of course. The training package is called Xiaomi-Robotics-0, and it’s supposed to decouple a robot’s movement and ‘thought’ systems. This allows for better, more natural movement, since cognitive processing isn’t getting in the way. How well that works… well, you can download the model and test it for yourself.

Don’t get super excited yet, though. The company detailed what its training data consists of. Hopefully, your humanoid robot project (why is that in your basement?) dearly needs the ability to disassemble Lego or fold towels. Those are two large chunks of the dataset, which includes “200M timesteps of robot trajectories and over 80M samples of general vision-language data.”

Lego disassembly constitutes 338 hours of training data, while towel manipulation accounts for another 400 hours. We wouldn’t be too dismissive of the time Xiaomi-Robotics-0 has spent on these, though. Taking Lego apart and putting towels neatly demands substantial dexterity. Those skills should transfer. Eventually.

If you’re only keen on the results of Xiaomi’s very technical robotics software breakdown, you’ll find those here. While the company only uses a pair of robot arms (and not an entire humanoid), their ability to take Lego apart (and clean up the counter at the same time) is actually fairly impressive.