How are humanoid robots trained?

How are humanoid robots trained?


The rise of AI models is transforming the way humanoid robots are trained. Data collection, simulation and new learning models pave the way for machines capable of generalizing their knowledge to new situations.

Robots have long been programmed to execute each of their movements, coded in advance. This approach worked in highly controlled environments, such as factories or logistics warehouses, but proved too limited in more unpredictable situations.

Recent advances in artificial intelligence and the emergence of foundation models have changed the game. Rather than only following predefined rules, humanoid robots are now trained on data. By observing human actions, they can reproduce gestures, identify recurring patterns and thus attempt to generalize their knowledge to new situations.

“We moved from a logic where we programmed behaviors to an approach where these behaviors are learned from data. This is the only possible way to scale up,” summarizes Deepak Pathak, co-founder and CEO of Skill AIan American start-up which is developing a model presented as a “general brain for robots”.

Collect real-world data

Robots learn mainly from three types of data: robotic data (very precise but difficult to collect on a large scale), video (abundant but less rich in information on physical interactions, such as forces or contacts between objects) and data generated in simulated environments, which suffer from a gap with the real world (“Sim-to-real gap”).

There are several methods for collecting this data. The simplest is observational learning: the robot observes a human performing certain tasks. Thanks to its cameras and sensors, it records movements and gestures, in order to reproduce them later. AI models will then be able to identify recurring patterns. For example, if hundreds of demonstrations show how to grab a cup from different locations, angles and lighting, the robot can generalize to learn how to grab a cylindrical object.

But the most widespread method is teleoperation. A human equipped with a remote control or a VR headset controls the robot’s gestures, allowing it to memorize them. The teleoperator can be equipped with haptic gloves and movement sensors, in order to collect more data. This method captures detailed information such as joint angle or applied force.

The main humanoid builders use this type of training. 1X, which begins to market the NEO domestic robotwill even offer a remote teleoperation service. An employee will be able to take control of the humanoid in order to teach it to perform certain household tasks in the customer’s home.

Data collection via teleoperation has become a real industry, particularly in China where specialized centers employ operators responsible for carrying out repetitive tasks in order to feed learning models intended for robots.

These approaches, although they prove to be more effective than programming, nevertheless have significant limitations. They prove to be particularly time-consuming and consume a lot of human labor power.

Simulate before acting

To try to circumvent these limitations, new methods have emerged. They aggregate several types of data, notably video, and have been designed to allow AI models to understand the laws of physics.

Video-Language-Action (VLA) models are fed, for example, by images and textual instructions, before producing as output a sequence of motor actions executable by a robot. Several major players are developing their own VLA models, such as GR00T N1 at NVIDIA, Gemini Robotics at DeepMind or Helix at Figure AI.

Skild AI, for its part, applies robotics a logic already used in large language models: pre-training on immense volumes of data, followed by refinement with more specific data from the real world. “This combination, with lots of general data on one side and then specific, high-quality data on the other, is one of the key principles of current AI,” explains Deepak Pathak.

The start-up Rhoda AI decided to pursue another path with its “Direct Video-Action” (DVA) model. This allows robots to learn directly from a high-performance video model, in order to improve their ability to act in real environments.

Another popular model in the world of robotics is the so-called World Models. They allow robots to gain an understanding of how the physical world works and to anticipate the consequences of their actions. Coupled with simulated environments, they allow robots to carry out millions of tests before being put into a real situation.

If LLMs like ChatGPT predict the next word, World Models predict the consequences of an action. For example, a robot can learn that a glass risks falling if pushed too hard, or that an object hidden behind another still exists. Major players in the sector include AMI Labs, co-founded by Yann Le Cun, and World Labs, founded by Fei-Fei Li.

Learning methods aggregating various types of data, coupled with simulation training, thus appear to be a solution to help robots understand the world around them. Players in the humanoid robotics sector hope that this will remove one of the main obstacles to the massive deployment of their models.

Leave a Reply

Your email address will not be published. Required fields are marked *