Valued at $14 billion, Skild AI trains its artificial intelligence models using simulation and human videos. The company aims to build a “general brain” capable of powering many types of robots.
American start-up Skild AI raised $1.4 billion earlier this year and is now valued at around $14 billion. Its co-founder, Deepak Pathak, explains why the robotics could enable the reindustrialization of Europe and the United States, and the reasons that lead him to consider physical AI as the key to AGI.
JDN. Can you introduce Skild AI?
Deepak Pathak. We are building a general AI model capable of operating on a wide variety of robots: humanoids, quadrupeds, robotic arms, drones and many other robotic systems. The idea is that the same underlying intelligence can function across different types of incarnations, tasks and environments, indoors, outdoors, in factories, hospitals or university campuses. We call this “omnibody intelligence”: any robot, any task, any one brain. Our goal is to create a model capable of adapting to all these contexts.
Why do you think a general AI brain is more powerful than specialized robotic agents?
Robotics has traditionally been viewed as a hardware problem. When people think of ChatGPT, they think of intelligence. But when they think of robotics, they first think of machines. Yet, despite decades of impressive demonstrations, we still do not see robots everywhere in our daily lives.
The reason is not the absence of efficient hardware, it is the absence of a real brain. What robotics lacks is general intelligence capable of reliably operating different hardware systems in real-world environments.
Do you also insist on versatility?
Our ambition is to build a universal brain capable of operating on all of these robotic systems. In return, each action performed by one of these robots generates data that helps improve the underlying fundamental model. Each deployment thus produces new data, feeding what we call a “continuous data flywheel”.
Where does the data used to train this universal brain come from?
Unlike language models, robotics does not have an Internet-scale dataset. This creates what I call the “chicken and egg problem” of robotics: to collect robotic data, you need robots already deployed; but to deploy these robots, they must already be functioning properly; and for them to work properly, you need data.
The most valuable data comes from real interactions between robots and their environment. They can be collected using teleoperation or virtual reality systems. However, even if this data is of excellent quality, it remains extremely limited in quantity.
We therefore bootstrap the system from multiple data sources, including videos of humans. We analyze how humans manipulate objects through egocentric footage or videos available online, and then use these observations to guide the robots’ behavior.
But observation alone is not enough. Watching Roger Federer play tennis won’t allow me to play like him: I would need thousands of hours of training. Robots need practice too. As real-world experiments are expensive, simulation becomes essential. Robots first learn to move by observing human demonstrations, then train through millions of simulated interactions.
How do you adapt the same foundation model to different robotic applications?
The approach is similar to that of language models. We first build a general, omnibody foundation model, the Skild Brain, and then specialize it for applications like industry, warehouses, or assembly tasks. Once deployed, the robots generate operational data that continually improves the model. This creates the data flywheel. One of the main challenges of robotics, however, remains the “sim-to-real gap”, that is to say the gap between the performances observed in simulation and those obtained in real environments. Our approach involves training adaptive systems exposed to a wide variety of simulated conditions. The goal is not simple memorization of scenarios, but the ability to continually adapt to new and unpredictable situations.
What is your vision of world models or VLA (Vision-Language-Action) approaches?
THE world models are often misunderstood. Their goal is not necessarily to generate perfectly realistic simulations of the future at the pixel level. The human brain doesn’t work that way either.
When you pick up a glass or pour water into a container, you do not mentally produce a photorealistic image of every movement of every drop of water. Your brain instead thinks in an abstract space. You intuitively understand concepts like force, timing, balance and friction. You know, for example, that too sudden a movement could cause the glass to fall. It is precisely this type of abstract reasoning that allows a system to generalize its skills to different environments and different forms of robots. Continuously generating fully realistic internal visualizations would be extremely computationally expensive.
This is why the model of tomorrow’s world will not necessarily be a model based on images. It will rely more on reasoning carried out in an abstract and efficient latent space, capable of capturing the essential properties of the physical world without having to reconstruct every visual detail.
Your research seems strongly inspired by the way humans learn. Is it voluntary?
Yes, a lot of research in AI and robotics, and much of my own academic work, has been inspired by the way humans learn, especially babies. Traditional robotics relied heavily on pre-programmed behaviors. What we are building is much more inspired by cognitive psychology and interaction learning. An important concept is that of observational learning. Children observe outcomes, intentions and interactions, then experiment on their own. This is very close to what we want robots to do. We use videos and demonstrations to help the robots understand behaviors and outcomes, then the robots train massively through simulation. A robotic foundation model can learn simultaneously across many incarnations, environments, and types of robots. This type of large-scale shared learning could allow these systems to become better than humans in certain capacities.
Do you think physical AI and humanoid robots are the path to AGI?
I even believe something even stronger: physical AI is the only path to AGI. When we talk about general intelligence, it means that it is capable of operating on very different tasks with the same brain. If you look at evolution, intelligence emerged primarily through physical interaction with the world. Animals have developed locomotion, coordination, balance and vision over hundreds of millions of years of evolution. Language, in comparison, is extremely recent: it has only existed for a few tens of thousands of years. This is why I don’t think language, alone, is enough to achieve AGI (artificial general intelligence). In my opinion, language is not the foundation of intelligence, but one of its manifestations. The deep source of intelligence is our ability to understand, anticipate, and interact with the physical world.
Skild AI recently raised $1.4 billion. What are the company’s priorities now?
Training these models requires immense computational resources. The second priority is deployment. The key to robotics is data from deployment. But we can’t focus on a single application or a single type of robot, otherwise we lose diversity, and diversity is the key to building a general model. This is why we deploy our systems in industry, warehouses, mobility and industrial automation. Each deployment generates operational data that improves future deployments.
Why are partnerships and acquisitions so important to Skild AI?
Robotics is a time-consuming industry, and it would be extremely difficult to break into every industry from scratch. This is why partnerships and acquisitions play a vital role.
For example, Zebra Technologies, through its logistics robotics division Fetch Robotics, already had several years of experience in deploying autonomous mobile robots in warehouses and industrial environments. This is one of the reasons behind our acquisition of these robotics activities.
We are now also collaborating with ABB and Universal Robots, two of the world’s leading manufacturers of industrial robots. By integrating their systems and working with these manufacturer partners, we can deploy the Skild Brain to hundreds of sites much faster than if we acted alone.
Our model is primarily about working with end customers, while collaborating with robot manufacturers and system integrators to deliver deployments. The overall strategy is simple: deploy our technology in as many different environments as possible, as quickly as possible.
You often talk about the “reindustrialization revolution”. What do you mean by that?
Many countries today want to rebuild their domestic industrial capacity. But labor costs are much higher in countries like the United States, and much industrial expertise has disappeared over time. If you want to produce advanced technologies (chips, electronics, robotic systems) locally while keeping them at the same price, automation becomes essential. This is why robotics and automation are becoming strategic priorities in many countries.