Alexandre Défossez is co-founder and chief exploration officer of Kyutai. During VivaTech, he takes stock of the laboratory’s progress since its creation in 2023.
JDN. Kyutai grew from 6 people initially to a team of 22, and from a focus on voice to several areas of research. How has the laboratory evolved in three years?
Alexandre Défossez. We actually started three to six years ago; there are 22 of us today, from postdocs to interns. And we are very satisfied with the research environment that we have built. Our laboratory is a bridge between industry and academia, and it is undoubtedly Kyutai’s mission that will have the most lasting impact: training talent. With us, they have direct access to cutting-edge subjects, but linked to concrete applications (voice, vision, world models, robotics, autonomous driving), with the associated computing resources. We sow seeds that will eventually germinate: a generation that will innovate, both in research and in entrepreneurship.
On the research subjects, we started with voice, full duplex speech-to-speech: a conversation like ours, where everyone speaks at any time, cuts themselves off, does backchanneling. It didn’t exist, and it still doesn’t really exist. Two years ago, we released Moshi, a world first, but a research prototype: no tool calling, no reasoning. We continue to work on it. At the same time, we have broadened the scope to demonstrate that we know how to tackle other themes and deliver cutting-edge proposals in one or two years. Two axes. Vision first: with MoshiVis, Moshi sees in still images, and we now attack the video. And the latest of these axes is the world model.
How do you measure the impact of Kyutai, a laboratory by definition open science and non-profit?
Impact is necessarily a complex subject, because it is mainly measured in the long term. But there are immediate metrics: engagement on social networks, number of stars on GitHub, downloads on Hugging Face. This gives a first measure. But it is not always meaningful, because to create volume, you have to produce something that is primarily aimed at the general public. We do it, with demos. On the other hand, on longer-term work, it is rather the academic and industrial circles which take over. There, the signal comes through feedback and requests for collaboration. We have had several: I can’t name names, but companies come to us to build on our ecosystem. And this is primarily due to a unique positioning: to my knowledge, we are the only cutting-edge research laboratory that is entirely open.
Are your models already used in products in production?
Yes. We deployed, with CMA CGM and La Provence, the voice synthesis of their press articles, on our technology. On the collaboration side, there is CMA CGM, and with Iliad we work mainly with Scaleway around access to computing, plus discussions with certain applications of the group, but nothing public for the moment. Beyond our own products, our models are used by other AI laboratories and publishers. Nvidia’s PersonaPlex is built on Moshi. On the speech part, and to name only those that I can name: Qwen TTS takes our TTS model, and the Mistral ASR model takes our ASR.
And then there is our Mimi codec, quite unique when it was released: 12.5 hertz, therefore a very low frame rate, ideal with a transformer. As soon as a frame is generated, the audio is output instantly, which existing codecs did not do. It is still downloaded by the millions every month, and many companies use it, often without us knowing it. On paper, Moshi has become a reference, a baseline that is now a little dated, but which has opened up a field of research: hundreds of works seek to extend it.
The world model is one of your major, fairly recent areas of research. Why did you position yourself in this niche? How is your approach different from AMI by Yann Le Cun?
On world models, we have a collaboration with General Intuition (Geneva AI start-up, editor’s note). Concretely, it creates a dataset made up of video game parts. For each player, we know exactly what they are doing with the keyboard and mouse. It is this data that we use to train the model. A world model on video games may seem abstract, but it is an ideal terrain: data in quantity, and manageable variability.
The objective is to train agents, AI capable of acting and deciding for a given goal, winning the game for example, or cooperating in a multiplayer game. The goal is to generalize. We won’t go directly from video games to reality, but once it works in the game, we know, from a purely algorithmic point of view, what works and what doesn’t. This algorithm can then be applied much more easily to reality. And that opens up applications in autonomous driving and robotics, because the idea, ultimately, is to make a world model of reality and simulate what happens there.
On the difference with AMI: they have a fairly clear point of view on architecture. We don’t have a fixed position on this, we will do what works best. For the rest, we have few details on the exact problem they intend to attack.
So the objective is not a single world model, but several world models specialized according to use cases?
Yes, there will be several world models depending on the use cases: that of robotics will not at all be the same as that of autonomous driving. But what we hope is that the skill and the algorithm are transferred, and that we can thus train a new model much more quickly. An example: with Moshi, we started with speech-to-speech. The architecture that we developed was then used for a whole range of problems, in transcription or text-to-speech. Basically, an AI model is inputs, outputs and a training dataset. To move from robotics to autonomous driving, you change the inputs and outputs, and everything in the middle remains the same.
How do you manage to continue funding the laboratory?
The initial funding allows us to last for several more years. This is also the advantage of not having launched into a total war of recruitment and computing, we are much more efficient than a lot of players. Then, we explore industrial collaborations, and it’s an interesting model. If a company wants to explore a subject and it fits into our research areas, we go there. We are developing models that will be open sourcefunded by this company, simply because it has an interest in seeing the field develop. She tells herself that it will be profitable for her. This is how we work to sustain the laboratory.