Scaling laws for native multimodal models

Metadata

Published

Apr 11, 2025

Authors

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby

Read time

4 min read

Paper overview

What did the authors set out to do?

Imagine you have a toy box full of different toys—some are blocks, some are dolls, and others are cars. Each toy is different, but they all fit together in the box. Now, imagine trying to understand how each toy works and how they can work together. That’s kind of what the authors of this research paper were trying to do, but instead of toys, they were working with something called multimodal models.

Multimodal models are like super-smart computers that can understand and work with different types of data, like text, images, and even sounds. The goal of this research was to figure out the best way to build these models so they can work efficiently and understand the world around them better.

The authors wanted to answer a big question: Is it better to build these models by combining different parts (like vision and language) from the start, or should we keep those parts separate and combine them later? They also wanted to see how these models could be scaled up to handle more data and become even smarter.

How did they do their research?

To answer their questions, the authors did a lot of experiments. They built many different models, some that combined data from the start (early-fusion models) and others that kept data separate and combined them later (late-fusion models). They also tried using something called Mixture of Experts (MoEs), which is a way to make the models more efficient by letting different parts of the model specialize in different tasks.

They trained these models on a lot of data, including text, images, and combinations of both. They used something called scaling laws to understand how the models’ performance changed as they made the models bigger or gave them more data to learn from. Scaling laws are like rules that help predict how well a model will work based on how much data and computing power it has.

The authors also compared their models to see which ones worked better and why. They looked at things like how much computer power was needed, how many parameters (or knobs) the models had, and how well they could understand and generate text and images.

What did they find out?

One of the biggest findings was that early-fusion models worked just as well as late-fusion models, and sometimes even better, especially when the models were smaller. This means that combining data from the start might be a better approach for some tasks. They also found that early-fusion models were more efficient and easier to train, which is important for making AI systems that can be used in the real world.

The authors also discovered that using MoEs made the models work better, especially when the models were smaller. This is because MoEs allow the model to specialize in different tasks, kind of like how humans have different experts for different jobs. They found that the models followed similar rules for scaling as large language models, but with some differences because they were working with multiple types of data.

Another important finding was that the way the data was mixed and used during training mattered. Models that were trained on a mix of text, images, and combinations of both worked better than models that were trained on just one type of data. This suggests that using a diverse set of data is important for building good multimodal models.

Why does this research matter?

This research is important because it helps us understand how to build better multimodal models. These models are used in things like self-driving cars, personal assistants like Siri or Alexa, and systems that can understand and describe images. By figuring out the best way to combine different types of data and scale these models, the authors are helping to make AI systems that are more efficient, easier to train, and better at understanding the world.

The findings challenge some common approaches to building multimodal models, like using pre-trained vision encoders and language models. Instead, the authors suggest that building models from scratch and combining data from the start might be a better approach. This could lead to new ways of building AI systems that are more flexible and can handle a wider range of tasks.

Overall, this research is a step forward in making AI systems that can understand and work with multiple types of data, which is an important part of building general-purpose AI that can help people in many different ways.

About Anara

Anara helps academics and research teams understand, organize, and write scientific documents. We're building tools to help researchers think and work better. Our AI- powered platform enables teams to quickly comprehend complex research, maintain organized knowledge bases, and produce well-cited documents - accelerating the path from discovery to publication. Experience how Anara can transform your research workflow today.

Get your research done faster with Anara.

Anara helps you understand, organize and write scientific documents with AI. Take it for a spin today. No card required.

Get Anara free
test