S

Six months ago, running GPT-4-class responses at any real volume was super expensive. But now the numbers look different... and opposite.

The reason frontier models have become cheaper to run today is not that hardware has gotten better or that labs have become more efficient at training.

Much of it comes down to an architectural change that most PMs have never heard of: Mixture of Experts, or MoE. Understanding it might not make you an ML engineer.

However, it will make you a more credible person in the room when the team talks about model selection, latency, and cost.

Let’s dig in!

The Problem with Big Models

The AI industry had a problem. Bigger models are smarter. More parameters means more capability: better reasoning, more knowledge, stronger performance across tasks.

That’s been the case since the early days of large language models.

But bigger models are also expensive. Not just to train, but every time you call the model for an inference (every API request, chat turn, generated token), the model has to process your input through all of its parameters, every single one.

A 70-billion-parameter model uses all 70 billion parameters on every token it generates. That’s what researchers refer to as a dense model.

Also, all parameters remain active for all inputs all the time.

Imagine hiring a company of 1,000 specialists and requiring every single one of them to come and weigh in on every decision, even the unimportant ones.

That’s a dense model at inference time. And so much waste.

MoE is the answer to that waste.

What Mixture of Experts actually is

The core idea has been around since the early 1990s.

MIT researchers proposed it in a 1991 Neural Computation paper, but the modern version came into focus with Google’s Switch Transformer research in 2021.

It then became mainstream when Mistral released Mixtral 8x7B in December 2023.

Inside a Mixture of Experts transformer: at each layer, a router selects 2 of 8 experts to process each token. All other experts sit idle. (Source)

The concept is simple. Instead of one massive neural network that handles everything, an MoE model will have multiple smaller neural networks (the “experts”).

Then, a router decides which experts get called for each piece of input. Not all experts fire on every token. Only a small subset, usually two, activate at any given moment.

Think of it like a hospital. A general hospital has specialists in cardiology, neurology, orthopaedics, oncology, and dozens of other fields.

When a patient arrives, they don’t see all 400 doctors.

A triage system routes them to the right department. The orthopaedic surgeon handles the broken ankle. The cardiologist handles the chest pain.

The hospital’s total capacity is enormous.

But any given patient only activates a small fraction of it. That’s sparse activation. And it’s what makes MoE architecturally different from a dense model.

How the Router Works

The router is a small, learned neural network layer that runs first, before any experts. It takes the incoming token (a piece of your prompt) and scores it for each expert.

Then it chooses the top two (in most cases) and routes that token there. Those two experts process the token individually and return their outputs.

The router combines that into a weighted sum, and the result flows to the next layer of the model. This happens at every layer, for every token.

The routing decisions are not fixed. They are learned during training. Across billions of training examples, the model learns which experts perform well at which inputs.

Some experts specialise in code. Some in factual recall. Some in reasoning. The model develops internal specialisation that you can’t see, but that shows up in performance.

How the router works: each token is scored against every expert, and only the top-k highest-scoring experts activate. Most experts do nothing for any given token. (Source)

The key point is that the experts which don’t get selected do nothing for that token. Their parameters are idle. They are in the building, but not in the room.

The Numbers That Make This Interesting

Mistral’s Mixtral 8x7B is the clearest public example of how this plays out. The name tells you the shape: 8 experts, each with 7 billion parameters.

But it’s not 56 billion parameters total, because the attention layers and embedding layers are shared across all experts. The total is approximately 47 billion parameters.

At inference time, only about 13 billion of those parameters activate per token.

So you are running a ~47-billion-parameter model, with the knowledge and capability that implies, at the compute cost of roughly a 13-billion-parameter dense model.

That’s less than 30% of the total parameter count doing work on any given token.

The results bear this out.

Mixtral 8x7B outperforms or matches Llama 2 70B across most benchmarks, while activating only about 13 billion parameters per token versus Llama 2 70B’s 70 billion.

You are getting 70B-class performance at something closer to 13B-class cost per token. Google’s Switch Transformer research demonstrated the training efficiency side.

By routing tokens to a single expert each (the simplest possible MoE setup), they got up to 7x faster pretraining than a dense model of equivalent parameter count, at the same compute budget. That efficiency at training time compounds, too.

Labs can train much larger models, reaching into the hundreds of billions or trillions of parameters, without proportionally increasing their compute spend.

Why This Connects to GPT-4 and The Cost Story

OpenAI has never officially disclosed its architecture.

However, SemiAnalysis and others describe GPT-4 as an MoE model, widely believed to use eight expert blocks, with a total parameter count in the hundreds of billions.

If accurate, it follows the same logic. A GPT-4 class model with MoE only activates a fraction of its parameters per token. That means lower compute per inference call.

Lower compute per inference call means lower cost per API request.

When GPT-4’s pricing dropped after its initial launch, and again when GPT-4o came with better pricing, that wasn’t just OpenAI being generous.

It reflected real reductions in the compute cost of serving the model, and the MoE architecture is part of that story. The same pattern shows up across the industry.

For example, Gemini 1.5 Pro, announced by Google in February 2024, is widely understood by researchers to use a mixture-of-experts design.

Again, Google hasn’t published full architectural details.

But the model’s ability to handle context windows of up to one million tokens at competitive pricing is consistent with sparse activation.

And that makes long-context inference controllable at scale.

Also read: ChatGPT Made You Smarter. AI Agents Will Replace You

The Tradeoffs Most People Don’t Mention

MoE is not a free lunch. There are tradeoffs worth knowing.

Memory requirements don’t shrink with compute. Even when only 13 billion parameters are active per token in Mixtral 8x7B, all 47 billion parameters must be loaded into memory.

You need to hold all the experts in VRAM, even the ones sitting idle. That makes serving MoE models memory-intensive, particularly at smaller deployment scales.

This is why you can’t straightforwardly run Mixtral 8x7B on a single consumer GPU, even though its active compute is that of a 13B model.

The memory footprint is still 47B. Load balancing is a real engineering problem. The router must distribute tokens evenly across all experts.

If it doesn’t, if certain experts get overwhelmed while others sit idle, you lose the efficiency benefits and create throughput bottlenecks.

Training an MoE model needs adding an auxiliary loss function specifically designed to encourage balanced expert utilisation. It’s not automatic.

Fine-tuning is harder. Dense models are relatively stable to fine-tune.

MoE models are prone to overfitting on small datasets, partly because the routing patterns can shift when you update weights on limited data.

For teams building fine-tuned applications, this is a real consideration.

But why does this matter when you are buying tokens?

Most PMs do not choose model architectures. They are choosing between offerings on a pricing page, but the architecture underneath shapes what you see.

When comparing models, you will notice that models with similar benchmark performance can have quite different prices.

Some of that is competition between labs. Some of it is hardware efficiency.

And some of it is the architecture, specifically, whether the model can achieve its capability level through sparse activation or requires dense computation on every token.

The models that can do more with fewer active parameters per token are inexpensive to serve at scale. That cost eventually flows through to API pricing.

So when you are in the budget talks, and someone asks why costs came down, or why one model is cheaper than another despite similar performance, this is the answer.

The Deeper Implication

There’s something worth sitting with here. For years, the assumption in AI was a rough correlation. It was all about more parameters, cost, and capability.

You had to pay for intelligence. The computational costs of frontier models were a barrier to building on them. MoE breaks that assumed relationship.

A model can be large enough to hold so much knowledge across many domains, while only drawing on the relevant slice for any given input.

Parameter count and inference cost are no longer the same thing.

That’s a genuine architectural shift, and it’s still playing out.

Researchers are exploring finer-grained MoE designs, improved routing algorithms, and methods to reduce memory overhead.

The efficiency gap between MoE and dense models at inference is likely to keep widening. The practical upshot is that the economics of AI products are very much better than they were a decade ago.

They will only continue to improve, not just because GPUs get faster, but because the models themselves are being designed to do more with less.

Next time the API bill comes up, you will know what’s actually driving it.

How I can help you:

  1. Fundamentals of Product Management - learn the fundamentals that will set you apart from the crowd and accelerate your PM career.
  2. Improve your communication: get access to 20 templates that will improve your written communication as a product manager by at least 10x.

More from 

Product Management Concepts

View All

Join Our Newsletter and Get the Latest
Posts to Your Inbox

No Spam. Unsubscribe any time.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.