Meta's Three-Llama Monte

Meta’s Three-Llama Monte

Meta just released Llama 4, its new family of AI models, and while the technical specifications are genuinely interesting, the entire launch feels like watching someone try to simultaneously drive a car, solve a Rubik’s cube, and explain quantum mechanics. Let’s try to disentangle what these models actually do before exploring why the release seems so scattered.

The Models: A Technical Triptych

Llama 4 consists of three distinctly different models, each with its own architectural approach:

Llama 4 Scout: A 17 billion active parameter model (109 billion total parameters) with 16 experts that fits on a single NVIDIA H100 GPU. It supports a massive 10 million token context window and multimodal capabilities, making it the “Honda Civic” of the lineup – practical, efficient, and surprisingly capable.
Llama 4 Maverick: Also 17 billion active parameters but with 128 experts (400 billion total parameters). Think of this as the “sports car” version – same driver position, but much more power under the hood. It supports a 1 million token context window.
Llama 4 Behemoth: The heavyweight with 288 billion active parameters (2 trillion total). This is Meta’s “monster truck” – not yet available but allegedly competitive with the most advanced models on the market.

The key innovation across all three models is the Mixture of Experts (MoE) architecture. Unlike traditional “dense” models where every parameter participates in processing every input, MoE models activate only relevant “expert” neural pathways for specific inputs.

This is analogous to how a large corporation might handle varied requests: when a legal question comes in, you don’t need input from the marketing department, the cafeteria staff, and the logistics team – you just route it to the legal experts. Similarly, these models dynamically route different types of queries to specialized neural pathways, potentially delivering more computational efficiency per parameter.

Long Context: The Elephant (or 10 Million Tokens) in the Room

One of Scout’s most striking features is its 10 million token context window – approximately 7,500 pages of text. For perspective, that’s enough to process the entire “A Song of Ice and Fire” series (so far), plus “The Lord of the Rings” trilogy, plus “War and Peace,” all in a single conversation.

While Meta provided basic validation using the “needle in a haystack” benchmark (essentially hiding information in a vast text and seeing if the model can find it), they’ve skipped more rigorous evaluations that would prove the model genuinely understands content across that entire context. It’s a bit like boasting your car can travel 1,000 miles without demonstrating it has enough fuel to get there.

Multimodality: Now With Added Vision

Unlike Llama 3’s awkward two-step release (text first, vision months later), Llama 4 arrives with image understanding capabilities built in from day one. This lets the models analyze images, charts, and visual content alongside text – a necessary feature to compete in today’s AI landscape where purely text-based models are increasingly viewed as vintage collectibles rather than cutting-edge technology.

Performance: A Tale of Two Benchmarks

Meta’s headline benchmark for Llama 4 Maverick shows it achieving impressive scores on LMArena. But there’s a footnote-sized catch: what they benchmarked was an “experimental chat version” that isn’t actually being released. It’s like advertising the fuel efficiency of a concept car that never makes it to production.

The models that are available appear to perform well across a range of evaluations, though reviews from the AI community have been inconsistent. Some find them competitive with other leading open-weight models, while others report performance that varies dramatically depending on where the model is hosted – suggesting these models might be particularly sensitive to implementation details.

Financial Implications: The Cost of the AI Arms Race

The Mixture of Experts architecture theoretically allows Meta to build more capable models with less training compute, which matters enormously for the economics. Training frontier AI models has become extraordinarily expensive, with costs for the largest models reportedly reaching hundreds of millions of dollars.

For Meta’s shareholders, this architectural choice represents a potential efficiency gain in the company’s massive AI investment. While competitors like OpenAI have clear revenue models tied directly to their AI capabilities, Meta’s AI investments remain primarily a means to enhance their advertising platforms and future AR/VR ambitions rather than direct revenue generators.

The Strategic Confusion

Now, about that scattered release strategy. Traditionally, Meta’s Llama models followed a clear pattern: same architecture at different sizes (7B, 13B, 70B, etc.). This made them particularly valuable for researchers who could study scaling laws and for the open-source community who could choose models based on their compute constraints.

With Llama 4, Meta has abandoned this approach in favor of highly differentiated models with varied architectures. This could be a response to the specialized models emerging from competitors: efficiency-focused models like Gemini Flash versus capability-focused models like Claude Opus or GPT-4.

But Meta’s community hasn’t been asking for this type of diversification. The “LocalLlama” community (named after Meta’s models) consists largely of enthusiasts running models on consumer hardware and researchers comparing model performance across sizes. Both groups benefit from the traditional “same architecture, different sizes” approach and may find these new MoE models less accessible due to their specialized inference requirements.

The licensing also continues to grow more restrictive with each release. Llama 4 requires tagging all artifacts with “Llama-“ branding, the license itself, “Built with Llama” for commercial uses, plus various use-case restrictions. Meanwhile, competitors like DeepSeek are releasing models under the MIT license, essentially saying “do whatever you want with this.”

The Saturday Special

The timing of the release – a Saturday drop when the AI world expected an announcement at the upcoming LlamaCon – suggests some form of internal pressure or competitive concern. It’s reminiscent of how public companies sometimes release disappointing earnings after market close on Fridays, hoping fewer people will notice.

The AI model ecosystem has matured significantly since Llama’s first release. What worked in 2023 may not work in 2025, and Meta appears to be adapting their strategy - perhaps somewhat frantically - to a rapidly changing competitive landscape.

Models in Search of a Market

Technically speaking, these new Llama models represent genuine innovation, particularly in efficiently scaling capabilities through MoE architectures and providing impressive context length. The multimodal capabilities on day one also show Meta learning from past release missteps.

But technology doesn’t exist in a vacuum. For all their technical merit, these models seem caught between competing visions: Are they research tools? Commercial platforms? Open resources for the AI community? The lack of clarity on this fundamental question creates confusion even among Meta’s most enthusiastic supporters.

For users and developers, the most significant question is whether these models offer enough technical advantages to outweigh the increasingly restrictive licensing terms, especially when alternatives like Qwen offer comprehensive model size ranges under more permissive licenses.

In the AI model landscape of 2025, technical capability may no longer be enough. The market increasingly demands clear positioning, consistent strategy, and licensing that aligns with intended use cases. While Llama 4 offers impressive technical specifications, Meta’s execution suggests a company still searching for its place in the rapidly evolving AI ecosystem.

Maybe the next iteration will bring both technical excellence and strategic clarity. Until then, we have some genuinely capable new models to experiment with – even if they arrived in the digital equivalent of showing up to a formal dinner wearing mismatched socks and slightly breathless from running.