DALL-E/Every illustration.

What Actually Matters (And What Doesn’t) For DeepSeek

Allow us to explain why your 401k is down

64 1

Was this newsletter forwarded to you? Sign up to get it in your inbox.


News of DeepSeek’s R1 model, released last week, has sent shockwaves through the tech world. Like many of you, we at Every have been captivated by the Chinese startup’s inexpensive, high-performing model, and the innovations that were necessary to achieve it. 

As for the implications? There’s a lot to reckon with, and we’re still only just figuring out what this new model can do. Investors mostly felt R1’s arrival on the scene wasn’t positive news for AI’s U.S.-based incumbents, and shares of Nvidia and other chip makers were hit particularly hard. Builders, meanwhile—including some of us here at Every—are pretty excited.

Because there’s so much to unpack, we’ve pulled together three of our writers to each tackle one aspect of the news that struck them, and where they see things going. Alex Duffy breaks down the innovations that led to R1 achieving a 90 percent cost reduction in performance compared with OpenAI’s o1 model. Entrepreneur in residence Edmar Ferreira discusses the immediate implications for people looking to build AI-based applications. Finally, Evan Armstrong talks about the markets’ (over-re)reactions.

Let’s dive in.—Michael Reilly, managing editor

DeepSeek R1 is a shift from ‘sounding good’ to ‘thinking better’

Most large language models (LLMs) rely on reinforcement learning (RL) to refine how “helpful and harmless” they sound. Notoriously, OpenAI has used cheap labor in Kenya to label and filter toxic outputs, fine-tuning its models to produce more acceptable language. 

DeepSeek R1 took a different path: Instead of focusing on sounding right, it zeroes in on being right—especially in math, coding, and logic. Rather than learning from subjective human preferences, R1 follows reasoning-oriented RL that rewards the model only if its code compiles and passes tests or if its math solutions are indisputably correct. Because “correctness” is easier to define for these tasks, R1 can scale its training without needing armies of human data labelers. Surprisingly, even for tasks that are more subjective—like creative writing—this emphasis on logical consistency tends to deliver better results, too.

R1’s leap in capability and efficiency wouldn’t be possible without its foundation model, DeepSeek-V3, which was released in December 2024. V3 itself is big—671 billion parameters (by comparison, GPT4-o is rumored to be 1.8 trillion, or three times as big)—yet it’s surprisingly cost-effective to run. That’s because V3 uses a mixture of experts (MoE) approach, where the model is divided into specialized sections, each of which functions as an “expert” in a certain domain. When a query comes in, only the expert section “lights up”—around 5 percent of the model, or 37 billion parameters. This significantly reduces the compute power needed. MoE gained traction in 2024 thanks to teams at companies like Mistral, xAI, and Databricks, which showed it can be easily integrated, scales well, and brings major efficiency gains.

On top of that, V3 embraced multi-token prediction (MTP). Rather than predicting text one word at a time and inspired by Meta’s FAIR (Fundamental AI Research) team’s ideas toward building "Better & Faster Large Language Models via Multi-token Prediction," it predicts multiple words simultaneously. Finally, a trick called FP8 training helps V3 run even faster and cheaper by using “rounded” (lower-precision) numbers. This approach slashes compute costs, memory usage, and reliance on huge GPU clusters—an especially big deal in an era of hardware export controls

Crucially, thanks to R1's new distillation approach to maintaining performance with models of smaller sizes, these advanced reasoning skills don’t require a Google-sized infrastructure. DeepSeek’s distillation techniques let R1’s capabilities trickle down into smaller, more budget-friendly versions of the model. You can even run a distilled variant locally on your MacBook Pro with just one line of code. In conjunction with its open-source license, this efficiency has led many cloud providers, like Groq, to provide access to their own hosted version of the R1 model. Having options gives consumers more choices taking factors like speed, reliability, price, and privacy into account. 

Perhaps R1’s biggest breakthrough is the confirmation that you no longer need enormous data centers or thousands of labelers to push the limits of LLMs. If you can define what “correctness” means in your domain—whether it’s coding, finance, medical diagnostics, or creative writing—you can apply reasoning-oriented RL to train or fine-tune your own model. You pick the benchmarks; you control the objective “good.” Meanwhile, V3’s underlying architecture and cost-saving optimizations ensure you won’t break the bank. By decoupling “performance” from raw scale and shifting it toward well-defined standards of correctness, and being willing to share its innovations, DeepSeek R1 hands more power to researchers, entrepreneurs, and even hobbyists—anyone willing to experiment on how we train and evaluate AI.—Alex Duffy

Welcome to the post-training era for startups

Training LLMs can be divided into two major phases: pre-training and post-training. The pre-training phase is an extremely expensive process that involves training a general model from a large corpus of data. Even in the case of DeepSeek, a single run of training costs $6 million, while it’s estimated that Meta’s Llama 3 model costs $120 million to train. DeepSeek’s decreased costs are a huge breakthrough, but they’re still too expensive for most organizations. 

Create a free account to continue reading

The Only Subscription
You Need to Stay at the
Edge of AI

The essential toolkit for those shaping the future

"This might be the best value you
can get from an AI subscription."

- Jay S.

Mail Every Content
AI&I Podcast AI&I Podcast
Monologue Monologue
Cora Cora
Sparkle Sparkle
Spiral Spiral

Join 100,000+ leaders, builders, and innovators

Community members

Already have an account? Sign in

What is included in a subscription?

Daily insights from AI pioneers + early access to powerful AI tools

Pencil Front-row access to the future of AI
Check In-depth reviews of new models on release day
Check Playbooks and guides for putting AI to work
Check Prompts and use cases for builders

Comments

You need to login before you can comment.
Don't have an account? Sign up!
Vinay Korrapati 10 months ago

Terrific analysis, thank you. The DS ‘event’ has hastened the AI value pool accretion tilt towards the application layer. While their 6M training cost is suspect, their reasoning and architectural approach, which you have so beautifully highlighted, will hopefully lower the cost of intelligence for vertical-specific use cases.