DeepSeek vs. building an LLM from scratch: when the toy transformer becomes infrastructure

Published 2026-05-12·Updated 2026-05-12·v1·#ai#llms#deepseek#transformers#machine-learning#ai-architecture#training#inference

There are two useful ways to understand large language models.

One is to build a small one yourself. Sebastian Raschka’s “Build a Large Language Model (From Scratch)” is the clean-room path: tokenize text, implement a GPT-style decoder-only transformer, train it to predict the next token, add instruction fine-tuning, and watch the machine become less mysterious.

The other is to study a frontier-scale system like DeepSeek. At that scale, the model is no longer just a stack of transformer blocks. It is a factory: expert routing, cache engineering, distributed training, FP8 precision, reinforcement learning, data curation, inference economics, and hardware topology.

A caveat: I could not verify a public “DeepSeek V4 Pro” technical paper. The most grounded public references are DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1. So when I say “DeepSeek V4 Pro” here, read it as the likely continuation of the public DeepSeek technical direction, not as a claim about an unpublished architecture.

The interesting lesson is that the educational LLM and the frontier LLM share the same skeleton. Almost everything around that skeleton changes.

the shared core

A basic from-scratch LLM usually starts with a decoder-only transformer.

Text becomes tokens. Tokens become embeddings. A stack of transformer blocks applies causal self-attention and feed-forward layers. The model predicts the next token. Training minimizes cross-entropy loss.

That same lineage still matters at the frontier. DeepSeek is not alien technology. It still learns from token sequences. It still uses transformer-family computation. It still autoregressively generates tokens.

That is why Raschka’s book is valuable. A tiny GPT teaches the grammar of the machine: embeddings, attention, residual streams, layer norm, logits, sampling, loss curves, instruction tuning. Once you understand those, terms like KV cache, MoE, MLA, and reasoning RL are no longer random acronyms. They are modifications to parts you can name.

But scale changes the bottlenecks.

from dense models to mixture of experts

In a simple educational LLM, every token usually passes through every parameter in every layer. This is dense computation.

DeepSeek-V3 is different. The paper reports a 671B-parameter Mixture-of-Experts model with roughly 37B active parameters per token. That is the key trick: huge total capacity, sparse per-token compute.

Instead of one feed-forward network doing all the work, the model has many expert networks. A router chooses which experts should process each token. Different tokens activate different parts of the model.

This is conceptually simple but operationally hard.

The router has to balance load. Tokens have to move across devices. Experts cannot become overloaded. Communication cannot eat all the speedup. Training cannot collapse into a few overused experts.

None of this appears in a beginner model. A toy transformer asks: what computation happens? A frontier MoE asks: which computation should happen for this token, where is it located, and how do we move data there cheaply?

That is the first big jump: the architecture becomes a distributed-systems problem.

attention becomes a memory problem

In a textbook implementation, attention is beautiful. Compute queries, keys, and values. Take scaled dot products. Apply a causal mask. Softmax. Multiply by values.

At frontier scale, attention is also a memory problem.

During generation, the model should not recompute the full prefix every time it emits a new token. The KV cache stores previous keys and values so each new token can attend to the cached context. For small models, this is an optimization. For production models, cache size can dominate inference cost.

DeepSeek-V2 and DeepSeek-V3 introduced Multi-head Latent Attention, or MLA, partly to reduce KV cache pressure. The idea is to compress the representation needed for attention so long-context inference becomes cheaper.

This is the kind of detail that separates classroom LLMs from deployed LLMs. A small model can waste memory and still teach the lesson. A frontier model serving millions of tokens cannot.

training becomes an industrial operation

When you build a small LLM from scratch, the goal is clarity. You can train on a modest corpus, maybe on one machine. You watch loss go down. You learn why batching matters, why tokenization matters, and why sampling settings change outputs.

DeepSeek-scale training is different. The public DeepSeek-V3 paper emphasizes FP8 mixed-precision training, large-scale parallelism, infrastructure optimizations, and cost-aware scaling. Those are not implementation footnotes. They are part of the model.

At scale, training means data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, checkpointing, communication overlap, hardware failures, numerical stability, and cluster utilization. If GPUs wait on each other, theoretical FLOPs do not matter. If the precision format is unstable, cost savings vanish. If the data pipeline stalls, the cluster burns money while doing nothing.

A small LLM teaches the algorithm. A frontier model teaches that the algorithm must survive contact with hardware.

reasoning is post-training, not just pretraining

A basic LLM learns next-token prediction. Instruction tuning can make it more helpful, but the core training story is usually pretraining plus supervised fine-tuning.

DeepSeek-R1 made the post-training story impossible to ignore. The R1 work emphasized reinforcement learning for reasoning behavior, including long reasoning trajectories and stronger problem-solving behavior. Whether a model exposes its chain of thought or summarizes it, the training objective has changed. The system is no longer merely trained to continue text. It is shaped to solve problems.

Modern frontier models are built in phases:

pretraining on huge token corpora
supervised fine-tuning
reinforcement learning or preference optimization
rejection sampling and distillation
tool-use and instruction-following tuning
safety and behavior shaping

The from-scratch LLM teaches you how language modeling works. DeepSeek-style systems show that capability is also a post-training product.

hardware shapes the model

Beginners compare parameter counts. Practitioners compare constraints.

MoE addresses compute. MLA addresses cache memory. FP8 addresses throughput and storage. Parallelism addresses cluster utilization. Quantization addresses inference economics. Context length addresses memory bandwidth and serving cost.

A toy GPT can ignore most of this. If it is slow, that is fine. If it wastes memory, reduce the batch size. If it only runs on a small dataset, the point is still educational.

A frontier model cannot be that casual. Inefficiency becomes millions of dollars. An architecture that is 5% cheaper to train or 20% cheaper to serve can be more important than a small benchmark bump.

This is why DeepSeek attracted attention. The public papers are not only about model quality. They are about the economics of intelligence.

what the toy model still teaches best

It would be wrong to dismiss the small from-scratch model as obsolete.

The toy model is the microscope. It shows the cells.

You learn why causal masks exist. You learn why positional embeddings matter. You learn why attention has quadratic cost. You learn how logits become probabilities. You learn why temperature changes sampling. You learn why instruction fine-tuning changes the interface without replacing the base model.

Those ideas remain useful when reading frontier papers. MoE modifies the feed-forward block. KV cache optimizes autoregressive decoding. MLA compresses attention state. Reasoning RL changes the post-training objective. FP8 changes the numerical substrate.

Without the small model, frontier systems look like a wall of acronyms. With the small model, each acronym has a place to attach.

what DeepSeek teaches beyond the textbook

DeepSeek teaches the parts of LLMs that do not fit inside a clean educational implementation.

First, scale changes the objective. The question is not merely “can we train it?” It is “can we train it efficiently enough to matter?”

Second, sparsity matters. You can increase total capacity without activating every parameter for every token.

Third, inference is part of architecture. KV cache size, batching, context length, quantization, and serving economics are first-class design constraints.

Fourth, post-training creates product behavior. Reasoning models are shaped by objectives beyond next-token prediction.

Fifth, infrastructure is research. Distributed training, precision formats, kernels, communication schedules, and failure handling are not boring engineering afterthoughts. They decide what models can exist.

the real comparison

DeepSeek versus an LLM from scratch is not “advanced versus simple.” It is a ladder.

The educational model answers: what is an LLM made of?

Scaling laws answer: what happens when we make it bigger, train it longer, and feed it more data?

DeepSeek-style systems answer: how do we make frontier capability sparse, cacheable, trainable, routable, and economically deployable?

That is the useful mental model. Raschka’s book gives you the clean room. DeepSeek gives you the power plant.

Both matter. If you only study the power plant, you may miss the physics. If you only study the clean room, you may miss the economics. Frontier AI is the combination: transformer mechanics plus industrial systems engineering.