
The Transformer Architecture
Introduced by Vaswani et al. in the 2017 paper 'Attention Is All You Need,' the transformer is the architectural foundation of nearly every frontier AI system today — GPT, Claude, Gemini, Llama, AlphaFold, and Stable Diffusion's text encoder all rely on it.
Key facts
- Introduced in 'Attention Is All You Need' (Vaswani et al., NeurIPS 2017).
- Self-attention captures long-range dependencies in O(1) path length.
- Highly parallelizable on modern GPU/TPU hardware.
- Quadratic memory in sequence length is the primary scaling bottleneck.
- Now standard across text, vision, audio, biology, and robotics.
Self-Attention: The Core Mechanism
Self-attention lets each position in a sequence directly relate to every other position via learned query, key, and value projections. The attention score between two positions is the scaled dot product of their query and key vectors, normalized via softmax.
Multi-head attention runs many attention operations in parallel — typically 8 to 128 heads — each learning different relationships such as syntactic dependencies, coreference, or long-range topical links.
Unlike recurrent networks, attention has O(1) path length between any two positions, capturing long-range dependencies that RNNs and LSTMs struggled with.
The Transformer Block
A standard transformer block alternates multi-head self-attention with a position-wise feedforward network, each wrapped in residual connections and layer normalization. Stacking dozens to hundreds of these blocks produces a deep model.
Positional information is injected via sinusoidal embeddings (original paper), learned embeddings, or rotary positional embeddings (RoPE) used in Llama, GPT-NeoX, and most modern open-weight models.
Why It Scaled
Transformers parallelize across sequence positions in ways RNNs cannot, exploiting modern GPU and TPU hardware. This single property unlocked training at unprecedented scale.
Quadratic memory cost in sequence length remains the architecture's main limitation. Flash Attention (Dao et al., 2022), grouped-query attention, sliding-window attention, and state-space hybrids (Mamba) are active responses.
Beyond Language
Vision Transformers (ViT, Dosovitskiy et al., 2020) treat image patches as tokens and now match or exceed CNNs on most benchmarks. AlphaFold 2 uses a transformer-based architecture (Evoformer) to predict protein structures. Audio (Whisper), video (Sora-class), and robotic control (RT-2, π0) all use transformer backbones.
The transformer has become a domain-general computational substrate — the closest thing modern ML has to a universal architecture.
Frequently asked
Why are transformers so powerful?
+
They combine expressive attention-based token mixing with hardware-friendly parallelism, scaling more reliably with data and compute than prior architectures.
What's next after transformers?
+
State-space models (Mamba, Mamba-2), mixture-of-experts (Mixtral, DeepSeek-V3), linear attention variants, and hybrid architectures are active research directions — but transformers remain dominant in 2026.
What is the context window?
+
The maximum number of tokens a transformer can attend to at once. Frontier models in 2025–2026 support 200K to 2M+ tokens via techniques like sliding-window attention, position interpolation, and ring attention.
Sources & further reading
Continue in this series
Foundations
Machine Learning: The Foundations
Neural Networks
Deep Learning: Hierarchical Representation from Raw Data
LLMs
Large Language Models: How They Work and Where They Fail
Cross-Modal
Multimodal AI: Text, Vision, Audio, Video, and Action
Learning from Reward
Reinforcement Learning: From AlphaGo to RLHF
Autonomy
AI Agents: Tools, Planning, and Autonomy
