Artificial Intelligence

Attention Is All You Need

Vaswani et al. · 2017 · NeurIPS

Introduced the Transformer, replacing recurrence with self-attention and reshaping modern AI.

Research objective

Demonstrate that a purely attention-based architecture can outperform recurrent and convolutional sequence models on translation tasks while being more parallelizable.

Methodology

The authors proposed an encoder-decoder model built entirely on scaled dot-product self-attention and position-wise feed-forward layers. Positional encodings replaced recurrence. The model was trained on WMT English-German and English-French translation benchmarks using standard supervised learning.

Key findings

Achieved state-of-the-art BLEU scores on WMT translation tasks at a fraction of the training cost.
Self-attention enables long-range dependency modeling without sequential bottlenecks.
Parallelism makes the architecture highly suited to GPU/TPU training at scale.

Strengths

Conceptually simple and elegant - no recurrence or convolution required.
Highly parallelizable, enabling scaling laws to be exploited.
Generalized far beyond translation, becoming the substrate of LLMs, vision transformers, and multimodal models.

Limitations

Quadratic memory cost in sequence length.
Positional encoding choices are heuristic and remain an active research area.
Pure attention lacks inductive biases for locality found in convolutions.

Practical implications

The Transformer is the architectural foundation of GPT, Claude, Gemini, LLaMA, and most frontier models.
Catalyzed the scaling-law era of AI research.

Read the original paper

Related entities

Scientist · geoffrey-hinton Scientist · yann-lecun Atlas · machine-intelligence Atlas · artificial-intelligence Glossary · transformer Glossary · attention Glossary · embeddings

Related research

Deep Residual Learning for Image Recognition

Residual connections enabled training of very deep networks, winning ImageNet 2015.

Read summary

Mastering the Game of Go with Deep Neural Networks and Tree Search

AlphaGo combined deep networks and Monte Carlo tree search to defeat world-class Go players.

Read summary