
Machine Learning: The Foundations
Machine learning is the engine behind nearly all modern AI — algorithms that learn patterns from data rather than following hand-coded rules. First named by Arthur Samuel in 1959, it now powers search, recommendation, translation, vision, speech, and the entire generative AI stack.
Key facts
- Term 'machine learning' coined by Arthur Samuel at IBM in 1959.
- Three paradigms: supervised, unsupervised (incl. self-supervised), reinforcement.
- Generalization — not training accuracy — is the goal.
- Bias-variance tradeoff governs model complexity choices.
- Gradient-boosted trees (XGBoost, LightGBM) remain state-of-the-art on most tabular benchmarks.
- Most ML production failures are data and deployment problems, not modeling problems.
Three Learning Paradigms
Supervised learning maps inputs to labeled outputs and accounts for the majority of deployed ML — image classification, spam detection, credit scoring, medical triage. It depends on large, accurately labeled datasets and well-defined targets.
Unsupervised learning discovers structure without labels — clustering customers, learning embeddings, compressing data. Self-supervised learning, a special case where the data supplies its own labels, is the regime that produced modern foundation models like GPT and CLIP.
Reinforcement learning optimizes behavior through reward feedback in sequential decision problems. It underpins game-playing systems (AlphaGo, AlphaZero), robotics control, recommender ranking, and the RLHF stage of frontier LLM training.
Generalization, Overfitting, and the Bias-Variance Tradeoff
The central scientific challenge of ML is generalization — performing well on data not seen during training. A model that perfectly memorizes the training set but fails on new inputs is useless.
The bias-variance tradeoff frames this: high-bias models underfit (too simple to capture the pattern); high-variance models overfit (capturing noise as if it were signal). Regularization (L1/L2, dropout, early stopping, data augmentation) and held-out validation are the standard tools for navigating it.
- Training/validation/test split prevents optimistic estimates.
- Cross-validation is the gold standard on small datasets.
- Modern over-parameterized deep networks generalize despite memorizing — the 'double descent' phenomenon documented by Belkin et al. (2019).
The Practical Pipeline
Production ML systems are infrastructure, not just models. They combine data collection, labeling, feature engineering or representation learning, model selection, training, evaluation, monitoring, and continuous retraining.
Google's widely cited 'Hidden Technical Debt in Machine Learning Systems' (2015) showed that the ML model itself is typically a small fraction of total system code. Most production failures occur at the data and deployment stages — distribution shift, label drift, broken pipelines — not the model.
Classical ML vs Deep Learning
Classical methods — linear/logistic regression, decision trees, gradient-boosted trees (XGBoost, LightGBM), SVMs, random forests — still dominate tabular data and many enterprise problems where datasets are small and interpretability matters.
Deep learning dominates perception (vision, speech) and language because it learns representations end-to-end from raw, high-dimensional inputs. The two paradigms increasingly coexist in hybrid stacks.
Frequently asked
Is ML the same as AI?
+
ML is the dominant subfield of AI today, but AI also includes symbolic reasoning, search, planning, and knowledge representation. All ML is AI; not all AI is ML.
Do we need huge datasets?
+
Deep learning typically does. Classical ML can work with hundreds or thousands of examples; transfer learning and pretrained foundation models dramatically lower data requirements for new tasks.
What is self-supervised learning?
+
A training regime where the model creates its own labels from the data structure — for example, predicting the next word in a sentence. It powers modern LLMs and vision models like DINO.
How is ML different from statistics?
+
Statistics emphasizes inference and explanation under explicit models; ML emphasizes prediction and generalization, often with flexible non-parametric models. The fields overlap heavily and increasingly converge.
Sources & further reading
Continue in this series
Neural Networks
Deep Learning: Hierarchical Representation from Raw Data
Architecture
The Transformer Architecture
LLMs
Large Language Models: How They Work and Where They Fail
Cross-Modal
Multimodal AI: Text, Vision, Audio, Video, and Action
Learning from Reward
Reinforcement Learning: From AlphaGo to RLHF
Autonomy
AI Agents: Tools, Planning, and Autonomy
