This post is a working map of machine learning: the four training paradigms (supervised, unsupervised, reinforcement, self-supervised), where deep learning fits in, and how modern systems like LLMs and recommenders are assembled from those pieces. It closes with a concrete path for learning the field in practice.

1. What machine learning is

Machine learning inverts the relationship between programs and data. In ordinary programming you write rules and run them over data. In machine learning you supply the data and the desired behavior, and the system produces the rules. Everything else is mechanics.

A toy contrast. Suppose you want to decide which incoming emails are spam. The traditional approach is to list features and thresholds yourself: contains the word “viagra,” sender domain less than 30 days old, more than three links per kilobyte of text, score them, flag anything over a threshold. You wrote the rules. They work until spammers adapt and you write more rules, and the rule list grows without bound.

The ML approach is to collect a few hundred thousand emails that are already labeled spam-or-not, hand them to an algorithm with no preconceptions, and let it search for whichever rule fits the data best. You don’t write “viagra appears” or “domain age matters.” The model finds whichever combination of word frequencies, header patterns, and stylistic signals best predicts the label, including signals you wouldn’t have thought of.

The artifact you build is called a model. It’s a parametric function: given an input vector, it produces an output. Training is the process of fitting the parameters to the data. Inference is running the trained model on new inputs.

Philosophically it’s curve-fitting in very high dimensions. The reason it works in practice now and not in 1990 comes down to three things happening at once: vastly more data (the internet exists), vastly more compute (GPUs, TPUs), and a handful of algorithmic ideas like backpropagation, attention, and contrastive losses that turned out to scale further than anyone expected.

The rest of this post walks through the paradigms that organize how data and signal get fed into those algorithms, the architectures that represent the models, and the recipes modern systems combine them into.

2. Supervised learning: data with answer keys

Supervised learning is the most familiar paradigm because it looks the most like normal engineering. You collect examples where you know the right answer, show them to the system, and ask it to learn the mapping from input to answer.

Labeled data means each example is a pair (input, correct_output). Twenty thousand emails each marked spam or not. A hundred thousand house listings each with a sold price. Ten million images each tagged with which of a thousand categories they belong to. The labels are the answer key.

There are two flavors based on what kind of output you want:

Classification when the output is one of a fixed set of categories. Spam or not. Cat, dog, or horse. Approve, reject, or refer for manual review.
Regression when the output is a continuous value. House price, expected revenue, time-to-failure for a part.

The same general shape applies to both. You hand a library a feature matrix X and a label vector y, you call fit, and you get a model you can call predict on:

from sklearn.linear_model import LogisticRegression

# X_train: feature vectors for past emails
# y_train: 0/1 labels (not spam / spam)
model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

That three-line shape (fit(X, y) then predict(X_new)) is the universal supervised-learning interface; almost every library you’ll meet exposes some version of it. Most of the work happens before and after: cleaning data into X and y, choosing a model class, tuning hyperparameters and evaluating the result.

The single most important habit in supervised learning is the train/test split. You hold some labeled examples back from training and only use them to check that the model generalizes (that it learned the pattern, not the specific examples). A model that scores 99% on the training set and 60% on the held-out set has memorized rather than learned. The technical name for this is overfitting, and it’s the failure mode you’ll spend most of your time guarding against.

Where it fits: anywhere you can collect labels at sufficient volume. Spam, credit decisions, click prediction, image tagging, medical screening, structured information extraction. If you can describe what the right answer looks like and you have history, supervised learning is the default starting point.

3. Unsupervised learning: structure without labels

Unsupervised learning is what you do when you don’t have labels and either can’t get them or don’t need them. You hand the algorithm a pile of data and ask it to find structure.

Three common shapes:

Clustering. Group similar items together. K-means is the canonical example: you tell the algorithm how many groups to find, it assigns each point to a cluster, and you inspect what the clusters mean afterwards. Customer segmentation, document organization, topic discovery in support tickets.
Dimensionality reduction. Take vectors with a thousand features and produce a faithful representation in two or fifty dimensions. PCA (principal component analysis) is the linear version. t-SNE and UMAP are nonlinear methods more often used for visualization. All three compress without throwing away the most informative variation.
Anomaly detection. Most points cluster together, outliers stand apart. Fraud detection, network intrusion, manufacturing defect screening. The model learns “normal” and flags deviations.

The reasonable question is: how do you know it’s right if there are no labels? Two answers.

First, sometimes the structure itself is the deliverable. The clusters are what the marketing team wanted. The anomalies are what the fraud team will investigate. There is no separate “ground truth”; the segments are the answer.

Second, labels are often expensive. You may have ten million events and zero appetite for hand-labeling. Unsupervised methods give you something to look at, which is often the starting point for a labeling pipeline that then enables supervised work downstream.

There’s no firm boundary between unsupervised and supervised, by the way. Modern systems mostly do something in between, which is the subject of §5 on self-supervised learning.

4. Reinforcement learning: learning from consequences

Reinforcement learning is structurally different from the two paradigms above. There’s no fixed dataset. An agent acts in an environment, receives a numeric reward signal, and adjusts its behavior over time to maximize total reward.

The vocabulary:

Agent is the thing taking actions.
Environment is everything outside the agent.
State is what the agent observes at a given moment.
Action is what the agent does next.
Reward is a scalar signal the environment returns.
Policy is the agent’s rule for picking actions given states.

The loop runs forever. The agent observes a state, picks an action according to its current policy, receives a reward and a new state, updates its policy, and repeats. Over millions of these steps the policy either improves on the task or it collapses, with rewards plateauing or oscillating in ways that take real effort to diagnose.

Two things make this hard in ways the supervised case isn’t. The data distribution is non-stationary because it depends on the policy, which is itself changing as training proceeds. And credit assignment is delayed: the agent won the game eighty moves ago, but which of those moves actually mattered? Most of RL theory exists to answer some version of those two.

Where it shows up: AlphaGo and game-playing systems, robotics, ad bidding, recommender systems that optimize over a session rather than a single click, and most recently, the RLHF (reinforcement learning from human feedback) stage of training large language models. RLHF uses the same machinery not to learn to play a game but to nudge a pre-trained language model toward outputs humans prefer.

5. Self-supervised learning: how LLMs got their training data

Self-supervised learning generates its own labels from the structure of the data, which is why the internet became a usable training corpus. It’s also, arguably, the most consequential recent idea in the field, and the one most worth understanding if you want to know how modern AI works.

The setup looks like supervised learning: you train on (input, label) pairs and minimize a prediction loss. The difference is where the labels come from. You generate them from the data itself.

The canonical example is next-token prediction in language modeling, which is also the backbone of every modern LLM (see Introduction to NLP: How Computers Understand Human Language for how this developed inside NLP). Take a chunk of text. The first hundred tokens are the input, the hundred-and-first is the label. Slide the window forward and repeat. Every sentence on the internet is now a stack of (context, next-word) pairs. The labels weren’t created by humans; they were already there, implicit in the structure of the text. Nobody had to be paid to make them.

Other variants follow the same pattern. Masked language modeling (the BERT family) blanks out random tokens in a sentence and asks the model to fill them in. Contrastive image-text pretraining (CLIP and friends) pairs an image with its caption from the web and teaches the model to recognize matching pairs against mismatched ones. In each case the supervision signal comes from a structural property of the data, not from human annotation.

The reason this mattered is the labeling bottleneck. In classical supervised learning, you could only train on as much labeled data as you could afford to produce. Self-supervised pretraining moved the bottleneck to compute and raw data scale, both of which are easier to throw money at than human labelers. The internet, which contains roughly all the text humans have ever written, became a usable training corpus.

Everything you currently know as “modern AI” (GPT, Claude, Gemini, Llama, the open-weights ecosystem, image-text models like CLIP and DALL-E) exists because self-supervised pretraining works at scale.

6. Where deep learning fits

Deep learning is a family of models, not a fifth learning paradigm. It’s neural networks with many layers, trainable under any of the four paradigms above. The phrase “AI, ML, deep learning” sometimes gets presented as a hierarchy of three increasingly fancy things, and that framing is misleading. The right mental model is paradigm on one axis, architecture on the other.

A neural network is a parametric function. You stack alternating linear transformations (matrix multiplies) and nonlinearities (functions like ReLU that bend the output curve). Training adjusts the weights of those matrices using gradient descent on a loss function. With enough layers and the right architectural choices, the network learns to extract useful features automatically. The old ML pipeline of “hand-craft features, then train a small model on them” collapses into “let the network find the features.”

In PyTorch, a basic feedforward network looks like this:

import torch.nn as nn

class TinyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, 10),
        )

    def forward(self, x):
        return self.layers(x)

Three linear layers, two ReLU nonlinearities between them. The input is a 784-dim vector (a flattened 28x28 image), the output is 10 numbers (one per digit class, 0 through 9). Training calls model(x), compares the output against the label, computes gradients with backpropagation, nudges the weights, and repeats a few million times. That’s the whole loop; the practical version adds batching, learning-rate scheduling, and validation passes on top.

The architectures you’ll hear named most often:

CNNs (convolutional neural networks) apply the same small filters across an input grid. Translation-invariant feature extraction. Built for images, also useful for audio spectrograms and any data with grid structure.
RNNs (recurrent neural networks) process sequences one element at a time, carrying internal state forward. LSTMs and GRUs are the practical variants. Mostly displaced by Transformers for sequence work, still alive in streaming and embedded contexts where they’re cheaper.
Transformers process sequences using attention, a mechanism where every position can directly attend to every other position. Parallelizable, scales well with model size and data, and now the backbone of essentially every state-of-the-art language model and most modern vision models too.

The thing to internalize is that these architectures are orthogonal to the paradigm. You can train a Transformer with supervised data, with self-supervised next-token prediction, with reinforcement learning, or in an unsupervised contrastive setup. The architecture is how the model represents the function. The paradigm is what data and signal you use to train it.

7. How modern AI systems combine the paradigms

Once you have the paradigms and the architectures, modern systems stop looking magical and start looking like recipes. Each one is some combination of training paradigms applied to some neural-network architecture at scale.

Large language models. Stage one is self-supervised pretraining: predict the next token on a trillion-token corpus from the web. Stage two is supervised fine-tuning on a smaller set of human-written instruction-response examples. Stage three is RLHF, which nudges outputs toward responses humans rated higher. The architecture is the Transformer at every stage. Three paradigms, one model. The behavior of the resulting system at inference time is heavily shaped by how you prompt it (see A Guide to Effective LLM Prompting).

Vision models. The 2010s version was supervised CNNs trained on ImageNet. The current version is some flavor of self-supervised or weakly-supervised pretraining (CLIP-style image-text contrastive learning is one widely-used recipe) followed by supervised fine-tuning for the specific task. Architectures have largely shifted from CNNs to Vision Transformers, though CNNs are still common in production for efficiency reasons.

Recommender systems. They get less attention than LLMs and are probably more economically important. They’re hybrids: supervised learning predicts the probability you’ll click, unsupervised learning produces user and item embeddings for retrieval, and reinforcement learning sometimes optimizes the whole session rather than a single click. From the outside they look simpler than an LLM. From the inside the production pipeline has more moving parts than most LLM serving stacks.

8. A learning path: Python, micrograd, then one domain library

The shortest useful path into ML is Python plus NumPy, then PyTorch, then a from-scratch tiny neural network, then one high-level library in a single domain. Math comes last, not first.

Python is the lingua franca. NumPy first, for the array operations everything else assumes you understand. Then PyTorch, which has become the default deep-learning framework in research and most of industry. TensorFlow still exists and is still useful, especially around mobile and on-device deployment, but PyTorch is where the bulk of new work happens.

Build a tiny neural network from scratch once, backpropagation included. Andrej Karpathy’s “micrograd” walkthrough does this in a couple of hours and demystifies neural networks more than any framework tutorial. Once you can derive backprop on a graph of scalar operations, the rest of the field stops feeling like incantations.

After that, pick one domain and use a high-level library. HuggingFace Transformers for NLP. fastai or torchvision for computer vision. Stable Baselines or CleanRL for reinforcement learning. The library will hide the parts you don’t yet need to understand and let you build something working in an afternoon.

Math, eventually. Linear algebra for what a layer is doing in terms of matrix operations, calculus for what gradient descent is doing, probability for what loss functions mean. You don’t need any of this upfront — pick it up as it bites. The wrong order is to spend three months on linear algebra before you train anything.

One thing to skip for now: the genre of AI doom or AI futurism content that floats around online. That conversation has substance, but it’s not where you want to spend your first hundred hours. Fluency in the tools comes first. Opinions about the trajectory of the technology are much more useful once you’ve actually trained something and watched it fail in detail.