1. What NLP is, and why it’s hard

Natural language processing (NLP) is the field that gets computers to do useful work with human language: reading it, writing it, translating, pulling structured facts out of it, answering questions about it. The definition is the easy part. Human language was never designed for a computer to handle, and almost every assumption that makes other kinds of data tractable falls apart when you point a program at a paragraph of English.

To see why, contrast a sentence with a row in a database. The row has labeled columns. You know which value is the customer name, which is the price, which is the timestamp. A sentence carries the same information without any of the labels. “Maria paid forty dollars on Tuesday” is parseable by a human in a fraction of a second, but for a computer, “Maria” is just a token with no inherent indication that it’s the subject, that it refers to a person, or that “forty dollars” is a monetary amount and not a phone number.

Language also leans heavily on context that isn’t written down. The sentence “I saw her duck” can mean two completely different things depending on whether “duck” is the bird or the verb, and the only way to tell is to look at the surrounding text or the situation it was said in. “The trophy didn’t fit in the suitcase because it was too big” requires you to know that trophies and suitcases exist, that one is being put inside the other, and that the bigger one couldn’t fit. The pronoun “it” has no syntactic clue pointing to the right antecedent. You resolve it with knowledge about the world.

Ambiguity, idiom, sarcasm, figurative language, missing context, the sheer variety of valid ways to say the same thing — it all stacks. Every problem NLP tries to solve is, at some level, a problem of getting a computer to work usefully without ever quite escaping that mess.

2. The task landscape: what NLP actually does

NLP is a family of tasks, each with a clear input and output. Most of the everyday tools you already use lean on at least one of them.

Classification. Take a piece of text, decide which bucket it goes in. Spam filters classify emails as spam or not spam. Support routers classify tickets by department. Sentiment analysis is a special case where the buckets are positive, negative, or neutral.
Named entity recognition (NER). Pull out the people, places, organizations, dates, and amounts mentioned in a piece of text. Useful anywhere you want to extract structured facts from unstructured prose, like turning a news article into a list of the companies it mentions.
Translation. Take text in one language, produce text in another. The classic NLP task, and the one whose quality jump over the last decade is the most obvious to anyone who has used Google Translate or DeepL.
Summarization. Take a long piece of text, produce a shorter one that captures the main points. The “TL;DR” feature on news apps is summarization, as is the meeting-notes feature in conferencing tools.
Question answering. Take a question, optionally with a source document, and produce an answer. Search engines and chat assistants both lean heavily on this.
Generation. Produce open-ended text from a prompt. Autocomplete in your email client, the replies suggested by your phone, and chatbots are all sitting on top of generation.

Each of these used to be its own subfield with its own techniques. The modern shift is that one architecture now handles most of them at once. The next section gets to that.

3. From words to numbers

Computers don’t understand words; they crunch numbers. So every NLP system starts by turning text into numbers in a way that preserves something useful about meaning. Two steps do most of that work: tokenization and embedding.

Tokenization splits text into pieces. The pieces might be whole words, but modern systems usually use subword units, so that an unfamiliar word like “tokenization” can still be broken into known fragments like “token” and “ization” without the model having to give up. Once the text is in tokens, each token gets a numeric ID it can be referenced by.

Embedding is the more interesting step. Each token gets represented not as a single number but as a vector — a list of several hundred (or several thousand) numbers, chosen so tokens with related meanings end up close to each other in the space those vectors define. Imagine plotting every word as a dot in a giant high-dimensional map. Synonyms cluster, words about the same topic form neighborhoods, and the dot for “king” lands near “queen.” The direction from “man” to “king” turns out to be roughly the same as the direction from “woman” to “queen.” You don’t have to tell the model any of this. The embedding model learns it from how words are used in text.

That single move, turning meaning into geometry, is what almost every modern NLP system is built on. Once words live in a space where “similar meaning” maps to “close together,” every downstream task becomes more tractable, from finding documents that match a query to checking whether two sentences are paraphrases.

4. A short history: from rules to LLMs

The history of NLP is the story of giving up on telling computers the rules of language, and letting them learn the patterns from data instead. Each major shift handed more decisions to the data and fewer to the human engineer.

The first era, roughly the 1960s through the 1980s, was rule-based. Linguists and engineers hand-wrote grammars and dictionaries: lists of words, rules for how they could combine, exceptions for the cases that didn’t fit. It worked for narrow demos. It fell apart on real text, because real text is full of cases the rules didn’t cover, and adding rules to handle them tended to break the rules that were already there.

The statistical era, from the 1990s into the 2000s, traded rules for counting. Instead of telling the computer how language works, you fed it enough text and let it tally up which words tended to appear next to which others, which sequences were common, which translations of a phrase came up most often in parallel corpora. Not elegant, but it scaled with data in a way the rules never did, and it powered the first generation of usable machine translation.

The neural era, from the early 2010s, kept the data-driven idea but used neural networks to learn richer representations. Word embeddings landed in this period and captured patterns that earlier statistical methods could not, including the analogy that “king minus man plus woman” is close to “queen.” That kind of structure used to be hand-coded into knowledge graphs; now it was falling out of a model that just read raw text.

The current era, transformer-based large language models, collapses most of the older task-specific architectures into one. The same model handles translation, summarization, classification, and question answering, often just by being given the task as a prompt. The old tasks haven’t gone away, but the way you build a system for them is now much more uniform than it used to be.

5. Where NLP still falls short

NLP today is better than it has ever been, and it still gets a lot wrong in predictable ways. Knowing the failure modes is the difference between using these tools well and being surprised by them.

The deepest issue is that current models match patterns rather than understand meaning. They have seen enormous amounts of text that pair questions with correct answers, statements with continuations, prompts with appropriate responses, and they have learned the shapes those pairings tend to take. That’s powerful, and it’s not the same as understanding. A model can produce a confident, fluent answer to a question it has no actual grasp of, and the output alone won’t tell you which is which.

A close cousin is hallucination in generative systems. A model asked about something it doesn’t really know will often produce a plausible-sounding answer with invented details: a fake citation, a quote that was never said and a fact that almost-but-not-quite matches reality. The fluency hides that the substance is wrong. Retrieval-augmented systems try to fix this by feeding the model real source text at query time; see the post on RAG for what that actually looks like in practice.

Bias is another structural issue. A model trained on a giant slice of the web inherits the patterns in that text, including the ones nobody would endorse on inspection. The mix of stereotypes about jobs and gender, skewed representation of which voices count as authoritative, and lopsided coverage of which cultures get written about and how all land in the training data and show up later in outputs.

Coverage across languages is uneven for the same reason. English and a handful of other languages dominate the available text online, and NLP quality reflects that. Languages with less written digital presence get systems that work noticeably worse, and the gap is slow to close even with deliberate effort.

None of this is fixed by making the models bigger. Scaling has solved many problems. It has not solved the fundamental one: a model can only learn what it has seen, and a lot of what humans actually mean is never written down.

6. The bigger picture

NLP sits inside the search bar, the spam filter, the autocomplete, the translator, the support chatbot, and most of the AI features being added to every product you use. Once you have the rough shape (text gets tokenized, tokens get embedded, models learn from huge piles of text, and modern systems collapse the old task-specific approaches into one) you can read most NLP news with a working sense of what’s actually happening and what’s marketing on top.

The natural next steps from here, if you want to go deeper without committing to a course, are the posts on this site about how to prompt these models effectively and how to extend them with retrieval. Both pick up where this one leaves off and assume the basics covered here.