A Guide to Effective LLM Prompting
1. Why prompting still matters
Prompt engineering did not fade as models improved. It became more load-bearing. The common assumption ran the other way, that prompting was a transitional skill, something that would matter less as models got smarter, but the opposite happened. Models got dramatically better at following instructions, which means the instructions themselves now carry more weight. A vague prompt to a weak model produced vague output you could blame on the model. A vague prompt to a strong model produces a confident, fluent answer to a question you didn’t quite ask.
What changed is the nature of the work. Prompting used to be partly about coaxing, with tricks and magic phrases and formatting hacks that nudged a reluctant model toward competence. Most of that is dead now. Modern models don’t need to be tricked into reasoning, and they don’t need to be persuaded to follow a format. What they need is an unambiguous specification of what you actually want, which is a different skill than coaxing capability out of something reluctant.
The single most useful test for any prompt is model-agnostic: show your prompt to a colleague with minimal context on the task and ask them to follow it. If they would be confused, the model will be too. The model has no access to the meeting where you decided what “summary” means for this project, or which of three plausible output formats you had in mind. It sees the words on the page and nothing else.
This guide is organized around that one idea. Every technique that follows is, at bottom, a way to remove ambiguity. Clear instructions remove it about the task, examples remove it about the format, and a visible structure removes ambiguity about which part of the prompt is which. Get the ambiguity out and most prompting problems disappear on their own.
One caveat before the techniques. Prompt engineering solves problems that are really about communication. It does not solve every problem. If your issue is latency or per-call cost, switching to a smaller or faster model is usually the better lever. If the model genuinely lacks the knowledge or the capability for the task, no prompt recovers it. Reach for prompting when the model could do the task but doesn’t yet know precisely what you’re asking for. That is the gap these techniques close.
2. Be clear and direct
The highest-leverage change you can make to a prompt is to make it more specific. Most underperforming prompts are not wrong, they are vague, and the model closes the gap with a reasonable guess that happens not to match your intent. The fix is detail, not cleverness.
A useful mental model: treat the model as a brilliant new hire on their first day. It has read an enormous amount and reasons well, but it knows nothing about your project, your conventions, or the unstated assumptions in your head. You would not tell a new colleague “build the dashboard” and expect the result to match your mental picture. The same instruction to a model produces the same mismatch.
Compare these two prompts:
Create an analytics dashboard.Create an analytics dashboard for a SaaS product's weekly active users.
Include a line chart of WAU over the last 12 weeks, a cohort retention
grid, and a top-five-features table. Use a muted palette, no animations.
Target a desktop layout at 1440px.The second prompt does not make the model smarter. It removes the four or five decisions the model would otherwise have made for you, each of which is a coin flip against your intent.
Specificity is half of it. The other half is explaining why. When you give the reason behind an instruction, the model can generalize it to cases you didn’t explicitly cover. The contrast is sharp even on a tiny instruction:
Never use ellipses.Your response will be read aloud by a text-to-speech engine, so never
use ellipses, since the engine does not know how to pronounce them.The first version is a rule the model applies literally. The second tells the model what problem the rule is solving, so it will also avoid other constructs the engine would mangle (emoji, ASCII art, unusual punctuation) without you having to enumerate them. A short clause of motivation often does more than a paragraph of rules.
Two practical habits follow from this. State the desired output format and constraints explicitly rather than hoping they’re inferred. And when the order or completeness of steps matters, give them as a numbered list, not a prose paragraph, so nothing is silently dropped. Modern models interpret instructions quite literally, which works in your favor here: if you ask for exactly what you want, that is generally what you get. The corollary is that they will not invent requirements you left out. If you want above-and-beyond effort, you have to ask for it.
3. Give the prompt a structure
Once a prompt grows past a few sentences, wrap each distinct kind of content in a labeled delimiter so the model can tell the parts apart. A long prompt is several things at once: instructions, background context, the actual input data and possibly examples. If you don’t visibly separate those parts, the model has to infer where one ends and the next begins, and it sometimes infers wrong, reading your input data as further instructions or vice versa.
Claude’s documentation specifically recommends XML-style tags for this, and Claude is tuned to respect them:
<instructions>
Summarize the support ticket below for an engineering audience.
Focus on reproduction steps and affected versions.
</instructions>
<ticket>
{{TICKET_TEXT}}
</ticket>The exact syntax matters less than the consistency. XML tags are the documented preference for Claude; other models respond just as well to Markdown headers, fenced blocks, or triple-quoted sections. What you should not do is rely on a blank line and hope. Pick one delimiter convention, use descriptive names (<ticket>, not <data2>), and apply it the same way every time. When content has a natural hierarchy, nest the tags: multiple documents inside a <documents> wrapper, each in its own <document> tag.
3.1. Order matters
Where you place content inside the prompt changes the result. The rule for large inputs: put the long material at the top and your actual question at the bottom, after the data. For multi-document prompts, Anthropic’s testing reports this ordering can improve response quality by up to 30%.
The reason is straightforward. If the question comes first and a 20,000-token document follows, the model reads the question, then a wall of text, and by the end has to recall what was asked. If the document comes first and the question comes last, the question lands while the material is freshest. Instructions and examples generally belong near the question at the bottom; only the bulk reference data goes on top.
3.2. The system prompt and the role
Most APIs split a prompt into a system prompt and a user message. Use the split. The system prompt is for stable, request-independent framing: who the model is acting as, the rules that hold for every call, the output contract. The user message carries the specific task and its input. Keeping the two separate means you can cache and reuse the system prompt and reason about each layer independently.
Role assignment belongs in the system prompt, and even one sentence shifts behavior measurably:
You are a senior security engineer reviewing code for vulnerabilities.A role is not decoration. It sets the vocabulary, the depth, and the priorities the model brings to the task. “You are a security engineer” and “you are a coding tutor for beginners” will produce genuinely different reviews of the same function, and one of them is the one you wanted. Pick the role that matches the audience and stakes of the output.
4. Show examples instead of describing them
When you want a specific output format, tone, or structure, showing the model one or more examples steers it far more reliably than describing what you want in prose. This is called few-shot or multishot prompting, and for anything with a non-obvious shape it is the single most effective technique in this guide.
The reason examples win is that some things are genuinely hard to specify in words. Describing a tone (“friendly but concise, technical without being dry”) leaves enormous room for interpretation. Two examples of that tone pin it down completely. The same is true of formatting quirks, edge-case handling, and the exact level of detail you expect. Prose tells the model the rule; examples show it the rule already applied.
Three properties make examples work:
- Relevant. Mirror your real inputs. An example built from a toy case teaches the model to handle toy cases.
- Diverse. Cover the edge cases and vary the examples enough that the model doesn’t latch onto an accidental pattern. If all three of your examples happen to have two-sentence answers, the model will conclude every answer is two sentences, whether or not you meant that.
- Structured. Wrap them so they’re unmistakably examples and not instructions or input. A common convention is one
<example>tag per case, all inside an<examples>wrapper.
Three to five examples is the usual sweet spot. Below three, the model may not have enough to generalize from. Far above five, you’re spending context budget for diminishing returns, and if your examples are even slightly inconsistent with each other, more of them just amplifies the noise. The diversity warning is the one people most often get wrong: an unintended pattern shared across all your examples becomes a rule the model follows invisibly.
5. Let the model reason before it answers
For any task involving multi-step logic, math, comparison, or analysis, giving the model room to reason before it commits to an answer measurably improves accuracy. A model that produces its conclusion as the very first token has no opportunity to catch its own mistake. A model that works through the problem first does.
Most current models expose this directly through a thinking or reasoning mode, where the model produces internal reasoning before its visible answer. When that mode is available, prefer it. When it isn’t, you can get most of the benefit manually by asking for step-by-step reasoning and separating it from the final answer with tags:
Work through this problem inside <reasoning> tags, considering each
constraint in turn. Then give your final answer inside <answer> tags.The tag separation matters for a practical reason: it lets you parse and discard the reasoning, keeping only the answer for the user, while still having gained the accuracy.
Two things are worth knowing about how to prompt for reasoning. First, general instructions usually beat prescriptive ones. “Think carefully through this problem” tends to produce better reasoning than a hand-written, numbered procedure, because the model’s own decomposition is often better than the one you would script, and a rigid script can stop it from noticing something you didn’t anticipate. Prescribe steps only when the procedure genuinely is fixed. Second, a self-check pass is cheap and effective. Appending “before finishing, verify your answer satisfies every constraint listed above” catches a real fraction of errors, especially in code and math.
The cost of reasoning is latency and tokens, so it is not free. Don’t force a deliberate reasoning pass on simple lookups or formatting tasks where the answer is immediate. Reserve it for the problems where a wrong first instinct is plausible. That is exactly where the extra tokens pay for themselves.
6. Steer the output
The most reliable way to control a model’s output is to describe what you do want. Negative instructions force the model to hold a forbidden thing in mind while avoiding it, which is both fragile and easy to violate. A positive instruction just describes the target.
The difference is concrete:
Do not use markdown in your response.Write your response as flowing prose in complete paragraphs.The first leaves “what should I do instead?” unanswered, and a stray bullet list slips through. The second describes the destination, so there’s nothing to slip. The same principle applies to length, tone, and structure: name the thing you want.
A few more levers for output control:
- Format indicators. Asking the model to write a section “inside
<summary>tags” is a strong, easily parsed format signal. Tags you put in the prompt tend to come back in the output. - Match your prompt’s style to the output’s style. If you want terse, unformatted prose back, write your prompt as terse, unformatted prose. Heavy Markdown in the prompt tends to produce heavy Markdown in the response. The model picks up on the register you’re writing in.
- Be explicit about verbosity. Recent models calibrate response length to how hard they judge the task to be, so the same prompt can yield a one-line answer or three paragraphs. If your application needs a consistent length, state it. And when you do, a positive example of the right length works better than an instruction not to ramble.
One older technique worth retiring: prefilling the assistant’s response to force a format, for example starting the reply with { to force JSON. Claude’s newer models have dropped support for prefilling the final assistant turn, and across providers the technique was always brittle. If you need structured output, use the structured-outputs or JSON-schema feature your provider offers, or define a tool whose parameters are the schema you want. Both constrain the output far more reliably than a prefix ever did.
7. Working with long context
When a prompt includes a large body of reference material, two techniques beyond ordering (section 3) keep the answer grounded in the documents instead of drifting: quote-grounding, and forcing the model to investigate before it asserts.
The first is quote-grounding. Before asking the model to reason over a long document, ask it to first extract the passages relevant to the question, verbatim, into a tagged block, and only then answer using those passages:
First, find every passage in the documents above relevant to the
customer's billing question and copy them verbatim into <quotes> tags.
Then answer the question using only those quotes.This works because it forces the model to locate its evidence before committing to a conclusion, rather than answering from a general impression of the document and backfilling. It also gives you something auditable: you can see which passages the answer rests on, and catch it when the model quotes something that doesn’t actually support its claim.
The second is forcing investigation before assertion, which matters most for agentic and code tasks where the model can read files or call tools. Left alone, a model will sometimes answer a question about a specific file from memory or inference rather than opening it. An explicit instruction closes that gap:
Never describe or make claims about a file you have not opened. If a
question references a specific file, read it before answering.Both techniques share a shape: they insert a grounding step between the question and the answer. The model must point at its evidence before it is allowed to conclude. That ordering is what keeps long-context answers honest.
8. Decompose hard tasks into chained prompts
When a task has several genuinely distinct stages, splitting it into a chain of separate prompts often beats cramming everything into one. Each call does one job, and the output of one becomes the input to the next. A single mega-prompt asking the model to research, then outline, then draft, then fact-check in one shot tends to do every stage at medium quality, because attention and effort are spread across all of them at once.
The most useful chaining pattern is self-correction: generate, then review, then refine.
- Generate. One prompt produces a first draft.
- Review. A second prompt is handed that draft and asked to critique it against explicit criteria (accuracy, completeness, tone) with no obligation to be kind.
- Refine. A third prompt takes the draft and the critique and produces the improved version.
Each step is a separate API call, and that separation is the point. You can inspect the intermediate output, log it, branch on it, or run the review step several times. The review prompt also benefits from not being the author: a model asked to critique a draft it is seeing fresh is more honestly critical than the same model asked to critique its own answer in a single continuous response.
The caveat: do not chain reflexively. Modern models already do a great deal of multi-step reasoning internally within a single call, and for most tasks one well-specified prompt is enough. Reach for an explicit chain when you specifically need to inspect or reuse an intermediate result, when you need to enforce a fixed pipeline, or when one stage’s prompt is large enough that combining it with the others would bury it. If you can’t name a concrete reason to split, don’t.
9. Over-prompting: the failure mode you’ll actually hit
The most common prompting mistake today is not under-specifying. It is over-prompting: piling on emphatic instructions, all-caps imperatives, and defensive rules that were tuned for older, less obedient models and are now actively counterproductive.
Older models needed to be shouted at. Instructions like CRITICAL: You MUST ALWAYS use the search tool existed because, without the volume, the model would skip the tool half the time. Current models follow instructions far more faithfully, and they read that emphasis literally. The result is over-triggering: a model that now reaches for the search tool on every trivial question, or treats a minor formatting preference as a hard constraint that overrides more important goals. A hedge like “if in doubt, use the tool” used to correct under-use; on a current model it causes constant, unnecessary use.
The fix is to dial the language back to normal. “Use the search tool when the question depends on current information” says exactly as much as the all-caps version and gets followed just as reliably, without the collateral over-triggering. Treat CRITICAL, MUST, ALWAYS, and NEVER as a budget you spend only on instructions that genuinely are critical. When everything is emphasized, nothing is.
This connects to a habit worth building. Prompts accrete cruft. A rule gets added to fix one bad output, the underlying model gets upgraded, and the rule is never removed, so it quietly distorts behavior for months. When you change models, or every few months regardless, reread your prompt and delete the instructions you can no longer justify. Many of them were workarounds for a model that no longer exists.
Which points at the last and most important habit: iterate against evidence, not vibes. Every technique in this guide is a hypothesis about what will improve your output, and the only way to know is to measure. Keep a set of representative test inputs, change one thing at a time, and compare. “This prompt feels better” is how cruft accumulates and how superstitions get encoded as rules. A small eval set, even ten or twenty hand-checked cases, turns prompting from guesswork into something you can improve on purpose. The people who get the most out of these models tend to be the ones who measure, not the ones with the cleverest phrasing.