In-Context Learning

Intro

In-context learning is the ability of an LLM to adapt to a task from the prompt context itself, without updating model weights. Mechanically, the model is still doing next-token prediction at inference time; the examples in the prompt change what token sequences are most probable next, so behavior changes without training. The key control is shot count: zero-shot (no examples), one-shot (one example), or few-shot (multiple examples). More shots can improve task steering, but they also consume context window budget.

Zero-Shot Prompting

Zero-shot prompting asks the model to perform a task with no demonstrations, relying on instruction quality and prior training.

When it works well:

Instruction-tuned models (for example ChatGPT- and Claude-style assistants) usually perform much better in zero-shot settings than base next-token models because they were aligned to follow instructions.

Classify sentiment as Positive, Neutral, or Negative.

Text: "The battery life is acceptable, but the camera is disappointing."
Answer:

Typical output:

Neutral

One-Shot Prompting

One-shot prompting is the minimal demonstration setting: one complete input-output example plus a new input to solve. Use it when zero-shot mostly works but output format or decision boundaries are still inconsistent.

Extract entities from support messages.
Return JSON with keys: customer, issue, severity.

Input: "Tom reports typo in footer link on pricing page."
Output: {"customer":"Tom","issue":"footer link typo on pricing page","severity":"low"}

Input: "Ava cannot reset password after SSO migration. She is blocked from login."
Output:

Possible output:

{"customer":"Ava","issue":"password reset fails after SSO migration","severity":"high"}

Few-Shot Prompting

Few-shot prompting provides multiple demonstrations so the model can copy task structure, label space, and output format. In practice, a small number of examples often improves stability on noisier inputs compared with one-shot.

Why it works:

Min et al. (2022) showed that demonstration format, label space, and input distribution can matter more than per-example label correctness. They report that randomly replacing labels in demonstrations often hurts less than expected, which suggests the model is strongly using structural cues from demonstrations.

Extract entities from support messages.
Return JSON with keys: customer, issue, severity.

Input: "Maria says checkout crashes on payment step. Impact is high for all EU users."
Output: {"customer":"Maria","issue":"checkout crash on payment step","severity":"high"}

Input: "Tom reports typo in footer link on pricing page."
Output: {"customer":"Tom","issue":"footer link typo on pricing page","severity":"low"}

Input: "Ava cannot reset password after SSO migration. She is blocked from login."
Output:

Possible output:

{"customer":"Ava","issue":"password reset fails after SSO migration","severity":"high"}

Design Principles

Limitations

When this pattern is not enough for reasoning-heavy tasks, continue with Reasoning Techniques.

Pitfalls

Recency Bias in Example Ordering

What goes wrong: the last example in a few-shot prompt has disproportionate influence on the output. If the last example is a rare edge case, the model over-applies that pattern to normal inputs.

Mitigation: test multiple orderings of your demonstration set. Place the most representative examples last, or randomize order across requests to average out the bias.

Adding More Shots Instead of Fixing the Root Cause

What goes wrong: the model produces inconsistent output, so the team adds more examples. The real problem is ambiguous instructions or inconsistent example formatting. More shots amplify the inconsistency rather than fixing it.

Mitigation: before increasing shot count, audit example consistency (same separators, same field order, same casing). Fix formatting first. Add shots only when the schema is clean and failures are about coverage, not consistency.

Context Window Pressure

What goes wrong: a few-shot prompt with 10 long examples consumes 3,000 tokens, leaving little room for the actual user input. For long documents or multi-turn conversations, the demonstrations crowd out the content.

Mitigation: keep examples short and representative. For long-context tasks, use one-shot or zero-shot with precise instructions. Measure token cost of the demonstration block and set a budget.

Tradeoffs

Approach Token cost Format control Knowledge injection Use when
Zero-shot Minimal Low None Simple, well-specified tasks; instruction-tuned models
One-shot Low Medium None Format is inconsistent; one example clarifies the schema
Few-shot (3-5) Medium High None Ambiguous class boundaries; complex output structure
Fine-tuning High (training) Very high Yes (new knowledge) Consistent task at scale; examples cannot fit in context
RAG + zero-shot Medium (retrieval) Low Yes (external docs) Task requires external knowledge not in model weights

Decision rule: start zero-shot. Move to one-shot when output format is inconsistent. Move to few-shot when class boundaries are ambiguous. Move to fine-tuning only when few-shot is too expensive at scale or the task requires knowledge injection. Use RAG when the task requires external facts.

Questions

References


Whats next