Large Language Models, Explained

A large language model — an LLM — is a program trained to predict the next chunk of text. That sounds almost too simple to explain ChatGPT, Claude or Gemini, yet prediction at enormous scale turns out to produce something that reads, reasons and writes. This guide covers how that works, the words you’ll keep hearing, and where these systems quietly break.

What the model is actually doing

At its core, an LLM reads everything before a point and estimates the most likely text to come next. Train that objective on a vast slice of the internet, books and code, and the model has to internalise grammar, facts, tone and reasoning patterns just to predict well. Nobody hand-codes those rules; they emerge from the data.

The text is broken into tokens — roughly word fragments. “Unbelievable” might be three tokens. The model works in tokens, prices are quoted in tokens, and its memory limit is measured in tokens too.

The transformer, in plain terms

Modern LLMs are built on an architecture called the transformer. Its key trick is attention: for every token, the model weighs how relevant every other token is. When it processes “the bank of the river,” attention lets it connect “bank” to “river” and rule out the financial meaning.

You don’t need the maths. The practical takeaway is that transformers handle long-range context well and scale efficiently on modern hardware — which is why they won.

Training happens in stages

Pre-training

The model reads a massive corpus and learns to predict text. This is where raw capability comes from, and it costs the most — millions of dollars in compute for a frontier model.

Fine-tuning and alignment

A pre-trained model is knowledgeable but unruly. A second stage, often using reinforcement learning from human feedback (RLHF), teaches it to follow instructions, be helpful, and refuse harmful requests. This is the difference between a model that completes text and an assistant that answers you.

Context windows and why they matter

The context window is how much text the model can consider at once — the prompt plus its own reply. Early models held a few thousand tokens; current ones hold hundreds of thousands. A bigger window means you can paste a whole contract or codebase and ask questions about it. But everything outside the window is invisible: the model has no memory of past chats unless that text is fed back in.

What LLMs are good at

Drafting and rewriting — emails, summaries, code, translations.
Explaining — turning dense material into plain language.
Pattern work — extracting structure from messy text, classifying, reformatting.
First-draft reasoning — walking through a problem step by step.

Where they fail

This is the part marketing tends to skip.

Hallucination. Because the model predicts plausible text, it will state false things with total confidence — fake citations, invented APIs, wrong dates. It has no built-in sense of “I don’t know.”
No live knowledge. A model knows only what was in its training data up to a cutoff, unless it’s connected to search or tools.
Brittle arithmetic and logic. It can reason, but it can also slip on a calculation a calculator would nail.
Sensitivity to wording. Rephrasing a prompt can change the answer. This is why “prompt engineering” exists.

If you can’t verify an LLM’s answer, treat it as a confident first draft, not a source.

The vocabulary that actually matters

Parameters — the model’s internal knobs, in the billions. More isn’t automatically better, but it correlates with capability.
Inference — running the model to get an answer (as opposed to training it).
Temperature — a setting that controls randomness: low for factual tasks, higher for creative ones.
RAG (retrieval-augmented generation) — feeding the model relevant documents at query time so it answers from real sources instead of memory.

How to think about them

An LLM is a powerful, fluent, unreliable assistant. The skill is using it where fluency helps and verification is cheap — and staying sceptical where it can confidently mislead you. Understand the prediction at the centre of it, and most of the behaviour stops being mysterious.