PaperPad

How Large Language Models (LLMs) Actually Work

Illustration of how large language models work, showing text input breaking into tokens, passing through transformer attention layers, and generating an output response.

Introduction

You type a question. A few seconds later, a wall of fluent, surprisingly coherent text comes back. It can write a cover letter, explain quantum physics, debug your code, or invent a bedtime story on demand.

But how does it actually do that?

“It’s just autocomplete” is the dismissive answer you’ll often hear. It’s not wrong — but it’s a bit like saying a symphony is just vibrating air. Technically true. Wildly undersells what’s happening.

Large language models (LLMs) are the technology powering tools like ChatGPT, Claude, and Gemini. They are, by any reasonable measure, one of the most extraordinary engineering achievements of the last decade.

This post breaks down how they actually work — from raw text data all the way to the moment a response appears on your screen. No maths degree required. Just curiosity.

By the end, you’ll understand what an LLM is, how it learns, what happens when you send it a prompt, and why it sometimes gets things spectacularly wrong.

What Is a Large Language Model?

A large language model is a type of artificial intelligence trained on enormous quantities of text. Its core job is to predict what word — or more precisely, what token — comes next in a sequence.

That might sound simple. It isn’t.

To predict text well, a model has to develop a surprisingly deep understanding of language, facts, logic, and context. It can’t fake its way through millions of training examples without internalising a tremendous amount of structure about how language and the world actually work.

“Large” refers to two things: the amount of training data (we’re talking about hundreds of billions of words) and the number of parameters inside the model (often hundreds of billions of those too — more on what parameters are in a moment).

The “language model” part means it works with language specifically — text in, text out. That’s different from image-generating AI or audio AI, though modern multimodal models increasingly handle all three.

If you want to understand the broader landscape before going deeper here, What Artificial Intelligence Really Means (Beyond the Hype) is a good starting point, as is Machine Learning vs Deep Learning: The Layers of AI.


Step 1 — Feeding the Model: Training Data

Everything starts with data. A lot of it.

Modern LLMs are trained on a corpus — a collected body of text — that typically includes:

  • Books (hundreds of thousands of them)
  • Web pages crawled from across the internet
  • Wikipedia and other encyclopaedias
  • Code repositories like GitHub
  • Academic papers and journals
  • Forums, news articles, and discussions

The total size of training data for frontier models (means most capable, and cutting edge AI models like OpenAI GPT-4, Anthropic Claude 3 Opus, Google Gemini Ultra etc) is typically measured in trillions of tokens. “Token” this is not every word ever written, but it’s a meaningful slice of human text production over the last few decades.

The model doesn’t “read” this data the way you’re reading this post. It processes it statistically, adjusting internal settings billions of times to get better and better at prediction.

Important Limitation: the training data has a cutoff date. The model only knows what was in its training Data. Anything that happened after that date — news events, new research, price changes — is invisible to it unless you tell it directly in the conversation.

For a deeper look at why data quality matters so much in AI, see How data Powers AI: Understanding AI’s Fuel.


Step 2 — Tokenisation: Breaking Text Into Pieces

Before the model can process text, that text has to be converted into a format with which the model can work mathematically.

The first step is tokenisation. It is the step where text is broken into tokens before the model processes it. The model does not read raw sentences as humans do; it first turns them into token IDs, then uses those IDs to predict the next token. In more simpler term, A token is usually a word, part of a word, punctuation mark, or sometimes even a space or byte fragment, depending on the tokenizer.

For example, common words like “the” are a single token. Longer or rarer words might be split into two or three tokens. Similarly, the word “tokenisation” might become “token” + “isation,”.

Tokenization workflow:

  • The text is normalized or prepared.
  • A tokenizer splits it into tokens.
  • Each token is mapped to a numeric ID from the model’s vocabulary.
  • Those IDs are fed into the LLM for prediction.

Spaces, punctuation, and capitalization all affect how tokens are counted.

Why tokens and not just words? Because this approach handles new words, foreign languages, code, and unusual strings more gracefully. A word the model has never seen before can still be broken into recognisable sub-pieces.

Once the text is tokenised, each token is converted into a numerical vector — a list of numbers that represents that token’s position and relationships in a high-dimensional mathematical space. This is called an embedding.

Embeddings are how meaning gets encoded mathematically. Words with similar meanings end up numerically close to each other. “King” and “queen” are closer together in this space than “king” and “bicycle.” The model learns these relationships during training.

For a more detailed look at how embeddings work, keep an eye out for the upcoming deep dive on that topic.


Step 3 — The Transformer Architecture

The dominant architecture powering almost every modern LLM is called the Transformer.

Introduced in a landmark 2017 paper titled “Attention Is All You Need” by researchers at Google, the Transformer replaced earlier architectures that processed text sequentially — word by word, left to right. That approach was slow and struggled with long sequences because early context got “forgotten” by the time the model reached the end.

The Transformer processes all tokens in a sequence simultaneously, not one at a time. This makes it dramatically faster to train and far better at maintaining context over long passages of text.

Inside a Transformer, there are layers — stacked blocks of mathematical operations. Each layer takes the representations from the previous layer and refines them. A modern frontier model might have 96 or more of these layers, each doing millions of calculations.

The key innovation in the Transformer is the attention mechanism — which is a important part of Transformer Architecture.


Step 4 — Attention: What the Model Pays Attention To

Attention is the mechanism that allows a Transformer to relate any word in a sequence to any other word in that sequence, regardless of how far apart they are.

Here’s a simple example. Consider the sentence:

“The trophy wouldn’t fit in the bag because it was too large.”

What does “it” refer to — the trophy or the bag? You probably worked it out instantly. But for a model processing text, this kind of pronoun resolution requires understanding the relationship between “it,” “trophy,” “bag,” and “large” — words spread across the sentence.

Attention lets the model do exactly this. For every token, the model calculates how relevant every other token in the sequence is to understanding it. These relevance scores are called attention weights.

There’s a more sophisticated version called multi-head attention, which runs several attention calculations in parallel. Each “head” can learn to attend to different types of relationships — one might focus on grammatical dependencies, another on semantic meaning, another on positional patterns. The results are combined to produce a richer representation.

This is why Transformers handle context so well. They don’t just look at the neighbouring word — they can draw connections across an entire document.

The attention mechanism is one of those concepts that genuinely rewards a deeper look. (A full illustrated post on attention is coming soon)


Step 5 — Prediction: One Token at a Time

Once training is complete, you have a model that’s very good at one thing: given all the tokens that came before, what token is most likely to come next?

When you type a prompt, the model processes your entire input, runs it through all its layers and attention heads, and outputs a probability distribution over its entire vocabulary for the next token. It then samples from that distribution to pick a token, adds it to the sequence, and repeats.

This is called autoregressive generation — the model generates text one token at a time, feeding each new token back in as part of the input for the next prediction.

This is why LLMs can feel like they’re “thinking as they write.” In a literal computational sense, they are — each token influences what comes next.

The model doesn’t write a full response in one go and then send it. It generates one token, then another, then another, until it predicts a stop sequence or hits a length limit. The streaming effect you see in perplexity, ChatGPT and Claude — text appearing word by word — is the model generating in real time.

There’s a parameter called temperature (ranges from 0 to 2) that controls how adventurous or conservative the output is.

For example:

A low temperature (e.g. 0.1 – 0.5): makes the model pick the most probable token almost every time — more predictable, less creative.

A high temperature (e.g. 0.8 – 1.5): introduces more randomness — more creative, but also more likely to go off the rails.


Step 6 — Fine-Tuning and RLHF: Making It Useful

Once a base AI is finished with its initial training, it’s smart but lacks direction. If you ask it a question, it might just give you more questions back instead of an answer. It doesn’t know how to be a “helpful assistant” yet, and it could even say things that are harmful. Think of it like a powerful car engine that’s running—it has plenty of power, but it needs a steering wheel so you can actually drive it safely.

This is where fine-tuning comes in. The model is trained further on a much smaller, curated dataset of high-quality examples. These examples demonstrate the kind of behaviour you want: helpful, clear, honest, safe.

Beyond that, most frontier models go through a process called Reinforcement Learning from Human Feedback (RLHF). Here’s how it works:

  1. The model generates several different responses to a prompt (Prompt is the task which we give AI, with its supporting Data and Instructions).
  2. Human raters rank those responses from AI, from best to worst.
  3. A separate model (the “reward model”) learns to predict those human preferences.
  4. The model is then fine-tuned using reinforcement learning, where it earns ‘rewards’ for giving better answers. This teaches it to consistently produce high-quality responses.

RLHF is why modern AI feels so helpful. Instead of just guessing the next word based on math, the AI has been shaped by real human feedback to match what we actually want.

The tradeoff is that RLHF can also make models overly cautious or prone to telling you what you want to hear. Getting this balance right is an active area of research.

To understand the Basics of how AI learns from feedback, check Supervised, Unsupervised, Reinforcement: The 3 Ways AI Learns, this covers the foundational concepts clearly.


Why Do LLMs Make Mistakes?

Now question arises, If models are trained on so much data and refined so carefully, why do LLMs get things wrong?

A few key reasons are:

They predict, not retrieve: AI models don’t “look things up” like a search engine or a library. Instead, they guess what comes next.

Think of it like an advanced version of “autofill” on your phone. Because the model is just following patterns to create sentences, it can sometimes say things that sound perfectly normal but are actually completely made up. In the technical term, we call this hallucination.

Training data has gaps and biases: AI learns by reading a massive amount of information. However, this information isn’t always perfect. If the “lessons” the AI learns from are flawed, the AI will be flawed too.

>>> Lets look at these problems:

  1. Missing Information (Gaps): If the AI wasn’t taught much about a specific topic or culture, it won’t know how to talk about it accurately. It’s like a student trying to pass a test after reading a textbook with missing chapters.
  2. Unfair Ideas (Biases): AI learns from things written by humans. If that writing contains unfair opinions or stereotypes, the AI will “catch” those bad habits. It ends up repeating the same unfair views it saw in its training.

Let’s take a example to understand it better. Imagine teaching a child about animals using only a book about cats. If you then ask that child about a dog, they will either be very confused or try to tell you that a dog is just a “weird-looking cat.” The child isn’t broken; they just had limited information.

The context window has a limit: There’s a finite amount of text the model can “see” at one time during a conversation. Very long conversations or documents may exceed this window, causing the model to lose track of earlier context. The Brains Behind the Machine: How Neural Networks Work] covers how this relates to neural network memory.

Stochastic outputs: AI doesn’t have a “set” answer for every question. Instead, it makes choices based on probability—like picking names out of a hat where some names are bigger than others. Because of this, you can ask the exact same question twice and get two different responses.

>>>Why different responses:

  1. The “Random” Factor: When the AI builds a sentence, it sees several words that could work. Sometimes it picks the most likely word, and sometimes it picks a less likely one to keep things interesting.
  2. The “Temperature” Setting: Think of this like a “Creativity Dial.” * If the temperature is low, the AI plays it safe and usually gives the same, standard answer. If the temperature is high, the AI takes more risks, leading to more diverse and creative answers (but sometimes less predictable).

Let’s take a simple example to understand it better. Imagine asking a friend for a restaurant recommendation. And below are two responses we might get.

  • One day, they might suggest their absolute favorite spot.
  • The next day, they might remember a new place they’ve been meaning to try.

The friend hasn’t changed their mind about what’s good; they are just sampling from different ideas they have at that moment. AI works the same way—it chooses from a list of “likely” options every time it speaks.

Understanding these limitations how AI works doesn’t make it less amazing—it just helps you learn how you can use it like a pro.


What “Parameters” Actually Means

You’ll often see LLMs described by their parameter count like GPT-4 has 1.8 trillions and Llama 3 has variants of several sizes from 8 to 70 billion roughly, and so on.

Parameters are the numerical values inside the model that get adjusted during training. Think of them as the knobs and dials that the training process tunes until the model gets good at prediction.

More parameters generally means the model can capture more complex patterns and relationships in language. But more isn’t always better — a well-trained smaller model can outperform a poorly trained larger one. And larger models are more expensive to run.

The relationship between parameter count, training data, training compute, and final performance is an active area of research. There’s even a formula for it: Chinchilla scaling laws. In brief, this this law states that for a fixed amount of computing power (compute), you should increase both the model size (parameters) and the training data (tokens) proportionally. Before this, the industry was focused almost exclusively on making models larger (more parameters) while keeping training data relatively small.


LLMs in the Real World: What They’re Used For

The same underlying technology powers a remarkably wide range of applications:

  • Writing assistants — drafting emails, blog posts, marketing copy
  • Customer support bots — handling queries at scale without human agents
  • Code generation — tools like GitHub Copilot write and debug code
  • Search and summarisation — pulling answers from large documents
  • Education and tutoring — personalised explanations on demand
  • Translation — contextual, nuanced, not just word-for-word
  • Healthcare — drafting clinical notes, summarizing research
  • Legal — contract review, case research

Unlock CustomGPT today: Try it free for 7 days and build your AI skills.


Conclusion

Large language models aren’t magic — but understanding how they work makes them feel even more remarkable.

They learn by processing an almost incomprehensible quantity of human text. They break that text into tokens, convert those tokens into mathematical representations, and use the Transformer architecture — powered by the attention mechanism — to understand relationships across language. They generate responses one token at a time, shaped by fine-tuning and human feedback to be helpful and safe.

And they still get things wrong. That’s not a flaw to dismiss — it’s a property to understand and work with.

The three things worth remembering: LLMs predict, they don’t retrieve. Their knowledge has a cutoff date. And their outputs are probabilistic, not deterministic — the same prompt can produce different results.

If you want to go deeper, [The Brains Behind the Machine: How Neural Networks Work] is the natural next step, covering the neural network foundations that underpin everything described here.


FAQ

What is a large language model in simple terms?

A large language model is an AI system trained on enormous amounts of text to predict what word — or token — comes next in a sequence. By getting very good at this prediction task, it develops a working understanding of language, facts, and reasoning that lets it write, answer questions, and hold conversations.

How is an LLM different from a search engine?

A search engine retrieves existing web pages that match your query. An LLM generates new text based on patterns learned during training. It doesn’t look anything up in real time (unless it’s been given a tool to do so). This is why LLMs can answer questions about things not on the web — and also why they can confidently state things that are wrong.

What does “training” an LLM actually involve?

Training involves feeding the model vast quantities of text and repeatedly adjusting its internal parameters to improve its predictions. The model makes a prediction, compares it to the actual next token, calculates how wrong it was, and adjusts accordingly — millions of times per second, over weeks, on thousands of specialised chips.

Why do LLMs sometimes make things up?

Because they’re optimised to produce plausible-sounding text, not verified facts. When a model doesn’t have strong training signal for a particular fact, it may generate a confident-sounding answer that’s statistically plausible but factually wrong. This is called hallucination. Cross-checking important facts from AI outputs is always a good habit.

What’s the difference between GPT-4, Claude, and Gemini?

All three are frontier LLMs built on the Transformer architecture, but trained by different companies (OpenAI, Anthropic, and Google respectively) on different datasets, with different fine-tuning approaches and safety philosophies. They have meaningfully different strengths — a full comparison post is planned for this site. (See Notes below.)

How large is “large” in a large language model?

Modern frontier models have anywhere from 7 billion to over a trillion parameters, depending on the model. The training datasets run into the trillions of tokens. The compute required to train them costs tens of millions of dollars. “Large” is doing a lot of work in that label.

Leave a Reply