LLM Fundamentals

What is an LLM, anyway?

A large language model is a staggeringly well-read pattern matcher that generates text one token at a time. The trick is simple. The scale is what makes it feel uncanny.

If you are talking about “AI,” you’re probably talking about an LLM, or a Large Language Model. It is not entirely wrong to think of them as incredibly good text predictors, similar to what you see when typing a message on your phone.

But “incredibly good” is doing a lot of work in that sentence. Your phone’s autocomplete can guess the next word in a text to your mom. An LLM can write the whole text, guess how your mom will reply, and then draft three versions of your response depending on what kind of day you want to have. Same basic idea, wildly different scale.

Here’s the short version of what’s actually going on.

So what is it, really?

An LLM is a giant mathematical model that has been fed a staggering amount of text — books, websites, forum posts, code, transcripts, the instruction manual for a 2003 breadmaker, probably — and has gotten very good at one specific trick: given some text, predict what text should come next.

That’s it. That’s the whole job.

The surprising part is that this one trick, done well enough, looks a lot like thinking. If you ask a model “What’s the capital of France?” it predicts that the next words should be something like “The capital of France is Paris.” It’s not looking anything up. It’s not consulting a database. It’s generating the most plausible continuation of the text you gave it, based on patterns it learned from everything it read.

This is why LLMs can feel weirdly smart and weirdly dumb at the same time. They can write a decent poem about your cat, but they might also confidently tell you that cats have seven legs if the training patterns push them that way. The model doesn’t “know” things the way you know things. It knows what text tends to follow other text.

How does it learn all that?

Training an LLM happens in roughly two phases, and they’re pretty different from each other.

The first phase is called pretraining, and it’s the big, expensive, headline-grabbing one. The model reads through an enormous pile of text — we’re talking trillions of words — and plays a game with itself. The game goes like this: hide a word, try to guess it, check the answer, adjust. Do this billions of times. Over the course of that process, the model’s internal settings (called parameters, which the next article covers) slowly shift to make better and better predictions.

Nobody is sitting there telling the model “this is a noun” or “France is in Europe.” It just figures out the patterns from raw exposure. After enough reading, the patterns it picks up start to include things that look an awful lot like grammar, facts, reasoning, and style. Whether that counts as “understanding” is a philosophical question I’m going to cheerfully dodge.

The second phase is fine-tuning, and it’s where the model learns to be useful instead of just predictive. A pretrained model, left alone, isn’t really a chatbot. Ask it a question and it might just generate more questions, because that’s what the surrounding text in its training data often looked like. Fine-tuning teaches it to answer helpfully, refuse certain requests, follow instructions, and generally behave like an assistant. This phase involves humans giving feedback on the model’s outputs, which steers it toward responses people actually want.

So: pretraining gives it the raw language ability. Fine-tuning gives it manners.

Okay, but how does it actually write?

Here’s where it gets fun. When you send a message to an LLM, it doesn’t compose a full response and then hand it to you. It writes one piece at a time.

First, your message gets chopped up into tokens. A token is roughly a word, but not quite — sometimes it’s a whole word (“cat”), sometimes a chunk of one (“ing”), sometimes just a piece of punctuation. The model sees your message as a sequence of these tokens, not as letters or words the way you do.

Then the model does its prediction trick. Based on your tokens, it calculates probabilities for what token should come next. Not just one guess — it assigns a probability to every possible next token in its vocabulary. “The” might get 12%, “A” might get 8%, “Paris” might get 3%, and so on across tens of thousands of options.

It picks one. Then it adds that token to the end, looks at the whole thing again, and predicts the next one. And the next. And the next. Word by word, or rather token by token, the response gets built in real time. That’s why you see responses stream in from left to right — you’re literally watching the model decide what to say as it says it.

The picking part has some interesting knobs. The model doesn’t always take the highest-probability token, because that would make it boring and repetitive. There’s usually some controlled randomness in the mix, which is why asking the same question twice can give you slightly different answers. It’s also why LLMs can surprise you, for better and for worse.

And that’s the loop. Read the context, predict the next token, pick one, add it, repeat. Keep going until the model predicts a special “I’m done” token or hits a length limit. The result is a response that feels like it was thought through, but was really assembled one small piece at a time.

What this doesn’t cover

I’ve glossed over a lot. I haven’t touched the architecture (transformers), what parameters actually are, why some models run on your phone and others need a data center, or how image and audio get into the mix. Those are coming in the next couple of posts.

For now, the mental model is this: an LLM is a very large, very well-read pattern matcher that writes by guessing the next word over and over. Everything else is detail on top of that.