Introduction to the Math Behind Transformers and LLMs

A Workshop on the Mathematics of Transformers and Large Language Models

An accessible introduction to the mathematics behind transformers and large language models, covering key concepts such as attention mechanisms, self-attention, multi-headed attention, positional encoding, and the training objective of next-word prediction.
odsc
workshop
transformers
llms
Author

Oren Bochman

Published

Monday, April 27, 2026

Modified

Monday, May 18, 2026

Keywords

Transformers, Large Language Models, LLMs, Attention, Self-Attention, Multi-Headed Attention, Positional Encoding, Next-Word Prediction, Cross-Entropy Loss

Introduction to the Math Behind Transformers and LLMs

NoteNotes
  • Purpose of the talk
    • David Hall explains the mathematics behind transformers and large language models in an accessible way.
    • The goal is not to train a full large language model, build chat bots, or teach prompting, but to reduce fear of the mathematical structure behind these systems.
    • The central modeling task is next-word prediction: given a partial sequence of words, predict the most likely next word.
  • Next-word prediction as the core idea
    • A language model begins with a context, such as “The cat sat on the…”
    • It predicts the next word, appends that word to the sequence, and then repeats the process.
    • This recursive procedure can generate longer completions, poems, answers, or dialogue.
    • Systems such as ChatGPT and Claude can be understood, at a high level, as sophisticated next-token prediction systems.
  • Naive Markov model baseline
    • Hoyle introduces a first-order Markov model using the phrase: “The more you know, the more you realize you don’t know.”
    • A Markov model predicts the next word using only the immediately preceding word.
    • Its transition probabilities can be represented as a matrix.
    • The model exposes two major problems:
      • It ignores most of the previous context.
      • It becomes impractical with large vocabularies because the transition matrix grows explosively.
    • It can also generate nonsensical sentences, illustrating why probabilistic language generation needs better structure.
  • Why embeddings are needed
    • One-hot representations are too large and sparse for realistic vocabularies.
    • Large language models instead map tokens into lower-dimensional embedding vectors.
    • These vectors numerically encode semantic relationships between words.
    • Word2vec examples show that embeddings support meaningful operations, such as similarity comparison and analogy-like vector arithmetic.
    • Hoyle demonstrates cosine similarity and examples such as comparing related words or testing vector analogies.
  • Main intuition of transformers
    • Transformers map tokens to embedding vectors, then transform those vectors so they become context-aware.
    • A word’s final vector should not merely represent the word itself; it should also encode relevant information from surrounding or preceding tokens.
    • Once context-aware vectors are obtained, a relatively standard classifier can predict the next token.
  • Attention mechanism
    • Attention constructs a new vector for each token by taking a weighted combination of other token vectors.
    • The weights indicate how much each token should “pay attention” to other tokens.
    • Instead of averaging the original embeddings directly, the model first applies learned linear transformations to produce value vectors.
  • Self-attention
    • Self-attention computes attention weights from the tokens themselves.
    • Tokens are transformed into query and key vectors.
    • The similarity between a query and a key, usually via an inner product, determines how much attention one token pays to another.
    • A softmax function converts these similarity scores into normalized attention weights.
    • Learned matrices produce the query, key, and value vectors.
  • Masking
    • For next-token prediction, the model must not look ahead at future tokens.
    • Masking prevents later tokens from influencing the representation of earlier positions.
    • This is done by forcing unwanted attention scores to effectively become zero after softmax.
    • Decoder-style language models use masking so that each position only attends to previous tokens.
  • Multi-headed attention
    • A single attention mechanism may not capture all relevant relationships between tokens.
    • Multi-headed attention uses several attention heads in parallel.
    • Different heads can specialize in different kinds of relationships or different parts of the embedding space.
    • Their outputs are combined, often by concatenation, to form a richer context-aware representation.
  • Transformer block structure
    • A transformer block contains:
      • Multi-headed self-attention.
      • A neural network layer for nonlinear mixing of information.
      • Normalization layers for stable training.
      • Residual connections to preserve and stabilize information flow.
    • These blocks transform input embeddings into context-aware embeddings suitable for prediction.
  • Predicting the next word
    • The model takes the final context-aware embedding vector in the sequence.
    • A softmax classifier maps that vector to a probability distribution over the vocabulary.
    • The next token can be chosen as the most probable word or sampled from the distribution for more variation.
    • The process repeats by appending the selected token and predicting again.
  • Training objective
    • The model learns its parameters from large text corpora.
    • Training minimizes cross-entropy loss, which compares predicted token probabilities to the observed next token.
    • Since the correct observed word has probability one and all alternatives have probability zero, this becomes equivalent to maximizing the likelihood of the training data.
    • Large models need very large datasets because they contain many parameters.
  • Positional information
    • Basic self-attention alone does not know word order.
    • If tokens are rearranged, attention based only on token identities may not distinguish the new order properly.
    • Transformers therefore add positional information to embeddings.
    • Positional encodings allow the model to represent not only which words occur, but where they occur in the sequence.
  • Types of transformer models
    • Decoder-only models
      • Map context-aware vectors to predicted next tokens.
      • Used in systems such as ChatGPT and Claude.
      • Best suited for generative language modeling.
    • Encoder-only models
      • Map full token sequences into context-aware vectors without next-token generation.
      • Useful for representation tasks.
    • Encoder-decoder models
      • Use an encoder to represent an input sequence and a decoder to generate an output sequence.
      • Useful for tasks such as machine translation.
  • Code example
    • Hoyle briefly shows how transformer operations can be implemented in PyTorch.
    • Key operations include matrix multiplication for query-key similarity, scaling by the square root of the vector dimension, masking, softmax, and multiplying attention weights by value vectors.
    • The point is that once high-level tensor operations are available, the mathematical structure of a transformer can be expressed compactly in code.
  • Q&A
    • A Markov model cannot generate words outside its fixed vocabulary.
    • Transformer models also predict only from a fixed vocabulary, but modern vocabularies are large enough to cover most practical cases.
    • Infrequent or domain-specific words can be handled better through domain-specific fine-tuning.
    • Older or smaller embedding datasets may produce weaker similarity results than modern embeddings.
    • The main structural difference between encoder and decoder blocks is masking: decoders mask future tokens, while encoders generally do not.
  • Overall takeaway
    • A large language model can be understood as a system that:
      • Converts tokens into vectors.
      • Uses attention to make those vectors context-aware.
      • Uses a classifier to predict the next token.
      • Repeats this process to generate text.
    • The mathematics is built from familiar components: vectors, matrices, inner products, softmax, probability, and loss minimization.

Citation

BibTeX citation:
@online{bochman2026,
  author = {Bochman, Oren},
  title = {Introduction to the {Math} {Behind} {Transformers} and
    {LLMs}},
  date = {2026-04-27},
  url = {https://orenbochman.github.io/posts/2026/04-27-ODSC-AI-2026-Day-0/talk2.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2026. “Introduction to the Math Behind Transformers and LLMs.” April 27. https://orenbochman.github.io/posts/2026/04-27-ODSC-AI-2026-Day-0/talk2.html.