How Large Language Models Like GPT Generate Text: A Deep Dive into Stochastic Decoding, Information Theory, and Randomness

12 min readOct 18, 2024

Ever wondered how models like GPT-4 or ChatGPT generate responses that seem so natural, almost as if they were written by a human? While large language models are trained to predict the most probable next word based on previous context, the actual text generation process is far more nuanced and sophisticated. The way these models select words isn’t just about predicting the most likely option — it involves a balance of randomness (or stochasticity) and predictability. This fine-tuned balance is what makes their responses feel dynamic and less robotic.

In this blog post, we’ll take a detailed look into how large language models (LLMs) generate text, breaking down key mathematical formulations, decoding strategies, and the role of randomness. We’ll also explore how techniques from information theory are being used to optimize text generation further, ensuring models can produce expressive and human-like responses. Along the way, we’ll cover common techniques like top-k, top-p (nucleus sampling), temperature sampling, and the emerging idea of typical sampling.

1. The Basics of Text Generation: How Do LLMs Work?

Predicting the Next Word

At the core of every large language model is the ability to predict the next word in a sequence. Imagine a sentence being built word by word. For each new word, the model must assign a probability to each possible word in its vocabulary and select the most appropriate one based on the prior context.

Mathematically, let’s represent a sentence as a sequence of words ( S = {w_1, w_2, \dots, w_n} ). The model’s task is to predict the probability of the next word ( w_{n+1} ), given the context of the words ( {w_1, w_2, \dots, w_n} ). This probability is denoted as ( P(w_{n+1} | w_1, w_2, \dots, w_n) ).

To compute this probability, the model uses a softmax function, which takes in the scores (or logits) from the neural network and converts them into probabilities:

[ P(w_{n+1} | S) = \frac{\exp(z_i)}{\sum_j \exp(z_j)} ]

Here, ( z_i ) is the score assigned by the model to the ( i )-th token, and the softmax function normalizes these scores into a probability distribution.

Limitations of Greedy Search

A straightforward approach would be to always pick the word with the highest probability at each step — this is known as greedy search. However, this leads to a significant limitation: the generated text becomes predictable and repetitive. In conversations, this would feel robotic and rigid, as the model is stuck in a loop of selecting the most likely word without exploring alternative options.

For example, if you ask an LLM for facts about cats, using greedy search might always return the same fact over and over because it doesn’t introduce any variation in the output.

This is where stochastic (randomized) decoding strategies come into play.

2. Introducing Randomness: Why is It Necessary?

The Need for Stochasticity

Language is inherently creative, and human conversations are full of subtle variations. To mimic this, LLMs must not only be accurate but also inject a certain level of unpredictability into their responses. This allows them to produce more engaging and natural output. In essence, we want the model to “take risks” and sometimes choose less likely words to make the conversation feel more human-like.

Without any randomness, LLMs would:

Repeat the same information: Always selecting the highest probability word leads to repetitive, robotic output.
Lack variation: Conversations would feel monotonous, without the variety and creativity that make human interactions interesting.

However, too much randomness can lead to incoherence and nonsense. So, the key question becomes: how can we introduce the right amount of randomness to balance between predictability and creativity?

Let’s explore three major stochastic decoding strategies — top-k sampling, top-p sampling, and temperature scaling — that are widely used to introduce randomness in LLMs.

3. Top-k Sampling: Truncating the Probability Space

How Top-k Sampling Works

One of the simplest stochastic methods is top-k sampling. Instead of considering the full vocabulary, the model selects the top ( k ) most probable tokens and samples the next word from this limited set. By restricting the choice to only the top ( k ) tokens, we ensure that the generated text remains somewhat diverse but still guided by high-probability words.

The steps in top-k sampling can be mathematically described as:

From the probability distribution ( P(w_{n+1} | S) ), select the top ( k ) tokens such that their cumulative probabilities account for most of the mass.
Normalize the probabilities of these ( k ) tokens to sum to 1:
[ P_k(w_{n+1}) = \frac{P(w_{n+1} | S)}{\sum_{i=1}^{k} P(w_{i+1} | S)} ]
Randomly sample one token from this truncated distribution and append it to the current sequence.
Repeat this process until a termination condition is met (e.g., reaching a punctuation mark).

Challenges with Top-k Sampling

While top-k introduces randomness, it has some limitations:

Flat Distributions: If the probability distribution is relatively uniform (i.e., many words have similar probabilities), limiting the choices to the top ( k ) tokens might cut off potentially interesting and relevant words. This can reduce the diversity of the generated text.
Peaky Distributions: On the other hand, if the probability distribution is highly concentrated (i.e., one or two words have very high probabilities), top-k might either include too many irrelevant tokens when ( k ) is large or exclude equally probable tokens when ( k ) is too small.

Thus, finding an optimal ( k ) is a balancing act, and a fixed ( k ) may not always work well in all contexts.

4. Top-p (Nucleus) Sampling: A Dynamic Approach

How Top-p Sampling Works

To address some of the limitations of top-k, top-p sampling (also called nucleus sampling) provides a more adaptive approach. Instead of fixing the number of tokens to consider, top-p selects the smallest number of tokens whose cumulative probability exceeds a certain threshold ( p ). For instance, you might set ( p = 0.7 ), meaning the model will choose from a subset of tokens that together account for 70% of the total probability mass.

The process for top-p sampling works as follows:

Sort the tokens based on their probabilities in descending order.
Select the smallest subset of tokens ( {w_1, w_2, \dots, w_n} ) such that:
[ \sum_{i=1}^{n} P(w_i | S) \geq p ]
Normalize the probabilities of this subset to sum to 1, similar to top-k sampling:
[ P_p(w_{n+1}) = \frac{P(w_{n+1} | S)}{\sum_{i=1}^{n} P(w_{i+1} | S)} ]
Randomly sample one token from this truncated distribution.

How Top-p Sampling Solves Top-k’s Problems

Top-p sampling dynamically adjusts the number of tokens considered based on the shape of the probability distribution, making it more versatile than top-k.

Flat Distributions: In cases where many tokens have similar probabilities, top-p will include more tokens in the sampling pool, preserving diversity.
Peaky Distributions: When the distribution is highly concentrated, top-p will focus on the most likely tokens, reducing the chances of selecting irrelevant words.

By adapting to the distribution, top-p sampling can better balance between diversity and relevance, making it a popular choice in modern LLMs.

5. Temperature Sampling: Controlling the Randomness

What is Temperature Sampling?

Temperature sampling provides an additional way to control the level of randomness in the generated text. The idea is to adjust the “sharpness” of the probability distribution by scaling the logits (scores) before applying the softmax function. The temperature parameter ( T ) controls this scaling:

[ P_T(w_{n+1} | S) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} ]

Here, ( T ) can take any value greater than 0, and the effect of temperature on the probability distribution can be summarized as follows:

( T = 1 ): The original probability distribution remains unchanged.
( T < 1 ): The distribution becomes sharper, concentrating more probability mass on the most likely tokens. As ( T ) approaches 0, the model approximates greedy search, where only the most likely token is selected.
( T > 1 ): The distribution becomes flatter, increasing the probability of selecting less probable tokens. As ( T ) increases, the model samples from a wider range of tokens, introducing more randomness.

The Entropy Perspective

Temperature directly influences the entropy of the probability distribution.

Entropy is a measure of uncertainty or randomness in a distribution, defined as:

[ H(P) = — \sum_i P(w_i | S) \log P(w_i | S) ]

Low Temperature (T < 1): Lower temperature results in lower entropy, meaning the model becomes more predictable and conservative in its choices.
High Temperature (T > 1): Higher temperature increases entropy, allowing the model to explore more diverse and less probable options.

The choice of temperature depends on the desired behavior of the model. For example, in creative writing tasks, a higher temperature might be used to encourage more imaginative and varied outputs. In contrast, for tasks requiring high accuracy, such as summarization, a lower temperature might be preferred to focus on the most relevant words.

6. Typical Sampling: An Information-Theoretic Approach

Building on these ideas, typical sampling introduces a novel approach rooted in information theory. The goal here is to generate text with an optimal information density — meaning the output should convey enough information to be interesting without overwhelming the reader with overly complex or unexpected words.

The underlying assumption is that human language operates at a balance between predictability and surprise. Too much predictability leads to boring and repetitive text, while too much surprise can lead to incoherence or confusion. Typical sampling attempts to model this balance mathematically by focusing on the entropy of the predicted word distribution.

Entropy and Typicality in Language Generation

In information theory, entropy is a measure of uncertainty or randomness in a probability distribution. Mathematically, it is defined as:

[ H(P) = — \sum_i P(w_i | S) \log P(w_i | S) ]

Here, ( P(w_i | S) ) is the probability of token ( w_i ) given the context ( S ), and entropy ( H(P) ) quantifies the average amount of information or surprise carried by the tokens in the distribution.

Typical sampling works by aiming to sample tokens whose entropy is close to the expected entropy of the distribution, which corresponds to the average information content of language. It avoids selecting tokens with either very high or very low information density, both of which can lead to poor-quality text generation.

High Entropy Tokens: Tokens with high entropy are often very surprising or unlikely given the context. While these tokens might introduce creativity, they also risk generating nonsensical or overly complex sequences that are difficult to process.
Low Entropy Tokens: Tokens with low entropy are highly predictable and might lead to overly conservative or repetitive text. These tokens add little new information to the conversation and can make the interaction feel stagnant.

By focusing on tokens with typical entropy, typical sampling achieves a balance between predictability and surprise, generating text that is both engaging and coherent.

How Typical Sampling Works

In practice, typical sampling proceeds as follows:

Calculate the entropy of the probability distribution over the next tokens based on the context ( S ).
Select tokens whose entropy is close to the expected entropy for the given context. This ensures that the tokens selected contribute neither too much nor too little information.
Normalize the probabilities of these tokens to form a new distribution from which one token is sampled.
Repeat this process for each subsequent token until a stopping condition (e.g., end of sentence) is met.

This approach ensures that the generated text is both informative and understandable, avoiding overly simplistic or overly complex sequences.

Why Typical Sampling Matters

The primary advantage of typical sampling is that it offers a more natural balance between creativity and coherence than traditional methods like top-k or top-p sampling. By targeting tokens with average information content, it encourages the generation of text that feels more aligned with human communication patterns.

In research, typical sampling has shown promising results, especially in tasks like story generation and abstractive summarization, where it’s crucial to maintain a balance between providing relevant details and avoiding redundancy. When compared to other methods, typical sampling reduces issues like repetition and improves the overall quality of the generated text.

Typical Sampling vs. Top-k and Top-p Sampling

To understand the practical difference between typical sampling and the more traditional top-k and top-p methods, let’s revisit the following two key points:

Top-k Sampling: Limits the choice to a fixed number ( k ) of high-probability tokens, which can sometimes lead to either too much or too little diversity, depending on the distribution.
Top-p (Nucleus) Sampling: Dynamically adjusts the number of tokens based on cumulative probabilities but doesn’t directly consider the information content of the tokens.
Typical Sampling: Unlike these methods, typical sampling directly targets the information density (entropy) of the tokens, ensuring that the output contains just the right amount of unpredictability, balancing creativity and clarity.

The result is a more sophisticated, human-like approach to generating text that prioritizes maintaining an optimal level of engagement without veering into chaos or monotony.

7. Beyond Token-Level Sampling: Future Directions in Text Generation

While the current generation of LLMs primarily focuses on token-level sampling techniques like top-k, top-p, and typical sampling, there’s growing interest in exploring more complex strategies for generating text. These might include:

7.1 Sentence-Level Planning

Instead of focusing solely on individual tokens, future models could incorporate sentence-level or paragraph-level planning to structure their responses. This could involve generating multiple sentences with specific objectives in mind, such as answering a question, providing an example, or drawing a conclusion.

By combining token-level and sentence-level generation strategies, LLMs could become even better at producing coherent and well-organized text, particularly for long-form content.

7.2 Multi-Objective Text Generation

Another avenue of research is multi-objective text generation, where models simultaneously optimize for several factors — such as factual accuracy, creativity, and engagement — during the decoding process. This approach could improve the quality of responses in specialized applications, such as academic writing, storytelling, or technical explanations.

In this context, sampling techniques like top-k, top-p, and typical sampling might be adapted to handle multiple objectives, each with its own set of constraints and priorities.

7.3 Hierarchical Language Models

Current LLMs process text at a relatively low level, predicting one token at a time. However, hierarchical language models — which can generate entire phrases, sentences, or even paragraphs in a single step — are a potential future development.

These models would be able to plan text generation at multiple levels of abstraction, incorporating both local coherence (within a sentence) and global coherence (across an entire paragraph or document).

8. Practical Applications of Stochastic Decoding Strategies

The stochastic decoding strategies discussed here are not just theoretical concepts — they have real-world applications across a wide range of tasks. Let’s look at how these techniques impact different use cases.

8.1 Creative Writing and Storytelling

For tasks like creative writing or storytelling, where variety and unpredictability are crucial, methods like top-p sampling and high-temperature sampling allow the model to generate more diverse and imaginative content. By introducing more randomness, the model is free to explore less probable sequences, leading to unexpected but engaging narratives.

In contrast, typical sampling can be particularly useful in ensuring that the story maintains coherence while still introducing creative elements. This makes it well-suited for generating long-form content, where maintaining narrative structure is important.

8.2 Abstractive Summarization

In tasks like abstractive summarization, where the goal is to create a concise summary of a larger text, typical sampling shines. By focusing on generating text with the right amount of information density, it helps ensure that the summary is both informative and easy to understand. It also reduces the risk of generating repetitive or redundant information, which can be an issue with other decoding strategies.

8.3 Conversational Agents

For conversational agents like chatbots, a balance between predictability and creativity is essential. In this context, temperature sampling is often used to control how creative or conservative the model’s responses are. Low temperatures are preferred when the goal is to provide accurate, fact-based responses, while higher temperatures can be used in more casual, creative conversations.

Top-k and top-p sampling can also be useful, depending on the conversational goals. Top-p is generally preferred in conversational settings because it dynamically adjusts to the context, providing more flexibility in the generated responses.

9. Conclusion: The Art of Balancing Randomness and Coherence

The generation of text by large language models like GPT is an intricate dance between randomness and probability. By introducing controlled randomness through techniques like top-k, top-p, temperature sampling, and typical sampling, these models can produce text that feels dynamic, creative, and human-like.

Each of these methods has its strengths and weaknesses, and the choice of which to use depends on the specific task at hand. Top-k and top-p sampling introduce varying degrees of randomness, while temperature sampling provides finer control over the level of creativity in the output. Typical sampling, on the other hand, leverages principles from information theory to generate text that is neither too predictable nor too surprising.

As language models continue to evolve, we will likely see even more sophisticated decoding strategies that incorporate multi-objective optimization, sentence-level planning, and hierarchical generation. These developments promise to make AI-generated text even more coherent, engaging, and useful in a wide range of applications.

Ultimately, the success of LLMs hinges on their ability to balance predictability with creativity, making them capable of generating not only accurate but also compelling text. The stochastic decoding strategies discussed in this blog are a critical part of this process, and their refinement will continue to shape the future of AI-driven text generation.

Further Reading:

If you want to dive deeper into the mathematical foundations of language models, check out our article on transformer architectures and how they revolutionized NLP.
To learn more about typical sampling and its applications, explore the original research paper that introduced the concept.

Stay tuned for more updates on how large language models are reshaping the landscape of AI and natural language processing!