Constrained Decoding

How LLMs are guided to produce valid, structured outputs through token-level filtering.

Published December 31, 2025 ET

What is Constrained Decoding?

Constrained decoding is a technique that forces language models to produce outputs conforming to a specific structure—like valid JSON, a regex pattern, or a formal grammar. Instead of hoping the model follows your formatting instructions, constrained decoding makes invalid outputs impossible at the token level.

The Problem It Solves

When you ask an LLM for JSON, you might get:

Here's the JSON you requested:

{
  "name": "Alice",
  "age": 30
}

Let me know if you need anything else!

The model wrapped your JSON in conversational text. Even with strong prompts, models can:

  • Add explanatory text around structured data
  • Use incorrect quote styles or trailing commas
  • Miss required fields or add extra ones
  • Produce syntactically invalid output

For programmatic use, "almost valid" is the same as broken.

How It Works

At each step of text generation, a language model produces a probability distribution over its entire vocabulary—typically 50,000+ tokens. Normally, you'd sample from this full distribution. Constrained decoding intervenes by:

  1. Filtering: Determine which tokens would keep the output on a valid path according to your constraint (schema, regex, grammar)
  2. Masking: Set the probability of invalid tokens to zero
  3. Renormalizing: Scale up the remaining probabilities so they sum to 1
  4. Sampling: Choose from only the valid tokens

The key insight: we're filtering, not overriding. If the model prefers token A over token B (3:1 odds), and both are valid, that preference is preserved. The model still guides fluent generation—we just prevent structural violations.

Optimization: Skipping Deterministic Tokens

Modern implementations skip generation entirely for tokens that can be uniquely determined. If you're generating JSON and just output {"name":, the next character must be a space or quote to start the value. No need to run inference—just emit it directly. This makes constrained decoding faster, not slower.

Types of Constraints

Regex patterns — Simple but limited. Good for emails, phone numbers, dates, or a constrained set of choices like (yes|no).

JSON Schema — The most common use case. Specify required fields, types, enums, nested objects. OpenAI and other providers convert schemas to grammars internally.

Context-Free Grammars (CFGs) — The most powerful option. CFGs can express nested and recursive structures that regex cannot, like balanced parentheses or nested JSON. Frameworks like llama.cpp use GBNF (GGML BNF) format.

Clarification: Structured ≠ Deterministic

Here's a common misconception: constrained decoding does not guarantee identical outputs from identical inputs.

Constrained decoding ensures structural validity—your JSON will always parse. But given the same prompt twice, you might get different valid completions:

{"status": "success", "count": 42}
{"status": "success", "count": 42, "message": "Done"}

Both are valid against a permissive schema. The constraint doesn't pick which valid output—just that it is valid.

True determinism is a separate problem, caused by:

  • Dynamic batching: Your request gets batched with others, affecting numerical precision
  • Hardware variations: Different GPUs may produce slightly different floating-point results
  • Non-deterministic operations: Some CUDA kernels are non-deterministic by default

OpenAI's seed parameter improves reproducibility but doesn't guarantee it. Research projects like batch-invariant-ops achieve true bitwise-identical outputs, but with a 10-40% performance cost.

Practical Applications

  • API responses: Guarantee your LLM always returns parseable JSON matching your schema
  • Code generation: Ensure syntactically valid SQL, regex, or configuration files
  • Data extraction: Pull structured records from unstructured text
  • Form filling: Map natural language to specific field values
  • Tool use: Force outputs that match function signatures exactly

Sources