Structured Output: Getting Reliable JSON, Tables, and Code from LLMs

Hero image: A factory assembly line where raw text enters one side and perfectly formatted JSON objects, tables, and code blocks emerge from the other, glowing with validation checkmarks

You've crafted the perfect system prompt. Your Chain-of-Thought reasoning produces brilliant analysis. Your few-shot examples nail the logic. Then you deploy to production, and everything breaks — because the model returned {analysis: "good"} instead of {"analysis": "good"}, or wrapped the JSON in a markdown code fence, or added a friendly "Here's the JSON you requested:" before the actual data.

Welcome to the structured output problem — the gap between getting the right answer and getting the right answer in the right format. In production AI applications, format reliability matters as much as content accuracy. A brilliant analysis that can't be parsed is worthless.

This is Part 4 of our Prompt Engineering Deep-Dive series. We've covered how system prompts set behavior (Part 1), Chain-of-Thought improves reasoning (Part 2), and few-shot examples teach patterns (Part 3). Now we tackle the engineering challenge of getting LLMs to produce machine-readable output consistently.

Why Structured Output Is Hard

LLMs are trained on natural language. Their default mode is to produce conversational, human-readable text. Asking them to produce strict JSON, SQL, or formatted tables goes against this default — and the model will constantly try to "help" by adding natural language around the structured data.

Common failure modes:

// Model adds conversational wrapper
"Sure! Here's the JSON you requested:
{\"result\": \"value\"}"

// Model uses single quotes instead of double
{'result': 'value'}

// Model adds trailing comma (invalid JSON)
{"items": ["a", "b", "c",]}

// Model wraps in markdown code fence

{"result": "value"}


// Model adds comments in JSON (invalid)
{
  "result": "value"  // this is the main result
}

Each of these produces valid-looking output that fails JSON.parse(). At 100 API calls per minute, even a 2% format failure rate means 2 broken responses every minute.

Architecture diagram showing the structured output pipeline: Prompt → LLM → Raw Output → Parser → Validation → Retry Loop

flowchart TB subgraph PROBLEM ["The Format Reliability Problem"] direction TB P1["LLM generates text"] P2{"Is output valid?"} P3["Parse & use"] P4["Retry / fallback"] P5["Log failure"] P1 --> P2 P2 -->|"Yes (85-95%)"| P3 P2 -->|"No (5-15%)"| P4 P4 -->|"Still fails"| P5 P4 -->|"Fixed"| P3 end subgraph SOLUTION ["Solutions (by reliability)"] direction TB S1["Prompt engineering
85-95% reliable"] S2["Few-shot examples
90-97% reliable"] S3["API-level constraints
99-100% reliable"] end style P1 fill:#3498db,stroke:#2980b9,color:#fff style P2 fill:#f39c12,stroke:#e67e22,color:#fff style P3 fill:#2ecc71,stroke:#27ae60,color:#fff style P4 fill:#e74c3c,stroke:#c0392b,color:#fff style P5 fill:#e74c3c,stroke:#c0392b,color:#fff style S1 fill:#f39c12,stroke:#e67e22,color:#fff style S2 fill:#3498db,stroke:#2980b9,color:#fff style S3 fill:#2ecc71,stroke:#27ae60,color:#fff style PROBLEM fill:#1a1a2e,stroke:#e74c3c,color:#fff style SOLUTION fill:#1a1a2e,stroke:#2ecc71,color:#fff

Level 1: Prompt-Based Structured Output

The simplest approach — use your prompt to request a specific format.

Technique 1: Explicit Format Instructions

System: You are a data extraction API. Return ONLY valid JSON.
Do not include any text before or after the JSON object.
Do not wrap the JSON in markdown code fences.
Do not include comments in the JSON.

The JSON schema is:
{
  "name": string,
  "email": string,
  "company": string,
  "role": string,
  "sentiment": "positive" | "negative" | "neutral"
}

Reliability: ~85-90%. Works most of the time, but the model will occasionally add wrapper text or deviate from the schema.

Technique 2: JSON Mode Trigger Words

Certain phrases in your prompt significantly improve JSON compliance:

// These phrases help:
"Respond with ONLY a JSON object"
"Output raw JSON with no explanation"
"Return valid JSON matching this exact schema"
"Your entire response must be parseable by JSON.parse()"

// This phrase hurts:
"Return the result as JSON"  // Too vague, model adds explanation

Technique 3: Start the Response

Pre-fill the beginning of the model's response to force the format:

# Anthropic's API supports pre-filling
response = client.messages.create(
    model="claude-sonnet-4-6",
    system="Extract contact info as JSON.",
    messages=[
        {"role": "user", "content": "John Smith, john@acme.com, CTO at Acme Corp. Very positive about our product."},
        {"role": "assistant", "content": "{"}  # Pre-fill forces JSON start
    ]
)
# Model continues from "{" — guaranteed to start as JSON

This is one of the most effective prompt-level techniques. By pre-filling {, you eliminate the "Sure, here's the JSON:" problem entirely. The model has no choice but to continue with JSON.

Technique 4: Few-Shot with Exact Format

Show examples with the exact format you need:

Extract contact information from the text.

Text: "Jane Doe is a VP of Engineering at TechCorp (jane@techcorp.io). She seemed interested."
Output: {"name":"Jane Doe","email":"jane@techcorp.io","company":"TechCorp","role":"VP of Engineering","sentiment":"positive"}

Text: "Met Bob from StartupXYZ. No email shared. He was skeptical about pricing."
Output: {"name":"Bob","email":null,"company":"StartupXYZ","role":null,"sentiment":"negative"}

Text: "Sarah Chen, sarah.chen@bigco.com, Data Science Lead at BigCo."
Output:

Notice: the examples use compact JSON (no pretty-printing). This matters — the model will mimic whatever format your examples use.

Level 2: API-Level Structured Output

Modern LLM APIs offer built-in features that guarantee valid structured output.

OpenAI: Structured Outputs (response_format)

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ContactInfo(BaseModel):
    name: str
    email: str | None
    company: str
    role: str | None
    sentiment: str  # "positive", "negative", "neutral"

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract contact info."},
        {"role": "user", "content": "John Smith, CTO at Acme..."}
    ],
    response_format=ContactInfo
)

contact = response.choices[0].message.parsed
# contact.name == "John Smith" — guaranteed valid

Reliability: 100%. The API constrains token generation to only produce valid JSON matching your schema. This isn't post-processing — it happens during generation.

OpenAI: JSON Mode (simpler)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    response_format={"type": "json_object"}
)
# Guaranteed valid JSON, but no schema enforcement

Anthropic: Tool Use for Structured Output

Claude doesn't have a dedicated JSON mode, but you can use tool definitions to enforce schemas:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[{
        "name": "extract_contact",
        "description": "Extract contact information from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "email": {"type": "string", "nullable": True},
                "company": {"type": "string"},
                "role": {"type": "string", "nullable": True},
                "sentiment": {
                    "type": "string",
                    "enum": ["positive", "negative", "neutral"]
                }
            },
            "required": ["name", "company", "sentiment"]
        }
    }],
    tool_choice={"type": "tool", "name": "extract_contact"},
    messages=[
        {"role": "user", "content": "John Smith, CTO at Acme Corp. Very positive."}
    ]
)

# Tool input is guaranteed valid JSON matching the schema
result = response.content[0].input

Reliability: 99.9%+. Tool use constrains the model to produce valid JSON matching your tool's input schema.

flowchart TB subgraph LEVELS ["Structured Output Approaches"] direction TB subgraph L1 ["Level 1: Prompt Engineering"] direction LR L1A["Format instructions"] L1B["Few-shot examples"] L1C["Response pre-fill"] end subgraph L2 ["Level 2: API Constraints"] direction LR L2A["JSON mode"] L2B["Structured outputs"] L2C["Tool use"] end subgraph L3 ["Level 3: Post-Processing"] direction LR L3A["Regex extraction"] L3B["Schema validation"] L3C["Auto-retry"] end end L1 -->|"85-95%"| RESULT["Production-Ready Output"] L2 -->|"99-100%"| RESULT L3 -->|"Catches remaining"| RESULT style L1A fill:#f39c12,stroke:#e67e22,color:#fff style L1B fill:#f39c12,stroke:#e67e22,color:#fff style L1C fill:#f39c12,stroke:#e67e22,color:#fff style L2A fill:#2ecc71,stroke:#27ae60,color:#fff style L2B fill:#2ecc71,stroke:#27ae60,color:#fff style L2C fill:#2ecc71,stroke:#27ae60,color:#fff style L3A fill:#3498db,stroke:#2980b9,color:#fff style L3B fill:#3498db,stroke:#2980b9,color:#fff style L3C fill:#3498db,stroke:#2980b9,color:#fff style RESULT fill:#6C63FF,stroke:#8B83FF,color:#fff style L1 fill:#16213e,stroke:#f39c12,color:#fff style L2 fill:#16213e,stroke:#2ecc71,color:#fff style L3 fill:#16213e,stroke:#3498db,color:#fff style LEVELS fill:#1a1a2e,stroke:#6C63FF,color:#fff

Level 3: Post-Processing and Validation

Even with API-level constraints, you need a validation layer for defense in depth.

Pattern: Extract, Validate, Retry

import json
import re
from jsonschema import validate, ValidationError

SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
    },
    "required": ["name", "sentiment"]
}

def extract_json(text: str) -> dict | None:
    """Extract JSON from model output, handling common issues."""
    # Try direct parse first
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
    
    # Strip markdown code fences
    text = re.sub(r'^```(?:json)?\s*', '', text, flags=re.MULTILINE)
    text = re.sub(r'```\s*$', '', text, flags=re.MULTILINE)
    
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError:
        pass
    
    # Extract first JSON object from mixed text
    match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', text)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass
    
    return None

def get_structured_output(
    prompt: str,
    schema: dict,
    max_retries: int = 2
) -> dict:
    """Get validated structured output with retry logic."""
    for attempt in range(max_retries + 1):
        response = call_llm(prompt)
        
        parsed = extract_json(response)
        if parsed is None:
            if attempt < max_retries:
                prompt += "\n\nYour previous response was not valid JSON. Please respond with ONLY a valid JSON object."
                continue
            raise ValueError(f"Failed to parse JSON after {max_retries + 1} attempts")
        
        try:
            validate(instance=parsed, schema=schema)
            return parsed
        except ValidationError as e:
            if attempt < max_retries:
                prompt += f"\n\nYour JSON was valid but didn't match the schema: {e.message}. Please fix and respond with ONLY the corrected JSON."
                continue
            raise

Pattern: Type-Safe Output with Pydantic

from pydantic import BaseModel, Field, field_validator

class AnalysisResult(BaseModel):
    summary: str = Field(min_length=10, max_length=500)
    risk_level: str = Field(pattern=r'^(low|medium|high|critical)$')
    confidence: float = Field(ge=0.0, le=1.0)
    findings: list[str] = Field(min_length=1)
    
    @field_validator('findings')
    @classmethod
    def findings_not_empty(cls, v):
        return [f for f in v if f.strip()]

def parse_analysis(raw: str) -> AnalysisResult:
    data = extract_json(raw)
    return AnalysisResult(**data)  # Raises ValidationError if invalid

Format-Specific Techniques

Getting Reliable Tables

For markdown tables, structure your prompt to specify columns explicitly:

Format your response as a markdown table with exactly these columns:
| Feature | React | Vue | Angular |
Include a header row and separator row. Every cell must have content (use "N/A" if not applicable).

Better approach — use structured JSON and render the table yourself:

system = "Compare frameworks. Return JSON array of objects with keys: feature, react, vue, angular"

# Then render in your application
data = get_structured_output(prompt, schema)
table = "| Feature | React | Vue | Angular |\n|---|---|---|---|\n"
for row in data:
    table += f"| {row['feature']} | {row['react']} | {row['vue']} | {row['angular']} |\n"

Getting Reliable Code

For code output, the model tends to add explanations. To get pure code:

Write a Python function that [task].
Return ONLY the code. No explanation, no markdown, no comments explaining what the code does.
Start directly with "def" or "import".

Or use response pre-filling:

messages = [
    {"role": "user", "content": "Write a Python function to validate email addresses"},
    {"role": "assistant", "content": "```python\n"}
]
# Model continues from the code fence opening

Getting Reliable Enums

When you need the model to choose from a fixed set of options:

❌ BAD: "Classify the sentiment"
# Model might return: "Positive", "positive", "POSITIVE", "mostly positive", "pos"

✅ GOOD: "Classify the sentiment. Return exactly one of: positive, negative, neutral"
# Or use tool_choice with enum constraint (guaranteed)

The Structured Output Decision Framework

| Requirement | Best Approach | Reliability |

|------------|---------------|-------------|

| Quick prototype | Prompt instructions + JSON.parse | 85-90% |

| Internal tool | Few-shot + response pre-fill | 90-95% |

| Customer-facing app | API structured outputs (OpenAI) or tool use (Anthropic) | 99%+ |

| High-volume pipeline | API constraints + Pydantic validation + retry | 99.9%+ |

| Safety-critical | API constraints + validation + human review | 100% |

flowchart TB Q1{"How critical is
format reliability?"} Q1 -->|"Prototype / internal"| A1["Prompt + few-shot
+ basic parsing"] Q1 -->|"Production app"| Q2{"Which API?"} Q1 -->|"Safety-critical"| A3["API constraints
+ validation
+ human review"] Q2 -->|"OpenAI"| A2A["response_format
with Pydantic model"] Q2 -->|"Anthropic"| A2B["tool_use with
input_schema"] Q2 -->|"Open-source"| A2C["Outlines / jsonformer
constrained generation"] style Q1 fill:#6C63FF,stroke:#8B83FF,color:#fff style Q2 fill:#6C63FF,stroke:#8B83FF,color:#fff style A1 fill:#f39c12,stroke:#e67e22,color:#fff style A2A fill:#2ecc71,stroke:#27ae60,color:#fff style A2B fill:#2ecc71,stroke:#27ae60,color:#fff style A2C fill:#2ecc71,stroke:#27ae60,color:#fff style A3 fill:#e74c3c,stroke:#c0392b,color:#fff

Open-Source: Constrained Generation

For self-hosted models, libraries like Outlines and jsonformer modify the token sampling process to guarantee valid output:

import outlines

model = outlines.models.transformers("meta-llama/Llama-3-8B-Instruct")

schema = '''{
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
    },
    "required": ["name", "age", "sentiment"]
}'''

generator = outlines.generate.json(model, schema)
result = generator("Extract info: John Smith, 35, loves the product")
# result is guaranteed to match the schema

This works by masking invalid tokens at each generation step. If the model is in the middle of generating {"age":, only digit tokens are allowed next. This is the gold standard for format reliability with open-source models.

Production Patterns

Pattern: Schema Registry

Centralize your output schemas for consistency and reuse:

from enum import Enum
from pydantic import BaseModel

class OutputSchemas:
    """Central registry of all structured output schemas."""
    
    class Sentiment(BaseModel):
        text: str
        label: str  # positive, negative, neutral
        confidence: float
        reasoning: str
    
    class CodeReview(BaseModel):
        file: str
        line: int
        severity: str  # low, medium, high, critical
        category: str
        description: str
        suggestion: str
    
    class EntityExtraction(BaseModel):
        entities: list[dict]
        relationships: list[dict]
        confidence: float

# Usage
schema = OutputSchemas.Sentiment
response = get_structured_output(prompt, schema.model_json_schema())
result = schema(**response)

Pattern: Graceful Degradation

When structured output fails, don't crash — degrade gracefully:

def analyze_with_fallback(text: str) -> dict:
    """Try structured output, fall back to unstructured."""
    try:
        # Try API-level structured output
        return get_structured_output_api(text)
    except Exception:
        pass
    
    try:
        # Try prompt-level with parsing
        raw = call_llm(f"Analyze this text and return JSON: {text}")
        return extract_json(raw)
    except Exception:
        pass
    
    # Final fallback: return raw text in a wrapper
    raw = call_llm(f"Analyze this text: {text}")
    return {
        "raw_analysis": raw,
        "structured": False,
        "error": "Could not produce structured output"
    }

Pattern: Output Monitoring

Track format compliance in production:

import time
from collections import defaultdict

class OutputMonitor:
    def __init__(self):
        self.stats = defaultdict(lambda: {"total": 0, "valid": 0, "retries": 0})
    
    def record(self, schema_name: str, valid: bool, retries: int = 0):
        self.stats[schema_name]["total"] += 1
        if valid:
            self.stats[schema_name]["valid"] += 1
        self.stats[schema_name]["retries"] += retries
    
    def report(self) -> dict:
        return {
            name: {
                "compliance_rate": s["valid"] / s["total"] if s["total"] > 0 else 0,
                "avg_retries": s["retries"] / s["total"] if s["total"] > 0 else 0,
                "total_calls": s["total"]
            }
            for name, s in self.stats.items()
        }

Common Mistakes

Mistake 1: Pretty-Printing in Production

# DON'T request pretty-printed JSON for machine consumption
"Return well-formatted, indented JSON"  # Wastes tokens

# DO use compact JSON
"Return compact JSON on a single line"  # Saves tokens, fewer parsing issues

Pretty-printed JSON uses 2-3x more tokens than compact JSON. At scale, this is significant cost.

Mistake 2: Not Handling Partial Output

Models have token limits. Long structured outputs can get truncated mid-JSON:

def safe_parse(text: str) -> dict | None:
    try:
        return json.loads(text)
    except json.JSONDecodeError as e:
        if "Unterminated string" in str(e) or "Expecting" in str(e):
            # Likely truncated — try to repair
            repaired = text.rstrip()
            # Close open strings, arrays, objects
            open_braces = repaired.count('{') - repaired.count('}')
            open_brackets = repaired.count('[') - repaired.count(']')
            repaired += '"' if repaired.count('"') % 2 != 0 else ''
            repaired += ']' * open_brackets + '}' * open_braces
            try:
                return json.loads(repaired)
            except json.JSONDecodeError:
                return None
        return None

Mistake 3: Over-Complex Schemas

❌ BAD: Deeply nested JSON with 15+ fields
# Model accuracy drops significantly with complex schemas

✅ BETTER: Flat or shallow schemas with <10 fields
# Break complex extractions into multiple calls if needed

Conclusion

Structured output is the bridge between AI analysis and software systems. The key principles:

1. Use API-level constraints when available — OpenAI's structured outputs and Anthropic's tool use provide near-100% format reliability

2. Always validate — Even with API constraints, validate with Pydantic or JSON Schema before processing

3. Build retry logic — Format failures will happen; handle them gracefully

4. Keep schemas simple — Flat, focused schemas produce more reliable output than complex nested ones

5. Monitor compliance — Track format success rates in production to catch regressions

In the next post, we'll explore Advanced Prompt Patterns — Tree-of-Thought, ReAct, Self-Consistency, and meta-prompting techniques that push the boundaries of what's possible with LLMs.

*This is Part 4 of the Prompt Engineering Deep-Dive series. Previous: [Few-Shot Prompting](#). Next: [Advanced Patterns — Tree-of-Thought, ReAct, and Beyond](#).*


Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained

29 Million Secrets Leaked: The Hardcoded Credentials Crisis