AmtocSoft Tech Insights: Database Query Optimization: EXPLAIN ANALYZE, Indexes, and the Queries That Kill Production

Wednesday, April 15, 2026

Database Query Optimization: EXPLAIN ANALYZE, Indexes, and the Queries That Kill Production

Hero: EXPLAIN ANALYZE output with cost breakdown and slow query highlighted

A database query that runs in 3ms with 1,000 rows will often take 4,000ms with 1,000,000 rows. The query didn't change. The data changed. Most production database incidents aren't caused by code bugs — they're caused by queries that worked fine in staging (with 5,000 rows) and collapsed in production (with 50,000,000 rows).

This guide covers the tools and patterns for diagnosing and fixing slow queries in PostgreSQL and MySQL: reading query plans, choosing the right indexes, eliminating N+1 queries, and avoiding the schema patterns that guarantee future pain.

The Problem: Queries That Look Fine But Aren't

Consider this query:

SELECT u.name, u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name, u.email
ORDER BY order_count DESC
LIMIT 20;

In development with 500 users and 2,000 orders: 4ms. In production with 2M users and 80M orders: 45 seconds. A sequential scan of 80M rows, a hash join, and a sort of 2M groups — all because there's no index on orders.user_id and users.created_at.

The tools to find and fix this exist in every database. Most developers don't know how to read them.

How It Works: Reading EXPLAIN ANALYZE

EXPLAIN ANALYZE is the most important debugging tool for slow queries. It shows the query plan the database chose and the actual execution statistics.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name, u.email
ORDER BY order_count DESC
LIMIT 20;

Sample output (annotated):

Limit  (cost=248542.33..248542.38 rows=20 width=48) (actual time=44812.221..44812.226 rows=20 loops=1)
  ->  Sort  (cost=248542.33..253447.89 rows=1962224 width=48) (actual time=44812.219..44812.221 rows=20 loops=1)
        Sort Key: (count(o.id)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  HashAggregate  (cost=185899.07..205521.31 rows=1962224 width=48) (actual time=38211.442..43187.553 rows=1962224 loops=1)
              Group Key: u.id
              Batches: 5  Memory Usage: 4145kB  Disk Usage: 36608kB  ← SPILLING TO DISK
              ->  Hash Left Join  (cost=44729.61..163355.95 rows=4508624 width=24) (actual time=1203.442..28847.221 rows=80124532 loops=1)
                    Hash Cond: (o.user_id = u.id)
                    ->  Seq Scan on orders o  (cost=0.00..82441.32 rows=4508624 width=8) (actual time=0.021..8442.112 rows=80124532 loops=1)
                                                                                                                          ↑ SEQUENTIAL SCAN — 80M rows
                    ->  Hash  (cost=29972.87..29972.87 rows=1182939 width=24) (actual time=1190.221..1190.221 rows=1962224 loops=1)
                          Buckets: 2097152  Batches: 2  Memory Usage: 50421kB
                          ->  Seq Scan on users u  (cost=0.00..29972.87 rows=1182939 width=24) (actual time=0.012..892.442 rows=1962224 loops=1)
                                Filter: (created_at > '2025-01-01 00:00:00'::timestamp)
                                Rows Removed by Filter: 37776                     ← Only removes 37K rows from 2M scan
Planning Time: 2.442 ms
Execution Time: 44814.221 ms  ← 44 seconds

The key numbers to read:

What to look for	Where it is	What it means
`Seq Scan` on large table	Node type	Missing index — table scanned row by row
High `actual time` vs `cost`	Each node	Planner estimate was wrong
`Disk Usage` in aggregation	HashAggregate	Query spilling to disk — work_mem too low
`Rows Removed by Filter`	Seq Scan	Filter applied after reading all rows
`loops=N` where N > 1	Nested Loop	Inner relation scanned N times

flowchart TD A[EXPLAIN ANALYZE output] --> B{Seq Scan?} B -- Yes on large table --> C[Add index on scan columns] B -- No --> D{Nested Loop with loops > 100?} D -- Yes --> E[N+1 query or missing join index] D -- No --> F{Disk Usage in aggregation?} F -- Yes --> G[Increase work_mem or optimize grouping] F -- No --> H{High actual vs estimated rows?} H -- Yes --> I[Run ANALYZE to update statistics] H -- No --> J[Query is probably fine] style C fill:#ef4444,color:#fff style E fill:#ef4444,color:#fff style G fill:#f59e0b,color:#fff style I fill:#f59e0b,color:#fff style J fill:#22c55e,color:#fff

Implementation: Fixing the Common Culprits

Fix 1: Add the Right Indexes

The query above needs two indexes:

-- Index for the WHERE clause on users
CREATE INDEX idx_users_created_at ON users(created_at);

-- Index for the JOIN condition on orders  
CREATE INDEX idx_orders_user_id ON orders(user_id);

After adding these, EXPLAIN ANALYZE shows:

Execution Time: 312 ms   ← Down from 44,814ms

Choosing index types:

-- B-tree (default): range queries, equality, ORDER BY
CREATE INDEX idx_orders_created_at ON orders(created_at);  -- WHERE created_at > '...'

-- Hash: equality only (faster for = than B-tree, no range support)
-- PostgreSQL 10+ only, rarely needed
CREATE INDEX idx_orders_status_hash ON orders USING HASH (status);

-- Partial index: index only rows that match a condition
-- Smaller index, faster for filtered queries
CREATE INDEX idx_orders_pending ON orders(user_id) 
    WHERE status = 'pending';  -- Only indexes pending orders

-- Composite index: column order matters
-- This index helps: WHERE user_id = ? AND created_at > ?
-- Also helps: WHERE user_id = ?  (leading column)
-- Does NOT help: WHERE created_at > ?  (non-leading column)
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

-- Covering index: include non-indexed columns to avoid table heap access
-- Query only touches the index, never the table
CREATE INDEX idx_orders_covering ON orders(user_id, created_at) 
    INCLUDE (status, total_amount);

The most common mistake: Creating an index on a column used in a WHERE clause when the column has low cardinality. An index on status where status is one of three values ('pending', 'active', 'closed') helps nothing — the planner will still do a sequential scan because 33% of rows match, and a seq scan is faster than reading 33M index entries.

-- WRONG: low cardinality, index will be ignored
CREATE INDEX idx_users_is_active ON users(is_active);  -- Only 2 values

-- RIGHT: high cardinality, or restrict with partial index
CREATE INDEX idx_users_active ON users(id) WHERE is_active = true;
-- Now a small index covering only active users

Fix 2: The N+1 Query Destroyer

N+1 is the single most common database performance problem in ORMs. The pattern:

# ORM code that looks innocent
users = User.objects.filter(created_at__gt='2025-01-01')

for user in users:
    # THIS IS N+1 — one query per user
    print(user.orders.count())

What happens: 1 query to get N users, then N queries to get each user's order count. With 10,000 users: 10,001 database round trips. At 2ms per round trip: 20 seconds.

-- What the ORM is running (10,001 queries):
SELECT * FROM users WHERE created_at > '2025-01-01';
SELECT COUNT(*) FROM orders WHERE user_id = 1;
SELECT COUNT(*) FROM orders WHERE user_id = 2;
-- ... × 9,999 more

The fix: annotate or JOIN to get everything in one query:

# Django: annotate with COUNT in a single query
from django.db.models import Count

users = (
    User.objects
    .filter(created_at__gt='2025-01-01')
    .annotate(order_count=Count('orders'))
    .order_by('-order_count')[:20]
)
# 1 query total: the JOIN + COUNT SQL from earlier

# SQLAlchemy: same idea
from sqlalchemy import func

result = (
    session.query(User, func.count(Order.id).label('order_count'))
    .outerjoin(Order, Order.user_id == User.id)
    .filter(User.created_at > '2025-01-01')
    .group_by(User.id)
    .order_by(desc('order_count'))
    .limit(20)
    .all()
)

Detecting N+1 in production:

# Django Debug Toolbar shows query count per request
# For production monitoring, track queries-per-request in your APM

# Or log slow query patterns manually:
import django.db.backends.utils as db_utils
import logging

original = db_utils.CursorWrapper.execute

def patched_execute(self, sql, params=None):
    logging.debug(f"QUERY: {sql[:200]}")
    return original(self, sql, params)

db_utils.CursorWrapper.execute = patched_execute

Fix 3: JSONB Performance (PostgreSQL)

Storing semi-structured data in JSONB columns is convenient but has performance traps:

-- SLOW: no index on JSONB field, full table scan
SELECT * FROM events WHERE metadata->>'user_id' = '12345';

-- FIX 1: GIN index on entire JSONB column (handles any key lookup)
CREATE INDEX idx_events_metadata_gin ON events USING GIN (metadata);

-- FIX 2: Expression index on specific key (smaller, faster for single key)
CREATE INDEX idx_events_user_id ON events((metadata->>'user_id'));

-- FIX 3: If you always query the same keys, use a generated column
ALTER TABLE events 
ADD COLUMN user_id_extracted TEXT 
    GENERATED ALWAYS AS (metadata->>'user_id') STORED;
CREATE INDEX idx_events_user_id_col ON events(user_id_extracted);
-- Now it's just a regular column with a regular index

Fix 4: The Pagination Trap

OFFSET-based pagination is fine with small datasets. With large ones, it's a silent killer:

-- SLOW: scans and discards 999,980 rows to return rows 999,981-1,000,000
SELECT * FROM events ORDER BY created_at DESC LIMIT 20 OFFSET 999980;

-- FIX: cursor-based pagination using the last seen ID
-- First page:
SELECT * FROM events ORDER BY created_at DESC, id DESC LIMIT 20;

-- Subsequent pages (pass last_created_at and last_id from previous response):
SELECT * FROM events 
WHERE (created_at, id) < ('2025-06-15 10:23:44', 98765)
ORDER BY created_at DESC, id DESC 
LIMIT 20;

The cursor approach scans exactly 20 rows regardless of which page you're on. The OFFSET approach scans N+20 rows every time.

Query Patterns to Avoid

flowchart LR A[Dangerous Patterns] A --> B["SELECT * (fetches all columns)"] A --> C["WHERE LOWER(email) = ... (function prevents index use)"] A --> D["WHERE id::text = '123' (cast prevents index use)"] A --> E["LIKE '%search%' (leading wildcard = seq scan)"] A --> F["OR conditions on different columns (index skipped)"] B --> B2["SELECT only needed columns"] C --> C2["Use citext column type or expression index"] D --> D2["Match column type in WHERE clause"] E --> E2["Full-text search (tsvector/tsquery)"] F --> F2["Separate queries with UNION ALL"] style A fill:#ef4444,color:#fff

-- FUNCTION IN WHERE: prevents index use
-- SLOW:
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';

-- FIX: use citext extension for case-insensitive storage
CREATE EXTENSION IF NOT EXISTS citext;
ALTER TABLE users ALTER COLUMN email TYPE citext;
-- Now: WHERE email = 'USER@EXAMPLE.COM' uses the index

-- OR: use an expression index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';  -- Now uses index


-- LEADING WILDCARD: full table scan
-- SLOW:
SELECT * FROM products WHERE name LIKE '%widget%';

-- FIX: full-text search with GIN index
ALTER TABLE products ADD COLUMN search_vec tsvector
    GENERATED ALWAYS AS (to_tsvector('english', name || ' ' || description)) STORED;
CREATE INDEX idx_products_search ON products USING GIN(search_vec);
SELECT * FROM products WHERE search_vec @@ plainto_tsquery('english', 'widget');

VACUUM, Bloat, and Table Maintenance

PostgreSQL doesn't immediately reclaim storage when rows are updated or deleted. Old row versions are kept for MVCC (Multi-Version Concurrency Control) — concurrent readers might still need them. VACUUM reclaims this dead space. When it's not running aggressively enough, tables bloat, queries slow down, and eventually the transaction ID wraparound problem causes the database to refuse writes entirely.

-- Check table bloat
SELECT 
    schemaname, 
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
    pg_size_pretty(
        pg_total_relation_size(schemaname||'.'||tablename) - 
        pg_relation_size(schemaname||'.'||tablename)
    ) AS index_size,
    n_dead_tup,
    n_live_tup,
    round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

High dead_pct means bloat. The fix: adjust autovacuum settings for high-traffic tables.

-- Per-table autovacuum tuning (override global settings for busy tables)
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,   -- Vacuum when 1% of rows are dead (default: 20%)
    autovacuum_analyze_scale_factor = 0.005, -- Analyze when 0.5% of rows change
    autovacuum_vacuum_cost_delay = 2,        -- Less I/O throttling for this table
    autovacuum_vacuum_threshold = 50         -- Minimum dead rows before vacuum runs
);

-- For tables with extremely high write volume, trigger manual vacuum:
VACUUM ANALYZE orders;  -- Reclaims dead rows, updates statistics

-- VACUUM FULL reclaims more space but locks the table (avoid on production unless necessary)
-- Use pg_repack extension instead for zero-downtime table compaction

Connection Pooling and Query Performance

Database connections are expensive: each one holds memory on the server, maintains state, and requires a network handshake to establish. Without connection pooling, a spike to 500 concurrent web workers means 500 simultaneous database connections. PostgreSQL can handle ~100-200 connections efficiently; beyond that, performance degrades sharply.

PgBouncer is the standard connection pooler for PostgreSQL:

# pgbouncer.ini
[databases]
myapp = host=localhost port=5432 dbname=myapp

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Transaction mode: connection returned to pool after each transaction
# Best for stateless web apps; doesn't support SET, advisory locks
pool_mode = transaction

max_client_conn = 2000     # Max simultaneous app connections
default_pool_size = 25     # Actual connections to PostgreSQL
max_db_connections = 100   # Hard limit per database

With PgBouncer in transaction mode: 2,000 application workers share 25 actual PostgreSQL connections. The database sees a steady 25 connections regardless of web traffic spikes.

The query performance implication: connection overhead is removed from the hot path. Queries that were spending 2-3ms establishing connections now start immediately. At high throughput, this compounds: 3ms × 1,000 req/s = 3 seconds of connection overhead per second, eliminated.

Production Monitoring: Finding Slow Queries

Don't wait for incidents. Enable pg_stat_statements to track query performance continuously:

-- Enable in postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.max = 10000
-- pg_stat_statements.track = all

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find the 10 slowest queries by total time
SELECT 
    round(total_exec_time::numeric, 2) AS total_ms,
    round(mean_exec_time::numeric, 2) AS avg_ms,
    calls,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    left(query, 120) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_exec_time DESC
LIMIT 10;

-- Find queries with high variance (occasional spikes)
SELECT 
    left(query, 120) AS query,
    round(mean_exec_time::numeric, 2) AS avg_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    round(max_exec_time::numeric, 2) AS max_ms
FROM pg_stat_statements
WHERE calls > 50 AND stddev_exec_time > mean_exec_time
ORDER BY stddev_exec_time DESC
LIMIT 10;

Set up automated alerting when mean_exec_time for high-volume queries exceeds your SLO. A 50ms query that runs 10,000 times per minute contributes 500 seconds of total database time per minute — before any other queries.

Production Considerations

Index Maintenance

Indexes aren't free. Every write pays an index update cost. Over-indexing a write-heavy table hurts throughput. Audit unused indexes regularly:

-- Find indexes that are never used
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE 'pg_%'
ORDER BY schemaname, tablename;

-- Also check index bloat (indexes grow from updates/deletes)
SELECT 
    tablename, 
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

Creating Indexes Without Downtime

CREATE INDEX in PostgreSQL locks the table for writes. For production tables, always use CONCURRENTLY:

-- This blocks writes until the index is built (avoid on production)
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- This builds without blocking (takes 2-3× longer but safe for production)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

CREATE INDEX CONCURRENTLY can fail mid-build. If it does, it leaves an invalid index that must be dropped:

-- Check for invalid indexes after CONCURRENT builds
SELECT indexname FROM pg_indexes 
WHERE tablename = 'orders' 
AND indexname IN (
    SELECT indexrelid::regclass::text FROM pg_index WHERE NOT indisvalid
);

-- Drop and recreate if invalid
DROP INDEX CONCURRENTLY idx_orders_user_id;
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

Query Planner Statistics and the ANALYZE Command

PostgreSQL's query planner makes decisions based on statistics about your data: how many distinct values a column has, how values are distributed, table row counts. These statistics go stale when large amounts of data are inserted or deleted. Stale statistics cause the planner to make wrong decisions — like choosing a sequential scan when an index would be faster.

-- View current statistics for a table
SELECT 
    attname AS column,
    n_distinct,                   -- Estimated distinct values (-1 to 1: negative = ratio of rows)
    correlation,                  -- Physical ordering correlation (1.0 = perfectly ordered, 0 = random)
    null_frac,                    -- Fraction of NULL values
    avg_width                     -- Average column value width in bytes
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY attname;

-- Manually update statistics (fast, non-blocking)
ANALYZE orders;

-- ANALYZE with increased statistics for complex distributions
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;  -- Default is 100 targets
ANALYZE orders(status);

When a query plan suddenly degrades after a large data load, run ANALYZE tablename first. It takes seconds and often fixes the plan without any other changes.

The `SET LOCAL` Pattern for Query Tuning

PostgreSQL allows overriding planner settings per-query for debugging. This is useful for testing whether a different strategy would be faster:

BEGIN;

-- Force sequential scan (disable index usage) to compare
SET LOCAL enable_indexscan = OFF;
SET LOCAL enable_bitmapscan = OFF;
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

ROLLBACK;  -- Revert settings

-- Or force a specific join strategy
BEGIN;
SET LOCAL enable_hashjoin = OFF;
SET LOCAL enable_mergejoin = OFF;
-- Forces nested loop join — useful when the planner chooses hash join incorrectly
EXPLAIN ANALYZE SELECT u.*, o.total FROM users u JOIN orders o ON u.id = o.user_id WHERE u.id = 123;
ROLLBACK;

These are diagnostic tools, not production settings. If a plan only works with enable_indexscan = OFF, the fix is updating statistics or hints via pg_hint_plan, not disabling index scans globally.

Conclusion

Query optimization is a skill built on a small number of core techniques used repeatedly:

Read EXPLAIN ANALYZE before making any changes — optimize what's actually slow, not what you think is slow
Index selectively — high-cardinality columns used in WHERE, JOIN, and ORDER BY; partial indexes for filtered queries
Kill N+1 at the ORM level — annotate, eager load, or write the JOIN yourself
Avoid index-defeating patterns — functions in WHERE, leading wildcards, low-cardinality indexes
Monitor continuously — pg_stat_statements catches regressions before they become incidents

The 44-second query at the start of this guide becomes 312ms with two indexes. That's a 143× improvement. Most database performance problems aren't hard to fix — they're hard to find. The tools above make them findable.

One more habit that separates strong backend engineers from average ones: run EXPLAIN ANALYZE on your application's 10 highest-frequency queries before going to production, not after the first incident. Query plans change as data grows — a plan that looks fine at launch can degrade badly at 100× the initial data. Capturing the baseline plan lets you detect regressions before users do.

Sources

PostgreSQL — EXPLAIN documentation, pg_stat_statements, index types
Markus Winand — "Use the Index, Luke" (SQL indexing and tuning book)
Percona — MySQL engineering blog
pganalyze — Query performance blog

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-05-15 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

AmtocSoft Tech Insights

Wednesday, April 15, 2026

Database Query Optimization: EXPLAIN ANALYZE, Indexes, and the Queries That Kill Production

The Problem: Queries That Look Fine But Aren't

How It Works: Reading EXPLAIN ANALYZE

Implementation: Fixing the Common Culprits

Fix 1: Add the Right Indexes

Fix 2: The N+1 Query Destroyer

Fix 3: JSONB Performance (PostgreSQL)

Fix 4: The Pagination Trap

Query Patterns to Avoid

VACUUM, Bloat, and Table Maintenance

Connection Pooling and Query Performance

Production Monitoring: Finding Slow Queries

Production Considerations

Index Maintenance

Creating Indexes Without Downtime

Query Planner Statistics and the ANALYZE Command

The `SET LOCAL` Pattern for Query Tuning

Conclusion

Sources

No comments:

Post a Comment

Structured Outputs Beyond JSON: Using Constrained Generation for Reliable Agent Tool Calls

Report Abuse

Labels

Wednesday, April 15, 2026

Database Query Optimization: EXPLAIN ANALYZE, Indexes, and the Queries That Kill Production

The Problem: Queries That Look Fine But Aren't

How It Works: Reading EXPLAIN ANALYZE

Implementation: Fixing the Common Culprits

Fix 1: Add the Right Indexes

Fix 2: The N+1 Query Destroyer

Fix 3: JSONB Performance (PostgreSQL)

Fix 4: The Pagination Trap

Query Patterns to Avoid

VACUUM, Bloat, and Table Maintenance

Connection Pooling and Query Performance

Production Monitoring: Finding Slow Queries

Production Considerations

Index Maintenance

Creating Indexes Without Downtime

Query Planner Statistics and the ANALYZE Command

The SET LOCAL Pattern for Query Tuning

Conclusion

Sources

No comments:

Post a Comment

Structured Outputs Beyond JSON: Using Constrained Generation for Reliable Agent Tool Calls

The `SET LOCAL` Pattern for Query Tuning