PostgreSQL Advanced: Partitioning, Replication, WAL, and High-Availability Patterns

PostgreSQL Advanced: Partitioning, Replication, WAL, and High-Availability Patterns

Hero image

Introduction

PostgreSQL handles a billion-row table differently than it handles a million-row table. The indexes that work at 10,000 rows per second write throughput fail under 100,000. The simple primary/replica setup that's fine for read-heavy workloads becomes a liability when you need zero-downtime failover during a primary crash. The VACUUM settings that keep an idle development database healthy leave a high-write production database bloated.

This post covers PostgreSQL at the scale where the defaults stop working: declarative table partitioning for billion-row tables, streaming replication architecture and failover with Patroni, Write-Ahead Log (WAL) configuration for performance and durability, and the VACUUM and autovacuum tuning that prevents table bloat from killing query performance. These are the PostgreSQL topics that most engineers learn reactively — after their first production incident.

Declarative Table Partitioning

Declarative partitioning (PostgreSQL 10+) splits a logical table into physical child tables called partitions. The query planner routes queries to only the relevant partitions — a process called partition pruning — dramatically reducing I/O for time-series and tenant-scoped data.

Range Partitioning (Time-Series Data)

-- Create parent table (no data stored here)
CREATE TABLE events (
    id          BIGSERIAL,
    tenant_id   UUID NOT NULL,
    event_type  TEXT NOT NULL,
    payload     JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create partitions by month
CREATE TABLE events_2026_01 
    PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE events_2026_02 
    PARTITION OF events
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

CREATE TABLE events_2026_03 
    PARTITION OF events
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Indexes on partitions (each partition gets its own index)
CREATE INDEX idx_events_2026_01_tenant ON events_2026_01 (tenant_id, created_at DESC);
CREATE INDEX idx_events_2026_02_tenant ON events_2026_02 (tenant_id, created_at DESC);

-- Insert routes to the correct partition automatically
INSERT INTO events (tenant_id, event_type, payload, created_at)
VALUES ('abc-123', 'page_view', '{"url": "/home"}', '2026-01-15');

-- Query uses partition pruning — only scans events_2026_01
EXPLAIN SELECT * FROM events 
WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'
AND tenant_id = 'abc-123';
-- "Append ... Seq Scan on events_2026_01 ..."  -- only January partition scanned

Old partitions can be detached (without data loss) and dropped when data exceeds retention:

-- Detach old partition: fast, no data movement
ALTER TABLE events DETACH PARTITION events_2026_01;
-- events_2026_01 is now a standalone table

-- Drop when ready
DROP TABLE events_2026_01;
-- O(1) operation — deletes files, no row-by-row DELETE

-- Automate with pg_partman extension
-- Creates and drops partitions on a schedule
SELECT partman.create_parent(
    p_parent_table => 'public.events',
    p_control => 'created_at',
    p_type => 'range',
    p_interval => 'monthly',
    p_retention => '6 months',     -- auto-drop partitions older than 6 months
    p_retention_keep_table => false
);

List Partitioning (Multi-Tenant Data)

-- Partition by tenant for tenant-isolated queries
CREATE TABLE orders (
    id          BIGSERIAL,
    tenant_id   TEXT NOT NULL,
    user_id     UUID NOT NULL,
    total_cents INTEGER NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY LIST (tenant_id);

CREATE TABLE orders_tenant_acme    PARTITION OF orders FOR VALUES IN ('acme');
CREATE TABLE orders_tenant_globex  PARTITION OF orders FOR VALUES IN ('globex');
CREATE TABLE orders_default        PARTITION OF orders DEFAULT;  -- catch-all

-- All queries for tenant 'acme' scan only orders_tenant_acme
SELECT * FROM orders WHERE tenant_id = 'acme' AND created_at > NOW() - INTERVAL '30 days';

Partition pruning reduces query scope. For a 2-year events table with monthly partitions (24 partitions), a query for one month scans 1/24th of the data. Without partitioning, the query scans the full table.

Architecture diagram

pg_stat_statements: Finding the Queries That Matter

Before tuning partitions, replication, or VACUUM, identify the queries consuming the most time. pg_stat_statements tracks cumulative statistics for every normalized query pattern — the single most valuable diagnostic view in PostgreSQL.

-- Enable (requires server restart after adding to shared_preload_libraries)
-- postgresql.conf: shared_preload_libraries = 'pg_stat_statements'

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 queries by total time
SELECT 
    left(query, 100) AS query_snippet,
    calls,
    round(total_exec_time::numeric, 2) AS total_ms,
    round(mean_exec_time::numeric, 2)  AS avg_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct_total,
    rows
FROM pg_stat_statements
WHERE query NOT ILIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 10;

-- Queries with high variability (stddev >> mean = sporadic slowness)
SELECT 
    left(query, 100) AS query_snippet,
    calls,
    round(mean_exec_time::numeric, 2) AS avg_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    round((stddev_exec_time / nullif(mean_exec_time, 0))::numeric, 2) AS cv  -- coefficient of variation
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time / nullif(mean_exec_time, 0) DESC
LIMIT 10;

-- Reset statistics (do periodically; old data clouds current analysis)
SELECT pg_stat_statements_reset();

The workflow: identify the top 10 queries by total time → EXPLAIN ANALYZE each → add missing indexes or rewrite inefficient queries → verify improvement. Repeat weekly. In most production databases, 20% of queries account for 80% of query time.

Write-Ahead Log: Performance and Durability Configuration

The Write-Ahead Log (WAL) is the foundation of PostgreSQL's durability. Every change is written to the WAL before being applied to data pages. On crash, PostgreSQL replays WAL from the last checkpoint — recovering to a consistent state.

WAL configuration controls the durability-performance trade-off:

-- postgresql.conf: WAL settings
-- Full sync to disk on every transaction commit (safest, slowest)
synchronous_commit = on          -- default; safest

-- Async WAL: allow OS to delay sync (10-100x faster writes, <1s data loss window)
synchronous_commit = off         -- lose last ~1s of committed transactions on crash
                                 -- NOT appropriate for financial data

-- Checkpoint settings: how often WAL is flushed to data pages
checkpoint_completion_target = 0.9   -- spread checkpoint over 90% of interval
max_wal_size = 4GB              -- allow more WAL before forcing checkpoint
min_wal_size = 1GB

-- WAL level: required for logical replication
wal_level = logical             -- enables logical decoding (for CDC, replication)
                                -- 'replica' for physical replication only

-- WAL compression (PostgreSQL 15+): reduces WAL volume, saves disk/network
wal_compression = lz4

WAL size monitoring:

-- Current WAL write location
SELECT pg_current_wal_lsn(), pg_walfile_name(pg_current_wal_lsn());

-- WAL generation rate (bytes/second over last minute)
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / extract(epoch from now())
AS wal_bytes_per_second;

-- Checkpoint frequency (should not checkpoint more than once every few minutes)
SELECT checkpoints_req, checkpoints_timed, 
       buffers_checkpoint, buffers_clean, buffers_backend,
       round(100.0 * checkpoints_req / (checkpoints_req + checkpoints_timed), 1) 
           AS pct_forced_checkpoints
FROM pg_stat_bgwriter;
-- High pct_forced_checkpoints: increase max_wal_size

A high rate of forced checkpoints (triggered by WAL reaching max_wal_size) indicates that max_wal_size is too small for the write rate. Forced checkpoints cause I/O spikes. Increase max_wal_size to allow PostgreSQL to spread checkpoint work over time.

Streaming Replication Architecture

PostgreSQL streaming replication sends WAL records from the primary to standbys in real time. Standbys apply WAL records and stay within seconds of the primary. Reads can be served from standbys (with hot_standby = on).

-- On primary: postgresql.conf
wal_level = replica
max_wal_senders = 10           -- max concurrent replication connections
wal_keep_size = 1GB            -- keep enough WAL for standbys to catch up after lag

-- pg_hba.conf: allow replication connections
-- TYPE  DATABASE    USER        ADDRESS         METHOD
host    replication replication standby-host/32 scram-sha-256

-- Create replication user
CREATE USER replication REPLICATION LOGIN PASSWORD 'strong-password';

-- On standby: recovery.conf (PostgreSQL 12+: primary_conninfo in postgresql.conf)
primary_conninfo = 'host=primary-host user=replication password=strong-password'
hot_standby = on               -- allow reads from standby
recovery_target_timeline = latest

Replication lag monitoring:

-- On primary: view connected standbys and their lag
SELECT client_addr, 
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
       write_lag,
       flush_lag,
       replay_lag        -- wall clock time behind primary
FROM pg_stat_replication;

-- On standby: check lag from standby's perspective
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag_seconds;

Alert if replay_lag > 30s — standbys this far behind risk serving stale data and may not promote quickly in a failover.

Automated Failover with Patroni

Manual failover (promoting a standby with pg_ctl promote) is too slow and error-prone for production SLAs. Patroni is the standard automated HA solution for PostgreSQL in 2026.

Patroni uses a distributed consensus store (etcd, ZooKeeper, or Consul) to elect a primary and coordinate failover. When the primary becomes unavailable, Patroni automatically promotes the most up-to-date standby:

# patroni.yml
scope: postgres-prod
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.1:8008

etcd3:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576   # 1MB: don't promote standby >1MB behind
    postgresql:
      use_pg_rewind: true              # fast resync of old primary as new standby
      parameters:
        max_connections: 200
        shared_buffers: 8GB
        wal_level: replica
        max_wal_senders: 10

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replication
      password: strong-password
    superuser:
      username: postgres
      password: strong-password

tags:
  nofailover: false
  noloadbalance: false

Patroni failover timeline:
1. Primary becomes unresponsive (crash or network partition)
2. Patroni's TTL expires (30 seconds in config above)
3. Patroni elects the standby with the most recent WAL position
4. Standby is promoted: pg_ctl promote
5. pg_rewind resynchronizes the old primary as a new standby
6. Total failover time: typically 30-60 seconds

HAProxy for connection routing: applications connect to HAProxy, which routes reads and writes to the correct node based on Patroni health checks:

# haproxy.cfg
frontend postgres_write
  bind *:5432
  default_backend postgres_primary

frontend postgres_read
  bind *:5433
  default_backend postgres_standbys

backend postgres_primary
  option httpchk GET /master   # Patroni returns 200 on primary, 503 on standby
  server pg-1 10.0.1.1:5432 check port 8008
  server pg-2 10.0.1.2:5432 check port 8008 backup
  server pg-3 10.0.1.3:5432 check port 8008 backup

backend postgres_standbys
  balance roundrobin
  option httpchk GET /replica  # Patroni returns 200 on standbys
  server pg-2 10.0.1.2:5432 check port 8008
  server pg-3 10.0.1.3:5432 check port 8008
Comparison visual

VACUUM and Autovacuum Tuning

PostgreSQL's MVCC (Multi-Version Concurrency Control) never updates rows in place. An UPDATE writes a new version of the row. The old version remains until VACUUM reclaims it. Without regular VACUUM, dead rows accumulate — bloating tables and indexes, degrading query performance, and eventually triggering transaction ID wraparound (a hard crash if not addressed).

Autovacuum monitoring:

-- Tables with high dead tuple counts (need vacuuming)
SELECT schemaname, tablename,
       n_live_tup, 
       n_dead_tup,
       round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
       last_vacuum, 
       last_autovacuum,
       last_analyze,
       last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;

-- Autovacuum currently running
SELECT pid, relid::regclass as table, phase, 
       heap_blks_scanned, heap_blks_total,
       round(100.0 * heap_blks_scanned / nullif(heap_blks_total, 0), 1) AS pct_done
FROM pg_stat_progress_vacuum;

Autovacuum tuning for high-write tables:

-- Default autovacuum: triggers at 20% dead tuples, too slow for large tables
-- A 100M-row table needs 20M dead tuples before autovacuum triggers!

-- Table-specific autovacuum settings: trigger more aggressively
ALTER TABLE events SET (
    autovacuum_vacuum_scale_factor = 0.01,   -- trigger at 1% dead rows
    autovacuum_vacuum_threshold = 1000,       -- or 1000 dead rows, whichever is smaller
    autovacuum_vacuum_cost_delay = 2,         -- 2ms between vacuum I/O bursts (default: 20ms)
    autovacuum_vacuum_cost_limit = 400,       -- more I/O budget for vacuum
    autovacuum_analyze_scale_factor = 0.01,  -- analyze at 1% too
    toast.autovacuum_vacuum_scale_factor = 0.01
);

-- Emergency manual VACUUM ANALYZE (does not block reads/writes)
VACUUM ANALYZE events;

-- VACUUM FULL: rewrites the table entirely (LOCKS the table, reclaims space)
-- Only for extreme bloat recovery — blocks all access
VACUUM FULL events;  -- use only during maintenance window

Transaction ID wraparound prevention:

-- Check tables at risk of transaction ID wraparound
SELECT oid::regclass as table,
       age(relfrozenxid) as xid_age,
       2000000000 - age(relfrozenxid) as xids_remaining  -- warn at 1B remaining
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;
-- If age > 1,500,000,000: VACUUM FREEZE immediately

Transaction ID wraparound is a hard deadline: if a table reaches 2 billion transactions old without a VACUUM FREEZE, PostgreSQL enters read-only "wraparound protection" mode. It will not accept new transactions until the table is vacuumed. This is a high-severity production incident. Monitor XID age; alert at 1.5 billion.

Connection Pooling: PgBouncer at Scale

PostgreSQL creates a new OS process per connection. Each process uses 5-10MB of memory. At 500 connections, that's 2.5-5GB of RAM consumed just by connection overhead, before any query work. PgBouncer is the standard solution: a lightweight connection pooler that maintains a pool of persistent connections to PostgreSQL and maps many client connections onto fewer server connections.

# pgbouncer.ini
[databases]
mydb = host=postgres-primary port=5432 dbname=mydb
mydb_read = host=postgres-standby port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432

# Transaction pooling: most efficient, but breaks session-level features
# (SET LOCAL, LISTEN/NOTIFY, advisory locks, prepared statements)
pool_mode = transaction

max_client_conn = 1000       # clients → PgBouncer: allow many
default_pool_size = 20       # PgBouncer → PostgreSQL: keep small
max_db_connections = 100     # total connections to PostgreSQL

auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Performance
server_lifetime = 3600       # recycle connections after 1 hour
server_idle_timeout = 600    # close idle server connections after 10min
client_idle_timeout = 0      # don't close idle clients

PgBouncer's three pooling modes:
- Session pooling: one server connection per client session. Minimal compatibility issues. Reduces max connections to pool size.
- Transaction pooling: server connection released back to pool after each transaction. Most efficient. Breaks SET LOCAL, LISTEN/NOTIFY, session-level prepared statements.
- Statement pooling: server connection released after each statement. Only for auto-commit workloads.

Transaction pooling is the production default. If your application uses session-level features that break under transaction pooling, use Pgpool-II (supports more PostgreSQL features) or ensure those features are limited to dedicated connections outside PgBouncer.

-- PgBouncer statistics (connect to pgbouncer admin database)
SHOW POOLS;
-- database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | maxwait

SHOW STATS;
-- Shows requests/second, latency percentiles per database

-- Alert if: cl_waiting > 0 sustained (pool exhausted, clients queuing)
-- Alert if: maxwait > 100ms (queries waiting for connection > 100ms)

When cl_waiting is consistently above zero, the pool is exhausted. Options: increase default_pool_size (puts more load on PostgreSQL), add read replicas and route read traffic to them, or optimize queries to reduce transaction duration.

PostgreSQL Performance Tuning: Memory Settings

The two most impactful memory settings:

-- shared_buffers: PostgreSQL's shared memory cache (data pages)
-- Rule: 25% of total RAM, up to 8GB (OS page cache handles the rest)
-- 32GB RAM → shared_buffers = 8GB
shared_buffers = 8GB

-- work_mem: per sort/hash join operation memory
-- CAUTION: applies per-operation, not per-connection
-- 100 connections × 10 concurrent sorts × 256MB = 256GB RAM usage
-- Rule: start at 16MB, increase if explain shows disk sorts
work_mem = 16MB   -- increase per-query with: SET work_mem = '256MB'

-- effective_cache_size: hint to planner about total available RAM (shared_buffers + OS cache)
-- 75% of total RAM
effective_cache_size = 24GB  -- for a 32GB server

-- random_page_cost: cost estimate for random disk reads
-- SSD: 1.1 (nearly as fast as sequential); HDD: 4.0 (default)
random_page_cost = 1.1   -- for SSD-backed storage

-- parallel query workers
max_parallel_workers_per_gather = 4     -- parallel query workers per query
max_parallel_workers = 8               -- total parallel workers
max_worker_processes = 16              -- total background workers

The planner uses random_page_cost and effective_cache_size to decide between sequential scans and index scans. With the default random_page_cost = 4 on an SSD, the planner over-prefers sequential scans. Setting random_page_cost = 1.1 corrects this bias and dramatically improves index utilization.

Index bloat and REINDEX CONCURRENTLY:

-- Check index bloat
SELECT schemaname, tablename, indexname,
       pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
       idx_scan AS times_used,
       idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan < 100  -- indexes rarely used (candidates for removal)
ORDER BY pg_relation_size(indexrelid) DESC;

-- Rebuild bloated index without locking (long-running operation)
REINDEX INDEX CONCURRENTLY idx_events_tenant_2026_01;
-- CONCURRENTLY: doesn't lock reads/writes, but takes longer

-- Create replacement index, then swap (zero downtime)
CREATE INDEX CONCURRENTLY idx_events_new ON events(tenant_id, created_at DESC)
    WHERE created_at > NOW() - INTERVAL '90 days';  -- partial index
-- Verify it's used, then DROP INDEX CONCURRENTLY the old one

Logical Replication and CDC

Logical replication decodes WAL changes into a logical row-change stream (INSERT/UPDATE/DELETE). Unlike physical streaming replication (which copies raw bytes), logical replication can replicate to different PostgreSQL versions, different schemas, or external systems.

Use cases:
- Zero-downtime major version upgrades (replicate to new version, cut over)
- CDC to Kafka or Debezium for event-driven architectures
- Real-time data warehouse loading
- Selective table replication

-- Publisher (primary): enable logical replication
wal_level = logical  -- required in postgresql.conf

-- Create a publication (which tables to replicate)
CREATE PUBLICATION orders_pub
FOR TABLE orders, order_items, payments;

-- Subscriber (replica):
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=primary-host dbname=mydb user=replication password=...'
PUBLICATION orders_pub;

-- Monitor replication slot lag (CRITICAL: don't let this grow large)
SELECT slot_name, 
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes,
       active
FROM pg_replication_slots;
-- Alert if lag_bytes > 1GB: slot is preventing WAL cleanup → disk fills

Replication slot danger: slots prevent WAL files from being deleted until the subscriber has consumed them. A disconnected subscriber means WAL accumulates indefinitely — filling your disk and crashing PostgreSQL. Always monitor replication slot lag; drop slots that are no longer in use.

Conclusion

PostgreSQL's advanced capabilities — partitioning, streaming replication, Patroni failover, WAL tuning, logical decoding — are production necessities at scale, not optional enhancements. The engineers who understand these features prevent the incidents that others debug reactively.

The operational priorities: partition large tables before they become large (adding partitioning to an existing table requires a data migration, not a DDL change). Implement Patroni before you need it (configuring HA during an outage is too late). Monitor replication lag and XID age as tier-1 alerts — these have hard operational deadlines. Tune autovacuum for high-write tables proactively — waiting until bloat is visible means you're already paying the performance cost.

PostgreSQL continues to be the most capable open-source relational database in 2026. Understanding its internals — WAL, MVCC, VACUUM, the replication architecture — is what separates teams that run PostgreSQL reliably at scale from teams that migrate to managed databases hoping the problems go away. The managed database route (RDS, Cloud SQL, AlloyDB) abstracts the operational complexity at a cost premium; understanding the internals makes the abstractions navigable regardless of deployment model.

Sources


Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained