PostgreSQL Advanced: Partitioning, Replication, WAL, and High-Availability Patterns
PostgreSQL Advanced: Partitioning, Replication, WAL, and High-Availability Patterns

Introduction
PostgreSQL handles a billion-row table differently than it handles a million-row table. The indexes that work at 10,000 rows per second write throughput fail under 100,000. The simple primary/replica setup that's fine for read-heavy workloads becomes a liability when you need zero-downtime failover during a primary crash. The VACUUM settings that keep an idle development database healthy leave a high-write production database bloated.
This post covers PostgreSQL at the scale where the defaults stop working: declarative table partitioning for billion-row tables, streaming replication architecture and failover with Patroni, Write-Ahead Log (WAL) configuration for performance and durability, and the VACUUM and autovacuum tuning that prevents table bloat from killing query performance. These are the PostgreSQL topics that most engineers learn reactively — after their first production incident.
Declarative Table Partitioning
Declarative partitioning (PostgreSQL 10+) splits a logical table into physical child tables called partitions. The query planner routes queries to only the relevant partitions — a process called partition pruning — dramatically reducing I/O for time-series and tenant-scoped data.
Range Partitioning (Time-Series Data)
-- Create parent table (no data stored here)
CREATE TABLE events (
id BIGSERIAL,
tenant_id UUID NOT NULL,
event_type TEXT NOT NULL,
payload JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);
-- Create partitions by month
CREATE TABLE events_2026_01
PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE events_2026_02
PARTITION OF events
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE events_2026_03
PARTITION OF events
FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');
-- Indexes on partitions (each partition gets its own index)
CREATE INDEX idx_events_2026_01_tenant ON events_2026_01 (tenant_id, created_at DESC);
CREATE INDEX idx_events_2026_02_tenant ON events_2026_02 (tenant_id, created_at DESC);
-- Insert routes to the correct partition automatically
INSERT INTO events (tenant_id, event_type, payload, created_at)
VALUES ('abc-123', 'page_view', '{"url": "/home"}', '2026-01-15');
-- Query uses partition pruning — only scans events_2026_01
EXPLAIN SELECT * FROM events
WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'
AND tenant_id = 'abc-123';
-- "Append ... Seq Scan on events_2026_01 ..." -- only January partition scanned
Old partitions can be detached (without data loss) and dropped when data exceeds retention:
-- Detach old partition: fast, no data movement
ALTER TABLE events DETACH PARTITION events_2026_01;
-- events_2026_01 is now a standalone table
-- Drop when ready
DROP TABLE events_2026_01;
-- O(1) operation — deletes files, no row-by-row DELETE
-- Automate with pg_partman extension
-- Creates and drops partitions on a schedule
SELECT partman.create_parent(
p_parent_table => 'public.events',
p_control => 'created_at',
p_type => 'range',
p_interval => 'monthly',
p_retention => '6 months', -- auto-drop partitions older than 6 months
p_retention_keep_table => false
);
List Partitioning (Multi-Tenant Data)
-- Partition by tenant for tenant-isolated queries
CREATE TABLE orders (
id BIGSERIAL,
tenant_id TEXT NOT NULL,
user_id UUID NOT NULL,
total_cents INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY LIST (tenant_id);
CREATE TABLE orders_tenant_acme PARTITION OF orders FOR VALUES IN ('acme');
CREATE TABLE orders_tenant_globex PARTITION OF orders FOR VALUES IN ('globex');
CREATE TABLE orders_default PARTITION OF orders DEFAULT; -- catch-all
-- All queries for tenant 'acme' scan only orders_tenant_acme
SELECT * FROM orders WHERE tenant_id = 'acme' AND created_at > NOW() - INTERVAL '30 days';
Partition pruning reduces query scope. For a 2-year events table with monthly partitions (24 partitions), a query for one month scans 1/24th of the data. Without partitioning, the query scans the full table.

pg_stat_statements: Finding the Queries That Matter
Before tuning partitions, replication, or VACUUM, identify the queries consuming the most time. pg_stat_statements tracks cumulative statistics for every normalized query pattern — the single most valuable diagnostic view in PostgreSQL.
-- Enable (requires server restart after adding to shared_preload_libraries)
-- postgresql.conf: shared_preload_libraries = 'pg_stat_statements'
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 queries by total time
SELECT
left(query, 100) AS query_snippet,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS avg_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct_total,
rows
FROM pg_stat_statements
WHERE query NOT ILIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 10;
-- Queries with high variability (stddev >> mean = sporadic slowness)
SELECT
left(query, 100) AS query_snippet,
calls,
round(mean_exec_time::numeric, 2) AS avg_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
round((stddev_exec_time / nullif(mean_exec_time, 0))::numeric, 2) AS cv -- coefficient of variation
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time / nullif(mean_exec_time, 0) DESC
LIMIT 10;
-- Reset statistics (do periodically; old data clouds current analysis)
SELECT pg_stat_statements_reset();
The workflow: identify the top 10 queries by total time → EXPLAIN ANALYZE each → add missing indexes or rewrite inefficient queries → verify improvement. Repeat weekly. In most production databases, 20% of queries account for 80% of query time.
Write-Ahead Log: Performance and Durability Configuration
The Write-Ahead Log (WAL) is the foundation of PostgreSQL's durability. Every change is written to the WAL before being applied to data pages. On crash, PostgreSQL replays WAL from the last checkpoint — recovering to a consistent state.
WAL configuration controls the durability-performance trade-off:
-- postgresql.conf: WAL settings
-- Full sync to disk on every transaction commit (safest, slowest)
synchronous_commit = on -- default; safest
-- Async WAL: allow OS to delay sync (10-100x faster writes, <1s data loss window)
synchronous_commit = off -- lose last ~1s of committed transactions on crash
-- NOT appropriate for financial data
-- Checkpoint settings: how often WAL is flushed to data pages
checkpoint_completion_target = 0.9 -- spread checkpoint over 90% of interval
max_wal_size = 4GB -- allow more WAL before forcing checkpoint
min_wal_size = 1GB
-- WAL level: required for logical replication
wal_level = logical -- enables logical decoding (for CDC, replication)
-- 'replica' for physical replication only
-- WAL compression (PostgreSQL 15+): reduces WAL volume, saves disk/network
wal_compression = lz4
WAL size monitoring:
-- Current WAL write location
SELECT pg_current_wal_lsn(), pg_walfile_name(pg_current_wal_lsn());
-- WAL generation rate (bytes/second over last minute)
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / extract(epoch from now())
AS wal_bytes_per_second;
-- Checkpoint frequency (should not checkpoint more than once every few minutes)
SELECT checkpoints_req, checkpoints_timed,
buffers_checkpoint, buffers_clean, buffers_backend,
round(100.0 * checkpoints_req / (checkpoints_req + checkpoints_timed), 1)
AS pct_forced_checkpoints
FROM pg_stat_bgwriter;
-- High pct_forced_checkpoints: increase max_wal_size
A high rate of forced checkpoints (triggered by WAL reaching max_wal_size) indicates that max_wal_size is too small for the write rate. Forced checkpoints cause I/O spikes. Increase max_wal_size to allow PostgreSQL to spread checkpoint work over time.
Streaming Replication Architecture
PostgreSQL streaming replication sends WAL records from the primary to standbys in real time. Standbys apply WAL records and stay within seconds of the primary. Reads can be served from standbys (with hot_standby = on).
-- On primary: postgresql.conf
wal_level = replica
max_wal_senders = 10 -- max concurrent replication connections
wal_keep_size = 1GB -- keep enough WAL for standbys to catch up after lag
-- pg_hba.conf: allow replication connections
-- TYPE DATABASE USER ADDRESS METHOD
host replication replication standby-host/32 scram-sha-256
-- Create replication user
CREATE USER replication REPLICATION LOGIN PASSWORD 'strong-password';
-- On standby: recovery.conf (PostgreSQL 12+: primary_conninfo in postgresql.conf)
primary_conninfo = 'host=primary-host user=replication password=strong-password'
hot_standby = on -- allow reads from standby
recovery_target_timeline = latest
Replication lag monitoring:
-- On primary: view connected standbys and their lag
SELECT client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
write_lag,
flush_lag,
replay_lag -- wall clock time behind primary
FROM pg_stat_replication;
-- On standby: check lag from standby's perspective
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag_seconds;
Alert if replay_lag > 30s — standbys this far behind risk serving stale data and may not promote quickly in a failover.
Automated Failover with Patroni
Manual failover (promoting a standby with pg_ctl promote) is too slow and error-prone for production SLAs. Patroni is the standard automated HA solution for PostgreSQL in 2026.
Patroni uses a distributed consensus store (etcd, ZooKeeper, or Consul) to elect a primary and coordinate failover. When the primary becomes unavailable, Patroni automatically promotes the most up-to-date standby:
# patroni.yml
scope: postgres-prod
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.1:8008
etcd3:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB: don't promote standby >1MB behind
postgresql:
use_pg_rewind: true # fast resync of old primary as new standby
parameters:
max_connections: 200
shared_buffers: 8GB
wal_level: replica
max_wal_senders: 10
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.1:5432
data_dir: /var/lib/postgresql/data
authentication:
replication:
username: replication
password: strong-password
superuser:
username: postgres
password: strong-password
tags:
nofailover: false
noloadbalance: false
Patroni failover timeline:
1. Primary becomes unresponsive (crash or network partition)
2. Patroni's TTL expires (30 seconds in config above)
3. Patroni elects the standby with the most recent WAL position
4. Standby is promoted: pg_ctl promote
5. pg_rewind resynchronizes the old primary as a new standby
6. Total failover time: typically 30-60 seconds
HAProxy for connection routing: applications connect to HAProxy, which routes reads and writes to the correct node based on Patroni health checks:
# haproxy.cfg
frontend postgres_write
bind *:5432
default_backend postgres_primary
frontend postgres_read
bind *:5433
default_backend postgres_standbys
backend postgres_primary
option httpchk GET /master # Patroni returns 200 on primary, 503 on standby
server pg-1 10.0.1.1:5432 check port 8008
server pg-2 10.0.1.2:5432 check port 8008 backup
server pg-3 10.0.1.3:5432 check port 8008 backup
backend postgres_standbys
balance roundrobin
option httpchk GET /replica # Patroni returns 200 on standbys
server pg-2 10.0.1.2:5432 check port 8008
server pg-3 10.0.1.3:5432 check port 8008

VACUUM and Autovacuum Tuning
PostgreSQL's MVCC (Multi-Version Concurrency Control) never updates rows in place. An UPDATE writes a new version of the row. The old version remains until VACUUM reclaims it. Without regular VACUUM, dead rows accumulate — bloating tables and indexes, degrading query performance, and eventually triggering transaction ID wraparound (a hard crash if not addressed).
Autovacuum monitoring:
-- Tables with high dead tuple counts (need vacuuming)
SELECT schemaname, tablename,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;
-- Autovacuum currently running
SELECT pid, relid::regclass as table, phase,
heap_blks_scanned, heap_blks_total,
round(100.0 * heap_blks_scanned / nullif(heap_blks_total, 0), 1) AS pct_done
FROM pg_stat_progress_vacuum;
Autovacuum tuning for high-write tables:
-- Default autovacuum: triggers at 20% dead tuples, too slow for large tables
-- A 100M-row table needs 20M dead tuples before autovacuum triggers!
-- Table-specific autovacuum settings: trigger more aggressively
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.01, -- trigger at 1% dead rows
autovacuum_vacuum_threshold = 1000, -- or 1000 dead rows, whichever is smaller
autovacuum_vacuum_cost_delay = 2, -- 2ms between vacuum I/O bursts (default: 20ms)
autovacuum_vacuum_cost_limit = 400, -- more I/O budget for vacuum
autovacuum_analyze_scale_factor = 0.01, -- analyze at 1% too
toast.autovacuum_vacuum_scale_factor = 0.01
);
-- Emergency manual VACUUM ANALYZE (does not block reads/writes)
VACUUM ANALYZE events;
-- VACUUM FULL: rewrites the table entirely (LOCKS the table, reclaims space)
-- Only for extreme bloat recovery — blocks all access
VACUUM FULL events; -- use only during maintenance window
Transaction ID wraparound prevention:
-- Check tables at risk of transaction ID wraparound
SELECT oid::regclass as table,
age(relfrozenxid) as xid_age,
2000000000 - age(relfrozenxid) as xids_remaining -- warn at 1B remaining
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;
-- If age > 1,500,000,000: VACUUM FREEZE immediately
Transaction ID wraparound is a hard deadline: if a table reaches 2 billion transactions old without a VACUUM FREEZE, PostgreSQL enters read-only "wraparound protection" mode. It will not accept new transactions until the table is vacuumed. This is a high-severity production incident. Monitor XID age; alert at 1.5 billion.
Connection Pooling: PgBouncer at Scale
PostgreSQL creates a new OS process per connection. Each process uses 5-10MB of memory. At 500 connections, that's 2.5-5GB of RAM consumed just by connection overhead, before any query work. PgBouncer is the standard solution: a lightweight connection pooler that maintains a pool of persistent connections to PostgreSQL and maps many client connections onto fewer server connections.
# pgbouncer.ini
[databases]
mydb = host=postgres-primary port=5432 dbname=mydb
mydb_read = host=postgres-standby port=5432 dbname=mydb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
# Transaction pooling: most efficient, but breaks session-level features
# (SET LOCAL, LISTEN/NOTIFY, advisory locks, prepared statements)
pool_mode = transaction
max_client_conn = 1000 # clients → PgBouncer: allow many
default_pool_size = 20 # PgBouncer → PostgreSQL: keep small
max_db_connections = 100 # total connections to PostgreSQL
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
# Performance
server_lifetime = 3600 # recycle connections after 1 hour
server_idle_timeout = 600 # close idle server connections after 10min
client_idle_timeout = 0 # don't close idle clients
PgBouncer's three pooling modes:
- Session pooling: one server connection per client session. Minimal compatibility issues. Reduces max connections to pool size.
- Transaction pooling: server connection released back to pool after each transaction. Most efficient. Breaks SET LOCAL, LISTEN/NOTIFY, session-level prepared statements.
- Statement pooling: server connection released after each statement. Only for auto-commit workloads.
Transaction pooling is the production default. If your application uses session-level features that break under transaction pooling, use Pgpool-II (supports more PostgreSQL features) or ensure those features are limited to dedicated connections outside PgBouncer.
-- PgBouncer statistics (connect to pgbouncer admin database)
SHOW POOLS;
-- database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | maxwait
SHOW STATS;
-- Shows requests/second, latency percentiles per database
-- Alert if: cl_waiting > 0 sustained (pool exhausted, clients queuing)
-- Alert if: maxwait > 100ms (queries waiting for connection > 100ms)
When cl_waiting is consistently above zero, the pool is exhausted. Options: increase default_pool_size (puts more load on PostgreSQL), add read replicas and route read traffic to them, or optimize queries to reduce transaction duration.
PostgreSQL Performance Tuning: Memory Settings
The two most impactful memory settings:
-- shared_buffers: PostgreSQL's shared memory cache (data pages)
-- Rule: 25% of total RAM, up to 8GB (OS page cache handles the rest)
-- 32GB RAM → shared_buffers = 8GB
shared_buffers = 8GB
-- work_mem: per sort/hash join operation memory
-- CAUTION: applies per-operation, not per-connection
-- 100 connections × 10 concurrent sorts × 256MB = 256GB RAM usage
-- Rule: start at 16MB, increase if explain shows disk sorts
work_mem = 16MB -- increase per-query with: SET work_mem = '256MB'
-- effective_cache_size: hint to planner about total available RAM (shared_buffers + OS cache)
-- 75% of total RAM
effective_cache_size = 24GB -- for a 32GB server
-- random_page_cost: cost estimate for random disk reads
-- SSD: 1.1 (nearly as fast as sequential); HDD: 4.0 (default)
random_page_cost = 1.1 -- for SSD-backed storage
-- parallel query workers
max_parallel_workers_per_gather = 4 -- parallel query workers per query
max_parallel_workers = 8 -- total parallel workers
max_worker_processes = 16 -- total background workers
The planner uses random_page_cost and effective_cache_size to decide between sequential scans and index scans. With the default random_page_cost = 4 on an SSD, the planner over-prefers sequential scans. Setting random_page_cost = 1.1 corrects this bias and dramatically improves index utilization.
Index bloat and REINDEX CONCURRENTLY:
-- Check index bloat
SELECT schemaname, tablename, indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used,
idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan < 100 -- indexes rarely used (candidates for removal)
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild bloated index without locking (long-running operation)
REINDEX INDEX CONCURRENTLY idx_events_tenant_2026_01;
-- CONCURRENTLY: doesn't lock reads/writes, but takes longer
-- Create replacement index, then swap (zero downtime)
CREATE INDEX CONCURRENTLY idx_events_new ON events(tenant_id, created_at DESC)
WHERE created_at > NOW() - INTERVAL '90 days'; -- partial index
-- Verify it's used, then DROP INDEX CONCURRENTLY the old one
Logical Replication and CDC
Logical replication decodes WAL changes into a logical row-change stream (INSERT/UPDATE/DELETE). Unlike physical streaming replication (which copies raw bytes), logical replication can replicate to different PostgreSQL versions, different schemas, or external systems.
Use cases:
- Zero-downtime major version upgrades (replicate to new version, cut over)
- CDC to Kafka or Debezium for event-driven architectures
- Real-time data warehouse loading
- Selective table replication
-- Publisher (primary): enable logical replication
wal_level = logical -- required in postgresql.conf
-- Create a publication (which tables to replicate)
CREATE PUBLICATION orders_pub
FOR TABLE orders, order_items, payments;
-- Subscriber (replica):
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=primary-host dbname=mydb user=replication password=...'
PUBLICATION orders_pub;
-- Monitor replication slot lag (CRITICAL: don't let this grow large)
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes,
active
FROM pg_replication_slots;
-- Alert if lag_bytes > 1GB: slot is preventing WAL cleanup → disk fills
Replication slot danger: slots prevent WAL files from being deleted until the subscriber has consumed them. A disconnected subscriber means WAL accumulates indefinitely — filling your disk and crashing PostgreSQL. Always monitor replication slot lag; drop slots that are no longer in use.
Conclusion
PostgreSQL's advanced capabilities — partitioning, streaming replication, Patroni failover, WAL tuning, logical decoding — are production necessities at scale, not optional enhancements. The engineers who understand these features prevent the incidents that others debug reactively.
The operational priorities: partition large tables before they become large (adding partitioning to an existing table requires a data migration, not a DDL change). Implement Patroni before you need it (configuring HA during an outage is too late). Monitor replication lag and XID age as tier-1 alerts — these have hard operational deadlines. Tune autovacuum for high-write tables proactively — waiting until bloat is visible means you're already paying the performance cost.
PostgreSQL continues to be the most capable open-source relational database in 2026. Understanding its internals — WAL, MVCC, VACUUM, the replication architecture — is what separates teams that run PostgreSQL reliably at scale from teams that migrate to managed databases hoping the problems go away. The managed database route (RDS, Cloud SQL, AlloyDB) abstracts the operational complexity at a cost premium; understanding the internals makes the abstractions navigable regardless of deployment model.
Sources
- PostgreSQL Documentation — Table Partitioning
- Patroni GitHub
- PostgreSQL Documentation — Logical Replication
- pg_partman Extension
Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.
☕ Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter
Comments
Post a Comment