Terraform Advanced Patterns: Modules, Remote State, and Managing Infrastructure Drift

Terraform Advanced Patterns: Modules, Remote State, and Managing Infrastructure Drift

Hero: Terraform state dependency graph with modules and resources

Most teams start Terraform with a single main.tf file and a handful of resources. Six months later, that file is 2,000 lines. Three environments (dev, staging, prod) have separate copies that have gradually diverged. The state file lives in someone's laptop. Applying requires knowing the "correct order" of operations. This is infrastructure as code in name only.

Terraform's advanced patterns — modular architecture, remote state, workspaces, drift detection, and policy-as-code — transform it from a "run once and pray" tool to a reliable, auditable infrastructure platform. This guide covers all of them with production-grade examples.

The Problem: Flat Terraform That Doesn't Scale

The anti-patterns that compound over time:

project/
├── main.tf          # 2,000+ lines of everything
├── variables.tf     # 50 variables, minimal documentation
├── outputs.tf       # What outputs exist? Nobody knows.
└── terraform.tfstate  # State on someone's laptop — NOT in git

When another team member runs terraform apply, they don't have the state file. When you add a new environment, you duplicate the entire directory. When you want to share the VPC configuration with another project, you copy-paste it.

The structured alternative:

infrastructure/
├── modules/
│   ├── vpc/              # Reusable VPC module
│   ├── eks/              # Reusable EKS cluster module
│   ├── rds/              # Reusable RDS module
│   └── service/          # Reusable service-level module
├── environments/
│   ├── dev/              # Dev environment composition
│   ├── staging/          # Staging environment composition
│   └── prod/             # Prod environment composition
└── global/               # Cross-environment resources (DNS, IAM)

How It Works: Module Architecture

A Terraform module is any directory with .tf files. Modules encapsulate resources, expose a clean interface (variables/outputs), and hide implementation details. The calling configuration provides inputs; the module returns outputs.

Writing a Reusable Module

# modules/rds/variables.tf
variable "identifier" {
  description = "Unique identifier for this RDS instance"
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,60}$", var.identifier))
    error_message = "Identifier must be 2-62 lowercase alphanumeric characters or hyphens."
  }
}

variable "engine_version" {
  description = "PostgreSQL engine version"
  type        = string
  default     = "16.2"
}

variable "instance_class" {
  description = "RDS instance type"
  type        = string
  default     = "db.t3.medium"
}

variable "storage_gb" {
  description = "Allocated storage in gigabytes"
  type        = number
  default     = 20

  validation {
    condition     = var.storage_gb >= 20 && var.storage_gb <= 65536
    error_message = "Storage must be between 20 and 65536 GB."
  }
}

variable "subnet_ids" {
  description = "List of subnet IDs for the DB subnet group"
  type        = list(string)
}

variable "vpc_id" {
  description = "VPC ID for security group placement"
  type        = string
}

variable "allowed_security_group_ids" {
  description = "Security group IDs allowed to connect to this RDS instance"
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Additional tags to apply to resources"
  type        = map(string)
  default     = {}
}
# modules/rds/main.tf
locals {
  common_tags = merge(var.tags, {
    Module  = "rds"
    Managed = "terraform"
  })
}

resource "aws_db_subnet_group" "this" {
  name       = "${var.identifier}-subnet-group"
  subnet_ids = var.subnet_ids
  tags       = local.common_tags
}

resource "aws_security_group" "rds" {
  name        = "${var.identifier}-rds-sg"
  vpc_id      = var.vpc_id
  description = "Security group for ${var.identifier} RDS instance"
  tags        = local.common_tags

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = var.allowed_security_group_ids
    description     = "PostgreSQL from allowed security groups"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "random_password" "master" {
  length           = 32
  special          = true
  override_special = "!#$%^&*()-_=+[]{}|"
}

resource "aws_secretsmanager_secret" "rds_password" {
  name                    = "/${var.identifier}/rds/master-password"
  recovery_window_in_days = 7
  tags                    = local.common_tags
}

resource "aws_secretsmanager_secret_version" "rds_password" {
  secret_id     = aws_secretsmanager_secret.rds_password.id
  secret_string = random_password.master.result
}

resource "aws_db_instance" "this" {
  identifier     = var.identifier
  engine         = "postgres"
  engine_version = var.engine_version
  instance_class = var.instance_class

  allocated_storage     = var.storage_gb
  max_allocated_storage = var.storage_gb * 2  # Auto-scaling up to 2×
  storage_encrypted     = true                 # Always encrypt

  db_subnet_group_name   = aws_db_subnet_group.this.name
  vpc_security_group_ids = [aws_security_group.rds.id]

  username = "postgres"
  password = random_password.master.result

  multi_az               = var.instance_class != "db.t3.micro"  # No multi-AZ for dev
  deletion_protection    = true
  skip_final_snapshot    = false
  final_snapshot_identifier = "${var.identifier}-final"

  backup_retention_period = 14
  backup_window           = "03:00-04:00"
  maintenance_window      = "Mon:04:00-Mon:05:00"

  performance_insights_enabled = true
  monitoring_interval          = 60

  tags = local.common_tags
}
# modules/rds/outputs.tf
output "endpoint" {
  description = "RDS instance endpoint"
  value       = aws_db_instance.this.endpoint
}

output "port" {
  description = "RDS instance port"
  value       = aws_db_instance.this.port
}

output "security_group_id" {
  description = "Security group ID attached to this RDS instance"
  value       = aws_security_group.rds.id
}

output "secret_arn" {
  description = "ARN of the Secrets Manager secret containing the master password"
  value       = aws_secretsmanager_secret.rds_password.arn
  sensitive   = true
}

Calling the Module from an Environment

# environments/prod/main.tf
terraform {
  required_version = ">= 1.7"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Pin to major version — minor updates auto-apply
    }
  }

  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"  # Prevents concurrent applies
  }
}

# Reference the VPC from a separate state
data "terraform_remote_state" "vpc" {
  backend = "s3"
  config = {
    bucket = "myorg-terraform-state"
    key    = "prod/vpc/terraform.tfstate"
    region = "us-east-1"
  }
}

module "payments_db" {
  source = "../../modules/rds"

  identifier   = "payments-prod"
  instance_class = "db.r6g.xlarge"
  storage_gb   = 100

  subnet_ids = data.terraform_remote_state.vpc.outputs.private_subnet_ids
  vpc_id     = data.terraform_remote_state.vpc.outputs.vpc_id

  allowed_security_group_ids = [
    module.payments_service.security_group_id
  ]

  tags = {
    Environment = "prod"
    Team        = "payments"
    CostCenter  = "engineering"
  }
}

Terragrunt: DRY Terraform Configurations

Repeating the same backend configuration and provider setup across dozens of module instantiations violates DRY. Terragrunt wraps Terraform to eliminate this repetition:

# terragrunt.hcl (at repository root)
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite"
  }
  config = {
    bucket         = "myorg-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite"
  contents  = <<EOF
provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Repository  = "myorg/infrastructure"
    }
  }
}
EOF
}
# environments/prod/payments-db/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()  # Inherits root terragrunt.hcl
}

terraform {
  source = "../../../modules//rds"  # Double // = module root
}

# Pass inputs to the module
inputs = {
  identifier     = "payments-prod"
  instance_class = "db.r6g.xlarge"
  storage_gb     = 100

  subnet_ids = dependency.vpc.outputs.private_subnet_ids
  vpc_id     = dependency.vpc.outputs.vpc_id
}

# Declare dependency on VPC module output
dependency "vpc" {
  config_path = "../vpc"
  mock_outputs = {
    private_subnet_ids = ["subnet-mock"]
    vpc_id             = "vpc-mock"
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

With Terragrunt, terragrunt run-all plan plans all modules in the directory tree simultaneously, respecting dependencies. terragrunt run-all apply applies them in the correct order.

The directory structure mirrors the account/region/environment hierarchy naturally:

infrastructure/
├── terragrunt.hcl               ← root config, shared by all
├── prod/
│   ├── us-east-1/
│   │   ├── vpc/
│   │   │   └── terragrunt.hcl
│   │   ├── eks/
│   │   │   └── terragrunt.hcl
│   │   └── payments-db/
│   │       └── terragrunt.hcl  ← uses root config, declares vpc dep

Remote State and State Locking

State stored locally is a collaboration and reliability problem. Remote state solves both:

# Bootstrap: create the S3 bucket and DynamoDB lock table first
# (typically in a separate "bootstrap" Terraform config or done manually)
resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state"
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"  # Keep all state file versions — essential for recovery
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

With remote state, terraform apply acquires a DynamoDB lock first. If another apply is running, it fails immediately instead of corrupting the state file.

Testing Terraform with Terratest

Infrastructure code should be tested like application code. Terratest deploys real infrastructure in a test AWS account, runs assertions, and tears it down:

// test/rds_module_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestRDSModule(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/rds",
        Vars: map[string]interface{}{
            "identifier":     "test-db-" + uniqueID(),
            "instance_class": "db.t3.micro",  // Cheapest for tests
            "storage_gb":     20,
            "subnet_ids":     getTestSubnetIDs(t),
            "vpc_id":         getTestVPCID(t),
        },
    }

    // Clean up after test regardless of pass/fail
    defer terraform.Destroy(t, terraformOptions)

    // Apply the module
    terraform.InitAndApply(t, terraformOptions)

    // Assert outputs
    endpoint := terraform.Output(t, terraformOptions, "endpoint")
    assert.NotEmpty(t, endpoint)

    // Assert the actual AWS resource
    dbID := "test-db-" + uniqueID()
    db := aws.GetRdsInstanceById(t, dbID, "us-east-1")

    assert.True(t, *db.StorageEncrypted, "Storage should be encrypted")
    assert.True(t, *db.DeletionProtection, "Deletion protection should be enabled")
    assert.Equal(t, int64(14), *db.BackupRetentionPeriod, "Backup retention should be 14 days")
}

Terratest tests are integration tests — they deploy real infrastructure and take 5-15 minutes. Run them on PR to main only, not every commit. Keep a dedicated test AWS account with limited quotas and automated cleanup of orphaned resources.

Drift Detection and Remediation

Infrastructure drift: the actual cloud state diverges from Terraform state. This happens when someone makes a manual change in the AWS console, or a resource is modified by another process.

# Detect drift — shows what changed outside Terraform
terraform plan -refresh-only

# Example output showing drift:
# ~ aws_security_group.rds
#     ingress {
#         + cidr_blocks = ["10.0.1.0/24"]  # Someone added this manually
#     }

Remediate drift in CI with a scheduled plan:

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection

on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours

jobs:
  drift-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, prod]

    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~1.7"

      - name: Terraform Init
        run: terraform init
        working-directory: environments/${{ matrix.environment }}

      - name: Check for Drift
        id: plan
        run: terraform plan -refresh-only -detailed-exitcode -out=drift.plan 2>&1
        working-directory: environments/${{ matrix.environment }}
        continue-on-error: true  # Exit code 2 = drift detected

      - name: Alert on Drift
        if: steps.plan.outputs.exitcode == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "⚠️ Infrastructure drift detected in ${{ matrix.environment }}. Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

The for_each and count Patterns

Creating multiple similar resources without copy-paste:

# for_each: create one resource per map entry (use over count when possible)
variable "environments" {
  default = {
    dev     = { instance_class = "db.t3.micro",  storage_gb = 20 }
    staging = { instance_class = "db.t3.medium", storage_gb = 50 }
    prod    = { instance_class = "db.r6g.xlarge", storage_gb = 100 }
  }
}

resource "aws_db_instance" "this" {
  for_each = var.environments

  identifier     = "myapp-${each.key}"
  instance_class = each.value.instance_class
  allocated_storage = each.value.storage_gb
  # ... other config
}

# Reference specific instances
output "prod_endpoint" {
  value = aws_db_instance.this["prod"].endpoint
}

# Why for_each over count: with count, removing middle item renumbers all later items
# With for_each, removing "staging" only destroys the staging instance
# count is fine for identical resources; for_each for distinct resources

# Dynamic blocks: create nested config blocks programmatically
resource "aws_security_group" "this" {
  name   = "app-sg"
  vpc_id = var.vpc_id

  dynamic "ingress" {
    for_each = var.allowed_ports
    content {
      from_port   = ingress.value
      to_port     = ingress.value
      protocol    = "tcp"
      cidr_blocks = var.allowed_cidrs
    }
  }
}

Terraform Policy as Code with Sentinel / OPA

Before terraform apply touches production, validate that the plan meets organizational policies — no public S3 buckets, no unencrypted databases, required tags on all resources:

# Sentinel policy (HCP Terraform)
# Prevents any S3 bucket from being publicly accessible
import "tfplan/v2" as tfplan

main = rule {
    all tfplan.resource_changes as _, changes {
        changes.type is "aws_s3_bucket" and
        changes.change.after.acl in ["private", null]
    }
}

For open-source (no HCP Terraform), use conftest with OPA Rego:

# policies/required_tags.rego
package terraform

required_tags := {"Environment", "Team", "CostCenter"}

deny[msg] {
    resource := input.resource_changes[_]
    resource.change.actions[_] in ["create", "update"]
    resource_tags := {tag | resource.change.after.tags[tag]}
    missing := required_tags - resource_tags
    count(missing) > 0
    msg := sprintf("Resource %s is missing required tags: %v", [resource.address, missing])
}
# CI check
- name: Generate Terraform Plan JSON
  run: terraform show -json tfplan.binary > plan.json

- name: Policy Check
  run: conftest test plan.json --policy ./policies/
  # Fails the pipeline if any deny rules trigger

CI/CD Pipeline for Terraform

Infrastructure changes need the same review process as application code — but with extra care because mistakes can be irreversible. The standard CI/CD pipeline for Terraform:

# .github/workflows/terraform.yml
name: Terraform Plan / Apply

on:
  pull_request:
    paths: ['environments/**', 'modules/**']
  push:
    branches: [main]
    paths: ['environments/**', 'modules/**']

jobs:
  terraform-plan:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, prod]

    permissions:
      id-token: write  # For OIDC authentication to AWS (no static credentials)
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/terraform-ci-${{ matrix.environment }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~1.7"

      - name: Terraform Init
        run: terraform init
        working-directory: environments/${{ matrix.environment }}

      - name: Terraform Validate
        run: terraform validate
        working-directory: environments/${{ matrix.environment }}

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=tfplan -no-color 2>&1 | tee plan-output.txt
        working-directory: environments/${{ matrix.environment }}

      - name: Comment Plan on PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const planOutput = require('fs').readFileSync('environments/${{ matrix.environment }}/plan-output.txt', 'utf8')
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan: ${{ matrix.environment }}\n\`\`\`\n${planOutput.slice(0, 65000)}\n\`\`\``
            })

  terraform-apply:
    needs: terraform-plan
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: prod  # GitHub Environment with required reviewers
    runs-on: ubuntu-latest

    steps:
      # ... same init steps
      - name: Terraform Apply
        run: terraform apply -auto-approve tfplan
        working-directory: environments/prod

Key practices:
- OIDC instead of static credentials: AWS IAM roles assumed via OIDC federation — no AWS keys stored in GitHub secrets
- Plan as PR comment: reviewers see exactly what will change before approving
- GitHub Environments with required reviewers: human approval before production apply
- Separate roles per environment: CI role for dev has fewer permissions than prod apply role

Production Considerations

Module Versioning

Once modules are shared across teams, pin versions to prevent unexpected changes:

# Pin to a specific tagged version in a private registry or Git tag
module "rds" {
  source  = "git::https://github.com/myorg/terraform-modules.git//rds?ref=v2.1.0"
  # ... inputs
}

# Or from Terraform Registry
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"  # Accept 5.x but not 6.x
}

Module updates are PRs. Teams subscribe to module changelog. Breaking changes increment the major version.

Import Existing Resources

Migrating existing manually-provisioned infrastructure to Terraform requires importing the existing state without recreating resources. The import block (Terraform 1.5+) makes this declarative:

# Import an existing RDS instance into Terraform management
import {
  to = module.payments_db.aws_db_instance.this
  id = "payments-prod"  # The RDS identifier
}

# Terraform generates the config to match the existing resource
# terraform plan -generate-config-out=generated.tf
# Review generated.tf, clean it up, then add to your config

Before import blocks, the workflow was terraform import command + manually writing the matching config (error-prone). The declarative approach is safer: plan shows what would change before applying.

Organizing Large Configurations with moved Blocks

Renaming or moving resources without destroying and recreating them:

# When you rename a resource (e.g., refactoring module structure),
# use moved blocks to update state without destroying infrastructure
moved {
  from = aws_db_instance.rds
  to   = module.payments_db.aws_db_instance.this
}

Without moved, Terraform destroys the old resource and creates a new one — catastrophic for databases. With moved, it updates the state reference only.

Conclusion

Terraform's advanced patterns solve the problems that flat configurations create as teams and infrastructure grow:
- Module architecture enables code reuse and consistent standards across teams
- Remote state with locking makes collaboration safe and auditable
- Drift detection in CI catches manual changes before they become incidents
- Policy as code prevents compliance violations from reaching production
- Module versioning gives consuming teams stability with an upgrade path

The investment in structure pays for itself the first time a new team member can provision a production database by calling a module with five lines of HCL — without understanding every networking and security detail behind it.

The pattern that prevents the most incidents: terraform plan in CI before every merge to the main branch, terraform apply only after the plan output is reviewed and approved. Infrastructure changes that skip review — "I'll just apply this small tweak manually" — are how configuration drift and outages start. The CI pipeline enforces the discipline when humans are in a hurry.

Terraform's adoption trajectory in 2026 includes a split between OpenTofu (the Linux Foundation fork, created after HashiCorp changed the license) and Terraform (now BSL licensed under HashiCorp). For the community, OpenTofu is a drop-in compatible fork that runs all existing Terraform configurations. The choice between them comes down to licensing requirements and vendor support contracts, not technical capability — they're essentially equivalent for the patterns in this guide.

The universal advice: store state remotely from day one. Version-pin providers. Build modules before you have four copies of the same resource block. Drift detection in CI before drift becomes an incident. These habits are cheap to establish early and expensive to retrofit.

One pattern that accelerates module adoption: provide working examples alongside each module. A examples/basic/ directory with a minimal working instantiation of the module reduces the time to first successful terraform apply from hours to minutes. Teams adopting a new module don't want to read variable documentation — they want to copy a working example and modify it. Modules with examples get adopted; modules without examples get copy-pasted around instead.


Sources

  • HashiCorp Terraform documentation: modules, backends, remote state
  • Gruntwork "Terraform: Up & Running" (Brikman)
  • Terragrunt documentation: gruntwork-io.github.io/terragrunt
  • Open Policy Agent: conftest for policy-as-code
  • Terratest documentation: github.com/gruntwork-io/terratest
  • OpenTofu: opentofu.org

Sources

  • HashiCorp Terraform documentation: modules, backends, remote state
  • Gruntwork "Terraform: Up & Running" (Brikman)
  • Open Policy Agent: conftest for policy-as-code
  • Terraform Sentinel documentation (HCP Terraform)
  • AWS Security blog: Terraform security best practices

Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained