Infrastructure as Code in 2026: Terraform Modules, Terragrunt, State Management, and Testing

Infrastructure as Code in 2026: Terraform Modules, Terragrunt, State Management, and Testing

Hero image

Introduction

Infrastructure as Code matured from "scripts that provision things" to a disciplined engineering practice with version control, peer review, automated testing, and deployment pipelines. That maturity was hard-won. The ecosystem earned its scars — teams that lost an afternoon to a corrupted state file, engineers who discovered a three-month-old manual console change during an incident, organizations that started with one Terraform monolith and spent six months carving it apart.

By 2026, Terraform is the standard IaC tool for AWS infrastructure, Terragrunt is the standard DRY wrapper around it, and the teams operating at scale have developed clear opinions on state file organization, module design, drift prevention, and testing. The tutorials still show you how to provision an EC2 instance. This post covers what happens after that — when you have five engineers, three environments, and twenty services, and you need infrastructure changes to be as reliable and reviewable as application code changes.

The problems that don't appear in tutorials are the ones that matter: state file contention when two engineers apply simultaneously, module sprawl when every team re-implements the same ECS service pattern, environment drift when prod silently diverges from staging over three months, and untestable Terraform that nobody dares touch because it might break something.

This post takes positions. One state file per workload. Terragrunt over workspace-based multi-env management. Terratest over manual verification. These opinions are grounded in operational experience, not framework loyalty. Where alternatives are genuinely reasonable, you'll see them called out. Where one approach is clearly better, the post says so.

The target is an Advanced engineer comfortable with Terraform fundamentals who needs to scale an IaC practice across a team — not someone learning to write their first resource block.


1. Terraform Module Design

Modules are Terraform's unit of reuse. Done well, they reduce duplication and encode institutional knowledge about how your organization provisions infrastructure. Done poorly, they become wrappers with sixty required inputs and no sensible defaults — worse than no module at all.

Interface Design: Minimal Required Inputs, Sensible Defaults, Escape Hatches

A well-designed module interface follows three principles. Required inputs are the minimum set that cannot have a sensible default: the service name, the container image URI, the environment tag. Optional inputs with defaults cover the 80% case: port 8080, memory 512, CPU 256. Escape hatches let callers override anything the module doesn't parameterize directly, typically via a tags merge or a raw aws_ecs_task_definition override block.

Every input that callers have to specify because you were too lazy to provide a default is friction. Every required input that could be derived from other inputs is a design smell.

Versioned Modules: Private Registry vs Git Tags

Use a private Terraform Registry (Terraform Cloud or a self-hosted registry) when you have a platform team responsible for module maintenance and a consuming team that should not need to understand the underlying implementation. Registry versioning enforces explicit upgrades and gives you a module changelog.

Use Git tags (git::https://github.com/org/terraform-modules.git//modules/ecs-service?ref=v1.4.2) when your org is small, module consumers are also contributors, and you want transparency into what changed without an additional system. Git tag references work identically to registry references in Terraform.

Never use ref=main in production. Pin to a tag. Floating references mean your infrastructure can change on the next terraform init.

Module Composition

Root modules are the entry points — they call child modules and wire outputs between them. Child modules are the reusable units. Provider resources live inside child modules or occasionally directly in root modules when they're environment-specific one-offs.

The dependency graph should be a DAG, not a web. Networking outputs feed the app module. App module outputs feed the database module. Database module outputs feed the monitoring module. Circular dependencies between modules are a signal that your service boundary is wrong.

The Wrapper Module Anti-Pattern

A wrapper module that does nothing but pass inputs through to an upstream module — adding no validation, no defaults, no composition — is technical debt. It adds a layer of indirection without adding value. The one justified exception: a wrapper that enforces your organization's tagging policy or naming convention that the upstream module doesn't enforce. Even then, consider whether a validation block in a shared variables.tf convention achieves the same goal without an extra module layer.

Input Validation with validation Blocks

Fail at plan time, not apply time. validation blocks run during plan and produce clear error messages without making any API calls.

# modules/ecs-service/variables.tf

variable "service_name" {
  type        = string
  description = "Name of the ECS service. Used in resource naming and tagging."

  validation {
    # Enforce kebab-case naming: lowercase letters, digits, hyphens only.
    # Prevents CloudWatch metric dimension mismatches and IAM path errors.
    condition     = can(regex("^[a-z0-9][a-z0-9-]{1,48}[a-z0-9]$", var.service_name))
    error_message = "service_name must be 3-50 chars, lowercase alphanumeric and hyphens, no leading/trailing hyphens."
  }
}

variable "container_port" {
  type        = number
  description = "Port the container listens on. ALB target group health check uses this port."
  default     = 8080

  validation {
    condition     = var.container_port >= 1024 && var.container_port <= 65535
    error_message = "container_port must be a non-privileged port (1024-65535)."
  }
}

variable "desired_count" {
  type        = number
  description = "Desired number of ECS tasks. Production should be >= 2 for HA."
  default     = 2

  validation {
    condition     = var.desired_count >= 1 && var.desired_count <= 100
    error_message = "desired_count must be between 1 and 100."
  }
}

variable "cpu" {
  type        = number
  description = "CPU units for the ECS task (256, 512, 1024, 2048, 4096). See Fargate task size table."
  default     = 256

  validation {
    # Fargate only allows specific CPU values. Catching this at plan time avoids
    # a confusing AWS API error during apply.
    condition     = contains([256, 512, 1024, 2048, 4096], var.cpu)
    error_message = "cpu must be one of: 256, 512, 1024, 2048, 4096 (Fargate task CPU values)."
  }
}

variable "memory" {
  type        = number
  description = "Memory (MB) for the ECS task. Must match valid Fargate cpu/memory combinations."
  default     = 512
}

variable "container_image" {
  type        = string
  description = "Docker image URI including tag or digest. Use digest for deterministic deployments."

  validation {
    # Require a tag or digest — bare image names without tags default to :latest,
    # which makes deployments non-deterministic.
    condition     = can(regex(":.+$", var.container_image))
    error_message = "container_image must include a tag or digest (e.g. myrepo/myimage:v1.2.3 or myrepo/myimage@sha256:...)."
  }
}

variable "environment_variables" {
  type        = map(string)
  description = "Non-secret environment variables. Secrets should use secrets_arns instead."
  default     = {}
}

variable "secrets_arns" {
  type        = map(string)
  description = "Map of env var name to Secrets Manager ARN. Injected as ECS secrets (not plain env vars)."
  default     = {}
}

variable "extra_security_group_ids" {
  type        = list(string)
  description = "Additional security group IDs to attach to the ECS service ENI. Escape hatch for VPC endpoint access."
  default     = []
}

variable "tags" {
  type        = map(string)
  description = "Tags merged onto all resources. Common tags (team, env) should come from the root module."
  default     = {}
}
# modules/ecs-service/main.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.0, < 6.0"
    }
  }
}

locals {
  # Merge caller-provided tags with module-generated tags.
  # Module-generated tags are the minimum required for cost allocation and incident response.
  base_tags = {
    ManagedBy = "terraform"
    Module    = "ecs-service"
  }
  merged_tags = merge(local.base_tags, var.tags)
}

resource "aws_cloudwatch_log_group" "service" {
  # One log group per service. Retention prevents unbounded CloudWatch costs.
  name              = "/ecs/${var.service_name}"
  retention_in_days = 30
  tags              = local.merged_tags
}

resource "aws_ecs_task_definition" "service" {
  family                   = var.service_name
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.cpu
  memory                   = var.memory
  execution_role_arn       = aws_iam_role.execution.arn
  task_role_arn            = aws_iam_role.task.arn

  container_definitions = jsonencode([
    {
      name      = var.service_name
      image     = var.container_image
      essential = true

      portMappings = [
        {
          containerPort = var.container_port
          protocol      = "tcp"
        }
      ]

      # Separate environment (plain text) from secrets (Secrets Manager injection).
      # This distinction matters for audit logs and prevents accidental secret exposure in task definitions.
      environment = [
        for k, v in var.environment_variables : { name = k, value = v }
      ]

      secrets = [
        for k, arn in var.secrets_arns : { name = k, valueFrom = arn }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.service.name
          "awslogs-region"        = data.aws_region.current.name
          "awslogs-stream-prefix" = "ecs"
        }
      }
    }
  ])

  tags = local.merged_tags
}

resource "aws_ecs_service" "service" {
  name            = var.service_name
  cluster         = var.ecs_cluster_id
  task_definition = aws_ecs_task_definition.service.arn
  desired_count   = var.desired_count

  launch_type = "FARGATE"

  network_configuration {
    subnets = var.private_subnet_ids
    security_groups = concat(
      [aws_security_group.service.id],
      var.extra_security_group_ids  # escape hatch for VPC endpoint SGs
    )
    assign_public_ip = false
  }

  # Ignore desired_count changes in Terraform state — auto-scaling manages this at runtime.
  # Without this, every terraform apply resets the count to the Terraform value,
  # undoing auto-scaling decisions.
  lifecycle {
    ignore_changes = [desired_count]
  }

  tags = local.merged_tags
}
# modules/ecs-service/outputs.tf

# Output everything a downstream module might need.
# It's cheap to output; it's expensive to add outputs later when a consumer needs them.

output "service_name" {
  value       = aws_ecs_service.service.name
  description = "ECS service name. Used by deployment scripts and monitoring dashboards."
}

output "service_arn" {
  value       = aws_ecs_service.service.id
  description = "ECS service ARN. Required for CodeDeploy deployment group configuration."
}

output "task_role_arn" {
  value       = aws_iam_role.task.arn
  description = "IAM role ARN for the ECS task. Attach additional policies here for S3/DynamoDB access."
}

output "security_group_id" {
  value       = aws_security_group.service.id
  description = "Security group ID for the ECS service ENI. Reference from RDS or ElastiCache ingress rules."
}

output "log_group_name" {
  value       = aws_cloudwatch_log_group.service.name
  description = "CloudWatch log group name. Used in CloudWatch Insights queries and alarms."
}

flowchart TD ROOT["Root Module\n(env/prod/main.tf)"] --> NET["networking module\noutputs: vpc_id, subnet_ids, sg_ids"] ROOT --> APP["app module (ecs-service)\ninputs: vpc_id, subnet_ids from networking\noutputs: service_sg_id, task_role_arn"] ROOT --> DB["database module (rds)\ninputs: service_sg_id from app\noutputs: db_endpoint, db_secret_arn"] ROOT --> MON["monitoring module\ninputs: log_group_name from app\n db_endpoint from database"] NET -->|vpc_id, private_subnet_ids| APP APP -->|security_group_id| DB APP -->|log_group_name| MON DB -->|db_secret_arn| APP


2. Terragrunt for DRY Multi-Environment Configurations

The Terraform multi-environment problem is well-documented and poorly solved by workspaces. Workspaces share a backend, share a state file, and require workspace-specific variable files that Terraform has no native mechanism to inherit. The result is either duplication — three copies of identical main.tf files — or fragile variable injection through TF_VAR_ environment variables in CI.

Terragrunt is an HCL wrapper around Terraform that solves this with a simple inheritance model: environment-specific configuration inherits from a shared _envcommon directory, overriding only what differs.

Directory Structure

infrastructure/
├── _envcommon/                    # Shared config inherited by all environments
│   ├── ecs-service.hcl            # Shared ECS service inputs
│   └── rds.hcl                    # Shared RDS inputs
├── terragrunt.hcl                 # Root config: remote state, provider generation
├── dev/
│   ├── env.hcl                    # Environment-specific vars (env = "dev", region = "us-east-1")
│   ├── ecs-service/
│   │   └── terragrunt.hcl         # Inherits _envcommon/ecs-service.hcl, overrides desired_count
│   └── rds/
│       └── terragrunt.hcl
├── staging/
│   ├── env.hcl
│   ├── ecs-service/
│   │   └── terragrunt.hcl
│   └── rds/
│       └── terragrunt.hcl
└── prod/
    ├── env.hcl
    ├── ecs-service/
    │   └── terragrunt.hcl         # Overrides: desired_count = 4, cpu = 1024
    └── rds/
        └── terragrunt.hcl         # Overrides: instance_class = "db.r6g.large"

Root terragrunt.hcl — Remote State and Provider Generation

# infrastructure/terragrunt.hcl
# Root config inherited by every child terragrunt.hcl via find_in_parent_folders()

locals {
  # Parse the environment from the directory path.
  # infrastructure/prod/ecs-service → env = "prod"
  path_components = split("/", path_relative_to_include())
  env             = local.path_components[0]

  # Load environment-specific variables from env.hcl
  env_vars   = read_terragrunt_config(find_in_parent_folders("env.hcl"))
  aws_region = local.env_vars.locals.aws_region
  account_id = local.env_vars.locals.account_id
}

# Generate provider.tf in each module directory at plan/apply time.
# This avoids repeating the provider block in every module and ensures
# the assume_role ARN is always environment-specific.
generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "${local.aws_region}"

  assume_role {
    # Each environment deploys into a separate AWS account.
    # This prevents a prod-targeted apply from hitting dev resources.
    role_arn = "arn:aws:iam::${local.account_id}:role/TerraformDeployRole"
  }

  default_tags {
    tags = {
      Environment = "${local.env}"
      ManagedBy   = "terragrunt"
    }
  }
}
EOF
}

# Remote state configuration.
# State files are isolated per module: s3://bucket/env/module-name/terraform.tfstate
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "myorg-terraform-state-${local.account_id}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = local.aws_region
    encrypt        = true
    dynamodb_table = "terraform-state-lock"

    # S3 bucket versioning must be enabled separately (see state management section).
    # Versioning allows state rollback after a botched apply.
  }
}

_envcommon/ecs-service.hcl — Shared Defaults

# infrastructure/_envcommon/ecs-service.hcl
# Inputs that are identical across dev/staging/prod.
# Environment-specific overrides happen in each env's terragrunt.hcl.

locals {
  env_vars  = read_terragrunt_config(find_in_parent_folders("env.hcl"))
  env       = local.env_vars.locals.env
}

inputs = {
  service_name   = "api-service"
  container_port = 8080
  cpu            = 256    # Override in prod to 1024
  memory         = 512    # Override in prod to 2048
  desired_count  = 1      # Override in prod to 4
}

prod/ecs-service/terragrunt.hcl — Environment Override

# infrastructure/prod/ecs-service/terragrunt.hcl

include "root" {
  path = find_in_parent_folders()
}

include "envcommon" {
  # Pull in shared defaults. merge strategy means prod inputs override envcommon inputs.
  path   = "${dirname(find_in_parent_folders())}/_envcommon/ecs-service.hcl"
  expose = true
  merge_strategy = "deep"
}

# dependency block wires cross-module outputs without hardcoding ARNs.
# Terragrunt runs a targeted plan/output on the dependency before applying this module.
dependency "networking" {
  config_path = "../networking"

  # mock_outputs are used during `plan` when the dependency hasn't been applied yet.
  # This enables plan-on-PR without requiring a live networking stack.
  mock_outputs = {
    vpc_id             = "vpc-00000000"
    private_subnet_ids = ["subnet-00000000", "subnet-11111111"]
  }
  mock_outputs_allowed_terraform_commands = ["plan", "validate"]
}

dependency "rds" {
  config_path  = "../rds"
  mock_outputs = {
    db_secret_arn = "arn:aws:secretsmanager:us-east-1:123456789012:secret:mock-db-secret"
  }
  mock_outputs_allowed_terraform_commands = ["plan", "validate"]
}

terraform {
  source = "git::https://github.com/myorg/terraform-modules.git//modules/ecs-service?ref=v2.1.0"
}

# Deep merge with envcommon — only override what differs in prod.
inputs = merge(
  include.envcommon.inputs,
  {
    # Production capacity — override shared defaults
    cpu           = 1024
    memory        = 2048
    desired_count = 4

    # Wire in dependency outputs — no hardcoded ARNs
    vpc_id             = dependency.networking.outputs.vpc_id
    private_subnet_ids = dependency.networking.outputs.private_subnet_ids
    ecs_cluster_id     = dependency.networking.outputs.ecs_cluster_id

    secrets_arns = {
      DATABASE_URL = dependency.rds.outputs.db_secret_arn
    }

    tags = {
      Team        = "platform"
      CostCenter  = "engineering"
    }
  }
)

flowchart TD ROOT["infrastructure/terragrunt.hcl\nRemote state config\nProvider generation\nAccount ID, region locals"] ENVCOMMON["_envcommon/ecs-service.hcl\ncpu=256, memory=512\ndesired_count=1\ncontainer_port=8080"] ENVHCL["prod/env.hcl\nenv=prod\naws_region=us-east-1\naccount_id=111122223333"] ROOT -->|"find_in_parent_folders()"| DEV["dev/ecs-service/terragrunt.hcl\ninherits envcommon\nno overrides"] ROOT -->|"find_in_parent_folders()"| STG["staging/ecs-service/terragrunt.hcl\ninherits envcommon\ndesired_count=2"] ROOT -->|"find_in_parent_folders()"| PROD["prod/ecs-service/terragrunt.hcl\nmerge(envcommon.inputs, {...})\ncpu=1024, desired_count=4"] ENVCOMMON -->|"include envcommon"| DEV ENVCOMMON -->|"include envcommon"| STG ENVCOMMON -->|"include envcommon"| PROD ENVHCL -->|"read_terragrunt_config"| ROOT


3. State Management at Scale

State files are the source of truth for what Terraform believes exists in the world. Treating them carelessly — one file for everything, no encryption, no locking — is the fastest path to a disaster that takes hours to recover from.

The Fundamental Rule: One State File Per Workload

Not one per environment, not one per region, not one monolith. One per workload — the unit of infrastructure that gets deployed, scaled, and destroyed together.

"Workload" is a judgment call, but a useful heuristic: if two resources are never applied in the same operation, they belong in different state files. Networking (VPCs, subnets, route tables) is deployed once and rarely changed. Application infrastructure (ECS services, RDS instances) changes frequently. Monitoring and alerting changes on its own cadence. Keep them separate.

A monolithic state file has two failure modes. The first is blast radius: a bug in one resource's configuration can corrupt the entire state. The second is velocity: every change requires a full plan across all resources, even unrelated ones, which is slow and increases the chance of accidental drift.

S3 Backend Configuration

# This is a partial backend configuration.
# The bucket name and region are injected at `terraform init` time via -backend-config flags
# or the Terragrunt remote_state block — never hardcoded in version control.
# Avoids exposing account-specific details in the public module source.

terraform {
  backend "s3" {
    # bucket and key are injected by Terragrunt's remote_state block.
    # Do not specify them here if using Terragrunt.

    region = "us-east-1"

    # Encrypt state at rest. State files contain plaintext secrets (database passwords,
    # API keys) because Terraform stores all resource attributes — including sensitive ones.
    encrypt = true

    # KMS key for state encryption. Default SSE-S3 is acceptable; KMS gives you
    # rotation, audit logs, and cross-account access control.
    kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123"

    # DynamoDB table for state locking.
    # Lock prevents concurrent applies from corrupting state.
    dynamodb_table = "terraform-state-lock"
  }
}
# S3 bucket for state storage — bootstrapped manually or via a separate "bootstrap" module.
# This is the one piece of infrastructure that cannot manage itself.

resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state-${data.aws_caller_identity.current.account_id}"

  # Prevent accidental deletion of the state bucket.
  lifecycle {
    prevent_destroy = true
  }

  tags = {
    Purpose   = "terraform-state"
    ManagedBy = "bootstrap"
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
  # Versioning is the rollback mechanism for state files.
  # After a botched apply, you can restore the previous state version from S3
  # and run terraform apply again to converge.
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform_state.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket                  = aws_s3_bucket.terraform_state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"  # On-demand billing; lock table traffic is spiky
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Purpose   = "terraform-state-lock"
    ManagedBy = "bootstrap"
  }
}

State File Refactoring and Drift Recovery

terraform state mv moves resources between state files without destroying and recreating them. Use it when splitting a monolith or renaming a resource within a module refactor. Always take a state backup before any state mv operation.

terraform import brings existing resources under Terraform management. Use it when a resource was created manually and you want IaC ownership going forward.

Broken DynamoDB locks (from a killed apply process) show up as "Error acquiring the state lock." Verify the lock is actually stale by checking the LockID in DynamoDB and comparing the timestamp. If it's more than a few hours old and no apply is running, use terraform force-unlock <LOCK_ID>. Never force-unlock a live apply.

flowchart LR subgraph BAD["Monolithic State — High Risk"] M["terraform.tfstate\n(single file)\nVPC + ECS + RDS + IAM\n+ CloudWatch + Route53"] style BAD fill:#ffeaea,stroke:#cc0000 end subgraph OK["Per-Environment State — Better"] D["dev/terraform.tfstate\nAll dev resources"] S["staging/terraform.tfstate\nAll staging resources"] P["prod/terraform.tfstate\nAll prod resources"] style OK fill:#fff8e1,stroke:#f9a825 end subgraph GOOD["Per-Workload State — Recommended"] N1["prod/networking\n.tfstate"] A1["prod/ecs-service\n.tfstate"] R1["prod/rds\n.tfstate"] M1["prod/monitoring\n.tfstate"] style GOOD fill:#e8f5e9,stroke:#388e3c end BAD -->|"Any change plans entire infra\nOne corruption = everything broken"| OK OK -->|"Still couples networking+app+db\nFull-env plan for single service change"| GOOD


4. Drift Detection and Remediation

Drift is the delta between what Terraform's state believes exists and what actually exists in AWS. It accumulates through three vectors: manual console changes by engineers under pressure, auto-scaling modifying desired counts, and external automation (Lambda functions, AWS Config remediations, third-party tools) creating or modifying resources.

Undetected drift is the most dangerous state your infrastructure can be in. You think you have IaC. You don't. You have IaC plus a shadow layer of undocumented manual changes that will survive until the next terraform destroy or a major refactor wipes them out.

Drift Detection in CI

Run terraform plan on a schedule — not just on pull requests. A plan that runs only when engineers make changes will never catch drift from external sources.

# .github/workflows/drift-detection.yml

name: Drift Detection

on:
  # Run daily at 6 AM UTC — before the engineering day starts, so drift
  # is visible in Slack before anyone starts making infrastructure changes.
  schedule:
    - cron: "0 6 * * 1-5"
  # Also allow manual trigger for on-demand drift checks.
  workflow_dispatch:

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        # Run drift detection for all environments in parallel.
        environment: [dev, staging, prod]
        module: [networking, ecs-service, rds, monitoring]
    permissions:
      id-token: write  # Required for OIDC authentication to AWS
      contents: read

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ vars[format('{0}_ACCOUNT_ID', matrix.environment)] }}:role/GitHubActionsRole
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.0"

      - name: Setup Terragrunt
        run: |
          wget -qO terragrunt "https://github.com/gruntwork-io/terragrunt/releases/download/v0.67.0/terragrunt_linux_amd64"
          chmod +x terragrunt
          sudo mv terragrunt /usr/local/bin/

      - name: Terragrunt Plan (Drift Detection)
        id: plan
        working-directory: infrastructure/${{ matrix.environment }}/${{ matrix.module }}
        run: |
          # -detailed-exitcode: exit 0 = no changes, exit 1 = error, exit 2 = changes detected
          terragrunt plan -detailed-exitcode -out=plan.tfplan 2>&1 | tee plan_output.txt
          echo "exitcode=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
        continue-on-error: true

      - name: Alert on Drift
        if: steps.plan.outputs.exitcode == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": ":rotating_light: *Terraform Drift Detected*",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": ":rotating_light: *Terraform Drift Detected*\n*Environment:* ${{ matrix.environment }}\n*Module:* ${{ matrix.module }}\n*Workflow:* <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Details>"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_DRIFT_WEBHOOK_URL }}
          SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK

      - name: Fail on Error (not on drift)
        # Exit code 2 means drift, which we alert on but don't fail the workflow.
        # Exit code 1 means a real error (auth failure, provider issue), which should fail.
        if: steps.plan.outputs.exitcode == '1'
        run: exit 1

Handling ignore_changes for Intentional Drift

Some drift is intentional. Auto-scaling modifies desired_count at runtime. Terraform should not reset it on every apply. Use ignore_changes for this, but document why.

resource "aws_ecs_service" "service" {
  # ... other config ...

  lifecycle {
    # desired_count is managed by Application Auto Scaling at runtime.
    # Without this ignore, terraform apply would reset the count to the Terraform value,
    # overriding auto-scaling decisions. This is intentional drift we accept.
    ignore_changes = [desired_count]
  }
}

Preventing Drift: Break-Glass Procedures

The goal is not zero manual console access — emergencies happen. The goal is zero undocumented manual console access. Implement a break-glass procedure: an IAM role that grants console write access, requires MFA, logs all API calls via CloudTrail, and triggers a PagerDuty alert when assumed. After every break-glass event, the engineer responsible must open a Terraform PR capturing the manual change before the end of the sprint.


5. Testing Infrastructure Code

"We can't test infrastructure" is a belief, not a fact. Terraform can be tested at multiple levels — unit, contract, integration, policy, and security — and each level catches different classes of bugs.

Terratest: Real Resources, Real Assertions

Terratest runs actual Terraform, provisions real AWS resources in a test account, runs assertions against them, then destroys everything. It's slow (5-10 minutes per test), it costs money (fractions of a cent per test run), and it catches things static analysis never will.

// modules/ecs-service/test/ecs_service_test.go

package test

import (
    "fmt"
    "testing"
    "time"

    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func TestECSServiceModule(t *testing.T) {
    t.Parallel()

    // Use a unique suffix to avoid conflicts when tests run concurrently.
    uniqueID := fmt.Sprintf("test-%d", time.Now().UnixMilli()%10000)
    serviceName := fmt.Sprintf("test-svc-%s", uniqueID)
    awsRegion := "us-east-1"

    terraformOptions := &terraform.Options{
        // The examples/ directory contains a minimal, self-contained instantiation
        // of the module for testing. It provisions its own VPC and ECS cluster.
        TerraformDir: "../examples/basic",

        Vars: map[string]interface{}{
            "service_name":     serviceName,
            "container_image":  "nginx:1.25.3",  // pinned tag — no :latest in tests
            "container_port":   8080,
            "desired_count":    1,
            "aws_region":       awsRegion,
        },

        // Retry on transient AWS API errors. ECS service creation can take 30-60 seconds.
        RetryableTerraformErrors: map[string]string{
            "Error creating ECS Service":   "ECS service creation is eventually consistent",
            "ResourceInUseException":       "Resource not yet available",
        },
        MaxRetries:         3,
        TimeBetweenRetries: 15 * time.Second,
    }

    // Always destroy resources after test — even if the test fails.
    defer terraform.Destroy(t, terraformOptions)

    terraform.InitAndApply(t, terraformOptions)

    // --- Assertions ---

    serviceArn := terraform.Output(t, terraformOptions, "service_arn")
    require.NotEmpty(t, serviceArn, "service_arn output must not be empty")

    // Verify the ECS service exists and is in a RUNNING state.
    ecsClient := aws.NewEcsClient(t, awsRegion)
    clusterArn := terraform.Output(t, terraformOptions, "cluster_arn")

    service := aws.GetEcsService(t, awsRegion, clusterArn, serviceName)
    assert.Equal(t, "ACTIVE", aws.GetString(service.Status), "ECS service should be ACTIVE")
    assert.Equal(t, int64(1), aws.GetInt64(service.DesiredCount), "desired count should match input")

    // Verify the CloudWatch log group was created.
    logGroupName := terraform.Output(t, terraformOptions, "log_group_name")
    assert.Equal(t, fmt.Sprintf("/ecs/%s", serviceName), logGroupName)

    // Verify the task role ARN follows expected naming convention.
    taskRoleArn := terraform.Output(t, terraformOptions, "task_role_arn")
    assert.Contains(t, taskRoleArn, serviceName, "task role ARN should contain service name")

    // Verify the security group was created and has no ingress from 0.0.0.0/0.
    sgID := terraform.Output(t, terraformOptions, "security_group_id")
    sg := aws.GetSecurityGroup(t, awsRegion, sgID)
    for _, perm := range sg.IpPermissions {
        for _, ipRange := range perm.IpRanges {
            assert.NotEqual(t, "0.0.0.0/0", aws.GetString(ipRange.CidrIp),
                "ECS service security group must not allow ingress from 0.0.0.0/0")
        }
    }

    _ = ecsClient // suppress unused import
}

Checkov: Static Security Analysis

Checkov runs without any AWS credentials — it analyzes Terraform plan output or raw HCL for security misconfigurations. Add it to the PR pipeline, before apply.

# In your GitHub Actions plan workflow, after terraform plan:

- name: Run Checkov
  uses: bridgecrewio/checkov-action@v12
  with:
    directory: .
    framework: terraform
    # Fail the build on HIGH and CRITICAL findings.
    # MEDIUM findings are reported but don't block merge — review weekly.
    soft_fail_on: MEDIUM,LOW,INFO
    output_format: github_failed_only
    # Skip checks that don't apply to your environment.
    # Document why each skip is justified.
    skip_check: >
      CKV_AWS_116,
      CKV_AWS_338

Infracost: Cost Estimation in CI

# .github/workflows/infracost.yml
# Runs on every PR that modifies Terraform files.
# Posts a cost diff comment showing the monthly cost change.

- name: Setup Infracost
  uses: infracost/actions/setup@v3
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Generate Infracost diff
  run: |
    # Generate cost estimate for the proposed changes
    infracost diff \
      --path=. \
      --format=json \
      --compare-to=infracost-base.json \
      --out-file=infracost-diff.json

- name: Post Infracost comment
  uses: infracost/actions/comment@v3
  with:
    path: infracost-diff.json
    # Show the monthly cost diff in the PR comment.
    # Engineers reviewing the PR can see "this change adds $47/month" before approving.
    behavior: update

6. CI/CD for Infrastructure

Infrastructure CI/CD has different requirements than application CI/CD. The blast radius of a bad deploy is higher. Rollback is harder. The feedback loop from plan to verify is slower. The pipeline design has to account for all three.

Plan on PR: Show Everything Before Merge

# .github/workflows/terraform-pr.yml

name: Terraform Plan

on:
  pull_request:
    paths:
      - "infrastructure/**"
      - "modules/**"

jobs:
  plan:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
      pull-requests: write

    strategy:
      matrix:
        environment: [dev, staging, prod]

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars[format('{0}_PLAN_ROLE', matrix.environment)] }}
          aws-region: us-east-1

      - name: Setup Terraform and Terragrunt
        run: |
          wget -qO- https://releases.hashicorp.com/terraform/1.9.0/terraform_1.9.0_linux_amd64.zip | unzip -
          sudo mv terraform /usr/local/bin/
          wget -qO terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.67.0/terragrunt_linux_amd64
          chmod +x terragrunt && sudo mv terragrunt /usr/local/bin/

      - name: Terragrunt Plan
        id: plan
        working-directory: infrastructure/${{ matrix.environment }}
        run: |
          terragrunt run-all plan \
            --terragrunt-non-interactive \
            -out=tfplan 2>&1 | tee plan_output.txt

      - name: Run Checkov Policy Check
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: infrastructure/${{ matrix.environment }}
          framework: terraform
          soft_fail_on: MEDIUM,LOW,INFO

      - name: Infracost Cost Diff
        if: matrix.environment == 'prod'  # Cost estimate only matters for prod changes
        run: |
          infracost diff \
            --path=infrastructure/prod \
            --format=json \
            --out-file=infracost.json

      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const planOutput = fs.readFileSync('infrastructure/${{ matrix.environment }}/plan_output.txt', 'utf8');
            const body = `## Terraform Plan — \`${{ matrix.environment }}\`\n\`\`\`\n${planOutput.slice(-30000)}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

Apply on Merge: Automated Dev, Gated Prod

# .github/workflows/terraform-apply.yml

name: Terraform Apply

on:
  push:
    branches: [main]
    paths:
      - "infrastructure/**"

jobs:
  apply-dev:
    runs-on: ubuntu-latest
    environment: dev  # No approval required for dev
    permissions:
      id-token: write
      contents: read

    steps:
      - uses: actions/checkout@v4
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.DEV_APPLY_ROLE }}
          aws-region: us-east-1
      - name: Apply to Dev
        working-directory: infrastructure/dev
        run: |
          terragrunt run-all apply \
            --terragrunt-non-interactive \
            # Apply one module at a time — parallelism=1 limits blast radius.
            # Parallel applies can cause race conditions in resource dependencies.
            -parallelism=1

  apply-staging:
    needs: apply-dev  # Staging applies only after dev succeeds
    runs-on: ubuntu-latest
    environment: staging  # Requires approval from staging-approvers team
    steps:
      - uses: actions/checkout@v4
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.STAGING_APPLY_ROLE }}
          aws-region: us-east-1
      - name: Apply to Staging
        working-directory: infrastructure/staging
        run: terragrunt run-all apply --terragrunt-non-interactive -parallelism=1

  apply-prod:
    needs: apply-staging  # Prod applies only after staging succeeds
    runs-on: ubuntu-latest
    environment: production  # Requires approval from senior-engineers team — configured in GitHub Environments
    steps:
      - uses: actions/checkout@v4
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.PROD_APPLY_ROLE }}
          aws-region: us-east-1
      - name: Apply to Prod
        working-directory: infrastructure/prod
        run: terragrunt run-all apply --terragrunt-non-interactive -parallelism=1

Rollback Strategy

Terraform has no native rollback. The rollback mechanism is: retrieve the previous state file version from S3 (S3 versioning must be enabled), restore it locally, and re-apply. This requires that the underlying infrastructure hasn't been destroyed, which is why prevent_destroy = true on critical resources matters.

For destructive changes (renaming a resource, changing a unique constraint), the safer path is usually to apply the new resource alongside the old one, migrate traffic, then destroy the old one — rather than trying to rollback.

Atlantis vs Terraform Cloud vs custom GitHub Actions: Atlantis is the right choice when you want plan/apply workflows in your existing GitHub PR without a SaaS dependency, and you're willing to operate the server. Terraform Cloud is the right choice when you want audit logs, Sentinel policies, and a managed execution environment. Custom GitHub Actions (like the examples above) are the right choice when your team already understands GitHub Actions and the additional systems overhead isn't justified.


Conclusion

Infrastructure as Code at scale is not harder than application engineering — it requires the same disciplines applied to a different problem domain. Module design with explicit interfaces and validation blocks makes configurations reviewable. Terragrunt's inheritance model makes multi-environment management maintainable without copying files. Per-workload state isolation limits blast radius. Scheduled drift detection makes the gap between declared and actual state visible before it becomes an incident. Terratest and Checkov bring the same test-before-merge hygiene to infrastructure that unit tests bring to application code.

The teams that get IaC right treat it exactly like application code: PRs, reviews, tests, CI/CD, and a culture of fixing broken state the same day it's detected rather than letting it accumulate. The teams that don't are the ones debugging 2 AM manual console changes with no audit trail, a corrupted state file, and no clear picture of what was actually running before the incident.

Start with the module design principles and state isolation strategy — those compound over time. Add Terragrunt when copy-paste between environment directories starts causing divergence. Add testing when you have enough modules to justify the investment. Add drift detection the day you catch someone making a console change without a follow-up Terraform PR.

The patterns in this post are not theoretical. They're the patterns that survive contact with production.


Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained