Infrastructure as Code in 2026: Terraform Modules, Terragrunt, State Management, and Testing
Infrastructure as Code in 2026: Terraform Modules, Terragrunt, State Management, and Testing

Introduction
Infrastructure as Code matured from "scripts that provision things" to a disciplined engineering practice with version control, peer review, automated testing, and deployment pipelines. That maturity was hard-won. The ecosystem earned its scars — teams that lost an afternoon to a corrupted state file, engineers who discovered a three-month-old manual console change during an incident, organizations that started with one Terraform monolith and spent six months carving it apart.
By 2026, Terraform is the standard IaC tool for AWS infrastructure, Terragrunt is the standard DRY wrapper around it, and the teams operating at scale have developed clear opinions on state file organization, module design, drift prevention, and testing. The tutorials still show you how to provision an EC2 instance. This post covers what happens after that — when you have five engineers, three environments, and twenty services, and you need infrastructure changes to be as reliable and reviewable as application code changes.
The problems that don't appear in tutorials are the ones that matter: state file contention when two engineers apply simultaneously, module sprawl when every team re-implements the same ECS service pattern, environment drift when prod silently diverges from staging over three months, and untestable Terraform that nobody dares touch because it might break something.
This post takes positions. One state file per workload. Terragrunt over workspace-based multi-env management. Terratest over manual verification. These opinions are grounded in operational experience, not framework loyalty. Where alternatives are genuinely reasonable, you'll see them called out. Where one approach is clearly better, the post says so.
The target is an Advanced engineer comfortable with Terraform fundamentals who needs to scale an IaC practice across a team — not someone learning to write their first resource block.
1. Terraform Module Design
Modules are Terraform's unit of reuse. Done well, they reduce duplication and encode institutional knowledge about how your organization provisions infrastructure. Done poorly, they become wrappers with sixty required inputs and no sensible defaults — worse than no module at all.
Interface Design: Minimal Required Inputs, Sensible Defaults, Escape Hatches
A well-designed module interface follows three principles. Required inputs are the minimum set that cannot have a sensible default: the service name, the container image URI, the environment tag. Optional inputs with defaults cover the 80% case: port 8080, memory 512, CPU 256. Escape hatches let callers override anything the module doesn't parameterize directly, typically via a tags merge or a raw aws_ecs_task_definition override block.
Every input that callers have to specify because you were too lazy to provide a default is friction. Every required input that could be derived from other inputs is a design smell.
Versioned Modules: Private Registry vs Git Tags
Use a private Terraform Registry (Terraform Cloud or a self-hosted registry) when you have a platform team responsible for module maintenance and a consuming team that should not need to understand the underlying implementation. Registry versioning enforces explicit upgrades and gives you a module changelog.
Use Git tags (git::https://github.com/org/terraform-modules.git//modules/ecs-service?ref=v1.4.2) when your org is small, module consumers are also contributors, and you want transparency into what changed without an additional system. Git tag references work identically to registry references in Terraform.
Never use ref=main in production. Pin to a tag. Floating references mean your infrastructure can change on the next terraform init.
Module Composition
Root modules are the entry points — they call child modules and wire outputs between them. Child modules are the reusable units. Provider resources live inside child modules or occasionally directly in root modules when they're environment-specific one-offs.
The dependency graph should be a DAG, not a web. Networking outputs feed the app module. App module outputs feed the database module. Database module outputs feed the monitoring module. Circular dependencies between modules are a signal that your service boundary is wrong.
The Wrapper Module Anti-Pattern
A wrapper module that does nothing but pass inputs through to an upstream module — adding no validation, no defaults, no composition — is technical debt. It adds a layer of indirection without adding value. The one justified exception: a wrapper that enforces your organization's tagging policy or naming convention that the upstream module doesn't enforce. Even then, consider whether a validation block in a shared variables.tf convention achieves the same goal without an extra module layer.
Input Validation with validation Blocks
Fail at plan time, not apply time. validation blocks run during plan and produce clear error messages without making any API calls.
# modules/ecs-service/variables.tf
variable "service_name" {
type = string
description = "Name of the ECS service. Used in resource naming and tagging."
validation {
# Enforce kebab-case naming: lowercase letters, digits, hyphens only.
# Prevents CloudWatch metric dimension mismatches and IAM path errors.
condition = can(regex("^[a-z0-9][a-z0-9-]{1,48}[a-z0-9]$", var.service_name))
error_message = "service_name must be 3-50 chars, lowercase alphanumeric and hyphens, no leading/trailing hyphens."
}
}
variable "container_port" {
type = number
description = "Port the container listens on. ALB target group health check uses this port."
default = 8080
validation {
condition = var.container_port >= 1024 && var.container_port <= 65535
error_message = "container_port must be a non-privileged port (1024-65535)."
}
}
variable "desired_count" {
type = number
description = "Desired number of ECS tasks. Production should be >= 2 for HA."
default = 2
validation {
condition = var.desired_count >= 1 && var.desired_count <= 100
error_message = "desired_count must be between 1 and 100."
}
}
variable "cpu" {
type = number
description = "CPU units for the ECS task (256, 512, 1024, 2048, 4096). See Fargate task size table."
default = 256
validation {
# Fargate only allows specific CPU values. Catching this at plan time avoids
# a confusing AWS API error during apply.
condition = contains([256, 512, 1024, 2048, 4096], var.cpu)
error_message = "cpu must be one of: 256, 512, 1024, 2048, 4096 (Fargate task CPU values)."
}
}
variable "memory" {
type = number
description = "Memory (MB) for the ECS task. Must match valid Fargate cpu/memory combinations."
default = 512
}
variable "container_image" {
type = string
description = "Docker image URI including tag or digest. Use digest for deterministic deployments."
validation {
# Require a tag or digest — bare image names without tags default to :latest,
# which makes deployments non-deterministic.
condition = can(regex(":.+$", var.container_image))
error_message = "container_image must include a tag or digest (e.g. myrepo/myimage:v1.2.3 or myrepo/myimage@sha256:...)."
}
}
variable "environment_variables" {
type = map(string)
description = "Non-secret environment variables. Secrets should use secrets_arns instead."
default = {}
}
variable "secrets_arns" {
type = map(string)
description = "Map of env var name to Secrets Manager ARN. Injected as ECS secrets (not plain env vars)."
default = {}
}
variable "extra_security_group_ids" {
type = list(string)
description = "Additional security group IDs to attach to the ECS service ENI. Escape hatch for VPC endpoint access."
default = []
}
variable "tags" {
type = map(string)
description = "Tags merged onto all resources. Common tags (team, env) should come from the root module."
default = {}
}
# modules/ecs-service/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 5.0, < 6.0"
}
}
}
locals {
# Merge caller-provided tags with module-generated tags.
# Module-generated tags are the minimum required for cost allocation and incident response.
base_tags = {
ManagedBy = "terraform"
Module = "ecs-service"
}
merged_tags = merge(local.base_tags, var.tags)
}
resource "aws_cloudwatch_log_group" "service" {
# One log group per service. Retention prevents unbounded CloudWatch costs.
name = "/ecs/${var.service_name}"
retention_in_days = 30
tags = local.merged_tags
}
resource "aws_ecs_task_definition" "service" {
family = var.service_name
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
container_definitions = jsonencode([
{
name = var.service_name
image = var.container_image
essential = true
portMappings = [
{
containerPort = var.container_port
protocol = "tcp"
}
]
# Separate environment (plain text) from secrets (Secrets Manager injection).
# This distinction matters for audit logs and prevents accidental secret exposure in task definitions.
environment = [
for k, v in var.environment_variables : { name = k, value = v }
]
secrets = [
for k, arn in var.secrets_arns : { name = k, valueFrom = arn }
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.service.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "ecs"
}
}
}
])
tags = local.merged_tags
}
resource "aws_ecs_service" "service" {
name = var.service_name
cluster = var.ecs_cluster_id
task_definition = aws_ecs_task_definition.service.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = concat(
[aws_security_group.service.id],
var.extra_security_group_ids # escape hatch for VPC endpoint SGs
)
assign_public_ip = false
}
# Ignore desired_count changes in Terraform state — auto-scaling manages this at runtime.
# Without this, every terraform apply resets the count to the Terraform value,
# undoing auto-scaling decisions.
lifecycle {
ignore_changes = [desired_count]
}
tags = local.merged_tags
}
# modules/ecs-service/outputs.tf
# Output everything a downstream module might need.
# It's cheap to output; it's expensive to add outputs later when a consumer needs them.
output "service_name" {
value = aws_ecs_service.service.name
description = "ECS service name. Used by deployment scripts and monitoring dashboards."
}
output "service_arn" {
value = aws_ecs_service.service.id
description = "ECS service ARN. Required for CodeDeploy deployment group configuration."
}
output "task_role_arn" {
value = aws_iam_role.task.arn
description = "IAM role ARN for the ECS task. Attach additional policies here for S3/DynamoDB access."
}
output "security_group_id" {
value = aws_security_group.service.id
description = "Security group ID for the ECS service ENI. Reference from RDS or ElastiCache ingress rules."
}
output "log_group_name" {
value = aws_cloudwatch_log_group.service.name
description = "CloudWatch log group name. Used in CloudWatch Insights queries and alarms."
}
2. Terragrunt for DRY Multi-Environment Configurations
The Terraform multi-environment problem is well-documented and poorly solved by workspaces. Workspaces share a backend, share a state file, and require workspace-specific variable files that Terraform has no native mechanism to inherit. The result is either duplication — three copies of identical main.tf files — or fragile variable injection through TF_VAR_ environment variables in CI.
Terragrunt is an HCL wrapper around Terraform that solves this with a simple inheritance model: environment-specific configuration inherits from a shared _envcommon directory, overriding only what differs.
Directory Structure
infrastructure/
├── _envcommon/ # Shared config inherited by all environments
│ ├── ecs-service.hcl # Shared ECS service inputs
│ └── rds.hcl # Shared RDS inputs
├── terragrunt.hcl # Root config: remote state, provider generation
├── dev/
│ ├── env.hcl # Environment-specific vars (env = "dev", region = "us-east-1")
│ ├── ecs-service/
│ │ └── terragrunt.hcl # Inherits _envcommon/ecs-service.hcl, overrides desired_count
│ └── rds/
│ └── terragrunt.hcl
├── staging/
│ ├── env.hcl
│ ├── ecs-service/
│ │ └── terragrunt.hcl
│ └── rds/
│ └── terragrunt.hcl
└── prod/
├── env.hcl
├── ecs-service/
│ └── terragrunt.hcl # Overrides: desired_count = 4, cpu = 1024
└── rds/
└── terragrunt.hcl # Overrides: instance_class = "db.r6g.large"
Root terragrunt.hcl — Remote State and Provider Generation
# infrastructure/terragrunt.hcl
# Root config inherited by every child terragrunt.hcl via find_in_parent_folders()
locals {
# Parse the environment from the directory path.
# infrastructure/prod/ecs-service → env = "prod"
path_components = split("/", path_relative_to_include())
env = local.path_components[0]
# Load environment-specific variables from env.hcl
env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
aws_region = local.env_vars.locals.aws_region
account_id = local.env_vars.locals.account_id
}
# Generate provider.tf in each module directory at plan/apply time.
# This avoids repeating the provider block in every module and ensures
# the assume_role ARN is always environment-specific.
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.aws_region}"
assume_role {
# Each environment deploys into a separate AWS account.
# This prevents a prod-targeted apply from hitting dev resources.
role_arn = "arn:aws:iam::${local.account_id}:role/TerraformDeployRole"
}
default_tags {
tags = {
Environment = "${local.env}"
ManagedBy = "terragrunt"
}
}
}
EOF
}
# Remote state configuration.
# State files are isolated per module: s3://bucket/env/module-name/terraform.tfstate
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "myorg-terraform-state-${local.account_id}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = local.aws_region
encrypt = true
dynamodb_table = "terraform-state-lock"
# S3 bucket versioning must be enabled separately (see state management section).
# Versioning allows state rollback after a botched apply.
}
}
_envcommon/ecs-service.hcl — Shared Defaults
# infrastructure/_envcommon/ecs-service.hcl
# Inputs that are identical across dev/staging/prod.
# Environment-specific overrides happen in each env's terragrunt.hcl.
locals {
env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
env = local.env_vars.locals.env
}
inputs = {
service_name = "api-service"
container_port = 8080
cpu = 256 # Override in prod to 1024
memory = 512 # Override in prod to 2048
desired_count = 1 # Override in prod to 4
}
prod/ecs-service/terragrunt.hcl — Environment Override
# infrastructure/prod/ecs-service/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
include "envcommon" {
# Pull in shared defaults. merge strategy means prod inputs override envcommon inputs.
path = "${dirname(find_in_parent_folders())}/_envcommon/ecs-service.hcl"
expose = true
merge_strategy = "deep"
}
# dependency block wires cross-module outputs without hardcoding ARNs.
# Terragrunt runs a targeted plan/output on the dependency before applying this module.
dependency "networking" {
config_path = "../networking"
# mock_outputs are used during `plan` when the dependency hasn't been applied yet.
# This enables plan-on-PR without requiring a live networking stack.
mock_outputs = {
vpc_id = "vpc-00000000"
private_subnet_ids = ["subnet-00000000", "subnet-11111111"]
}
mock_outputs_allowed_terraform_commands = ["plan", "validate"]
}
dependency "rds" {
config_path = "../rds"
mock_outputs = {
db_secret_arn = "arn:aws:secretsmanager:us-east-1:123456789012:secret:mock-db-secret"
}
mock_outputs_allowed_terraform_commands = ["plan", "validate"]
}
terraform {
source = "git::https://github.com/myorg/terraform-modules.git//modules/ecs-service?ref=v2.1.0"
}
# Deep merge with envcommon — only override what differs in prod.
inputs = merge(
include.envcommon.inputs,
{
# Production capacity — override shared defaults
cpu = 1024
memory = 2048
desired_count = 4
# Wire in dependency outputs — no hardcoded ARNs
vpc_id = dependency.networking.outputs.vpc_id
private_subnet_ids = dependency.networking.outputs.private_subnet_ids
ecs_cluster_id = dependency.networking.outputs.ecs_cluster_id
secrets_arns = {
DATABASE_URL = dependency.rds.outputs.db_secret_arn
}
tags = {
Team = "platform"
CostCenter = "engineering"
}
}
)
3. State Management at Scale
State files are the source of truth for what Terraform believes exists in the world. Treating them carelessly — one file for everything, no encryption, no locking — is the fastest path to a disaster that takes hours to recover from.
The Fundamental Rule: One State File Per Workload
Not one per environment, not one per region, not one monolith. One per workload — the unit of infrastructure that gets deployed, scaled, and destroyed together.
"Workload" is a judgment call, but a useful heuristic: if two resources are never applied in the same operation, they belong in different state files. Networking (VPCs, subnets, route tables) is deployed once and rarely changed. Application infrastructure (ECS services, RDS instances) changes frequently. Monitoring and alerting changes on its own cadence. Keep them separate.
A monolithic state file has two failure modes. The first is blast radius: a bug in one resource's configuration can corrupt the entire state. The second is velocity: every change requires a full plan across all resources, even unrelated ones, which is slow and increases the chance of accidental drift.
S3 Backend Configuration
# This is a partial backend configuration.
# The bucket name and region are injected at `terraform init` time via -backend-config flags
# or the Terragrunt remote_state block — never hardcoded in version control.
# Avoids exposing account-specific details in the public module source.
terraform {
backend "s3" {
# bucket and key are injected by Terragrunt's remote_state block.
# Do not specify them here if using Terragrunt.
region = "us-east-1"
# Encrypt state at rest. State files contain plaintext secrets (database passwords,
# API keys) because Terraform stores all resource attributes — including sensitive ones.
encrypt = true
# KMS key for state encryption. Default SSE-S3 is acceptable; KMS gives you
# rotation, audit logs, and cross-account access control.
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123"
# DynamoDB table for state locking.
# Lock prevents concurrent applies from corrupting state.
dynamodb_table = "terraform-state-lock"
}
}
# S3 bucket for state storage — bootstrapped manually or via a separate "bootstrap" module.
# This is the one piece of infrastructure that cannot manage itself.
resource "aws_s3_bucket" "terraform_state" {
bucket = "myorg-terraform-state-${data.aws_caller_identity.current.account_id}"
# Prevent accidental deletion of the state bucket.
lifecycle {
prevent_destroy = true
}
tags = {
Purpose = "terraform-state"
ManagedBy = "bootstrap"
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
# Versioning is the rollback mechanism for state files.
# After a botched apply, you can restore the previous state version from S3
# and run terraform apply again to converge.
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.terraform_state.arn
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST" # On-demand billing; lock table traffic is spiky
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Purpose = "terraform-state-lock"
ManagedBy = "bootstrap"
}
}
State File Refactoring and Drift Recovery
terraform state mv moves resources between state files without destroying and recreating them. Use it when splitting a monolith or renaming a resource within a module refactor. Always take a state backup before any state mv operation.
terraform import brings existing resources under Terraform management. Use it when a resource was created manually and you want IaC ownership going forward.
Broken DynamoDB locks (from a killed apply process) show up as "Error acquiring the state lock." Verify the lock is actually stale by checking the LockID in DynamoDB and comparing the timestamp. If it's more than a few hours old and no apply is running, use terraform force-unlock <LOCK_ID>. Never force-unlock a live apply.
4. Drift Detection and Remediation
Drift is the delta between what Terraform's state believes exists and what actually exists in AWS. It accumulates through three vectors: manual console changes by engineers under pressure, auto-scaling modifying desired counts, and external automation (Lambda functions, AWS Config remediations, third-party tools) creating or modifying resources.
Undetected drift is the most dangerous state your infrastructure can be in. You think you have IaC. You don't. You have IaC plus a shadow layer of undocumented manual changes that will survive until the next terraform destroy or a major refactor wipes them out.
Drift Detection in CI
Run terraform plan on a schedule — not just on pull requests. A plan that runs only when engineers make changes will never catch drift from external sources.
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
# Run daily at 6 AM UTC — before the engineering day starts, so drift
# is visible in Slack before anyone starts making infrastructure changes.
schedule:
- cron: "0 6 * * 1-5"
# Also allow manual trigger for on-demand drift checks.
workflow_dispatch:
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
# Run drift detection for all environments in parallel.
environment: [dev, staging, prod]
module: [networking, ecs-service, rds, monitoring]
permissions:
id-token: write # Required for OIDC authentication to AWS
contents: read
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ vars[format('{0}_ACCOUNT_ID', matrix.environment)] }}:role/GitHubActionsRole
aws-region: us-east-1
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.9.0"
- name: Setup Terragrunt
run: |
wget -qO terragrunt "https://github.com/gruntwork-io/terragrunt/releases/download/v0.67.0/terragrunt_linux_amd64"
chmod +x terragrunt
sudo mv terragrunt /usr/local/bin/
- name: Terragrunt Plan (Drift Detection)
id: plan
working-directory: infrastructure/${{ matrix.environment }}/${{ matrix.module }}
run: |
# -detailed-exitcode: exit 0 = no changes, exit 1 = error, exit 2 = changes detected
terragrunt plan -detailed-exitcode -out=plan.tfplan 2>&1 | tee plan_output.txt
echo "exitcode=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
continue-on-error: true
- name: Alert on Drift
if: steps.plan.outputs.exitcode == '2'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":rotating_light: *Terraform Drift Detected*",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":rotating_light: *Terraform Drift Detected*\n*Environment:* ${{ matrix.environment }}\n*Module:* ${{ matrix.module }}\n*Workflow:* <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Details>"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_DRIFT_WEBHOOK_URL }}
SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
- name: Fail on Error (not on drift)
# Exit code 2 means drift, which we alert on but don't fail the workflow.
# Exit code 1 means a real error (auth failure, provider issue), which should fail.
if: steps.plan.outputs.exitcode == '1'
run: exit 1
Handling ignore_changes for Intentional Drift
Some drift is intentional. Auto-scaling modifies desired_count at runtime. Terraform should not reset it on every apply. Use ignore_changes for this, but document why.
resource "aws_ecs_service" "service" {
# ... other config ...
lifecycle {
# desired_count is managed by Application Auto Scaling at runtime.
# Without this ignore, terraform apply would reset the count to the Terraform value,
# overriding auto-scaling decisions. This is intentional drift we accept.
ignore_changes = [desired_count]
}
}
Preventing Drift: Break-Glass Procedures
The goal is not zero manual console access — emergencies happen. The goal is zero undocumented manual console access. Implement a break-glass procedure: an IAM role that grants console write access, requires MFA, logs all API calls via CloudTrail, and triggers a PagerDuty alert when assumed. After every break-glass event, the engineer responsible must open a Terraform PR capturing the manual change before the end of the sprint.
5. Testing Infrastructure Code
"We can't test infrastructure" is a belief, not a fact. Terraform can be tested at multiple levels — unit, contract, integration, policy, and security — and each level catches different classes of bugs.
Terratest: Real Resources, Real Assertions
Terratest runs actual Terraform, provisions real AWS resources in a test account, runs assertions against them, then destroys everything. It's slow (5-10 minutes per test), it costs money (fractions of a cent per test run), and it catches things static analysis never will.
// modules/ecs-service/test/ecs_service_test.go
package test
import (
"fmt"
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestECSServiceModule(t *testing.T) {
t.Parallel()
// Use a unique suffix to avoid conflicts when tests run concurrently.
uniqueID := fmt.Sprintf("test-%d", time.Now().UnixMilli()%10000)
serviceName := fmt.Sprintf("test-svc-%s", uniqueID)
awsRegion := "us-east-1"
terraformOptions := &terraform.Options{
// The examples/ directory contains a minimal, self-contained instantiation
// of the module for testing. It provisions its own VPC and ECS cluster.
TerraformDir: "../examples/basic",
Vars: map[string]interface{}{
"service_name": serviceName,
"container_image": "nginx:1.25.3", // pinned tag — no :latest in tests
"container_port": 8080,
"desired_count": 1,
"aws_region": awsRegion,
},
// Retry on transient AWS API errors. ECS service creation can take 30-60 seconds.
RetryableTerraformErrors: map[string]string{
"Error creating ECS Service": "ECS service creation is eventually consistent",
"ResourceInUseException": "Resource not yet available",
},
MaxRetries: 3,
TimeBetweenRetries: 15 * time.Second,
}
// Always destroy resources after test — even if the test fails.
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// --- Assertions ---
serviceArn := terraform.Output(t, terraformOptions, "service_arn")
require.NotEmpty(t, serviceArn, "service_arn output must not be empty")
// Verify the ECS service exists and is in a RUNNING state.
ecsClient := aws.NewEcsClient(t, awsRegion)
clusterArn := terraform.Output(t, terraformOptions, "cluster_arn")
service := aws.GetEcsService(t, awsRegion, clusterArn, serviceName)
assert.Equal(t, "ACTIVE", aws.GetString(service.Status), "ECS service should be ACTIVE")
assert.Equal(t, int64(1), aws.GetInt64(service.DesiredCount), "desired count should match input")
// Verify the CloudWatch log group was created.
logGroupName := terraform.Output(t, terraformOptions, "log_group_name")
assert.Equal(t, fmt.Sprintf("/ecs/%s", serviceName), logGroupName)
// Verify the task role ARN follows expected naming convention.
taskRoleArn := terraform.Output(t, terraformOptions, "task_role_arn")
assert.Contains(t, taskRoleArn, serviceName, "task role ARN should contain service name")
// Verify the security group was created and has no ingress from 0.0.0.0/0.
sgID := terraform.Output(t, terraformOptions, "security_group_id")
sg := aws.GetSecurityGroup(t, awsRegion, sgID)
for _, perm := range sg.IpPermissions {
for _, ipRange := range perm.IpRanges {
assert.NotEqual(t, "0.0.0.0/0", aws.GetString(ipRange.CidrIp),
"ECS service security group must not allow ingress from 0.0.0.0/0")
}
}
_ = ecsClient // suppress unused import
}
Checkov: Static Security Analysis
Checkov runs without any AWS credentials — it analyzes Terraform plan output or raw HCL for security misconfigurations. Add it to the PR pipeline, before apply.
# In your GitHub Actions plan workflow, after terraform plan:
- name: Run Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: .
framework: terraform
# Fail the build on HIGH and CRITICAL findings.
# MEDIUM findings are reported but don't block merge — review weekly.
soft_fail_on: MEDIUM,LOW,INFO
output_format: github_failed_only
# Skip checks that don't apply to your environment.
# Document why each skip is justified.
skip_check: >
CKV_AWS_116,
CKV_AWS_338
Infracost: Cost Estimation in CI
# .github/workflows/infracost.yml
# Runs on every PR that modifies Terraform files.
# Posts a cost diff comment showing the monthly cost change.
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate Infracost diff
run: |
# Generate cost estimate for the proposed changes
infracost diff \
--path=. \
--format=json \
--compare-to=infracost-base.json \
--out-file=infracost-diff.json
- name: Post Infracost comment
uses: infracost/actions/comment@v3
with:
path: infracost-diff.json
# Show the monthly cost diff in the PR comment.
# Engineers reviewing the PR can see "this change adds $47/month" before approving.
behavior: update
6. CI/CD for Infrastructure
Infrastructure CI/CD has different requirements than application CI/CD. The blast radius of a bad deploy is higher. Rollback is harder. The feedback loop from plan to verify is slower. The pipeline design has to account for all three.
Plan on PR: Show Everything Before Merge
# .github/workflows/terraform-pr.yml
name: Terraform Plan
on:
pull_request:
paths:
- "infrastructure/**"
- "modules/**"
jobs:
plan:
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
pull-requests: write
strategy:
matrix:
environment: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars[format('{0}_PLAN_ROLE', matrix.environment)] }}
aws-region: us-east-1
- name: Setup Terraform and Terragrunt
run: |
wget -qO- https://releases.hashicorp.com/terraform/1.9.0/terraform_1.9.0_linux_amd64.zip | unzip -
sudo mv terraform /usr/local/bin/
wget -qO terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.67.0/terragrunt_linux_amd64
chmod +x terragrunt && sudo mv terragrunt /usr/local/bin/
- name: Terragrunt Plan
id: plan
working-directory: infrastructure/${{ matrix.environment }}
run: |
terragrunt run-all plan \
--terragrunt-non-interactive \
-out=tfplan 2>&1 | tee plan_output.txt
- name: Run Checkov Policy Check
uses: bridgecrewio/checkov-action@v12
with:
directory: infrastructure/${{ matrix.environment }}
framework: terraform
soft_fail_on: MEDIUM,LOW,INFO
- name: Infracost Cost Diff
if: matrix.environment == 'prod' # Cost estimate only matters for prod changes
run: |
infracost diff \
--path=infrastructure/prod \
--format=json \
--out-file=infracost.json
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const planOutput = fs.readFileSync('infrastructure/${{ matrix.environment }}/plan_output.txt', 'utf8');
const body = `## Terraform Plan — \`${{ matrix.environment }}\`\n\`\`\`\n${planOutput.slice(-30000)}\n\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
Apply on Merge: Automated Dev, Gated Prod
# .github/workflows/terraform-apply.yml
name: Terraform Apply
on:
push:
branches: [main]
paths:
- "infrastructure/**"
jobs:
apply-dev:
runs-on: ubuntu-latest
environment: dev # No approval required for dev
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.DEV_APPLY_ROLE }}
aws-region: us-east-1
- name: Apply to Dev
working-directory: infrastructure/dev
run: |
terragrunt run-all apply \
--terragrunt-non-interactive \
# Apply one module at a time — parallelism=1 limits blast radius.
# Parallel applies can cause race conditions in resource dependencies.
-parallelism=1
apply-staging:
needs: apply-dev # Staging applies only after dev succeeds
runs-on: ubuntu-latest
environment: staging # Requires approval from staging-approvers team
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.STAGING_APPLY_ROLE }}
aws-region: us-east-1
- name: Apply to Staging
working-directory: infrastructure/staging
run: terragrunt run-all apply --terragrunt-non-interactive -parallelism=1
apply-prod:
needs: apply-staging # Prod applies only after staging succeeds
runs-on: ubuntu-latest
environment: production # Requires approval from senior-engineers team — configured in GitHub Environments
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.PROD_APPLY_ROLE }}
aws-region: us-east-1
- name: Apply to Prod
working-directory: infrastructure/prod
run: terragrunt run-all apply --terragrunt-non-interactive -parallelism=1
Rollback Strategy
Terraform has no native rollback. The rollback mechanism is: retrieve the previous state file version from S3 (S3 versioning must be enabled), restore it locally, and re-apply. This requires that the underlying infrastructure hasn't been destroyed, which is why prevent_destroy = true on critical resources matters.
For destructive changes (renaming a resource, changing a unique constraint), the safer path is usually to apply the new resource alongside the old one, migrate traffic, then destroy the old one — rather than trying to rollback.
Atlantis vs Terraform Cloud vs custom GitHub Actions: Atlantis is the right choice when you want plan/apply workflows in your existing GitHub PR without a SaaS dependency, and you're willing to operate the server. Terraform Cloud is the right choice when you want audit logs, Sentinel policies, and a managed execution environment. Custom GitHub Actions (like the examples above) are the right choice when your team already understands GitHub Actions and the additional systems overhead isn't justified.
Conclusion
Infrastructure as Code at scale is not harder than application engineering — it requires the same disciplines applied to a different problem domain. Module design with explicit interfaces and validation blocks makes configurations reviewable. Terragrunt's inheritance model makes multi-environment management maintainable without copying files. Per-workload state isolation limits blast radius. Scheduled drift detection makes the gap between declared and actual state visible before it becomes an incident. Terratest and Checkov bring the same test-before-merge hygiene to infrastructure that unit tests bring to application code.
The teams that get IaC right treat it exactly like application code: PRs, reviews, tests, CI/CD, and a culture of fixing broken state the same day it's detected rather than letting it accumulate. The teams that don't are the ones debugging 2 AM manual console changes with no audit trail, a corrupted state file, and no clear picture of what was actually running before the incident.
Start with the module design principles and state isolation strategy — those compound over time. Add Terragrunt when copy-paste between environment directories starts causing divergence. Add testing when you have enough modules to justify the investment. Add drift detection the day you catch someone making a console change without a follow-up Terraform PR.
The patterns in this post are not theoretical. They're the patterns that survive contact with production.
Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.
☕ Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter
Comments
Post a Comment