AmtocSoft Tech Insights: Temporal and Durable Execution: How to Build Workflows That Never Fail

Tuesday, April 14, 2026

Temporal and Durable Execution: How to Build Workflows That Never Fail

Hero image showing a durable workflow engine persisting through infrastructure failures

You have a workflow that takes two hours. It involves twelve API calls — a payment processor, an inventory system, a fulfillment service, three downstream notification services, and a few internal state machines. You've written it as a clean async chain. At step nine, your Kubernetes pod gets evicted. A node fails. A deployment rolls out mid-execution. Where do you restart?

If you're using a traditional queue worker or a naive promise chain, the honest answer is: you don't know. You have partial state scattered across databases, some API calls that succeeded and some that didn't, and no reliable way to know which ones. You're going to spend the next four hours in incident review, writing custom reconciliation scripts, and hoping you didn't double-charge a customer.

This is the durable execution problem. And it's one of the most underappreciated challenges in distributed systems engineering — especially now that AI agents are running for minutes, hours, and in some cases days, making calls to external APIs, waiting on human approvals, and maintaining complex internal state across sessions.

Temporal.io has built an industrial-grade answer to this problem. It calls the pattern "durable execution," and once you understand how it works under the hood, you'll start seeing the problem it solves everywhere — in payment flows, order fulfillment, data pipelines, and increasingly in AI agent orchestration. This post is a deep dive into what durable execution is, how Temporal implements it, and when you actually need it versus a simpler queue-based approach.

This post is part of the AI Agent Engineering: Complete 2026 Guide. Durable execution is the runtime layer for long-running agents — the guide explains how it connects to orchestration, tool use, and context management.

The Problem with Naive Long-Running Processes

Before we get into Temporal specifically, it's worth sitting with the problem in detail. Because the failure modes are subtle and they compound.

Consider a simplified order fulfillment workflow:

Validate the order
Reserve inventory
Charge the customer's payment method
Create the shipment record
Notify the warehouse
Send confirmation email to the customer
Update the order status to "fulfilled"

Written as a naive async function in TypeScript, this looks clean:

// Naive approach — looks fine, breaks badly
async function fulfillOrder(orderId: string) {
  const order = await validateOrder(orderId);
  await reserveInventory(order);
  await chargePayment(order);           // Step 3: if we crash here...
  await createShipmentRecord(order);   // Step 4 never runs
  await notifyWarehouse(order);
  await sendConfirmationEmail(order);
  await updateOrderStatus(orderId, 'fulfilled');
}

What happens when the process crashes between step three and step four? The customer has been charged. The shipment record doesn't exist. The warehouse has no idea. Your order status is stuck in an ambiguous intermediate state. Depending on your database transaction boundaries, you might not even know which steps completed.

The naive recovery strategies all have problems:

Retry the whole thing: You'll double-charge the customer. Unless every step is perfectly idempotent (and in practice, many payment processors and third-party APIs aren't), replaying from the start is dangerous.

Checkpoint to a database: You add manual checkpoint logic — save the current step to a table, reload on restart. Now your business logic is tangled with failure recovery logic. The code becomes hard to reason about, test, and maintain. And you've just reinvented a state machine, badly.

Use a message queue per step: Each step publishes to a queue, the next step consumes it. This handles individual step failures but creates a distributed state machine spread across twelve queue topics. Visibility is terrible. Debugging requires correlating logs across multiple consumers. Adding a new step means wiring up new queues and coordination logic.

Accept the inconsistency: Some teams do this. They write reconciliation jobs that run nightly to find orders stuck in intermediate states and repair them. This works until volume scales and until the corner cases multiply.

None of these approaches are wrong per se — teams ship production systems with all of them. But they all require the engineer to manage failure recovery manually, which is exactly the kind of accidental complexity that Temporal is designed to eliminate.

flowchart TD A[Start: fulfillOrder called] --> B[validateOrder ✓] B --> C[reserveInventory ✓] C --> D[chargePayment ✓] D --> E{Process Crashes} E --> F[createShipmentRecord ✗ — never ran] E --> G[notifyWarehouse ✗ — never ran] E --> H[sendConfirmationEmail ✗ — never ran] E --> I[updateOrderStatus ✗ — never ran] style E fill:#ff4444,color:#fff style F fill:#ffcccc style G fill:#ffcccc style H fill:#ffcccc style I fill:#ffcccc J[Result] --> K["Customer charged but order stuck
No shipment record
No email
Warehouse unaware
Manual reconciliation required"] F --> J G --> J H --> J I --> J

This diagram represents what production incidents look like without durable execution. Each box colored red is state you have to recover manually. With any complexity, this becomes untenable.

What Is Durable Execution?

Durable execution is the idea that the state of a running program — its call stack, local variables, the point of execution it has reached — can be persisted externally and reconstituted after a failure. From the code's perspective, the execution never stopped. From the infrastructure's perspective, the process can crash, restart, and resume seamlessly.

The key insight is that instead of persisting state, you persist events. Every action that a workflow takes is recorded as an append-only event log. When a process crashes and a new worker picks up the workflow, it doesn't load some serialized stack frame — it replays the event history, re-executing the workflow function from the beginning, but fast-forwarding through steps that already completed by returning their previously recorded results without actually re-executing the side effects.

This event sourcing approach is what makes durable execution both powerful and tricky. The workflow function executes multiple times — once originally, and then on every replay. This means workflow code must be deterministic: given the same sequence of events, it must produce the same sequence of decisions. No random numbers, no direct calls to Date.now(), no filesystem reads, no API calls inside workflow code. All non-deterministic operations must be delegated to Activities.

The mental model shift is significant. You stop thinking about your workflow as a process to be kept alive, and start thinking about it as a pure function over its event history. The event log is the source of truth. The process is just a compute unit that interprets it.

Temporal architecture diagram showing workers, event history, and the Temporal server

How Temporal Works Under the Hood

Temporal has three main components: the Temporal Server (the orchestration backend), Workflow Workers (processes that run your workflow code), and Activity Workers (processes that run your activity code). In many deployments, the same process hosts both a workflow worker and an activity worker, but they're logically distinct.

The Temporal Server maintains the event history for every running workflow instance. It acts as the state machine engine — deciding which workflow is ready to make progress, dispatching tasks to workers via long-polling queues, and recording every event (workflow started, activity scheduled, activity completed, timer fired, signal received, etc.) into a durable backend store (typically Cassandra or PostgreSQL).

Workflow Workers poll the server for workflow tasks. A workflow task is essentially a message saying "here is the event history for workflow X — figure out what to do next." The worker runs your workflow function against this history, but intercepted by Temporal's SDK so that calls to executeActivity(), sleep(), or waitForSignal() become state machine commands rather than real calls. The worker never actually blocks — it runs the function to a "decision point" (the next async operation), records the decisions, and returns control to the Temporal server. The workflow function may be called dozens of times as new events arrive, always replaying from the beginning but fast-forwarding through completed steps.

Activity Workers poll for activity tasks — requests to run a specific activity function with specific arguments. Activities are where the real work happens: the API calls, the database writes, the side effects. Activities run to completion (or fail and get retried according to their retry policy). When they complete, the result is recorded as an event and the workflow worker is given a new workflow task to make its next decision.

sequenceDiagram participant C as Client participant TS as Temporal Server participant WW as Workflow Worker participant AW as Activity Worker participant Ext as External API C->>TS: StartWorkflow(orderId) TS->>TS: Record: WorkflowExecutionStarted TS->>WW: WorkflowTask (history: [Started]) WW->>WW: Execute workflow fn → schedules chargePayment activity WW->>TS: Decision: ScheduleActivity(chargePayment) TS->>TS: Record: ActivityTaskScheduled TS->>AW: ActivityTask(chargePayment, args) AW->>Ext: POST /charge Ext-->>AW: {success: true, txnId: "abc123"} AW->>TS: ActivityTaskCompleted(result) TS->>TS: Record: ActivityTaskCompleted Note over TS,WW: Worker crashes here TS->>WW: WorkflowTask (history: [Started, ActivityScheduled, ActivityCompleted]) WW->>WW: Replay workflow fn → chargePayment returns cached result WW->>WW: Workflow fn continues → schedules createShipment activity WW->>TS: Decision: ScheduleActivity(createShipment)

The replay behavior is what makes the system resilient. When the worker crashed after the payment activity completed but before the shipment activity was scheduled, the Temporal server simply re-delivered the workflow task to a new worker with the full history. The new worker replayed the workflow function, saw that the payment activity had already completed (it's in the history), returned the recorded result without making a real API call, and then continued to schedule the next activity as if nothing had happened. Zero data loss, zero duplicate charges.

The tradeoff is determinism. Because the workflow function replays, any non-deterministic code inside it will produce different results on replay than on the original execution, causing the history to diverge and throwing a NonDeterministicWorkflowError. This is why you can't call Math.random() directly in a workflow, why you use workflow.sleep() instead of setTimeout(), and why you never make API calls inside workflow code directly. The SDK provides safe alternatives for everything you need.

Workflows vs Activities: The Critical Split

The Workflow/Activity split is the most important concept in Temporal, and getting it wrong is the most common source of bugs and confusion.

Workflow code is the orchestrator. It defines the sequence and logic of what happens, but it never directly causes side effects. It calls executeActivity(), it calls workflow.sleep(), it waits on signals. All of these are yielding control to the Temporal server. The workflow function is effectively a coroutine — it pauses at every async boundary and the Temporal server schedules its resumption based on external events.

Because workflow code replays, it must be deterministic. Every time the same events arrive, the workflow function must make the same sequence of decisions. This means:

No Date.now() — use workflow.now() (returns a deterministic timestamp from the event history)
No Math.random() — use workflow.uuid4() or pass randomness in via signals/activities
No direct network calls, file reads, or database queries
No global mutable state that persists across activations
No non-deterministic iteration over unordered data structures (e.g., iterating over object keys whose order isn't guaranteed)

Activity code is where the real work happens. Activities are regular functions — they can do anything. They can call APIs, write to databases, read files, wait on external events. They're isolated side-effectful operations with clear inputs and outputs. They run to completion or fail, and Temporal handles retry with configurable backoff.

Activity retry policies are one of Temporal's most useful features:

// Activity with retry policy
const { chargePayment } = proxyActivities<typeof activities>({
  startToCloseTimeout: '30 seconds',
  retry: {
    maximumAttempts: 5,
    initialInterval: '1 second',
    backoffCoefficient: 2,
    maximumInterval: '30 seconds',
    nonRetryableErrorTypes: ['PaymentDeclinedError', 'InvalidCardError'],
  },
});

This says: if chargePayment fails, retry up to 5 times with exponential backoff, but don't retry if the failure is a PaymentDeclinedError (a business logic failure, not an infrastructure failure). The distinction between retriable and non-retriable failures is something you'd have to implement manually in any queue-based system.

Full Code Example: Order Processing Workflow

Let's build the order fulfillment workflow we described earlier, properly, with Temporal. This is a complete working example in TypeScript.

Step 1: Define the Activities

// activities.ts
import { ApplicationFailure } from '@temporalio/activity';

export interface Order {
  orderId: string;
  customerId: string;
  items: Array<{ productId: string; quantity: number; price: number }>;
  totalAmount: number;
  shippingAddress: string;
}

export interface FulfillmentResult {
  orderId: string;
  transactionId: string;
  shipmentId: string;
  estimatedDelivery: string;
}

export async function validateOrder(orderId: string): Promise<Order> {
  // Fetch order from database
  const order = await db.orders.findById(orderId);
  if (!order) {
    // Non-retryable: the order doesn't exist
    throw ApplicationFailure.create({
      message: `Order ${orderId} not found`,
      nonRetryable: true,
    });
  }
  if (order.status !== 'pending') {
    throw ApplicationFailure.create({
      message: `Order ${orderId} is not in pending state`,
      nonRetryable: true,
    });
  }
  return order;
}

export async function reserveInventory(order: Order): Promise<void> {
  // Call inventory service — retriable on transient failures
  const response = await inventoryService.reserve({
    orderId: order.orderId,
    items: order.items,
  });
  if (!response.success) {
    if (response.reason === 'OUT_OF_STOCK') {
      throw ApplicationFailure.create({
        message: `Inventory unavailable for order ${order.orderId}`,
        nonRetryable: true,
      });
    }
    throw new Error(`Inventory reservation failed: ${response.reason}`);
  }
}

export async function chargePayment(order: Order): Promise<string> {
  // Idempotency key ensures no double charges on retry
  const idempotencyKey = `order-${order.orderId}-charge`;
  const result = await paymentProcessor.charge({
    customerId: order.customerId,
    amount: order.totalAmount,
    idempotencyKey,
  });
  if (!result.success) {
    throw ApplicationFailure.create({
      message: `Payment failed: ${result.declineReason}`,
      nonRetryable: true,  // Don't retry declined payments
    });
  }
  return result.transactionId;
}

export async function createShipmentRecord(
  order: Order,
  transactionId: string
): Promise<string> {
  const shipment = await shippingService.create({
    orderId: order.orderId,
    address: order.shippingAddress,
    items: order.items,
    transactionId,
  });
  return shipment.shipmentId;
}

export async function notifyWarehouse(
  order: Order,
  shipmentId: string
): Promise<void> {
  await warehouseService.dispatch({
    shipmentId,
    items: order.items,
    pickupDeadline: new Date(Date.now() + 2 * 60 * 60 * 1000).toISOString(),
  });
}

export async function sendConfirmationEmail(
  order: Order,
  shipmentId: string,
  estimatedDelivery: string
): Promise<void> {
  await emailService.send({
    to: order.customerId,
    template: 'order-confirmation',
    data: { orderId: order.orderId, shipmentId, estimatedDelivery },
  });
}

export async function updateOrderStatus(
  orderId: string,
  status: string,
  metadata: Record<string, string>
): Promise<void> {
  await db.orders.update(orderId, { status, ...metadata });
}

Step 2: Define the Workflow

// workflows/orderFulfillment.ts
import { proxyActivities, defineSignal, setHandler, condition, workflow } from '@temporalio/workflow';
import type * as activities from '../activities';

// Import activities as proxies — the SDK intercepts these calls
const {
  validateOrder,
  reserveInventory,
  chargePayment,
  createShipmentRecord,
  notifyWarehouse,
  sendConfirmationEmail,
  updateOrderStatus,
} = proxyActivities<typeof activities>({
  startToCloseTimeout: '2 minutes',
  retry: {
    maximumAttempts: 10,
    initialInterval: '1 second',
    backoffCoefficient: 2,
    maximumInterval: '5 minutes',
  },
});

// Signals allow external systems to communicate with running workflows
export const cancelOrderSignal = defineSignal<[{ reason: string }]>('cancelOrder');

// Queries allow external systems to read workflow state
export const getStatusQuery = defineQuery<string>('getStatus');

export async function orderFulfillmentWorkflow(orderId: string): Promise<FulfillmentResult> {
  let status = 'starting';
  let cancelRequested = false;
  let cancelReason = '';

  // Register signal handler — workflow can receive signals at any point
  setHandler(cancelOrderSignal, ({ reason }) => {
    cancelRequested = true;
    cancelReason = reason;
  });

  // Register query handler — external systems can read current state
  setHandler(getStatusQuery, () => status);

  try {
    status = 'validating';
    const order = await validateOrder(orderId);

    // Check for cancellation between steps
    if (cancelRequested) {
      await updateOrderStatus(orderId, 'cancelled', { cancelReason });
      return { orderId, status: 'cancelled', cancelReason };
    }

    status = 'reserving-inventory';
    await reserveInventory(order);

    if (cancelRequested) {
      // Release the reservation we just made before cancelling
      await releaseInventory(order);
      await updateOrderStatus(orderId, 'cancelled', { cancelReason });
      return { orderId, status: 'cancelled', cancelReason };
    }

    status = 'charging-payment';
    const transactionId = await chargePayment(order);

    // Past this point, cancellation requires a refund flow — different workflow
    status = 'creating-shipment';
    const shipmentId = await createShipmentRecord(order, transactionId);

    status = 'notifying-warehouse';
    await notifyWarehouse(order, shipmentId);

    status = 'sending-confirmation';
    const estimatedDelivery = calculateEstimatedDelivery(order.shippingAddress);
    await sendConfirmationEmail(order, shipmentId, estimatedDelivery);

    status = 'fulfilled';
    await updateOrderStatus(orderId, 'fulfilled', {
      transactionId,
      shipmentId,
      estimatedDelivery,
    });

    return { orderId, transactionId, shipmentId, estimatedDelivery };

  } catch (error) {
    status = 'failed';
    await updateOrderStatus(orderId, 'failed', {
      errorMessage: error.message,
    });
    throw error;
  }
}

// This function is safe to call in workflow code because it's pure/deterministic
function calculateEstimatedDelivery(address: string): string {
  // This would need to be an activity if it made external calls
  // Since it's pure calculation, it can live in workflow code
  const deliveryDays = address.includes('CA') ? 3 : 5;
  const delivery = new Date(workflow.now() + deliveryDays * 24 * 60 * 60 * 1000);
  return delivery.toISOString().split('T')[0];
}

Step 3: Worker Setup

// worker.ts
import { Worker } from '@temporalio/worker';
import * as activities from './activities';

async function run() {
  const worker = await Worker.create({
    workflowsPath: require.resolve('./workflows/orderFulfillment'),
    activities,
    taskQueue: 'order-fulfillment',
    // Configure worker-level concurrency
    maxConcurrentActivityTaskExecutions: 100,
    maxConcurrentWorkflowTaskExecutions: 40,
  });

  await worker.run();
}

run().catch(console.error);

Step 4: Starting a Workflow and Querying Its State

// client.ts
import { Client } from '@temporalio/client';
import { orderFulfillmentWorkflow, getStatusQuery, cancelOrderSignal } from './workflows/orderFulfillment';

const client = new Client();

// Start a workflow
const handle = await client.workflow.start(orderFulfillmentWorkflow, {
  taskQueue: 'order-fulfillment',
  workflowId: `order-${orderId}`,   // Deterministic ID enables deduplication
  args: [orderId],
});

console.log(`Started workflow: ${handle.workflowId}`);

// Query the status at any point — works even if the worker has been restarted
const status = await handle.query(getStatusQuery);
console.log(`Current status: ${status}`);

// Send a cancellation signal — the workflow will handle it at the next safe checkpoint
await handle.signal(cancelOrderSignal, { reason: 'Customer requested cancellation' });

// Wait for completion
const result = await handle.result();
console.log('Order fulfilled:', result);

flowchart TD A[Client: StartWorkflow orderId] --> B[Temporal Server] B --> C{Workflow Dispatcher} C --> D[validateOrder Activity] D --> E{Cancelled?} E -->|No| F[reserveInventory Activity] E -->|Yes| Z[updateOrderStatus: cancelled] F --> G{Cancelled?} G -->|No| H[chargePayment Activity] G -->|Yes| Y[releaseInventory + updateOrderStatus: cancelled] H --> I[createShipmentRecord Activity] I --> J[notifyWarehouse Activity] J --> K[sendConfirmationEmail Activity] K --> L[updateOrderStatus: fulfilled] L --> M[Return FulfillmentResult] style H fill:#ff9900 style D fill:#4CAF50,color:#fff style F fill:#4CAF50,color:#fff style I fill:#4CAF50,color:#fff style J fill:#4CAF50,color:#fff style K fill:#4CAF50,color:#fff N[cancelOrderSignal] -.->|async signal| E N -.->|async signal| G

This is the power of the pattern in practice. The workflow is twelve steps of business logic written as sequential code. The failure recovery, the retry logic, the cancellation handling, the state visibility — all of it is handled by Temporal, not by manual checkpointing or complex queue topologies. If a worker dies between any two steps, the next worker picks up exactly where it left off.

Temporal for AI Agent Orchestration

The same characteristics that make Temporal ideal for order fulfillment make it increasingly compelling for AI agent orchestration — and the connection is more direct than it might appear.

Modern AI agents face the durable execution problem in an amplified form:

Long execution times: An agent tasked with researching a technical topic, drafting a report, and scheduling a review might run for two to four hours. It might spawn sub-agents, wait for their results, and synthesize them. This is not a job you want to lose mid-execution because a container was evicted.

Unreliable external calls: LLM API calls fail. Rate limits get hit. Network timeouts happen. A naive agent that doesn't retry LLM calls with appropriate backoff will fail far more often than necessary. Activity retry policies are exactly what you need here.

Human-in-the-loop checkpoints: Many production AI workflows require human approval at key decision points. An agent that drafts a customer email needs a human to review it before sending. Temporal's Signal mechanism is the cleanest way to implement this: the workflow pauses waiting for a humanApprovedSignal, and a human can approve via an API call that sends the signal at any point — even days later.

Audit trails: Temporal's event history is an automatic, immutable audit trail of every decision the agent made, every activity it executed, every signal it received. For AI agents making consequential decisions, this observability is invaluable.

Here's a pattern for a multi-step AI research agent:

// workflows/researchAgent.ts
import { proxyActivities, defineSignal, setHandler, condition } from '@temporalio/workflow';

const {
  searchWeb,
  fetchPageContent,
  summarizeContent,
  draftReport,
  sendForHumanReview,
  publishReport,
} = proxyActivities<typeof activities>({
  startToCloseTimeout: '5 minutes',
  retry: {
    maximumAttempts: 3,
    initialInterval: '2 seconds',
    backoffCoefficient: 2,
  },
});

// Human approval signal
export const approveReportSignal = defineSignal<[{ approved: boolean; feedback?: string }]>('approveReport');

export async function researchAgentWorkflow(topic: string): Promise<string> {
  let humanApproval: { approved: boolean; feedback?: string } | null = null;

  setHandler(approveReportSignal, (approval) => {
    humanApproval = approval;
  });

  // Step 1: Search for relevant sources
  const searchResults = await searchWeb({ query: topic, maxResults: 20 });

  // Step 2: Fetch and summarize content from top sources (fan-out)
  const summaries = await Promise.all(
    searchResults.slice(0, 5).map(result =>
      fetchPageContent({ url: result.url })
        .then(content => summarizeContent({ content, topic }))
    )
  );

  // Step 3: Draft a consolidated report using an LLM
  const draft = await draftReport({
    topic,
    summaries,
    wordTarget: 2000,
  });

  // Step 4: Send for human review — workflow pauses here
  await sendForHumanReview({
    draft,
    reviewerEmail: 'editor@amtocsoft.com',
    workflowId: workflow.info().workflowId,
  });

  // Wait up to 48 hours for human approval
  const approved = await condition(
    () => humanApproval !== null,
    '48 hours'
  );

  if (!approved || !humanApproval?.approved) {
    // Timeout or rejection — revise based on feedback
    if (humanApproval?.feedback) {
      const revisedDraft = await draftReport({
        topic,
        summaries,
        wordTarget: 2000,
        revisionNotes: humanApproval.feedback,
      });
      return revisedDraft;
    }
    throw new Error('Report not approved within deadline');
  }

  // Step 5: Publish the approved report
  const publishedUrl = await publishReport({ content: draft });
  return publishedUrl;
}

The condition() call is a Temporal primitive that blocks the workflow until a predicate becomes true or a timeout fires. The workflow is suspended in the Temporal server — no worker thread is blocked, no memory is consumed while waiting. When the human sends the approval signal (via a simple API call), the server wakes the workflow and delivers a new workflow task to a worker.

This pattern — long-running agent with human-in-the-loop checkpoints, parallel tool calls fanned out as activities, comprehensive retry on LLM calls — is becoming the de facto architecture for production AI workflows at companies that have learned from early failures.

Temporal vs Alternatives

There are several tools in this space, and Temporal is not always the right choice. Here's an honest comparison:

Feature	Temporal	AWS Step Functions	Apache Airflow	BullMQ	Inngest
Execution model	Code-first, durable replay	Visual JSON state machine	DAG scheduler	Queue-based job runner	Event-driven, code-first
Language support	TypeScript, Go, Java, Python, PHP	Any (via Lambda)	Python	TypeScript/Node.js	TypeScript, Python
Durability	Full (event-sourced replay)	Full (managed AWS)	Partial (depends on config)	Limited (Redis-based)	Full (event-sourced)
Long-running workflows	Excellent (years, if needed)	Good (1 year max)	Good (depends on task)	Limited (Redis TTL)	Good (months)
Human-in-the-loop	First-class (Signals)	Supported via Wait states	Awkward (external trigger)	Manual (custom logic)	Supported (durable sleeps)
Replay/debugging	Excellent (full history UI)	Good (execution history)	Good (DAG view)	Limited	Good (event stream)
Local development	Local server (easy)	LocalStack (awkward)	Complex setup	Simple (Redis)	Dev server available
Scalability	Excellent (horizontal)	Excellent (serverless)	Good (with tuning)	Good (Redis cluster)	Good (serverless)
Operational complexity	High (self-host) / Low (Temporal Cloud)	Low (fully managed)	High	Low	Low (fully managed)
Cost model	OSS + Cloud fees	Per-state-transition	OSS + infra	OSS + Redis	Per-step pricing
Learning curve	Steep (determinism rules)	Moderate (JSON verbosity)	Moderate (DAG concepts)	Low	Low to moderate
Best for	Complex multi-step workflows, AI agents	AWS-native, Lambda-heavy workflows	Batch data pipelines	Simple job queues	Event-driven apps on serverless

Step Functions is often the right choice if you're already deep in the AWS ecosystem and your workflows are expressible as the JSON state machine format — which is genuinely viable for many cases. The managed infrastructure and native IAM integration are real advantages.

Airflow is built for scheduled batch data pipelines and ETL, not for arbitrary workflows with complex branching and external signals. It's the wrong tool for real-time or event-driven workflows.

BullMQ is excellent for traditional background job processing — sending emails, resizing images, processing uploads. If your jobs are short, independent, and don't need to maintain state across multiple steps, BullMQ is faster to set up, easier to reason about, and plenty reliable.

Inngest is worth serious consideration if you want durable execution without the operational overhead of running Temporal. It's designed for serverless environments, has a clean developer experience, and handles the common patterns well. The tradeoff is less control and a pricing model that gets expensive at scale.

When NOT to Use Temporal

Temporal's overhead is real, and adding it to systems that don't need it creates unnecessary complexity.

Short, independent jobs: If you're sending a welcome email, resizing a profile picture, or running a quick data validation — BullMQ or a simple message queue is the right tool. Temporal has observable latency in the low hundreds of milliseconds for simple workflows due to the round trips to the server. That overhead is irrelevant for two-hour workflows; it's significant for sub-second jobs.

Sub-second latency requirements: Temporal is not for real-time processing. If your SLA requires P99 latency under 100ms, you're looking at stream processing (Kafka Streams, Flink), not workflow orchestration.

Simple linear pipelines: If your workflow is always "do step A, then step B, then step C" with no branching, no signals, no fan-out, and each step takes under a second — the durability guarantees probably aren't worth the operational overhead. A well-designed queue-based pipeline with idempotent steps handles this fine.

Small teams without operational capacity: Self-hosted Temporal requires running Cassandra or PostgreSQL for the event store, managing the Temporal server cluster, handling upgrades, monitoring health. This is non-trivial. Temporal Cloud removes most of this but comes with real costs. If your team doesn't have the capacity to maintain this infrastructure, you might want to start with a simpler tool and migrate when the complexity justifies it.

Systems where code agility matters more than durability: Every time you change workflow logic, you have to think about versioning (discussed below). If your workflow logic changes frequently and you value the ability to deploy quickly without worrying about in-flight workflows, this is a source of friction.

Production Considerations

Workflow versioning is the hardest operational problem with Temporal, and it's worth understanding before you commit to the platform.

Because the workflow function replays from history, changing the workflow code can break replay of in-flight workflows. If you deploy a new version of your workflow that inserts a new activity call between existing ones, any workflow instance that was at step three under the old code will be confused when it replays under the new code and encounters a different sequence of events.

Temporal provides the patched() API for safe versioning:

import { patched } from '@temporalio/workflow';

export async function orderFulfillmentWorkflow(orderId: string) {
  const order = await validateOrder(orderId);
  await reserveInventory(order);

  // Added in v2: fraud check before payment
  if (patched('add-fraud-check')) {
    await runFraudCheck(order);  // New step — only runs for workflows created after this deploy
  }

  const transactionId = await chargePayment(order);
  // ... rest of workflow
}

The patched() call returns true for new workflow instances and false for old ones replaying, letting both old and new code paths coexist safely. Once all in-flight workflows from before the change have completed, you can remove the branch and clean up. This workflow versioning discipline is the main operational overhead that teams underestimate.

Namespace management: In production, use separate Temporal namespaces for production and staging environments, and consider separate namespaces for different business domains if you have teams that need isolation. Namespaces provide fault isolation — a runaway workflow in the order processing namespace can't affect the payment namespace.

Worker scaling: Temporal workers are stateless and horizontally scalable. Standard Kubernetes HPA based on CPU/memory works well. For more precise scaling, Temporal Cloud exposes metrics on task queue backlog depth that you can autoscale against.

Temporal Cloud vs self-hosted: For most production use cases, Temporal Cloud is worth the cost. The managed service handles the operational complexity of running the Cassandra backend, the server cluster, TLS, backups, and upgrades. Self-hosting makes sense at very high scale or when you have specific data residency requirements.

Observability: The Temporal UI provides a per-workflow view of the event history, which is excellent for debugging individual workflow failures. For aggregate monitoring, export metrics to your existing observability stack (Prometheus/Grafana). Track: workflow execution latency, activity retry rates, task queue depth (signals backlog), and workflow failure rates by type.

Conclusion

Durable execution is not a new idea — the database community has been persisting transactional state for decades, and event sourcing has been a known pattern in distributed systems for years. What Temporal has done is productionize this idea as a general-purpose execution substrate that works for arbitrary code, not just database operations. The result is a platform that shifts the mental model for building long-running distributed processes: instead of managing failure recovery manually, you write sequential code that assumes infinite reliability, and the platform provides the durability guarantees.

The convergence with AI agents is what makes this pattern increasingly important right now. The same problems that make multi-step business workflows hard — partial failures, non-idempotent external calls, long execution horizons, the need for human checkpoints — are amplified in AI agent workloads. An agent that runs for hours, makes dozens of LLM API calls, spins up sub-agents, and waits for human approval is exactly the kind of workload that durable execution was built for.

The honest assessment: Temporal has a real learning curve, meaningful operational overhead, and the workflow versioning story requires discipline. It's not the right tool for every use case. But for complex, multi-step workflows where reliability and observability are requirements rather than nice-to-haves — payment flows, order fulfillment, data pipelines, and increasingly AI agent orchestration — the investment pays back quickly. Every hour you don't spend writing reconciliation scripts for partial failures is an hour you can spend building features.

The infrastructure for reliable long-running computation is here. The question is whether your workflows are complex enough to justify it. If you're building AI agents that run for more than a few minutes, the answer is almost certainly yes.

Sources

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-25 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

AmtocSoft Tech Insights