Anyone working with Lambda knows the story: you have a process that needs to wait for something. A human approval, a response from a slow API, an operation that will take longer than the famous 15-minute timeout. The traditional solution? Step Functions. Works well, but it’s more infrastructure to manage, more state to coordinate, more money leaving your wallet.
Durable Functions arrived to change this. It’s not a new feature if you come from the Azure world, where the concept has existed for years. But now it’s available in AWS Lambda, and the proposal is simple: write sequential code that looks synchronous, but can pause, sleep, and resume after days if necessary. All without paying for the wait time.
The most interesting scenario I see? Integrations with LLMs. Think of an agent that needs to call multiple tools, wait for human responses along the way, and maintain context for hours. With Durable Functions, you don’t need a complex queue architecture. It simply works.
How the Magic Happens: Checkpoint and Replay
Here’s the thing: Durable Functions doesn’t keep your Lambda running for hours. That would be too expensive. What it does is a checkpoint and replay mechanism.
When you call a context.step(), Lambda records the result in a log. If the function needs to pause, it saves this checkpoint and simply stops. When it resumes, the function is invoked again from scratch. That’s right, from the beginning. But here’s the trick: the steps that already executed don’t run again. The system retrieves the values from the checkpoint log and moves forward.
It’s elegant, but has a crucial implication: your code needs to be deterministic. If you use datetime.now() outside a step to make a decision, you’ll have problems. On replay, the value will be different, and the execution may diverge from the original path. The same goes for random numbers, UUIDs, anything that changes between executions.
In Practice: An Approval Workflow
I built a demo that illustrates the concept well. It’s an order approval system where the function literally stops and waits for a human to click a button.
The workflow has five stages: create the order in DynamoDB, validate the data, wait for approval, process or cancel according to the decision, and notify. The interesting part is the third stage.
callback = context.create_callback(
name="approval_callback",
config=CallbackConfig(timeout=Duration.from_minutes(5)),
)
# Save the callback_id in the database for the approval handler to find
table.update_item(
Key={"orderId": order["orderId"]},
UpdateExpression="SET callbackId = :callbackId",
ExpressionAttributeValues={":callbackId": callback.callback_id},
)
# Here the execution suspends. You pay nothing while waiting.
approval_result = callback.result()
When this callback.result() is called, the Lambda simply stops executing. There’s no instance running, no charge. It can stay like this for minutes or hours. When someone approves the order via API, another function calls send_durable_execution_callback_success() with the result, and the original function resumes exactly where it left off.
Every operation with side effects is inside a @durable_step. Create order, validate, process, cancel, notify. This ensures that if something goes wrong in the middle and the function needs to restart, the already completed steps won’t execute again.
Dissecting the Code
Let’s go through the complete workflow. It starts with the SDK imports:
from aws_durable_execution_sdk_python import (
DurableContext,
durable_execution,
durable_step,
StepContext,
)
from aws_durable_execution_sdk_python.config import CallbackConfig, Duration
The DurableContext is the object you receive in the handler and use for all durable operations. The durable_execution is the decorator that transforms your function into a durable workflow. The durable_step marks functions that should be checkpointed. The StepContext is automatically injected into steps.
Each step is a decorated function:
@durable_step
def create_order(step_context: StepContext, event: dict) -> dict:
order_id = str(uuid.uuid4())
order = {
"orderId": order_id,
"customerName": body.get("customerName", "Unknown"),
"status": "pending_approval",
"createdAt": datetime.utcnow().isoformat(),
}
table.put_item(Item=order)
return order
Notice that uuid.uuid4() and datetime.utcnow() are inside the step. This is intentional. If they were outside, on replay they would have different values. Inside the step, the entire result is persisted in the checkpoint, so on replay the SDK simply returns the order that was saved the first time.
The main handler orchestrates everything:
@durable_execution
def handler(event: dict, context: DurableContext) -> dict:
# Step 1: Create order
order = context.step(create_order(event))
# Step 2: Validate
context.step(validate_order(order))
# Step 3: Create callback and wait
callback = context.create_callback(
name="approval_callback",
config=CallbackConfig(timeout=Duration.from_minutes(5)),
)
# Store the callback_id for another function to respond
table.update_item(
Key={"orderId": order["orderId"]},
UpdateExpression="SET callbackId = :callbackId",
ExpressionAttributeValues={":callbackId": callback.callback_id},
)
# Suspends here
approval_result = callback.result()
# Step 4: Process or cancel
if approval_result.get("approved"):
context.step(process_order(order))
else:
context.step(cancel_order(order))
# Step 5: Notify
context.step(send_notification(order, approved))
return {"statusCode": 200, "body": json.dumps({...})}
The context.step() is how you execute a durable step. It calls the function, persists the result, and on replay retrieves directly from the checkpoint without re-executing.
The context.create_callback() creates a wait point. The callback.result() is where the function suspends. The SDK generates a unique callback_id that you need to store somewhere, so another function can send the response later.
The branching part after the callback is interesting: the if approval_result.get("approved") will run on replay too, but since the approval_result came from the checkpoint, the decision will be the same. Determinism preserved.
The Gotchas You’ll Encounter
Yan Cui wrote about five pitfalls of Durable Functions, and it’s worth knowing before using in production.
The first is the non-deterministic code I mentioned. If you use timestamps or random for branching decisions outside of steps, prepare for strange behavior. The solution is to capture these decisions inside steps, where they’re recorded in the checkpoint.
The second is side effects outside of steps. Updating database, calling external API, sending email, all this needs to be in a step. Otherwise, on replay, it will execute again. Imagine sending the same email three times because the function underwent replay twice.
The third is more subtle: mutating closure variables inside a step. The SDK doesn’t persist these mutations. On replay, the variable will have the original value, not the modified one. The code seems to work in tests, but in production can give bizarre behavior.
The fourth is using dynamic names for steps. If the step name changes between executions, the system can’t find the result in the checkpoint. Always use static and predictable names.
The fifth is more technical: results larger than 256kb in child contexts are not stored. If you use parallel() or map() and the result exceeds this limit, on replay the context will re-execute. If there was non-durable code inside, it will run again.
About Cost and When to Use
The strong point here is cost. You don’t pay for the 15 minutes, 1 hour, or 24 hours that the function sits waiting. The ExecutionTimeout can be configured up to one year. One year. Think of the possibilities.
But it’s not a silver bullet. If you have a complex workflow with many parallel branches and elaborate conditional logic, Step Functions might still make more sense for visualization and debugging. Durable Functions shines when you want simplicity: linear code that needs long pauses.
The LLM scenario I mentioned at the beginning is perfect. A conversational agent that needs human-in-the-loop, processes that mix automation with manual approvals, integrations with slow systems. All of this becomes simpler to write and cheaper to run.
Conclusion
AWS Lambda Durable Functions fills a gap that existed for years. It’s not revolutionary for those who know the Azure equivalent, but having this available natively in the AWS ecosystem, with SAM and CloudFormation, greatly facilitates adoption.
The demo I built shows the most common pattern: a workflow that needs human approval. Three functions, a DynamoDB table, and a simple frontend. The order is created, the function pauses waiting for callback, and resumes when someone decides. The code is linear, easy to understand, and you don’t pay for the waiting minutes.
If you’re hitting the limits of traditional Lambda or spending too much with Step Functions for simple cases, it’s definitely worth experimenting.
Insights & Takeaways
Checkpoint and replay is the central mechanism: the function doesn’t keep running, it stops and restarts from zero when it resumes, recovering already computed results from the checkpoint log.
Determinism is mandatory: timestamps, random, and any value that changes between executions needs to be captured inside steps, or you’ll have inconsistent behavior on replay.
Side effects only inside steps: API calls, database updates, sending notifications - everything that shouldn’t repeat needs to be encapsulated in a
@durable_step.You don’t pay for wait time: when the function suspends waiting for callback or wait, there’s no charge. This completely changes the economics of long processes.
LLMs and human-in-the-loop are ideal use cases: the combination of long executions, pauses for human intervention, and simple linear code is exactly where Durable Functions shines.
