Retries & Error Handling
Configure retry strategies and handle failures gracefully
Retries & Error Handling
Every step.do() has automatic retry support built-in. Network hiccup? API rate limit? Transient database error? Ablauf will retry the step automatically using Durable Object alarms (non-blocking, of course).
Default Behavior
By default, every step gets 3 retry attempts with 1 second delay and exponential backoff:
// Uses defaults: { limit: 3, delay: "1s", backoff: "exponential" }
const data = await step.do('fetch-data', async () => {
const res = await fetch('https://api.example.com/data');
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
});If the step fails all 3 times, it throws StepRetryExhaustedError and the workflow stops.
Per-Step Overrides
Need more retries for a critical operation? Override the defaults:
const data = await step.do(
'critical-operation',
async () => {
// This step gets 10 attempts with 5s delay
return await somethingFragile();
},
{
retries: {
limit: 10,
delay: '5s',
backoff: 'exponential',
},
},
);Workflow-Level Defaults
Set default retry config for all steps in a workflow:
import { defineWorkflow } from '@der-ablauf/workflows';
const MyWorkflow = defineWorkflow((t) => ({
type: 'my-workflow',
input: t.object({
/* ... */
}),
defaults: {
retries: { limit: 5, delay: '2s', backoff: 'linear' },
},
run: async (step, payload) => {
// All steps in this workflow default to 5 attempts with linear backoff
await step.do('step-1', async () => {
/* ... */
});
await step.do('step-2', async () => {
/* ... */
});
},
}));Per-step overrides still work — they take precedence over workflow defaults.
Backoff Strategies
Ablauf supports three backoff strategies:
| Strategy | Formula | Example (1s base delay) |
|---|---|---|
"fixed" | delay | 1s, 1s, 1s, 1s |
"linear" | delay * attempt | 1s, 2s, 3s, 4s |
"exponential" | delay * 2^(attempt-1) | 1s, 2s, 4s, 8s |
Exponential backoff is usually the right choice — it gives temporary issues time to resolve without hammering a struggling service.
Retry delays use duration strings like "500ms", "1s", "30s", "5m", or "1h".
When Retries Are Exhausted
When all retry attempts fail, the step throws StepRetryExhaustedError. This propagates to the workflow's run() function, marking the workflow as errored.
try {
await step.do('flaky-operation', async () => {
// This might fail...
});
} catch (err) {
if (err instanceof StepRetryExhaustedError) {
// Log the failure, send an alert, etc.
console.error(`Step failed after ${err.attempts} attempts`);
}
throw err;
}For the complete list of error classes and their HTTP status codes, see the API Reference.
Skipping Retries with NonRetriableError
Sometimes retrying is pointless — the error is permanent and will never succeed. For these cases, throw NonRetriableError inside your step function to immediately fail the step without retrying:
import { defineWorkflow, NonRetriableError } from '@der-ablauf/workflows';
const order = defineWorkflow((t) => ({
type: 'process-order',
input: t.object({ userId: t.string() }),
run: async (step, payload) => {
const user = await step.do('validate-user', async () => {
const user = await getUser(payload.userId);
if (user.banned) {
throw new NonRetriableError('User is banned');
}
return user;
});
// ...
},
}));When NonRetriableError is thrown:
- The step is immediately marked as
failed— no retries are attempted, regardless of the retry configuration - The error is recorded in the step's retry history (visible in the dashboard)
- The workflow transitions to
errored
NonRetriableError extends plain Error, not WorkflowError. It's designed to be simple for user code — no error codes or HTTP statuses needed.
When to Use NonRetriableError
Use it for errors where retrying would be wasteful:
- Business rule violations — user is banned, account is suspended
- Authorization failures — invalid API key, insufficient permissions
- Invalid data — malformed input discovered mid-step
- Resource gone — the thing you need no longer exists
Crash & OOM Recovery
Cloudflare Durable Objects run in isolates with a 128 MB memory limit. If a step exceeds this limit (or the isolate crashes for any reason), the entire isolate is killed — your step's try/catch never executes, and the error is never recorded.
Ablauf handles this automatically using write-ahead step tracking. Before executing your step function, Ablauf persists the step as "running" in SQLite with an incremented attempt counter. If the isolate dies mid-execution:
- The step remains in
"running"state in durable storage (SQLite survives isolate resets) - A safety alarm (set before replay started) fires and triggers the alarm handler
- The alarm handler replays the workflow, detecting the orphaned
"running"step - The crash is recorded in the step's retry history
- Normal retry logic kicks in — backoff delay, then re-execution
- If retries are exhausted, the step fails permanently with
StepRetryExhaustedError
This means OOM crashes are handled identically to normal step failures — no infinite loops, no zombie workflows. Ablauf sets a safety alarm before every replay so there's always a trigger for crash recovery, whether the OOM happened during initial execution, a resume, or an alarm-driven retry.
Attempt 1: step.do() → write-ahead (running, attempts=1) → fn() → OOM 💥
↓ isolate killed, safety alarm fires
Recovery: alarm() → replay() → detects status="running" → schedule retry with backoff
↓ retry alarm fires
Attempt 2: step.do() → write-ahead (running, attempts=2) → fn() → success ✓If a step deterministically exceeds 128 MB, retries won't help — it will fail on every attempt and eventually exhaust retries. Move memory-heavy work to a separate Worker via service binding RPC so the DO isolate stays safe. The separate Worker has its own 128 MB and its crash won't kill your workflow's state.
What Survives an Isolate Crash
| Data | Survives? | Why |
|---|---|---|
| Completed step results | Yes | Persisted in SQLite before the crash |
| Workflow metadata & payload | Yes | Written to durable storage on creation |
| In-flight step attempt counter | Yes | Write-ahead persists before fn() runs |
| The step function's return value | No | Isolate died before it could be saved |
| JavaScript variables in memory | No | Isolate memory is wiped |
Result Size Limits
Every workflow has a cumulative memory budget for step results. By default, the total serialized size of all completed step results cannot exceed 64 MB. This prevents workflows from accumulating enough data to trigger an OOM crash during replay.
After each step.do() execution, Ablauf measures the serialized result size and checks it against the remaining budget. If the new result would push the total over the limit, the step fails before the result is stored.
How It Works
Step 1: result = 2 MB → total: 2 MB / 64 MB ✓
Step 2: result = 10 MB → total: 12 MB / 64 MB ✓
Step 3: result = 55 MB → total: 67 MB / 64 MB ✗ → failsDefault Behavior
By default, exceeding the budget throws a NonRetriableError — the step fails immediately without retrying (since retrying will produce the same oversized result):
const MyWorkflow = defineWorkflow({
type: 'my-workflow',
input: z.object({ /* ... */ }),
// Default: 64 MB budget, non-retryable on overflow
run: async (step, payload) => { /* ... */ },
});Custom Configuration
Override the defaults with resultSizeLimit:
const HeavyWorkflow = defineWorkflow({
type: 'heavy-workflow',
input: z.object({ /* ... */ }),
resultSizeLimit: {
maxSize: '128mb', // Increase the budget
onOverflow: 'retry', // Use retry logic instead of immediate failure
},
run: async (step, payload) => { /* ... */ },
});| Option | Type | Default | Description |
|---|---|---|---|
maxSize | string | "64mb" | Cumulative byte budget. Accepts: "512kb", "64mb", "1gb". |
onOverflow | "fail" | "retry" | "fail" | "fail" throws NonRetriableError. "retry" throws a retryable error. |
The 64 MB default leaves ~64 MB headroom for the engine, your workflow code, and deserialized objects within the 128 MB isolate limit. Raising it above 100 MB is risky — consider offloading heavy data to external storage instead.
Size Strings
Size strings follow the same pattern as duration strings:
| Format | Example | Description |
|---|---|---|
Nb | "100b" | Bytes |
Nkb | "512kb" | Kilobytes |
Nmb | "64mb" | Megabytes |
Ngb | "1gb" | Gigabytes |
Invalid size strings throw InvalidSizeError.
Best Practices
Set realistic retry limits. If an API is down, 100 retries won't help. Use retries for transient issues, not systemic failures.
Use exponential backoff for external services. Linear or fixed backoff can overwhelm a struggling service.
Don't retry non-idempotent operations blindly. If retrying a step could cause duplicate charges, emails, or data corruption, add idempotency checks inside the step.
await step.do(
'send-email',
async () => {
// Check if email was already sent before retrying
const alreadySent = await checkEmailLog(userId);
if (alreadySent) return;
await sendEmail(userId, 'Welcome!');
},
{
retries: { limit: 5 },
},
);Offload memory-heavy steps. If a step does inference, image processing, or anything that might exceed 128 MB, call it via fetch() or service binding RPC so it runs in a separate isolate. Your DO stays safe and can retry if the external call fails.
await step.do('run-inference', async () => {
// Runs in a separate Worker's isolate — OOM here won't kill the DO
const result = await env.INFERENCE_WORKER.runModel(payload);
return result;
}, { retries: { limit: 5, delay: '10s', backoff: 'exponential' } });