Retries & Error Handling

Every step.do() has automatic retry support built-in. Network hiccup? API rate limit? Transient database error? Ablauf will retry the step automatically using Durable Object alarms (non-blocking, of course).

Default Behavior

By default, every step gets 3 retry attempts with 1 second delay and exponential backoff:

// Uses defaults: { limit: 3, delay: "1s", backoff: "exponential" }
const data = await step.do('fetch-data', async () => {
	const res = await fetch('https://api.example.com/data');
	if (!res.ok) throw new Error(`HTTP ${res.status}`);
	return res.json();
});

If the step fails all 3 times, it throws StepRetryExhaustedError and the workflow stops.

Per-Step Overrides

Need more retries for a critical operation? Override the defaults:

const data = await step.do(
	'critical-operation',
	async () => {
		// This step gets 10 attempts with 5s delay
		return await somethingFragile();
	},
	{
		retries: {
			limit: 10,
			delay: '5s',
			backoff: 'exponential',
		},
	},
);

Workflow-Level Defaults

Set default retry config for all steps in a workflow:

import { defineWorkflow } from '@der-ablauf/workflows';

const MyWorkflow = defineWorkflow((t) => ({
	type: 'my-workflow',
	input: t.object({
		/* ... */
	}),
	defaults: {
		retries: { limit: 5, delay: '2s', backoff: 'linear' },
	},
	run: async (step, payload) => {
		// All steps in this workflow default to 5 attempts with linear backoff
		await step.do('step-1', async () => {
			/* ... */
		});
		await step.do('step-2', async () => {
			/* ... */
		});
	},
}));

Per-step overrides still work — they take precedence over workflow defaults.

Backoff Strategies

Ablauf supports three backoff strategies:

Strategy	Formula	Example (1s base delay)
`"fixed"`	`delay`	1s, 1s, 1s, 1s
`"linear"`	`delay * attempt`	1s, 2s, 3s, 4s
`"exponential"`	`delay * 2^(attempt-1)`	1s, 2s, 4s, 8s

Exponential backoff is usually the right choice — it gives temporary issues time to resolve without hammering a struggling service.

Retry delays use duration strings like "500ms", "1s", "30s", "5m", or "1h".

When Retries Are Exhausted

When all retry attempts fail, the step throws StepRetryExhaustedError. This propagates to the workflow's run() function, marking the workflow as errored.

try {
	await step.do('flaky-operation', async () => {
		// This might fail...
	});
} catch (err) {
	if (err instanceof StepRetryExhaustedError) {
		// Log the failure, send an alert, etc.
		console.error(`Step failed after ${err.attempts} attempts`);
	}
	throw err;
}

For the complete list of error classes and their HTTP status codes, see the API Reference.

Skipping Retries with NonRetriableError

Sometimes retrying is pointless — the error is permanent and will never succeed. For these cases, throw NonRetriableError inside your step function to immediately fail the step without retrying:

import { defineWorkflow, NonRetriableError } from '@der-ablauf/workflows';

const order = defineWorkflow((t) => ({
	type: 'process-order',
	input: t.object({ userId: t.string() }),
	run: async (step, payload) => {
		const user = await step.do('validate-user', async () => {
			const user = await getUser(payload.userId);
			if (user.banned) {
				throw new NonRetriableError('User is banned');
			}
			return user;
		});
		// ...
	},
}));

When NonRetriableError is thrown:

The step is immediately marked as failed — no retries are attempted, regardless of the retry configuration
The error is recorded in the step's retry history (visible in the dashboard)
The workflow transitions to errored

NonRetriableError extends plain Error, not WorkflowError. It's designed to be simple for user code — no error codes or HTTP statuses needed.

When to Use NonRetriableError

Use it for errors where retrying would be wasteful:

Business rule violations — user is banned, account is suspended
Authorization failures — invalid API key, insufficient permissions
Invalid data — malformed input discovered mid-step
Resource gone — the thing you need no longer exists

Crash & OOM Recovery

Cloudflare Durable Objects run in isolates with a 128 MB memory limit. If a step exceeds this limit (or the isolate crashes for any reason), the entire isolate is killed — your step's try/catch never executes, and the error is never recorded.

Ablauf handles this automatically using write-ahead step tracking. Before executing your step function, Ablauf persists the step as "running" in SQLite with an incremented attempt counter. If the isolate dies mid-execution:

The step remains in "running" state in durable storage (SQLite survives isolate resets)
A safety alarm (set before replay started) fires and triggers the alarm handler
The alarm handler replays the workflow, detecting the orphaned "running" step
The crash is recorded in the step's retry history
Normal retry logic kicks in — backoff delay, then re-execution
If retries are exhausted, the step fails permanently with StepRetryExhaustedError

This means OOM crashes are handled identically to normal step failures — no infinite loops, no zombie workflows. Ablauf sets a safety alarm before every replay so there's always a trigger for crash recovery, whether the OOM happened during initial execution, a resume, or an alarm-driven retry.

Attempt 1: step.do() → write-ahead (running, attempts=1) → fn() → OOM 💥
  ↓ isolate killed, safety alarm fires
Recovery: alarm() → replay() → detects status="running" → schedule retry with backoff
  ↓ retry alarm fires
Attempt 2: step.do() → write-ahead (running, attempts=2) → fn() → success ✓

If a step deterministically exceeds 128 MB, retries won't help — it will fail on every attempt and eventually exhaust retries. Move memory-heavy work to a separate Worker via service binding RPC so the DO isolate stays safe. The separate Worker has its own 128 MB and its crash won't kill your workflow's state.

What Survives an Isolate Crash

Data	Survives?	Why
Completed step results	Yes	Persisted in SQLite before the crash
Workflow metadata & payload	Yes	Written to durable storage on creation
In-flight step attempt counter	Yes	Write-ahead persists before `fn()` runs
The step function's return value	No	Isolate died before it could be saved
JavaScript variables in memory	No	Isolate memory is wiped

Result Size Limits

Every workflow has a cumulative memory budget for step results. By default, the total serialized size of all completed step results cannot exceed 64 MB. This prevents workflows from accumulating enough data to trigger an OOM crash during replay.

After each step.do() execution, Ablauf measures the serialized result size and checks it against the remaining budget. If the new result would push the total over the limit, the step fails before the result is stored.

How It Works

Step 1: result = 2 MB  → total: 2 MB  / 64 MB ✓
Step 2: result = 10 MB → total: 12 MB / 64 MB ✓
Step 3: result = 55 MB → total: 67 MB / 64 MB ✗ → fails

Default Behavior

By default, exceeding the budget throws a NonRetriableError — the step fails immediately without retrying (since retrying will produce the same oversized result):

const MyWorkflow = defineWorkflow({
	type: 'my-workflow',
	input: z.object({ /* ... */ }),
	// Default: 64 MB budget, non-retryable on overflow
	run: async (step, payload) => { /* ... */ },
});

Custom Configuration

Override the defaults with resultSizeLimit:

const HeavyWorkflow = defineWorkflow({
	type: 'heavy-workflow',
	input: z.object({ /* ... */ }),
	resultSizeLimit: {
		maxSize: '128mb',     // Increase the budget
		onOverflow: 'retry',  // Use retry logic instead of immediate failure
	},
	run: async (step, payload) => { /* ... */ },
});

Option	Type	Default	Description
`maxSize`	`string`	`"64mb"`	Cumulative byte budget. Accepts: `"512kb"`, `"64mb"`, `"1gb"`.
`onOverflow`	`"fail" \| "retry"`	`"fail"`	`"fail"` throws NonRetriableError. `"retry"` throws a retryable error.

The 64 MB default leaves ~64 MB headroom for the engine, your workflow code, and deserialized objects within the 128 MB isolate limit. Raising it above 100 MB is risky — consider offloading heavy data to external storage instead.

Size Strings

Size strings follow the same pattern as duration strings:

Format	Example	Description
`Nb`	`"100b"`	Bytes
`Nkb`	`"512kb"`	Kilobytes
`Nmb`	`"64mb"`	Megabytes
`Ngb`	`"1gb"`	Gigabytes

Invalid size strings throw InvalidSizeError.

Best Practices

Set realistic retry limits. If an API is down, 100 retries won't help. Use retries for transient issues, not systemic failures.

Use exponential backoff for external services. Linear or fixed backoff can overwhelm a struggling service.

Don't retry non-idempotent operations blindly. If retrying a step could cause duplicate charges, emails, or data corruption, add idempotency checks inside the step.

await step.do(
	'send-email',
	async () => {
		// Check if email was already sent before retrying
		const alreadySent = await checkEmailLog(userId);
		if (alreadySent) return;

		await sendEmail(userId, 'Welcome!');
	},
	{
		retries: { limit: 5 },
	},
);

Offload memory-heavy steps. If a step does inference, image processing, or anything that might exceed 128 MB, call it via fetch() or service binding RPC so it runs in a separate isolate. Your DO stays safe and can retry if the external call fails.

await step.do('run-inference', async () => {
	// Runs in a separate Worker's isolate — OOM here won't kill the DO
	const result = await env.INFERENCE_WORKER.runModel(payload);
	return result;
}, { retries: { limit: 5, delay: '10s', backoff: 'exponential' } });

Retries & Error Handling

On this page