Skip to main content
Problem 30

Build a Background Job Processor with Retries

HARDBUILD
Queues / Background Jobs+4

You're building the worker that drains a job queue in a production backend. Jobs sometimes succeed, sometimes fail transiently (a flaky API, a timeout), and sometimes fail permanently (bad payload, expired credential).

Your job is to write the state machine that handles all three cases correctly: successes finalize, transient failures retry with a delay, permanent failures move to a dead-letter state immediately. Transient failures eventually exhaust if they hit the retry cap.

You're given two skeleton files: queue_store.py (in-memory job storage) and processor.py (the worker loop). The tests drive processor.process_next(now) to simulate time — no real sleeping required.

Requirements
  • QueueStore.enqueue(job_type, payload) creates a pending job with attempts=0 and next_attempt_at=0, and returns its id.
  • QueueStore.next_runnable(now) returns the oldest pending job whose next_attempt_at <= now, or None. It must not return jobs in succeeded or dead_letter state.
  • When a handler returns normally, the job transitions to succeeded, attempts is incremented once, and the handler is not invoked again on later process_next calls.
  • When a handler raises PermanentError, the job transitions to dead_letter immediately — no matter how many retries remain — and last_error records the exception message.
  • When a handler raises any other exception, attempts increments and the job is rescheduled at now + retry_delay_seconds. After max_retries attempts, the next failure puts the job in dead_letter.
  • processor.process_next(now) returns True if a job was processed (success, retry-scheduled, or dead-lettered) and False if no job was runnable at now.
Examples
Example 1
Input
store.enqueue('send_email', {'to': 'a@x.com'}); processor.register('send_email', lambda p: None); processor.process_next(now=0)
Output
Returns True. Job's state == 'succeeded', attempts == 1.
Example 2
Input
Handler raises TransientError on first call, succeeds on second. max_retries=3, retry_delay_seconds=5.
process_next(now=0) → fails, schedules retry at t=5
process_next(now=4) → returns False (not yet runnable)
process_next(now=5) → succeeds
Output
Final state == 'succeeded', attempts == 2.
Example 3
Input
Handler raises PermanentError on first call.
process_next(now=0)
Output
Returns True. State == 'dead_letter', attempts == 1, last_error contains the exception message.
Constraints
  • Do not use real time.sleep or wall-clock time inside the processor. The caller passes now explicitly so tests stay fast.
  • PermanentError must skip the retry cap entirely. It is not a transient exception with a flag — it's a different control path.
  • A job that is currently in dead_letter or succeeded must never be re-run by process_next.
Follow-up

How would you add exponential backoff so retry #2 waits longer than retry #1? Where would the formula live so it's testable without time-based flakes?

Hints
Related Practice
Track
Backend Basics

Keep moving through related backend basics problems and build a stronger search-friendly practice loop around this topic.

View track →
Console output will appear here...