Skip to main content
HardBackendBuild

Build a Background Job Processor with Retries

You're building the worker that drains a job queue in a production backend. Jobs sometimes succeed, sometimes fail transiently (a flaky API, a timeout), and sometimes fail permanently (bad payload, expired credential)....

What you will practice

PythonQueues / Background JobsAsync PatternsRetriesBackoffState Machines

Requirements

  • `QueueStore.enqueue(job_type, payload)` creates a `pending` job with `attempts=0` and `next_attempt_at=0`, and returns its id.
  • `QueueStore.next_runnable(now)` returns the oldest `pending` job whose `next_attempt_at <= now`, or `None`. It must not return jobs in `succeeded` or `dead_letter` state.
  • When a handler returns normally, the job transitions to `succeeded`, attempts is incremented once, and the handler is not invoked again on later `process_next` calls.
  • When a handler raises `PermanentError`, the job transitions to `dead_letter` immediately — no matter how many retries remain — and `last_error` records the exception message.
  • When a handler raises any other exception, attempts increments and the job is rescheduled at `now + retry_delay_seconds`. After `max_retries` attempts, the next failure puts the job in `dead_letter`.
  • `processor.process_next(now)` returns `True` if a job was processed (success, retry-scheduled, or dead-lettered) and `False` if no job was runnable at `now`.

Starter files

queue_store.pyEditable starter
processor.pyEditable starter

What the judge checks

  • Runs in the python environment with the python-pytest runner.
  • Uses a 8000ms judge budget.
  • Behavior rules include: Successful Jobs Processed Once, Transient Failure Retried With Delay, Permanent Failure Goes To Dead Letter Immediately, Max Retry Cap Enforced.

Constraints

  • Do not use real `time.sleep` or wall-clock time inside the processor. The caller passes `now` explicitly so tests stay fast.
  • `PermanentError` must skip the retry cap entirely. It is not a transient exception with a flag — it's a different control path.
  • A job that is currently in `dead_letter` or `succeeded` must never be re-run by `process_next`.

Example behavior

Input
store.enqueue('send_email', {'to': 'a@x.com'}); processor.register('send_email', lambda p: None); processor.process_next(now=0)
Output
Returns True. Job's state == 'succeeded', attempts == 1.
Input
Handler raises TransientError on first call, succeeds on second. max_retries=3, retry_delay_seconds=5.
process_next(now=0) → fails, schedules retry at t=5
process_next(now=4) → returns False (not yet runnable)
process_next(now=5) → succeeds
Output
Final state == 'succeeded', attempts == 2.

Follow-up

How would you add exponential backoff so retry #2 waits longer than retry #1? Where would the formula live so it's testable without time-based flakes?