reliability

Transactional Email “Send Once” with Delivered Marker

Emails should be idempotent. Store a delivered marker (or unique key) so retries don’t spam users. This pattern is especially useful for receipts and password reset flows.

HTTP Timeouts + Retries Wrapper (Faraday)

I wrapped external HTTP calls once I realized most “flaky APIs” were actually my fault: no timeouts, unclear retries, and logs that didn’t tell a story. In Client with timeouts, I centralize a Faraday connection with explicit open_timeout and timeout,

Graceful Degradation: Feature-Based Rescue

Not every failure should be a 500. If a non-critical dependency fails (e.g., recommendations), rescue narrowly, emit a metric/log, and serve a baseline response.

Safer Background Job Arguments (Serialize IDs only)

Jobs should accept simple primitives (IDs, strings), not full objects. It avoids serialization surprises and makes jobs resilient across deploys. This also reduces job payload size.

Schema-Backed Enums (DB Constraint + Rails enum)

Rails enums are nice, but the DB should enforce allowed values. Use a CHECK constraint (or native enum type) plus the Rails enum mapping. It prevents bad writes from console scripts and future migrations.

Counter Cache Repair Job (Consistency Tooling)

Counter caches drift (deleted records, backfills, manual SQL). A repair job that recomputes counts safely is invaluable. It’s the kind of operational code you’re glad you wrote the first time a dashboard is wrong.

Backend: normalize errors with a single Express handler

Without a centralized error handler, you end up with a mix of thrown errors, ad-hoc res.status(500) blocks, and inconsistent JSON shapes. I use one Express error middleware that maps known errors to stable codes and logs unknown errors with request co

Circuit breaker wrapper for flaky third-party APIs

When a dependency starts timing out, naive retries can amplify the outage by piling on more work. A circuit breaker gives the system a chance to breathe: after enough failures, it opens and returns a fast error, then it half-opens to probe recovery. I

Health checks with readiness + liveness

One /health endpoint is ambiguous: is the process alive, or is it actually ready to serve traffic? I split them. Liveness answers ‘should the orchestrator restart me?’ and is usually just ‘the event loop is alive’. Readiness answers ‘can I accept traf

Background Job Backpressure with Queue Depth Guard

When downstream systems degrade, jobs pile up and amplify outages. Add a simple “queue depth guard” so non-critical jobs skip or reschedule instead of making the backlog worse.

Safer Time-Based Deletes with “mark then sweep”

Direct deletes can be risky and slow. Mark records for deletion, then sweep in batches in a maintenance job. This gives you observability and a rollback window.

Idempotency keys for “create” endpoints

Retries are inevitable: mobile clients, flaky networks, and load balancers will resend POST requests. Without idempotency you end up double-charging or double-creating records. I store an Idempotency-Key with a sha256 hash of the request body and the