Last quarter, a payment service I maintain started throwing 503s at 2 AM. The root cause was a catalog service three hops away that had run out of database connections. Without circuit breakers, every request to the payment service blocked for 30 seconds waiting on a dead dependency, the request pool filled up, and the entire checkout flow collapsed. A single unhealthy service took down five others in under a minute.
That is cascading failure, and it is the default behavior of most microservice architectures. You have to actively design against it. This post is the set of Go resilience patterns I apply to every service I build — circuit breakers, bulkheads, retries with bounded backoff, timeouts with context propagation, health checks, and graceful degradation — plus the failure modes each one is actually defending against and the ones it’s not.
I’ll show the shape of the code, not a drop-in library. Copy the patterns, understand the tradeoffs, and write your own. Copy-paste without understanding and you’ll ship a bug that looks like resilience until the day it isn’t.
Threat Model: What Actually Takes Services Down
Before any code, name the failure modes. Every pattern below defends against something specific — if you can’t name the threat, drop the pattern.
| Failure mode | What it looks like | Defense |
|---|---|---|
| Cascading failure | One slow dependency blocks callers; callers fill up and fail their callers | Circuit breaker + timeouts |
| Thundering herd | All clients retry simultaneously after an outage, overwhelming recovery | Exponential backoff with jitter |
| Resource exhaustion | One bad dependency starves a shared goroutine pool or connection pool | Bulkhead (bounded per-dependency pools) |
| Retry storm | Clients retry 3x against an already-overloaded service, tripling load | Circuit breaker gating retries, retry budgets |
| Partial failure | Non-critical dependency down, whole request fails when it could degrade | Graceful degradation, cached/default responses |
| Slow-loris hops | End-to-end deadline is 2s but each hop waits 5s | Context deadline propagation across RPC boundaries |
| Orchestrator flapping | Liveness probe restarts a service that’s merely slow, compounding the outage | Distinct liveness/readiness/startup probes |
I’ll come back to this table. The most common mistake I see is teams implementing retries without a circuit breaker. Retries without a breaker turn a struggling service into a dead one: every client retries 3x, tripling load on a service already at its limit. The breaker is what makes retries safe.
Circuit Breaker: Fail Fast When The Downstream Is Down
A circuit breaker is a state machine that sits between your code and a remote call. Closed = calls flow through. Open = calls fail instantly without touching the dependency. Half-open = one probe call at a time, to detect recovery. The breaker flips to Open after too many failures in a window, then flips back via Half-open after a cooldown.
The point is not to handle the error. The point is to stop waiting. When a downstream hangs, every caller blocks a goroutine and a connection for the full timeout. Multiply that by request rate and you’ve exhausted your pool in seconds. An open breaker returns immediately so the caller can free resources, return stale data, or fail gracefully.
When to open, when to close
There are two schools of thought on the threshold: consecutive failures vs. rolling error rate. Consecutive is simpler and usually enough — a handful of back-to-back failures is a strong signal. Rolling-rate (e.g. >50% errors over the last 20 requests) is more accurate under mixed traffic, but needs a ring buffer and tuning. Start with consecutive; move to rolling only when you see it misfire.
For Half-open, let exactly one request through at a time until you’ve accumulated enough successes to close. Let a hundred through and you’ll re-overload the recovering service. This matters more than people think.
Why a full Lock, not RLock
Every call to allowRequest may need to transition state (Open -> Half-open when the cooldown elapses). RLock can’t upgrade to Lock in Go, and manual upgrade via unlock-then-relock creates a TOCTOU race. Just take the write lock. The contention cost is tiny next to a network RPC.
The state-change callback trap
This is the bug I want to highlight because I’ve seen it in production code: a lot of breaker implementations call the onStateChange observer callback while holding the lock. If that callback logs, emits a metric, or (worst case) calls back into the breaker to read state, you have a deadlock, a lock-inversion hazard, or — the subtler version — the callback runs before the state transition is visible to other goroutines. Equally bad: if you release the lock and then call the callback, a concurrent updateState can already have flipped the state a second time, so the callback observes a value that doesn’t match its own (from, to) arguments.
The fix is narrow: commit the state change under the lock, snapshot what the callback needs, release the lock, then dispatch the callback on a separate goroutine. The callback sees exactly the transition it was told about, and nothing it does can deadlock the breaker.
// circuitbreaker/circuitbreaker.go
package circuitbreaker
import (
"context"
"errors"
"fmt"
"sync"
"time"
)
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
func (s State) String() string {
switch s {
case StateClosed:
return "closed"
case StateOpen:
return "open"
case StateHalfOpen:
return "half-open"
default:
return "unknown"
}
}
// ErrOpen is returned when the circuit is open. Callers check with errors.Is.
var ErrOpen = errors.New("circuit breaker: open")
type Config struct {
Name string
FailureThreshold int // consecutive failures to open
SuccessThreshold int // consecutive successes in half-open to close
ResetTimeout time.Duration // how long to stay Open before probing
OnStateChange func(name string, from, to State)
}
type CircuitBreaker struct {
cfg Config
mu sync.Mutex
state State
failureCount int
successCount int
halfOpenInFlight bool // allow only one probe in half-open
lastStateChangeTime time.Time
}
func New(cfg Config) *CircuitBreaker {
if cfg.FailureThreshold <= 0 {
cfg.FailureThreshold = 5
}
if cfg.SuccessThreshold <= 0 {
cfg.SuccessThreshold = 2
}
if cfg.ResetTimeout <= 0 {
cfg.ResetTimeout = 10 * time.Second
}
return &CircuitBreaker{cfg: cfg, state: StateClosed, lastStateChangeTime: time.Now()}
}
The Execute path has two responsibilities: admission (should this call proceed?) and recording (what did this call’s outcome mean for state?). Both are mutex-protected. The fiddly bit is half-open: we must let exactly one probe through, and we must remember we did so a later record call can decrement that counter. One more subtle requirement: if fn() panics, we still have to record the outcome — otherwise a panic inside a half-open probe leaves halfOpenInFlight=true forever and the breaker is permanently stuck. defer + named return covers both the normal and panic paths.
func (cb *CircuitBreaker) Execute(fn func() error) (err error) {
if err := cb.admit(); err != nil {
return err
}
// Use defer + named return so a panic in fn() still records the
// outcome. Without this, a panic inside a half-open probe leaves
// halfOpenInFlight=true forever and the breaker never recovers.
defer func() {
if r := recover(); r != nil {
// Belt-and-suspenders: if record itself panics (e.g. a buggy
// OnStateChange wrapper), don't let it shadow the original
// panic value — recover the inner panic and re-raise the
// original so the stack trace points at the real bug.
func() {
defer func() { _ = recover() }()
cb.record(fmt.Errorf("panic: %v", r))
}()
panic(r) // re-raise after state is consistent
}
cb.record(err)
}()
err = fn()
return err
}
func (cb *CircuitBreaker) admit() error {
cb.mu.Lock()
defer cb.mu.Unlock()
switch cb.state {
case StateClosed:
return nil
case StateOpen:
if time.Since(cb.lastStateChangeTime) < cb.cfg.ResetTimeout {
return ErrOpen
}
// Cooldown elapsed. Transition to half-open and admit this as the probe.
cb.transitionLocked(StateHalfOpen)
cb.halfOpenInFlight = true
return nil
case StateHalfOpen:
if cb.halfOpenInFlight {
return ErrOpen // only one probe at a time
}
cb.halfOpenInFlight = true
return nil
}
return ErrOpen
}
func (cb *CircuitBreaker) record(err error) {
cb.mu.Lock()
if cb.state == StateHalfOpen {
cb.halfOpenInFlight = false
}
// Context cancellations/deadlines are the caller giving up — they
// are NOT a signal the dependency is unhealthy. Counting them would
// let a burst of client disconnects trip the breaker and block
// traffic the dependency is perfectly capable of serving.
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
cb.mu.Unlock()
return
}
if err != nil {
cb.failureCount++
cb.successCount = 0
if cb.state == StateHalfOpen ||
(cb.state == StateClosed && cb.failureCount >= cb.cfg.FailureThreshold) {
cb.transitionLocked(StateOpen)
}
} else {
cb.successCount++
cb.failureCount = 0
if cb.state == StateHalfOpen && cb.successCount >= cb.cfg.SuccessThreshold {
cb.transitionLocked(StateClosed)
}
}
cb.mu.Unlock()
}
transitionLocked is where the callback fix lives. It stages the callback invocation and dispatches it from a fresh goroutine after the lock is released. The callback runs exactly once per transition, sees consistent (from, to) values, and cannot deadlock us.
// transitionLocked must be called with cb.mu held. It mutates state, snapshots
// the (from, to) pair, and dispatches the observer callback on a goroutine
// so it runs AFTER the lock is released and cannot re-enter the breaker.
func (cb *CircuitBreaker) transitionLocked(to State) {
from := cb.state
if from == to {
return
}
cb.state = to
cb.lastStateChangeTime = time.Now()
cb.failureCount = 0
cb.successCount = 0
if cb.cfg.OnStateChange != nil {
name := cb.cfg.Name
cbFn := cb.cfg.OnStateChange
go cbFn(name, from, to)
}
}
func (cb *CircuitBreaker) State() State {
cb.mu.Lock()
defer cb.mu.Unlock()
return cb.state
}
A word on the alternative: some breakers let the callback run synchronously inside the lock. That’s defensible if — and only if — you document that callbacks must be non-blocking, non-logging, and never touch the breaker. I don’t trust that contract to survive a junior dev reaching for it in an incident. Dispatch-to-goroutine is safer by default.
One more decision record makes: context cancellation and deadline errors do not count as failures. If a client disconnects mid-request and your handler returns ctx.Err(), that is not a signal the downstream is sick — it’s a signal the caller gave up. Counting those toward the threshold lets a burst of browser closes or upstream timeouts trip the breaker on a perfectly healthy dependency. The test I use: the failure must be attributable to the downstream, not to anything happening on my side of the wire.
HTTP Client With A Breaker
The breaker by itself doesn’t know about HTTP. Wrap it in a thin client so call sites get a boring Get/Post API with the breaker applied automatically. A 5xx from the server counts as a failure; a 4xx does not (that’s a client bug, not a dependency outage).
// client/http.go
package client
import (
"context"
"errors"
"fmt"
"io"
"net/http"
"time"
"example.com/resilience/circuitbreaker"
)
type HTTPClient struct {
http *http.Client
cb *circuitbreaker.CircuitBreaker
}
func NewHTTPClient(timeout time.Duration, cb *circuitbreaker.CircuitBreaker) *HTTPClient {
return &HTTPClient{
http: &http.Client{Timeout: timeout},
cb: cb,
}
}
func (c *HTTPClient) Get(ctx context.Context, url string) ([]byte, error) {
var body []byte
err := c.cb.Execute(func() error {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return err
}
resp, err := c.http.Do(req)
if err != nil {
return fmt.Errorf("request: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return fmt.Errorf("upstream %d: %s", resp.StatusCode, resp.Status)
}
body, err = io.ReadAll(resp.Body)
return err
})
if errors.Is(err, circuitbreaker.ErrOpen) {
// Caller distinguishes "we skipped the call" from "the call failed".
return nil, err
}
return body, err
}
Bulkheads: Bound The Blast Radius
A bulkhead is exactly what it sounds like on a ship: a wall that contains flooding to one compartment. In code, it’s a bounded resource pool per dependency — a semaphore of N concurrent calls, optionally with a bounded waiting queue. When the pool is full, calls fail fast instead of piling up.
Without bulkheads, all your calls share one Go scheduler, one HTTP connection pool, one request handler pool. A slow catalog service fills that shared pool with waiters, and suddenly your fast user service can’t get a slot. Bulkheads carve up the pool so the catalog outage stays in its own compartment.
Bounded queues, not unbounded
An unbounded queue is not a queue — it’s a memory leak with a latency chart. If the dependency is slow, the queue grows until you OOM or tail latency reaches “service is down” territory. Cap the queue. When it’s full, reject new work immediately with a clear error. The caller then decides: degrade, retry elsewhere, or propagate the error.
The goroutine leak waiting to bite you
Here’s the bug that used to live in this article and probably lives in most bulkhead implementations on the internet: you spawn N worker goroutines reading from a channel. When you shut down, unless you close the task channel, those workers block forever on the range loop. Worse, if you just close the channel you may panic any producer still trying to send. The clean shape is: a shutdown signal channel, plus a Close() method that’s safe to call once, plus a sync.WaitGroup so Close() blocks until workers are gone.
// bulkhead/bulkhead.go
package bulkhead
import (
"context"
"errors"
"fmt"
"sync"
)
var (
ErrQueueFull = errors.New("bulkhead: queue full")
ErrClosed = errors.New("bulkhead: closed")
)
type task struct {
ctx context.Context
fn func(context.Context) error
done chan error
}
type Bulkhead struct {
name string
queue chan task
stop chan struct{}
stopOnce sync.Once
wg sync.WaitGroup
}
type Config struct {
Name string
MaxConcurrency int // number of workers
MaxQueueSize int // bounded buffer in front of workers
}
func New(cfg Config) *Bulkhead {
if cfg.MaxConcurrency <= 0 {
cfg.MaxConcurrency = 10
}
if cfg.MaxQueueSize < 0 {
cfg.MaxQueueSize = 0
}
b := &Bulkhead{
name: cfg.Name,
queue: make(chan task, cfg.MaxQueueSize),
stop: make(chan struct{}),
}
b.wg.Add(cfg.MaxConcurrency)
for i := 0; i < cfg.MaxConcurrency; i++ {
go b.worker()
}
return b
}
The worker loop uses a select on both the task channel and the stop channel. That’s the shutdown signal: workers exit cleanly, drained or not. I intentionally drop in-flight queued tasks on shutdown rather than waiting them out — if you’re closing the bulkhead, you’re typically closing the service, and the caller’s context will cancel anyway.
One more thing the worker has to get right: if t.fn panics, the caller of Submit is blocked on <-t.done. A naked t.done <- t.fn(t.ctx) never writes on a panic path, so the caller hangs forever (or until its context deadline fires, if it bothered to set one — and a lot of callers don’t). The fix is a safeRun helper that puts a recover around the call and always writes an outcome to t.done. This mirrors the panic-safety pattern in the circuit breaker’s Execute: no panic exits without the state machine being told about it.
func (b *Bulkhead) worker() {
defer b.wg.Done()
for {
select {
case <-b.stop:
return
case t, ok := <-b.queue:
if !ok {
return
}
// Honor the caller's context: if it's already dead, skip the work.
if err := t.ctx.Err(); err != nil {
t.done <- err
continue
}
safeRun(t)
}
}
}
// safeRun guarantees t.done receives a value, even if t.fn panics.
// Without this, a panicking task strands Submit's caller on <-t.done.
func safeRun(t task) {
defer func() {
if r := recover(); r != nil {
t.done <- fmt.Errorf("bulkhead: panic: %v", r)
}
}()
t.done <- t.fn(t.ctx)
}
Submit tries to enqueue without blocking. If the queue is full we return ErrQueueFull immediately — that’s the whole point. If the bulkhead is closed, ErrClosed. If the caller’s context dies while we’re waiting, we propagate that. One shape I see bungled constantly: a select that races stop against the enqueue send will, if both are ready, randomly pick the enqueue — and Go’s select doesn’t promise ordering. The clean shape is a non-blocking stop check before enqueueing, and a stop case on the wait leg too, so a Close() after a successful enqueue still unblocks the caller instead of hanging it forever when no context deadline is set.
func (b *Bulkhead) Submit(ctx context.Context, fn func(context.Context) error) error {
// Check Closed FIRST with a non-blocking read. A select that races
// stop against queue-send can pick the enqueue even after Close(),
// stranding the task with no worker to pick it up.
select {
case <-b.stop:
return ErrClosed
default:
}
t := task{ctx: ctx, fn: fn, done: make(chan error, 1)}
select {
case b.queue <- t:
default:
return ErrQueueFull
}
// Wait for worker, caller cancellation, or shutdown. Without the
// stop case, a Close() after enqueue would strand this goroutine
// when no ctx deadline is set.
select {
case err := <-t.done:
return err
case <-ctx.Done():
return ctx.Err()
case <-b.stop:
return ErrClosed
}
}
// Close stops the bulkhead and waits for workers to exit. Safe to call once.
// Subsequent Submit calls return ErrClosed.
func (b *Bulkhead) Close() {
b.stopOnce.Do(func() {
close(b.stop)
})
b.wg.Wait()
}
Sizing: bulkheads cap concurrency, not request rate, so apply Little’s Law — concurrency ≈ RPS × average latency. If you’re sending 200 RPS to catalog at a typical 50ms latency, that’s roughly 10 concurrent calls in flight across the whole fleet; divide by your replica count (say 10) and each replica’s bulkhead needs ~1 slot plus headroom. Push the arithmetic the other way when latency is high: 200 RPS at 500ms is ~100 concurrent, or ~10 per replica. Most teams I audit size bulkheads far too high because they eyeball “RPS / replicas” and skip the latency term. Queue depth should be small — I usually pick queue = concurrency, so at most you’re holding one wave of overflow.
Timeouts And Context Propagation
Timeouts are the resilience pattern that people think they have, but usually don’t. The test is: if your outermost request has a 2-second deadline, and it calls three downstream services, does each downstream hop inherit that 2-second budget, or does each hop have its own 5-second timeout?
If each hop has its own fixed timeout, you can spend 5 seconds on a call that the client already gave up on. That’s wasted work and — worse — work that fills your bulkhead while real, still-wanted traffic waits.
The pattern in Go is always use context deadlines, never bare timeouts. Set the end-to-end deadline at the edge (HTTP server, gRPC entry point). Every downstream call derives from that context. Per-hop timeouts are expressed as context.WithTimeout(parent, perHopBudget) — whichever fires first wins.
// at the edge, typically in middleware
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
// inside handlers, per-hop budgets shrink the parent deadline
hopCtx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)
defer cancel()
resp, err := downstream.DoSomething(hopCtx, req)
A subtle failure mode: http.Client.Timeout is independent of the request context. If you set both, whichever fires first wins, which is fine. But if you rely only on http.Client.Timeout with no request context, the caller cancelling their parent context doesn’t cancel the HTTP request. The fix: always construct requests with http.NewRequestWithContext and let the context carry cancellation.
Retries: Only Idempotent, Only With Budget, Only With Jitter
Retries are dangerous unless you bound them hard. Four rules:
- Only retry idempotent operations. GET, PUT, DELETE of specific IDs — safe. POST that creates a record — unsafe without an idempotency key, because you’ll create duplicates.
- Retry only on transient errors: network errors, timeouts, 502/503/504. Never retry on 400, 401, 403, 404, 409, 422 — the response will be identical, you’re just adding load.
- Exponential backoff with jitter. Fixed backoff synchronizes retries across a fleet, producing a thundering herd the moment the dependency recovers. Jitter spreads them.
- Budget the retries. Max 2-3 attempts, always check
ctx.Done()between attempts, and never retry past the parent deadline.
Classifying “retryable” in modern Go (without deprecated APIs)
The old way was if netErr, ok := err.(net.Error); ok { return netErr.Temporary() || netErr.Timeout() }. Don’t do that anymore. net.Error.Temporary() was deprecated in Go 1.18 precisely because “temporary” had no consistent meaning — it flagged things that weren’t transient and missed things that were. The Go team advises: use errors.Is with specific sentinels, check net.Error.Timeout(), and check context errors explicitly.
Here is a classifier that works on current Go:
// retry/classify.go
package retry
import (
"context"
"errors"
"net"
"syscall"
)
// IsRetryable returns true for errors that are worth retrying.
// It does NOT call net.Error.Temporary() — that method is deprecated
// and semantically vague. Instead: timeouts, context cancellations for
// the caller's reasons, and specific transient syscall errors.
func IsRetryable(err error) bool {
if err == nil {
return false
}
// Caller gave up — do not retry, propagate cancellation.
if errors.Is(err, context.Canceled) {
return false
}
// Deadline from the caller — do not retry, we're out of budget.
if errors.Is(err, context.DeadlineExceeded) {
return false
}
// Connection refused / reset — dependency not ready. Retryable.
if errors.Is(err, syscall.ECONNREFUSED) ||
errors.Is(err, syscall.ECONNRESET) ||
errors.Is(err, syscall.EPIPE) {
return true
}
// Net timeout (dial/read/write). Retryable with backoff.
var netErr net.Error
if errors.As(err, &netErr) && netErr.Timeout() {
return true
}
// DNS failures: IsTimeout is the non-deprecated signal.
// net.DNSError.IsTemporary was deprecated in Go 1.18 alongside
// net.Error.Temporary() — same reason, same vagueness.
var dnsErr *net.DNSError
if errors.As(err, &dnsErr) && dnsErr.IsTimeout {
return true
}
return false
}
For HTTP status codes, wrap them in a typed error and decide at the retry layer:
// retry/status.go
package retry
import "net/http"
type StatusError struct {
Code int
}
func (e *StatusError) Error() string { return "upstream status " + http.StatusText(e.Code) }
func IsRetryableStatus(code int) bool {
switch code {
case 408, 425, 429, 500, 502, 503, 504:
return true
}
return false
}
429 deserves a note: it’s the dependency telling you to slow down. If it includes Retry-After, honor it — don’t just run your own backoff over the top.
Backoff with jitter, clamped
The classic bug in jitter math: if RandomizationFactor > 1.0, the multiplier 1.0 + factor*(2*rand-1.0) can go negative, producing a negative sleep (which is then cast to a huge positive duration by time arithmetic — or zero, depending on the shape of your math). Clamp the factor to [0, 1] in the constructor. No exceptions.
// retry/retry.go
// Requires Go 1.22+ for math/rand/v2.
package retry
import (
"context"
"errors"
"math/rand/v2"
"time"
)
type Options struct {
MaxAttempts int // total attempts, including the first
InitialInterval time.Duration
MaxInterval time.Duration
Multiplier float64
RandomizationFactor float64 // 0.0 to 1.0
IsRetryable func(error) bool
}
func NewOptions() Options {
return Options{
MaxAttempts: 3,
InitialInterval: 100 * time.Millisecond,
MaxInterval: 5 * time.Second,
Multiplier: 2.0,
RandomizationFactor: 0.5,
IsRetryable: IsRetryable,
}
}
// validate clamps inputs to safe ranges. Negative or absurd values
// become sane defaults rather than runtime weirdness.
func (o *Options) validate() {
if o.MaxAttempts < 1 {
o.MaxAttempts = 1
}
if o.InitialInterval <= 0 {
o.InitialInterval = 100 * time.Millisecond
}
if o.MaxInterval < o.InitialInterval {
o.MaxInterval = o.InitialInterval
}
if o.Multiplier < 1.0 {
o.Multiplier = 1.0
}
// Clamp jitter to [0, 1]. Outside that range produces negative sleeps
// or absurd intervals. This is the bug people ship.
if o.RandomizationFactor < 0 {
o.RandomizationFactor = 0
}
if o.RandomizationFactor > 1 {
o.RandomizationFactor = 1
}
if o.IsRetryable == nil {
o.IsRetryable = IsRetryable
}
}
func Do(ctx context.Context, fn func(context.Context) error, opts Options) error {
opts.validate()
interval := opts.InitialInterval
var lastErr error
for attempt := 0; attempt < opts.MaxAttempts; attempt++ {
// Respect the parent deadline before every attempt.
if err := ctx.Err(); err != nil {
return err
}
err := fn(ctx)
if err == nil {
return nil
}
lastErr = err
if !opts.IsRetryable(err) {
return err
}
if attempt == opts.MaxAttempts-1 {
break
}
// Jitter multiplier in [1-factor, 1+factor], always > 0 after clamp.
jitter := 1.0 + opts.RandomizationFactor*(2.0*rand.Float64()-1.0)
next := time.Duration(float64(interval) * jitter)
if next > opts.MaxInterval {
next = opts.MaxInterval
}
interval = time.Duration(float64(interval) * opts.Multiplier)
if interval > opts.MaxInterval {
interval = opts.MaxInterval
}
timer := time.NewTimer(next)
select {
case <-timer.C:
case <-ctx.Done():
timer.Stop()
return ctx.Err()
}
}
return errors.Join(errors.New("retry: exhausted"), lastErr)
}
Breaker-gated retries
This is the pairing that actually works: retries inside the breaker, not around it. The breaker sees the final outcome of (call + retries), decides whether to open. The retry layer sees the classifier’s verdict and backs off. If the breaker is open, retries never happen — which is exactly what you want during an outage.
err := cb.Execute(func() error {
return retry.Do(ctx, func(ctx context.Context) error {
return doCall(ctx)
}, retryOpts)
})
Health Checks: Liveness, Readiness, Startup Are Not The Same
Kubernetes has three probe types and they are not interchangeable. Every team I’ve audited has gotten at least one of them wrong.
- Liveness: “is this process healthy, or should the orchestrator kill it?” Answer based on the process itself — deadlock detector, goroutine count, internal panic state. Never check downstream dependencies here. If your user service liveness probe fails when the database is down, Kubernetes will restart the user service, which doesn’t help, makes debugging harder, and can worsen the outage.
- Readiness: “should I receive traffic right now?” Answer based on dependencies that are strictly required for this service to serve requests. If the service is up but not yet warmed up, or a required dependency is down, readiness fails and the load balancer drains this pod until it recovers.
- Startup: “has the process finished initializing?” Used to give slow-starting services more time before liveness/readiness kick in, without loosening their thresholds.
The readiness probe is the interesting one. It must be cheap (called every few seconds), must not cascade (don’t call another service’s readiness, which calls another), and must reflect a binary answer. Load balancers — service meshes, Kubernetes services, cloud LBs — remove pods from rotation when readiness fails. That’s a controlled shedding tool; use it.
// health/health.go
package health
import (
"context"
"encoding/json"
"net/http"
"sync"
"time"
)
type CheckFunc func(ctx context.Context) error
type Check struct {
Name string
Fn CheckFunc
Timeout time.Duration
}
type Registry struct {
mu sync.RWMutex
liveness []Check
readiness []Check
}
func New() *Registry { return &Registry{} }
func (r *Registry) AddLiveness(c Check) { r.mu.Lock(); r.liveness = append(r.liveness, c); r.mu.Unlock() }
func (r *Registry) AddReadiness(c Check) { r.mu.Lock(); r.readiness = append(r.readiness, c); r.mu.Unlock() }
func (r *Registry) run(ctx context.Context, checks []Check) (bool, map[string]string) {
results := make(map[string]string, len(checks))
ok := true
for _, c := range checks {
cctx, cancel := context.WithTimeout(ctx, c.Timeout)
err := c.Fn(cctx)
cancel()
if err != nil {
ok = false
results[c.Name] = err.Error()
} else {
results[c.Name] = "ok"
}
}
return ok, results
}
func (r *Registry) LivenessHandler() http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
r.mu.RLock()
checks := append([]Check(nil), r.liveness...)
r.mu.RUnlock()
r.write(w, req, checks)
})
}
func (r *Registry) ReadinessHandler() http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
r.mu.RLock()
checks := append([]Check(nil), r.readiness...)
r.mu.RUnlock()
r.write(w, req, checks)
})
}
func (r *Registry) write(w http.ResponseWriter, req *http.Request, checks []Check) {
ok, results := r.run(req.Context(), checks)
status := http.StatusOK
if !ok {
status = http.StatusServiceUnavailable
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(status)
_ = json.NewEncoder(w).Encode(results)
}
Liveness checks I actually add: “can this goroutine acquire a token from a per-process watchdog semaphore within 1 second?” That catches deadlocks but is blind to downstream outages. Readiness checks I actually add: DB ping with a tight timeout, one probe call to each strictly-required dependency, “am I past the warmup period?” flag.
Graceful Degradation: Partial Answers Beat No Answers
When a non-critical dependency is down, returning an error to the user is often the wrong answer. Stale data, a default value, or a partial response is usually better. The homepage with an empty “Recommended for you” section beats a 500.
The shape: wrap the call in a TryWithFallback(primary, fallback). Primary is the live dependency. Fallback is cache, default value, or skip-this-section. On success, update the cache. On failure, serve cache if fresh-enough, or the default.
A warning before the code: this cache is keyed by an application-chosen string (user ID, product ID, region). If any part of the key is attacker-influenced and the key space is unbounded, the map grows without bound — Get filters by expiry but never deletes, so expired entries accumulate forever. The shape below includes a background janitor that sweeps expired entries on an interval, plus a Close() to stop it. For high-cardinality production caches I reach for hashicorp/golang-lru/v2 instead: bounded size, O(1) eviction, no goroutine to babysit. Pick LRU the moment your key space is user-controlled or your working set is larger than you want to eyeball.
// degrade/degrade.go
package degrade
import (
"context"
"sync"
"time"
)
type entry struct {
val any
exp time.Time
}
type Cache struct {
mu sync.RWMutex
data map[string]entry
ttl time.Duration
stop chan struct{}
stopOnce sync.Once
}
// NewCache returns a TTL cache with a background janitor. Call Close when
// done to stop the janitor goroutine.
func NewCache(ttl time.Duration) *Cache {
c := &Cache{
data: make(map[string]entry),
ttl: ttl,
stop: make(chan struct{}),
}
go c.janitor(ttl)
return c
}
func (c *Cache) Get(key string) (any, bool) {
c.mu.RLock()
e, ok := c.data[key]
c.mu.RUnlock()
if !ok || time.Now().After(e.exp) {
return nil, false
}
return e.val, true
}
func (c *Cache) Set(key string, val any) {
c.mu.Lock()
c.data[key] = entry{val: val, exp: time.Now().Add(c.ttl)}
c.mu.Unlock()
}
// janitor evicts expired entries so the map cannot grow unbounded when
// Get-misses leave stale keys behind. Interval == ttl is a reasonable
// default: worst case an entry lingers for ~2*ttl after expiry.
func (c *Cache) janitor(interval time.Duration) {
if interval <= 0 {
return
}
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-c.stop:
return
case now := <-t.C:
c.mu.Lock()
for k, e := range c.data {
if now.After(e.exp) {
delete(c.data, k)
}
}
c.mu.Unlock()
}
}
}
func (c *Cache) Close() {
c.stopOnce.Do(func() { close(c.stop) })
}
// TryWithFallback runs primary; on failure, serves cache if present,
// otherwise invokes fallback. The return value carries whether the
// result is fresh or degraded, so callers can surface a banner.
type Result[T any] struct {
Value T
Degraded bool
}
func TryWithFallback[T any](
ctx context.Context,
cache *Cache,
key string,
primary func(context.Context) (T, error),
fallback func() T,
) (Result[T], error) {
v, err := primary(ctx)
if err == nil {
cache.Set(key, v)
return Result[T]{Value: v, Degraded: false}, nil
}
// Do not degrade on caller cancellation or deadline — that's the
// caller saying "stop", not the dependency being unhealthy.
// Propagate so timeouts and cancellations stay visible up the stack.
if ctx.Err() != nil {
var zero T
return Result[T]{Value: zero, Degraded: true}, ctx.Err()
}
if v, ok := cache.Get(key); ok {
return Result[T]{Value: v.(T), Degraded: true}, nil
}
return Result[T]{Value: fallback(), Degraded: true}, nil
}
Two things to be honest about degradation:
- The user should know. Surface a banner (“showing cached results”) or a structural hint in the response. Silent degradation hides outages from everyone, including you.
- It doesn’t work for writes. Creating orders, taking payments, changing passwords — you can’t fallback-to-cache those. For write paths, degrade to “we’re having trouble, try again” and let the circuit breaker do its job.
stale-while-revalidate is the next level: serve from cache immediately, trigger an async refresh, return fresh on the next call. Good for read-heavy high-traffic endpoints where latency matters more than being current to the millisecond.
Wiring It Together
The layering I use for every outbound dependency:
context deadline (end-to-end budget)
-> bulkhead (bounded concurrency per dependency)
-> circuit breaker (fail fast if dependency is down)
-> retry (bounded, jittered, idempotent only)
-> http.Client with per-attempt timeout
Each layer has one job. Break any layer and the failure mode has a name: no bulkhead = resource exhaustion; no breaker = cascading failure; no retry = flap on transient errors; no deadline = slow-loris hops.
func NewUserServiceClient() *UserClient {
cb := circuitbreaker.New(circuitbreaker.Config{
Name: "user-service",
FailureThreshold: 5,
SuccessThreshold: 2,
ResetTimeout: 10 * time.Second,
OnStateChange: func(name string, from, to circuitbreaker.State) {
metrics.CBStateChange.WithLabelValues(name, to.String()).Inc()
},
})
bh := bulkhead.New(bulkhead.Config{
Name: "user-service",
MaxConcurrency: 20,
MaxQueueSize: 20,
})
return &UserClient{
http: &http.Client{Timeout: 500 * time.Millisecond},
cb: cb,
bulkhead: bh,
retry: retry.NewOptions(),
}
}
func (c *UserClient) GetUser(ctx context.Context, id string) (*User, error) {
var u *User
err := c.bulkhead.Submit(ctx, func(ctx context.Context) error {
return c.cb.Execute(func() error {
return retry.Do(ctx, func(ctx context.Context) error {
var err error
u, err = c.fetchUser(ctx, id)
return err
}, c.retry)
})
})
return u, err
}
And on the shutdown side, honor the layers in reverse: stop accepting new requests, drain in-flight requests up to a shutdown budget, close bulkheads.
func (s *Server) Shutdown(ctx context.Context) error {
if err := s.http.Shutdown(ctx); err != nil {
return err
}
s.userClient.bulkhead.Close()
s.catalogClient.bulkhead.Close()
return nil
}
What I’d Actually Choose
For any new Go service with downstream dependencies, I start with this stack:
- Breaker: the one in this post, or
sony/gobreakerif you want a library. Both are fine.gobreakerhas a good API and is maintained. - Bulkhead: roll your own with a bounded channel + worker pool as shown. It’s 60 lines and you understand every line.
- Retry:
cenkalti/backoff/v4if you want a library with good jitter; otherwise the pattern above. Avoidbackoff/v3— the API changed. - Timeouts: exclusively via
context.WithDeadlineat the edge andcontext.WithTimeoutper hop. Delete anytime.Afteryou find in request paths; they leak timers. - Health: separate
/livezand/readyzhandlers./livezchecks nothing external./readyzchecks the immediately-required dependencies with short timeouts. - Degradation: explicit
TryWithFallbackat the handler level, with aDegradedflag on responses that the frontend can render as a banner.
The biggest mistake I see teams make: layering resilience patterns without integration tests that actually break dependencies. A circuit breaker you haven’t seen open, a bulkhead you haven’t seen reject, a fallback you haven’t seen serve — none of those are real. Write tests that use httptest servers returning 503s, dropping connections mid-response, and sleeping past timeouts. Then run those tests on every change. The patterns only help if they’re wired correctly, and “wired correctly” is proved by tests, not by reading the code.
And the second biggest: resilience patterns as a substitute for capacity. If your dependency is at 100% CPU 24/7, no circuit breaker will save you — you need to scale. Resilience patterns are for handling incidents, not for masking chronic under-provisioning.