Resilient Microservices in Go

Circuit breakers, bulkheads, retries, timeouts, and degradation — the fault-tolerance patterns that keep distributed Go systems alive when dependencies fail.

MicroservicesGolangArchitecture

Last quarter, a payment service I maintain started throwing 503s at 2 AM. The root cause was a catalog service three hops away that had run out of database connections. Without circuit breakers, every request to the payment service blocked for 30 seconds waiting on a dead dependency, the request pool filled up, and the entire checkout flow collapsed. A single unhealthy service took down five others in under a minute.

That is cascading failure, and it is the default behavior of most microservice architectures. You have to actively design against it. This post is the set of Go resilience patterns I apply to every service I build — circuit breakers, bulkheads, retries with bounded backoff, timeouts with context propagation, health checks, and graceful degradation — plus the failure modes each one is actually defending against and the ones it’s not.

I’ll show the shape of the code, not a drop-in library. Copy the patterns, understand the tradeoffs, and write your own. Copy-paste without understanding and you’ll ship a bug that looks like resilience until the day it isn’t.

Threat Model: What Actually Takes Services Down

Before any code, name the failure modes. Every pattern below defends against something specific — if you can’t name the threat, drop the pattern.

Failure modeWhat it looks likeDefense
Cascading failureOne slow dependency blocks callers; callers fill up and fail their callersCircuit breaker + timeouts
Thundering herdAll clients retry simultaneously after an outage, overwhelming recoveryExponential backoff with jitter
Resource exhaustionOne bad dependency starves a shared goroutine pool or connection poolBulkhead (bounded per-dependency pools)
Retry stormClients retry 3x against an already-overloaded service, tripling loadCircuit breaker gating retries, retry budgets
Partial failureNon-critical dependency down, whole request fails when it could degradeGraceful degradation, cached/default responses
Slow-loris hopsEnd-to-end deadline is 2s but each hop waits 5sContext deadline propagation across RPC boundaries
Orchestrator flappingLiveness probe restarts a service that’s merely slow, compounding the outageDistinct liveness/readiness/startup probes

I’ll come back to this table. The most common mistake I see is teams implementing retries without a circuit breaker. Retries without a breaker turn a struggling service into a dead one: every client retries 3x, tripling load on a service already at its limit. The breaker is what makes retries safe.

Circuit Breaker: Fail Fast When The Downstream Is Down

A circuit breaker is a state machine that sits between your code and a remote call. Closed = calls flow through. Open = calls fail instantly without touching the dependency. Half-open = one probe call at a time, to detect recovery. The breaker flips to Open after too many failures in a window, then flips back via Half-open after a cooldown.

The point is not to handle the error. The point is to stop waiting. When a downstream hangs, every caller blocks a goroutine and a connection for the full timeout. Multiply that by request rate and you’ve exhausted your pool in seconds. An open breaker returns immediately so the caller can free resources, return stale data, or fail gracefully.

When to open, when to close

There are two schools of thought on the threshold: consecutive failures vs. rolling error rate. Consecutive is simpler and usually enough — a handful of back-to-back failures is a strong signal. Rolling-rate (e.g. >50% errors over the last 20 requests) is more accurate under mixed traffic, but needs a ring buffer and tuning. Start with consecutive; move to rolling only when you see it misfire.

For Half-open, let exactly one request through at a time until you’ve accumulated enough successes to close. Let a hundred through and you’ll re-overload the recovering service. This matters more than people think.

Why a full Lock, not RLock

Every call to allowRequest may need to transition state (Open -> Half-open when the cooldown elapses). RLock can’t upgrade to Lock in Go, and manual upgrade via unlock-then-relock creates a TOCTOU race. Just take the write lock. The contention cost is tiny next to a network RPC.

The state-change callback trap

This is the bug I want to highlight because I’ve seen it in production code: a lot of breaker implementations call the onStateChange observer callback while holding the lock. If that callback logs, emits a metric, or (worst case) calls back into the breaker to read state, you have a deadlock, a lock-inversion hazard, or — the subtler version — the callback runs before the state transition is visible to other goroutines. Equally bad: if you release the lock and then call the callback, a concurrent updateState can already have flipped the state a second time, so the callback observes a value that doesn’t match its own (from, to) arguments.

The fix is narrow: commit the state change under the lock, snapshot what the callback needs, release the lock, then dispatch the callback on a separate goroutine. The callback sees exactly the transition it was told about, and nothing it does can deadlock the breaker.

// circuitbreaker/circuitbreaker.go
package circuitbreaker

import (
	"context"
	"errors"
	"fmt"
	"sync"
	"time"
)

type State int

const (
	StateClosed State = iota
	StateOpen
	StateHalfOpen
)

func (s State) String() string {
	switch s {
	case StateClosed:
		return "closed"
	case StateOpen:
		return "open"
	case StateHalfOpen:
		return "half-open"
	default:
		return "unknown"
	}
}

// ErrOpen is returned when the circuit is open. Callers check with errors.Is.
var ErrOpen = errors.New("circuit breaker: open")

type Config struct {
	Name             string
	FailureThreshold int           // consecutive failures to open
	SuccessThreshold int           // consecutive successes in half-open to close
	ResetTimeout     time.Duration // how long to stay Open before probing
	OnStateChange    func(name string, from, to State)
}

type CircuitBreaker struct {
	cfg                 Config
	mu                  sync.Mutex
	state               State
	failureCount        int
	successCount        int
	halfOpenInFlight    bool // allow only one probe in half-open
	lastStateChangeTime time.Time
}

func New(cfg Config) *CircuitBreaker {
	if cfg.FailureThreshold <= 0 {
		cfg.FailureThreshold = 5
	}
	if cfg.SuccessThreshold <= 0 {
		cfg.SuccessThreshold = 2
	}
	if cfg.ResetTimeout <= 0 {
		cfg.ResetTimeout = 10 * time.Second
	}
	return &CircuitBreaker{cfg: cfg, state: StateClosed, lastStateChangeTime: time.Now()}
}

The Execute path has two responsibilities: admission (should this call proceed?) and recording (what did this call’s outcome mean for state?). Both are mutex-protected. The fiddly bit is half-open: we must let exactly one probe through, and we must remember we did so a later record call can decrement that counter. One more subtle requirement: if fn() panics, we still have to record the outcome — otherwise a panic inside a half-open probe leaves halfOpenInFlight=true forever and the breaker is permanently stuck. defer + named return covers both the normal and panic paths.

func (cb *CircuitBreaker) Execute(fn func() error) (err error) {
	if err := cb.admit(); err != nil {
		return err
	}
	// Use defer + named return so a panic in fn() still records the
	// outcome. Without this, a panic inside a half-open probe leaves
	// halfOpenInFlight=true forever and the breaker never recovers.
	defer func() {
		if r := recover(); r != nil {
			// Belt-and-suspenders: if record itself panics (e.g. a buggy
			// OnStateChange wrapper), don't let it shadow the original
			// panic value — recover the inner panic and re-raise the
			// original so the stack trace points at the real bug.
			func() {
				defer func() { _ = recover() }()
				cb.record(fmt.Errorf("panic: %v", r))
			}()
			panic(r) // re-raise after state is consistent
		}
		cb.record(err)
	}()
	err = fn()
	return err
}

func (cb *CircuitBreaker) admit() error {
	cb.mu.Lock()
	defer cb.mu.Unlock()

	switch cb.state {
	case StateClosed:
		return nil
	case StateOpen:
		if time.Since(cb.lastStateChangeTime) < cb.cfg.ResetTimeout {
			return ErrOpen
		}
		// Cooldown elapsed. Transition to half-open and admit this as the probe.
		cb.transitionLocked(StateHalfOpen)
		cb.halfOpenInFlight = true
		return nil
	case StateHalfOpen:
		if cb.halfOpenInFlight {
			return ErrOpen // only one probe at a time
		}
		cb.halfOpenInFlight = true
		return nil
	}
	return ErrOpen
}

func (cb *CircuitBreaker) record(err error) {
	cb.mu.Lock()

	if cb.state == StateHalfOpen {
		cb.halfOpenInFlight = false
	}

	// Context cancellations/deadlines are the caller giving up — they
	// are NOT a signal the dependency is unhealthy. Counting them would
	// let a burst of client disconnects trip the breaker and block
	// traffic the dependency is perfectly capable of serving.
	if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
		cb.mu.Unlock()
		return
	}

	if err != nil {
		cb.failureCount++
		cb.successCount = 0
		if cb.state == StateHalfOpen ||
			(cb.state == StateClosed && cb.failureCount >= cb.cfg.FailureThreshold) {
			cb.transitionLocked(StateOpen)
		}
	} else {
		cb.successCount++
		cb.failureCount = 0
		if cb.state == StateHalfOpen && cb.successCount >= cb.cfg.SuccessThreshold {
			cb.transitionLocked(StateClosed)
		}
	}
	cb.mu.Unlock()
}

transitionLocked is where the callback fix lives. It stages the callback invocation and dispatches it from a fresh goroutine after the lock is released. The callback runs exactly once per transition, sees consistent (from, to) values, and cannot deadlock us.

// transitionLocked must be called with cb.mu held. It mutates state, snapshots
// the (from, to) pair, and dispatches the observer callback on a goroutine
// so it runs AFTER the lock is released and cannot re-enter the breaker.
func (cb *CircuitBreaker) transitionLocked(to State) {
	from := cb.state
	if from == to {
		return
	}
	cb.state = to
	cb.lastStateChangeTime = time.Now()
	cb.failureCount = 0
	cb.successCount = 0

	if cb.cfg.OnStateChange != nil {
		name := cb.cfg.Name
		cbFn := cb.cfg.OnStateChange
		go cbFn(name, from, to)
	}
}

func (cb *CircuitBreaker) State() State {
	cb.mu.Lock()
	defer cb.mu.Unlock()
	return cb.state
}

A word on the alternative: some breakers let the callback run synchronously inside the lock. That’s defensible if — and only if — you document that callbacks must be non-blocking, non-logging, and never touch the breaker. I don’t trust that contract to survive a junior dev reaching for it in an incident. Dispatch-to-goroutine is safer by default.

One more decision record makes: context cancellation and deadline errors do not count as failures. If a client disconnects mid-request and your handler returns ctx.Err(), that is not a signal the downstream is sick — it’s a signal the caller gave up. Counting those toward the threshold lets a burst of browser closes or upstream timeouts trip the breaker on a perfectly healthy dependency. The test I use: the failure must be attributable to the downstream, not to anything happening on my side of the wire.

HTTP Client With A Breaker

The breaker by itself doesn’t know about HTTP. Wrap it in a thin client so call sites get a boring Get/Post API with the breaker applied automatically. A 5xx from the server counts as a failure; a 4xx does not (that’s a client bug, not a dependency outage).

// client/http.go
package client

import (
	"context"
	"errors"
	"fmt"
	"io"
	"net/http"
	"time"

	"example.com/resilience/circuitbreaker"
)

type HTTPClient struct {
	http *http.Client
	cb   *circuitbreaker.CircuitBreaker
}

func NewHTTPClient(timeout time.Duration, cb *circuitbreaker.CircuitBreaker) *HTTPClient {
	return &HTTPClient{
		http: &http.Client{Timeout: timeout},
		cb:   cb,
	}
}

func (c *HTTPClient) Get(ctx context.Context, url string) ([]byte, error) {
	var body []byte
	err := c.cb.Execute(func() error {
		req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
		if err != nil {
			return err
		}
		resp, err := c.http.Do(req)
		if err != nil {
			return fmt.Errorf("request: %w", err)
		}
		defer resp.Body.Close()

		if resp.StatusCode >= 500 {
			return fmt.Errorf("upstream %d: %s", resp.StatusCode, resp.Status)
		}
		body, err = io.ReadAll(resp.Body)
		return err
	})
	if errors.Is(err, circuitbreaker.ErrOpen) {
		// Caller distinguishes "we skipped the call" from "the call failed".
		return nil, err
	}
	return body, err
}

Bulkheads: Bound The Blast Radius

A bulkhead is exactly what it sounds like on a ship: a wall that contains flooding to one compartment. In code, it’s a bounded resource pool per dependency — a semaphore of N concurrent calls, optionally with a bounded waiting queue. When the pool is full, calls fail fast instead of piling up.

Without bulkheads, all your calls share one Go scheduler, one HTTP connection pool, one request handler pool. A slow catalog service fills that shared pool with waiters, and suddenly your fast user service can’t get a slot. Bulkheads carve up the pool so the catalog outage stays in its own compartment.

Bounded queues, not unbounded

An unbounded queue is not a queue — it’s a memory leak with a latency chart. If the dependency is slow, the queue grows until you OOM or tail latency reaches “service is down” territory. Cap the queue. When it’s full, reject new work immediately with a clear error. The caller then decides: degrade, retry elsewhere, or propagate the error.

The goroutine leak waiting to bite you

Here’s the bug that used to live in this article and probably lives in most bulkhead implementations on the internet: you spawn N worker goroutines reading from a channel. When you shut down, unless you close the task channel, those workers block forever on the range loop. Worse, if you just close the channel you may panic any producer still trying to send. The clean shape is: a shutdown signal channel, plus a Close() method that’s safe to call once, plus a sync.WaitGroup so Close() blocks until workers are gone.

// bulkhead/bulkhead.go
package bulkhead

import (
	"context"
	"errors"
	"fmt"
	"sync"
)

var (
	ErrQueueFull = errors.New("bulkhead: queue full")
	ErrClosed    = errors.New("bulkhead: closed")
)

type task struct {
	ctx  context.Context
	fn   func(context.Context) error
	done chan error
}

type Bulkhead struct {
	name     string
	queue    chan task
	stop     chan struct{}
	stopOnce sync.Once
	wg       sync.WaitGroup
}

type Config struct {
	Name           string
	MaxConcurrency int // number of workers
	MaxQueueSize   int // bounded buffer in front of workers
}

func New(cfg Config) *Bulkhead {
	if cfg.MaxConcurrency <= 0 {
		cfg.MaxConcurrency = 10
	}
	if cfg.MaxQueueSize < 0 {
		cfg.MaxQueueSize = 0
	}
	b := &Bulkhead{
		name:  cfg.Name,
		queue: make(chan task, cfg.MaxQueueSize),
		stop:  make(chan struct{}),
	}
	b.wg.Add(cfg.MaxConcurrency)
	for i := 0; i < cfg.MaxConcurrency; i++ {
		go b.worker()
	}
	return b
}

The worker loop uses a select on both the task channel and the stop channel. That’s the shutdown signal: workers exit cleanly, drained or not. I intentionally drop in-flight queued tasks on shutdown rather than waiting them out — if you’re closing the bulkhead, you’re typically closing the service, and the caller’s context will cancel anyway.

One more thing the worker has to get right: if t.fn panics, the caller of Submit is blocked on <-t.done. A naked t.done <- t.fn(t.ctx) never writes on a panic path, so the caller hangs forever (or until its context deadline fires, if it bothered to set one — and a lot of callers don’t). The fix is a safeRun helper that puts a recover around the call and always writes an outcome to t.done. This mirrors the panic-safety pattern in the circuit breaker’s Execute: no panic exits without the state machine being told about it.

func (b *Bulkhead) worker() {
	defer b.wg.Done()
	for {
		select {
		case <-b.stop:
			return
		case t, ok := <-b.queue:
			if !ok {
				return
			}
			// Honor the caller's context: if it's already dead, skip the work.
			if err := t.ctx.Err(); err != nil {
				t.done <- err
				continue
			}
			safeRun(t)
		}
	}
}

// safeRun guarantees t.done receives a value, even if t.fn panics.
// Without this, a panicking task strands Submit's caller on <-t.done.
func safeRun(t task) {
	defer func() {
		if r := recover(); r != nil {
			t.done <- fmt.Errorf("bulkhead: panic: %v", r)
		}
	}()
	t.done <- t.fn(t.ctx)
}

Submit tries to enqueue without blocking. If the queue is full we return ErrQueueFull immediately — that’s the whole point. If the bulkhead is closed, ErrClosed. If the caller’s context dies while we’re waiting, we propagate that. One shape I see bungled constantly: a select that races stop against the enqueue send will, if both are ready, randomly pick the enqueue — and Go’s select doesn’t promise ordering. The clean shape is a non-blocking stop check before enqueueing, and a stop case on the wait leg too, so a Close() after a successful enqueue still unblocks the caller instead of hanging it forever when no context deadline is set.

func (b *Bulkhead) Submit(ctx context.Context, fn func(context.Context) error) error {
	// Check Closed FIRST with a non-blocking read. A select that races
	// stop against queue-send can pick the enqueue even after Close(),
	// stranding the task with no worker to pick it up.
	select {
	case <-b.stop:
		return ErrClosed
	default:
	}

	t := task{ctx: ctx, fn: fn, done: make(chan error, 1)}
	select {
	case b.queue <- t:
	default:
		return ErrQueueFull
	}

	// Wait for worker, caller cancellation, or shutdown. Without the
	// stop case, a Close() after enqueue would strand this goroutine
	// when no ctx deadline is set.
	select {
	case err := <-t.done:
		return err
	case <-ctx.Done():
		return ctx.Err()
	case <-b.stop:
		return ErrClosed
	}
}

// Close stops the bulkhead and waits for workers to exit. Safe to call once.
// Subsequent Submit calls return ErrClosed.
func (b *Bulkhead) Close() {
	b.stopOnce.Do(func() {
		close(b.stop)
	})
	b.wg.Wait()
}

Sizing: bulkheads cap concurrency, not request rate, so apply Little’s Law — concurrency ≈ RPS × average latency. If you’re sending 200 RPS to catalog at a typical 50ms latency, that’s roughly 10 concurrent calls in flight across the whole fleet; divide by your replica count (say 10) and each replica’s bulkhead needs ~1 slot plus headroom. Push the arithmetic the other way when latency is high: 200 RPS at 500ms is ~100 concurrent, or ~10 per replica. Most teams I audit size bulkheads far too high because they eyeball “RPS / replicas” and skip the latency term. Queue depth should be small — I usually pick queue = concurrency, so at most you’re holding one wave of overflow.

Timeouts And Context Propagation

Timeouts are the resilience pattern that people think they have, but usually don’t. The test is: if your outermost request has a 2-second deadline, and it calls three downstream services, does each downstream hop inherit that 2-second budget, or does each hop have its own 5-second timeout?

If each hop has its own fixed timeout, you can spend 5 seconds on a call that the client already gave up on. That’s wasted work and — worse — work that fills your bulkhead while real, still-wanted traffic waits.

The pattern in Go is always use context deadlines, never bare timeouts. Set the end-to-end deadline at the edge (HTTP server, gRPC entry point). Every downstream call derives from that context. Per-hop timeouts are expressed as context.WithTimeout(parent, perHopBudget) — whichever fires first wins.

// at the edge, typically in middleware
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()

// inside handlers, per-hop budgets shrink the parent deadline
hopCtx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)
defer cancel()
resp, err := downstream.DoSomething(hopCtx, req)

A subtle failure mode: http.Client.Timeout is independent of the request context. If you set both, whichever fires first wins, which is fine. But if you rely only on http.Client.Timeout with no request context, the caller cancelling their parent context doesn’t cancel the HTTP request. The fix: always construct requests with http.NewRequestWithContext and let the context carry cancellation.

Retries: Only Idempotent, Only With Budget, Only With Jitter

Retries are dangerous unless you bound them hard. Four rules:

  1. Only retry idempotent operations. GET, PUT, DELETE of specific IDs — safe. POST that creates a record — unsafe without an idempotency key, because you’ll create duplicates.
  2. Retry only on transient errors: network errors, timeouts, 502/503/504. Never retry on 400, 401, 403, 404, 409, 422 — the response will be identical, you’re just adding load.
  3. Exponential backoff with jitter. Fixed backoff synchronizes retries across a fleet, producing a thundering herd the moment the dependency recovers. Jitter spreads them.
  4. Budget the retries. Max 2-3 attempts, always check ctx.Done() between attempts, and never retry past the parent deadline.

Classifying “retryable” in modern Go (without deprecated APIs)

The old way was if netErr, ok := err.(net.Error); ok { return netErr.Temporary() || netErr.Timeout() }. Don’t do that anymore. net.Error.Temporary() was deprecated in Go 1.18 precisely because “temporary” had no consistent meaning — it flagged things that weren’t transient and missed things that were. The Go team advises: use errors.Is with specific sentinels, check net.Error.Timeout(), and check context errors explicitly.

Here is a classifier that works on current Go:

// retry/classify.go
package retry

import (
	"context"
	"errors"
	"net"
	"syscall"
)

// IsRetryable returns true for errors that are worth retrying.
// It does NOT call net.Error.Temporary() — that method is deprecated
// and semantically vague. Instead: timeouts, context cancellations for
// the caller's reasons, and specific transient syscall errors.
func IsRetryable(err error) bool {
	if err == nil {
		return false
	}
	// Caller gave up — do not retry, propagate cancellation.
	if errors.Is(err, context.Canceled) {
		return false
	}
	// Deadline from the caller — do not retry, we're out of budget.
	if errors.Is(err, context.DeadlineExceeded) {
		return false
	}
	// Connection refused / reset — dependency not ready. Retryable.
	if errors.Is(err, syscall.ECONNREFUSED) ||
		errors.Is(err, syscall.ECONNRESET) ||
		errors.Is(err, syscall.EPIPE) {
		return true
	}
	// Net timeout (dial/read/write). Retryable with backoff.
	var netErr net.Error
	if errors.As(err, &netErr) && netErr.Timeout() {
		return true
	}
	// DNS failures: IsTimeout is the non-deprecated signal.
	// net.DNSError.IsTemporary was deprecated in Go 1.18 alongside
	// net.Error.Temporary() — same reason, same vagueness.
	var dnsErr *net.DNSError
	if errors.As(err, &dnsErr) && dnsErr.IsTimeout {
		return true
	}
	return false
}

For HTTP status codes, wrap them in a typed error and decide at the retry layer:

// retry/status.go
package retry

import "net/http"

type StatusError struct {
	Code int
}

func (e *StatusError) Error() string { return "upstream status " + http.StatusText(e.Code) }

func IsRetryableStatus(code int) bool {
	switch code {
	case 408, 425, 429, 500, 502, 503, 504:
		return true
	}
	return false
}

429 deserves a note: it’s the dependency telling you to slow down. If it includes Retry-After, honor it — don’t just run your own backoff over the top.

Backoff with jitter, clamped

The classic bug in jitter math: if RandomizationFactor > 1.0, the multiplier 1.0 + factor*(2*rand-1.0) can go negative, producing a negative sleep (which is then cast to a huge positive duration by time arithmetic — or zero, depending on the shape of your math). Clamp the factor to [0, 1] in the constructor. No exceptions.

// retry/retry.go
// Requires Go 1.22+ for math/rand/v2.
package retry

import (
	"context"
	"errors"
	"math/rand/v2"
	"time"
)

type Options struct {
	MaxAttempts         int           // total attempts, including the first
	InitialInterval     time.Duration
	MaxInterval         time.Duration
	Multiplier          float64
	RandomizationFactor float64 // 0.0 to 1.0
	IsRetryable         func(error) bool
}

func NewOptions() Options {
	return Options{
		MaxAttempts:         3,
		InitialInterval:     100 * time.Millisecond,
		MaxInterval:         5 * time.Second,
		Multiplier:          2.0,
		RandomizationFactor: 0.5,
		IsRetryable:         IsRetryable,
	}
}

// validate clamps inputs to safe ranges. Negative or absurd values
// become sane defaults rather than runtime weirdness.
func (o *Options) validate() {
	if o.MaxAttempts < 1 {
		o.MaxAttempts = 1
	}
	if o.InitialInterval <= 0 {
		o.InitialInterval = 100 * time.Millisecond
	}
	if o.MaxInterval < o.InitialInterval {
		o.MaxInterval = o.InitialInterval
	}
	if o.Multiplier < 1.0 {
		o.Multiplier = 1.0
	}
	// Clamp jitter to [0, 1]. Outside that range produces negative sleeps
	// or absurd intervals. This is the bug people ship.
	if o.RandomizationFactor < 0 {
		o.RandomizationFactor = 0
	}
	if o.RandomizationFactor > 1 {
		o.RandomizationFactor = 1
	}
	if o.IsRetryable == nil {
		o.IsRetryable = IsRetryable
	}
}

func Do(ctx context.Context, fn func(context.Context) error, opts Options) error {
	opts.validate()
	interval := opts.InitialInterval
	var lastErr error

	for attempt := 0; attempt < opts.MaxAttempts; attempt++ {
		// Respect the parent deadline before every attempt.
		if err := ctx.Err(); err != nil {
			return err
		}
		err := fn(ctx)
		if err == nil {
			return nil
		}
		lastErr = err
		if !opts.IsRetryable(err) {
			return err
		}
		if attempt == opts.MaxAttempts-1 {
			break
		}
		// Jitter multiplier in [1-factor, 1+factor], always > 0 after clamp.
		jitter := 1.0 + opts.RandomizationFactor*(2.0*rand.Float64()-1.0)
		next := time.Duration(float64(interval) * jitter)
		if next > opts.MaxInterval {
			next = opts.MaxInterval
		}
		interval = time.Duration(float64(interval) * opts.Multiplier)
		if interval > opts.MaxInterval {
			interval = opts.MaxInterval
		}

		timer := time.NewTimer(next)
		select {
		case <-timer.C:
		case <-ctx.Done():
			timer.Stop()
			return ctx.Err()
		}
	}
	return errors.Join(errors.New("retry: exhausted"), lastErr)
}

Breaker-gated retries

This is the pairing that actually works: retries inside the breaker, not around it. The breaker sees the final outcome of (call + retries), decides whether to open. The retry layer sees the classifier’s verdict and backs off. If the breaker is open, retries never happen — which is exactly what you want during an outage.

err := cb.Execute(func() error {
    return retry.Do(ctx, func(ctx context.Context) error {
        return doCall(ctx)
    }, retryOpts)
})

Health Checks: Liveness, Readiness, Startup Are Not The Same

Kubernetes has three probe types and they are not interchangeable. Every team I’ve audited has gotten at least one of them wrong.

  • Liveness: “is this process healthy, or should the orchestrator kill it?” Answer based on the process itself — deadlock detector, goroutine count, internal panic state. Never check downstream dependencies here. If your user service liveness probe fails when the database is down, Kubernetes will restart the user service, which doesn’t help, makes debugging harder, and can worsen the outage.
  • Readiness: “should I receive traffic right now?” Answer based on dependencies that are strictly required for this service to serve requests. If the service is up but not yet warmed up, or a required dependency is down, readiness fails and the load balancer drains this pod until it recovers.
  • Startup: “has the process finished initializing?” Used to give slow-starting services more time before liveness/readiness kick in, without loosening their thresholds.

The readiness probe is the interesting one. It must be cheap (called every few seconds), must not cascade (don’t call another service’s readiness, which calls another), and must reflect a binary answer. Load balancers — service meshes, Kubernetes services, cloud LBs — remove pods from rotation when readiness fails. That’s a controlled shedding tool; use it.

// health/health.go
package health

import (
	"context"
	"encoding/json"
	"net/http"
	"sync"
	"time"
)

type CheckFunc func(ctx context.Context) error

type Check struct {
	Name    string
	Fn      CheckFunc
	Timeout time.Duration
}

type Registry struct {
	mu       sync.RWMutex
	liveness []Check
	readiness []Check
}

func New() *Registry { return &Registry{} }

func (r *Registry) AddLiveness(c Check)  { r.mu.Lock(); r.liveness = append(r.liveness, c); r.mu.Unlock() }
func (r *Registry) AddReadiness(c Check) { r.mu.Lock(); r.readiness = append(r.readiness, c); r.mu.Unlock() }

func (r *Registry) run(ctx context.Context, checks []Check) (bool, map[string]string) {
	results := make(map[string]string, len(checks))
	ok := true
	for _, c := range checks {
		cctx, cancel := context.WithTimeout(ctx, c.Timeout)
		err := c.Fn(cctx)
		cancel()
		if err != nil {
			ok = false
			results[c.Name] = err.Error()
		} else {
			results[c.Name] = "ok"
		}
	}
	return ok, results
}

func (r *Registry) LivenessHandler() http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
		r.mu.RLock()
		checks := append([]Check(nil), r.liveness...)
		r.mu.RUnlock()
		r.write(w, req, checks)
	})
}

func (r *Registry) ReadinessHandler() http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
		r.mu.RLock()
		checks := append([]Check(nil), r.readiness...)
		r.mu.RUnlock()
		r.write(w, req, checks)
	})
}

func (r *Registry) write(w http.ResponseWriter, req *http.Request, checks []Check) {
	ok, results := r.run(req.Context(), checks)
	status := http.StatusOK
	if !ok {
		status = http.StatusServiceUnavailable
	}
	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(status)
	_ = json.NewEncoder(w).Encode(results)
}

Liveness checks I actually add: “can this goroutine acquire a token from a per-process watchdog semaphore within 1 second?” That catches deadlocks but is blind to downstream outages. Readiness checks I actually add: DB ping with a tight timeout, one probe call to each strictly-required dependency, “am I past the warmup period?” flag.

Graceful Degradation: Partial Answers Beat No Answers

When a non-critical dependency is down, returning an error to the user is often the wrong answer. Stale data, a default value, or a partial response is usually better. The homepage with an empty “Recommended for you” section beats a 500.

The shape: wrap the call in a TryWithFallback(primary, fallback). Primary is the live dependency. Fallback is cache, default value, or skip-this-section. On success, update the cache. On failure, serve cache if fresh-enough, or the default.

A warning before the code: this cache is keyed by an application-chosen string (user ID, product ID, region). If any part of the key is attacker-influenced and the key space is unbounded, the map grows without bound — Get filters by expiry but never deletes, so expired entries accumulate forever. The shape below includes a background janitor that sweeps expired entries on an interval, plus a Close() to stop it. For high-cardinality production caches I reach for hashicorp/golang-lru/v2 instead: bounded size, O(1) eviction, no goroutine to babysit. Pick LRU the moment your key space is user-controlled or your working set is larger than you want to eyeball.

// degrade/degrade.go
package degrade

import (
	"context"
	"sync"
	"time"
)

type entry struct {
	val any
	exp time.Time
}

type Cache struct {
	mu       sync.RWMutex
	data     map[string]entry
	ttl      time.Duration
	stop     chan struct{}
	stopOnce sync.Once
}

// NewCache returns a TTL cache with a background janitor. Call Close when
// done to stop the janitor goroutine.
func NewCache(ttl time.Duration) *Cache {
	c := &Cache{
		data: make(map[string]entry),
		ttl:  ttl,
		stop: make(chan struct{}),
	}
	go c.janitor(ttl)
	return c
}

func (c *Cache) Get(key string) (any, bool) {
	c.mu.RLock()
	e, ok := c.data[key]
	c.mu.RUnlock()
	if !ok || time.Now().After(e.exp) {
		return nil, false
	}
	return e.val, true
}

func (c *Cache) Set(key string, val any) {
	c.mu.Lock()
	c.data[key] = entry{val: val, exp: time.Now().Add(c.ttl)}
	c.mu.Unlock()
}

// janitor evicts expired entries so the map cannot grow unbounded when
// Get-misses leave stale keys behind. Interval == ttl is a reasonable
// default: worst case an entry lingers for ~2*ttl after expiry.
func (c *Cache) janitor(interval time.Duration) {
	if interval <= 0 {
		return
	}
	t := time.NewTicker(interval)
	defer t.Stop()
	for {
		select {
		case <-c.stop:
			return
		case now := <-t.C:
			c.mu.Lock()
			for k, e := range c.data {
				if now.After(e.exp) {
					delete(c.data, k)
				}
			}
			c.mu.Unlock()
		}
	}
}

func (c *Cache) Close() {
	c.stopOnce.Do(func() { close(c.stop) })
}

// TryWithFallback runs primary; on failure, serves cache if present,
// otherwise invokes fallback. The return value carries whether the
// result is fresh or degraded, so callers can surface a banner.
type Result[T any] struct {
	Value    T
	Degraded bool
}

func TryWithFallback[T any](
	ctx context.Context,
	cache *Cache,
	key string,
	primary func(context.Context) (T, error),
	fallback func() T,
) (Result[T], error) {
	v, err := primary(ctx)
	if err == nil {
		cache.Set(key, v)
		return Result[T]{Value: v, Degraded: false}, nil
	}
	// Do not degrade on caller cancellation or deadline — that's the
	// caller saying "stop", not the dependency being unhealthy.
	// Propagate so timeouts and cancellations stay visible up the stack.
	if ctx.Err() != nil {
		var zero T
		return Result[T]{Value: zero, Degraded: true}, ctx.Err()
	}
	if v, ok := cache.Get(key); ok {
		return Result[T]{Value: v.(T), Degraded: true}, nil
	}
	return Result[T]{Value: fallback(), Degraded: true}, nil
}

Two things to be honest about degradation:

  1. The user should know. Surface a banner (“showing cached results”) or a structural hint in the response. Silent degradation hides outages from everyone, including you.
  2. It doesn’t work for writes. Creating orders, taking payments, changing passwords — you can’t fallback-to-cache those. For write paths, degrade to “we’re having trouble, try again” and let the circuit breaker do its job.

stale-while-revalidate is the next level: serve from cache immediately, trigger an async refresh, return fresh on the next call. Good for read-heavy high-traffic endpoints where latency matters more than being current to the millisecond.

Wiring It Together

The layering I use for every outbound dependency:

context deadline (end-to-end budget)
  -> bulkhead (bounded concurrency per dependency)
    -> circuit breaker (fail fast if dependency is down)
      -> retry (bounded, jittered, idempotent only)
        -> http.Client with per-attempt timeout

Each layer has one job. Break any layer and the failure mode has a name: no bulkhead = resource exhaustion; no breaker = cascading failure; no retry = flap on transient errors; no deadline = slow-loris hops.

func NewUserServiceClient() *UserClient {
	cb := circuitbreaker.New(circuitbreaker.Config{
		Name:             "user-service",
		FailureThreshold: 5,
		SuccessThreshold: 2,
		ResetTimeout:     10 * time.Second,
		OnStateChange: func(name string, from, to circuitbreaker.State) {
			metrics.CBStateChange.WithLabelValues(name, to.String()).Inc()
		},
	})
	bh := bulkhead.New(bulkhead.Config{
		Name:           "user-service",
		MaxConcurrency: 20,
		MaxQueueSize:   20,
	})
	return &UserClient{
		http:     &http.Client{Timeout: 500 * time.Millisecond},
		cb:       cb,
		bulkhead: bh,
		retry:    retry.NewOptions(),
	}
}

func (c *UserClient) GetUser(ctx context.Context, id string) (*User, error) {
	var u *User
	err := c.bulkhead.Submit(ctx, func(ctx context.Context) error {
		return c.cb.Execute(func() error {
			return retry.Do(ctx, func(ctx context.Context) error {
				var err error
				u, err = c.fetchUser(ctx, id)
				return err
			}, c.retry)
		})
	})
	return u, err
}

And on the shutdown side, honor the layers in reverse: stop accepting new requests, drain in-flight requests up to a shutdown budget, close bulkheads.

func (s *Server) Shutdown(ctx context.Context) error {
	if err := s.http.Shutdown(ctx); err != nil {
		return err
	}
	s.userClient.bulkhead.Close()
	s.catalogClient.bulkhead.Close()
	return nil
}

What I’d Actually Choose

For any new Go service with downstream dependencies, I start with this stack:

  • Breaker: the one in this post, or sony/gobreaker if you want a library. Both are fine. gobreaker has a good API and is maintained.
  • Bulkhead: roll your own with a bounded channel + worker pool as shown. It’s 60 lines and you understand every line.
  • Retry: cenkalti/backoff/v4 if you want a library with good jitter; otherwise the pattern above. Avoid backoff/v3 — the API changed.
  • Timeouts: exclusively via context.WithDeadline at the edge and context.WithTimeout per hop. Delete any time.After you find in request paths; they leak timers.
  • Health: separate /livez and /readyz handlers. /livez checks nothing external. /readyz checks the immediately-required dependencies with short timeouts.
  • Degradation: explicit TryWithFallback at the handler level, with a Degraded flag on responses that the frontend can render as a banner.

The biggest mistake I see teams make: layering resilience patterns without integration tests that actually break dependencies. A circuit breaker you haven’t seen open, a bulkhead you haven’t seen reject, a fallback you haven’t seen serve — none of those are real. Write tests that use httptest servers returning 503s, dropping connections mid-response, and sleeping past timeouts. Then run those tests on every change. The patterns only help if they’re wired correctly, and “wired correctly” is proved by tests, not by reading the code.

And the second biggest: resilience patterns as a substitute for capacity. If your dependency is at 100% CPU 24/7, no circuit breaker will save you — you need to scale. Resilience patterns are for handling incidents, not for masking chronic under-provisioning.

← Back to blog