Authentication Patterns for Distributed Systems

Authentication in a monolith is straightforward — one session store, one middleware, done. In a distributed system with dozens of services, it gets complicated fast. Which service validates the token? How do you propagate identity across service-to-service calls? What happens when two browser tabs race to refresh the same token? What happens when a refresh token gets stolen?

I’ve built auth systems for enough distributed architectures to have strong opinions. My current stack: short-lived asymmetric JWTs for stateless edge validation, OIDC via a dedicated IdP for user-facing flows, rotating refresh tokens with reuse detection, and mTLS for service-to-service. This post walks through the design decisions — the why before the how — with Go code you can adapt. I’ll point at TypeScript/jose equivalents where they differ.

One thing up front: I’m showing you the shape of the code, not a drop-in library. Build your own from these patterns and you’ll understand your auth. Copy-paste without understanding and you’ll ship an incident.

Threat Model First

Before any code, name the threats. Every control below defends against something specific — if you can’t name the threat, drop the control.

Threat	Defense
Credential stuffing / brute force	Rate limit per IP and per username, lockouts, MFA
Stolen access token	Short TTL (5-15 min), bind to audience, revoke by rotating signing key
Stolen refresh token	Server-side storage, single-use rotation, reuse detection → kill family
Stolen signing key	Asymmetric keys (RS256/ES256), key rotation, private key never leaves issuer
Cross-audience token confusion	Validate `iss` and `aud` on every call
Token in URL / logs	httpOnly cookies or POST form, never query string
CSRF on login / state-changing routes	SameSite=Strict cookies, state bound to session, double-submit tokens
OAuth2 login CSRF	PKCE + state tied to user’s browser session
Account takeover via OAuth2	Check `email_verified`, explicit account-linking flow, never auto-link by email
Service-to-service spoofing	mTLS with SPIFFE identities, not shared bearer secrets
User enumeration	Generic error messages, constant-time comparisons, equal latency paths
Token replay across services	Narrow audience per service, short TTL, consider DPoP

I’ll reference this table as I go. If a control isn’t in the table, it probably isn’t worth adding.

Token Strategy: HS256 Is Wrong for Distributed Systems

The most common auth mistake I see in distributed systems: signing JWTs with HS256 and sharing the HMAC secret across every service that validates them. It’s superficially simpler — one key, symmetric signing — but it violates least privilege. Any service that validates tokens has the power to mint tokens. Compromise the sessions service and the attacker can forge admin tokens for the payments service.

Use asymmetric signing (RS256 or ES256) with a single issuer that holds the private key. Every other service fetches the public key via JWKS and validates only. Key rotation becomes a matter of publishing a new key in the JWKS with a new kid, rolling issuance to it, and retiring the old one after max token lifetime.

Here’s a JWT authenticator that does this properly:

// auth/jwt.go
package auth

import (
	"context"
	"crypto/rsa"
	"errors"
	"fmt"
	"net/http"
	"strings"
	"time"

	"github.com/golang-jwt/jwt/v5"
)

// Config holds JWT verification settings. Shared by all validating services.
// Only the issuer service holds PrivateKey.
type Config struct {
	PrivateKey    *rsa.PrivateKey   // nil on validator-only services
	PublicKeys    map[string]*rsa.PublicKey // keyed by JWT `kid` header
	ActiveKeyID   string            // which key to sign with (issuer only)
	Issuer        string
	Audience      string            // this service's expected audience
	TokenDuration time.Duration     // keep short: 5-15 minutes
}

// Claims carries identity and authorization. Keep this list small —
// every field adds bytes to every request and every log line.
type Claims struct {
	Username    string   `json:"username"`
	Roles       []string `json:"roles,omitempty"`
	Permissions []string `json:"perms,omitempty"`
	jwt.RegisteredClaims
}

type Authenticator struct {
	cfg Config
}

func New(cfg Config) (*Authenticator, error) {
	if cfg.Issuer == "" || cfg.Audience == "" {
		return nil, errors.New("issuer and audience are required")
	}
	if len(cfg.PublicKeys) == 0 {
		return nil, errors.New("at least one public key is required")
	}
	if cfg.TokenDuration == 0 {
		cfg.TokenDuration = 15 * time.Minute
	}
	return &Authenticator{cfg: cfg}, nil
}

// AccessTTL exposes the configured access token lifetime so callers can
// surface it in responses (e.g. expires_in) without duplicating the constant.
func (a *Authenticator) AccessTTL() time.Duration { return a.cfg.TokenDuration }

Notice what’s not in Claims: email, full name, profile data. JWTs travel on every request. Treat them like a TCP header — the minimum needed to authorize the call. Fetch the rest from a user service when you need it.

Issuing a token from the IdP service:

func (a *Authenticator) Issue(userID, username string, roles, perms []string) (string, error) {
	if a.cfg.PrivateKey == nil {
		return "", errors.New("cannot issue: no private key configured")
	}
	now := time.Now()
	claims := Claims{
		Username:    username,
		Roles:       roles,
		Permissions: perms,
		RegisteredClaims: jwt.RegisteredClaims{
			Subject:   userID,
			Issuer:    a.cfg.Issuer,
			Audience:  jwt.ClaimStrings{a.cfg.Audience},
			ExpiresAt: jwt.NewNumericDate(now.Add(a.cfg.TokenDuration)),
			IssuedAt:  jwt.NewNumericDate(now),
			NotBefore: jwt.NewNumericDate(now),
			ID:        newJTI(), // unique per token, used for revocation list
		},
	}
	tok := jwt.NewWithClaims(jwt.SigningMethodRS256, claims)
	tok.Header["kid"] = a.cfg.ActiveKeyID
	return tok.SignedString(a.cfg.PrivateKey)
}

Validation is where most bugs live. The parser callback must pin the signing method, select the right public key by kid, and verify issuer + audience explicitly. The library will check expiry, not-before, and signature — but issuer and audience are not checked unless you opt in.

func (a *Authenticator) Validate(tokenString string) (*Claims, error) {
	token, err := jwt.ParseWithClaims(
		tokenString,
		&Claims{},
		func(t *jwt.Token) (any, error) {
			// Pin the exact algorithm. This defeats alg=none and HS256/RS256
			// confusion. Pinning to SigningMethodRSA alone would accept
			// RS256/RS384/RS512/PS256/... — if you issue RS256, require RS256.
			if t.Method != jwt.SigningMethodRS256 {
				return nil, fmt.Errorf("unexpected signing method: %v", t.Header["alg"])
			}
			kid, _ := t.Header["kid"].(string)
			key, ok := a.cfg.PublicKeys[kid]
			if !ok {
				return nil, fmt.Errorf("unknown key id: %q", kid)
			}
			return key, nil
		},
		jwt.WithIssuer(a.cfg.Issuer),
		jwt.WithAudience(a.cfg.Audience),
		jwt.WithExpirationRequired(),
		jwt.WithLeeway(30*time.Second), // clock skew tolerance — no more
	)
	if err != nil {
		return nil, fmt.Errorf("parse token: %w", err)
	}
	claims, ok := token.Claims.(*Claims)
	if !ok {
		return nil, errors.New("invalid token claims")
	}
	// jwt/v5 returns a non-nil err when the token is invalid, so reaching here
	// with ParseWithClaims having returned nil means token.Valid is true.
	// We don't re-check it — the library is authoritative.
	return claims, nil
}

Three things to internalize from this function:

The algorithm check is defense-in-depth against the classic alg: none and HS256/RS256 confusion bugs. Even if your library claims to defend against them, pin it yourself. Libraries ship regressions.
jwt.WithIssuer and jwt.WithAudience are opt-in. The library does not fail closed on them. I’ve reviewed production code that validated “successfully” against tokens issued for entirely different systems.
Leeway is for clock skew, not for stretching token lifetime. 30 seconds is plenty.

A note on replay within the TTL. A stolen access token is valid until it expires, full stop. There’s no server-side check here that rejects a token on second use — the point of stateless validation is that every service can check a token without talking to a store. Accept that tradeoff: the mitigation is a short TTL (5–15 minutes), and key rotation for targeted revocation. If you need finer-grained revocation — one user logged out of one device — move the check to a JTI denylist in Redis, keyed by jti with a TTL matching ExpiresAt. You’ve just added a network hop to every authenticated request; make sure you actually need it.

The middleware is unremarkable — extract the Bearer token, validate, stash claims on context with a typed key:

type claimsKey struct{}

func (a *Authenticator) Middleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		authz := r.Header.Get("Authorization")
		scheme, token, ok := strings.Cut(authz, " ")
		// RFC 7235: the scheme name is case-insensitive. "Bearer", "bearer",
		// and "BEARER" are all valid. Don't reject a compliant client over case.
		if !ok || !strings.EqualFold(scheme, "Bearer") || token == "" {
			// Generic error — do not tell the caller which check failed.
			http.Error(w, "unauthorized", http.StatusUnauthorized)
			return
		}
		claims, err := a.Validate(token)
		if err != nil {
			// Log the detail server-side only.
			logAuthFailure(r, err)
			http.Error(w, "unauthorized", http.StatusUnauthorized)
			return
		}
		ctx := context.WithValue(r.Context(), claimsKey{}, claims)
		next.ServeHTTP(w, r.WithContext(ctx))
	})
}

func ClaimsFrom(ctx context.Context) (*Claims, bool) {
	c, ok := ctx.Value(claimsKey{}).(*Claims)
	return c, ok
}

TypeScript equivalent: use jose with createRemoteJWKSet for the public keys and jwtVerify with issuer and audience options. jose is better-maintained than jsonwebtoken and its defaults are stricter.

The login handler is short and full of traps. Here’s the version that gets it right:

// handlers/login.go
package handlers

import (
	"crypto/rand"
	"crypto/subtle"
	"encoding/base64"
	"encoding/json"
	"errors"
	"fmt"
	"net/http"
	"strings"
	"time"

	"example.com/app/httpx"
	"golang.org/x/crypto/argon2"
)

var errInvalidCreds = errors.New("invalid credentials")

const (
	maxUsernameLen      = 64
	maxPasswordLen      = 128 // cap before argon2id to bound CPU/memory cost
	cookieClockSkewBuf  = 5 * time.Minute
)

type loginReq struct {
	Username string `json:"username"`
	Password string `json:"password"`
}

func (h *Handlers) Login(w http.ResponseWriter, r *http.Request) {
	// Rate limit by IP FIRST — before we parse anything. An attacker who can
	// flood the decoder before hitting the limiter can DoS the service by
	// making us do JSON work on garbage.
	if !h.rl.Allow("ip:" + httpx.ClientIP(r)) {
		http.Error(w, "too many requests", http.StatusTooManyRequests)
		return
	}

	// Cap the request body. Login payloads are tiny; anything larger is abuse.
	r.Body = http.MaxBytesReader(w, r.Body, 4096)

	var req loginReq
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, "invalid request", http.StatusBadRequest)
		return
	}

	// Normalize and length-cap the username BEFORE using it as a rate-limit
	// key. The username is attacker-authored: without a bound, an attacker
	// can send unlimited unique usernames and blow up the limiter's keyspace
	// (memory exhaustion). Your limiter also needs TTL eviction (Redis EXPIRE
	// or an LRU) — unique keys are an unbounded set even with length caps.
	req.Username = strings.ToLower(strings.TrimSpace(req.Username))
	if req.Username == "" || len(req.Username) > maxUsernameLen {
		// Generic error — don't reveal that the username was malformed.
		http.Error(w, "invalid credentials", http.StatusUnauthorized)
		return
	}
	// Cap the password BEFORE calling argon2id. The body is already capped at
	// 4KB, but argon2id will happily spend its full tuned cost on whatever it
	// receives. A 4KB password repeated across a botnet is an app-level DoS.
	if len(req.Password) == 0 || len(req.Password) > maxPasswordLen {
		http.Error(w, "invalid credentials", http.StatusUnauthorized)
		return
	}
	if !h.rl.Allow("user:" + req.Username) {
		http.Error(w, "too many requests", http.StatusTooManyRequests)
		return
	}

	user, err := h.users.FindByUsername(r.Context(), req.Username)
	// Whether or not the user exists, always do the argon2id work.
	// This evens out timing and defeats user enumeration by latency.
	hash := dummyHash
	if err == nil && user != nil {
		hash = user.PasswordHash
	}
	if verifyArgon2id(hash, req.Password) != nil || user == nil {
		h.audit.LogFailedLogin(r, req.Username)
		// One generic message. Never distinguish "no such user" from "wrong password".
		http.Error(w, "invalid credentials", http.StatusUnauthorized)
		return
	}

	// Adaptive rehash: if the stored hash was produced with weaker parameters
	// than our current policy (hardware got faster, OWASP bumped the floor),
	// recompute it now that we have the plaintext in hand. Best-effort —
	// failure must not block login. The user won't notice; the next login
	// verifies against the stronger hash.
	if needsRehashArgon2id(user.PasswordHash) {
		if fresh, err := hashArgon2id(req.Password); err == nil {
			_ = h.users.UpdatePasswordHash(r.Context(), user.ID, fresh)
		}
	}

	// Issue the pair.
	access, err := h.jwt.Issue(user.ID, user.Username, user.Roles, user.Permissions)
	if err != nil {
		http.Error(w, "internal error", http.StatusInternalServerError)
		return
	}
	refresh, refreshExp, err := h.refresh.Mint(r.Context(), user.ID)
	if err != nil {
		http.Error(w, "internal error", http.StatusInternalServerError)
		return
	}

	// Deliver refresh token in an httpOnly, SameSite=Strict cookie.
	// Access token goes in the JSON body for the SPA to hold in memory only.
	// Cookie expiry tracks the server-side token expiry, plus a small buffer
	// for client/server clock skew — the cookie should outlive the token
	// slightly, never the other way around, or a valid token becomes
	// un-sendable.
	cookieMaxAge := int(time.Until(refreshExp.Add(cookieClockSkewBuf)) / time.Second)
	if cookieMaxAge <= 0 {
		// Refused a token minted in the past — config drift or clock jump.
		// Never issue a cookie with MaxAge <= 0: the browser would treat that
		// as "delete immediately" and the user would be in an auth loop.
		http.Error(w, "internal error", http.StatusInternalServerError)
		return
	}
	http.SetCookie(w, &http.Cookie{
		Name:     "rt",
		Value:    refresh,
		Path:     "/auth/refresh",
		HttpOnly: true,
		Secure:   true,
		SameSite: http.SameSiteStrictMode,
		// Set both Max-Age and Expires. Max-Age is the RFC 6265 standard and
		// is what modern clients honor; Expires is a legacy fallback for
		// older HTTP libraries. Without both, some clients hold the cookie
		// longer than the token is valid.
		MaxAge:  cookieMaxAge,
		Expires: refreshExp.Add(cookieClockSkewBuf),
	})
	h.audit.LogSuccessfulLogin(r, user.ID)
	writeJSON(w, map[string]any{
		"access_token": access,
		"expires_in":   int(h.jwt.AccessTTL() / time.Second),
	})
}

// dummyHash is a precomputed argon2id PHC hash of a random string, computed
// at startup under the CURRENT argonParams so failed-login timing matches
// the timing of verifying a real user whose hash is also at current params.
//
// Drift warning: if you bump argonParams and you have users whose stored
// hashes are still at the old (weaker) params, their verify runs faster than
// dummy, which re-opens a timing oracle for user enumeration until all hashes
// are rehashed via the adaptive-rehash path. If you keep old and new users
// together long-term, pin dummyHash to the strongest params you've ever
// shipped so dummy is always the slowest path.
var dummyHash = mustHashArgon2id("not-a-real-password-0xDEADBEEF")

// verifyArgon2id parses a PHC-format argon2id hash ($argon2id$v=19$m=...$salt$hash)
// and verifies the password in constant time. Returns an error on mismatch.
//
// Why argon2id and not bcrypt: bcrypt silently truncates inputs at 72 bytes,
// so two passwords that share a 72-byte prefix hash to the same value. If you
// keep using bcrypt, HMAC-SHA256 the password first (with a server-side pepper)
// to fold it to a fixed length before calling bcrypt. Argon2id has no such
// gotcha and is memory-hard against GPU attackers.
func verifyArgon2id(encoded, password string) error {
	parts := strings.Split(encoded, "$")
	if len(parts) != 6 || parts[1] != "argon2id" {
		return errors.New("bad hash format")
	}
	var m, t uint32
	var p uint8
	if _, err := fmt.Sscanf(parts[3], "m=%d,t=%d,p=%d", &m, &t, &p); err != nil {
		return err
	}
	// Sanity-cap the parsed params. This is a second-line defense against a
	// DB-write attacker: tampered rows with m=9999999 would burn 9+ GB of
	// memory per login attempt. Caps should be roughly one order of magnitude
	// above argonParams so legitimate bumps aren't blocked, but low enough
	// that a bad row can't DoS the process. Tune to your actual policy.
	// m is in KiB: 1<<18 = 262144 KiB ≈ 256 MiB, ~13x the 19 MiB default.
	if m > 1<<18 || t > 10 || p > 16 {
		return errors.New("hash parameters out of bounds")
	}
	salt, err := base64.RawStdEncoding.DecodeString(parts[4])
	if err != nil {
		return err
	}
	want, err := base64.RawStdEncoding.DecodeString(parts[5])
	if err != nil {
		return err
	}
	got := argon2.IDKey([]byte(password), salt, t, m, p, uint32(len(want)))
	if subtle.ConstantTimeCompare(got, want) != 1 {
		return errors.New("mismatch")
	}
	return nil
}

// argonParams is the current policy. Bump when hardware gets faster or when
// OWASP's minimums change. A successful login that sees a weaker hash
// triggers an in-place upgrade.
var argonParams = struct {
	time, memory uint32
	threads      uint8
	keyLen       uint32
}{time: 2, memory: 19 * 1024, threads: 1, keyLen: 32}

// hashArgon2id produces a fresh PHC-encoded hash under the current policy.
func hashArgon2id(password string) (string, error) {
	salt := make([]byte, 16)
	if _, err := rand.Read(salt); err != nil {
		return "", err
	}
	key := argon2.IDKey([]byte(password), salt, argonParams.time,
		argonParams.memory, argonParams.threads, argonParams.keyLen)
	return fmt.Sprintf("$argon2id$v=19$m=%d,t=%d,p=%d$%s$%s",
		argonParams.memory, argonParams.time, argonParams.threads,
		base64.RawStdEncoding.EncodeToString(salt),
		base64.RawStdEncoding.EncodeToString(key)), nil
}

// needsRehashArgon2id returns true if the encoded hash used weaker parameters
// than the current argonParams policy. Called after a successful verify so
// we can transparently upgrade the user's stored hash.
func needsRehashArgon2id(encoded string) bool {
	parts := strings.Split(encoded, "$")
	if len(parts) != 6 || parts[1] != "argon2id" {
		return true // malformed — treat as outdated
	}
	var m, t uint32
	var p uint8
	if _, err := fmt.Sscanf(parts[3], "m=%d,t=%d,p=%d", &m, &t, &p); err != nil {
		return true
	}
	return m < argonParams.memory || t < argonParams.time || p < argonParams.threads
}

The critical moves:

Rate limit by IP before anything else runs, then by username after you’ve normalized and length-capped it. The order matters. If you decode JSON first and rate-limit after, an attacker can flood the decoder to DoS you before the limiter fires. If you use the raw user-supplied username as a limiter key, an attacker sends unique strings forever and eats your limiter’s memory. IP-limit at the door, cap the body with http.MaxBytesReader, normalize, bound the key, then username-limit.
Independent IP and username limits. An attacker controlling a botnet defeats per-IP alone; a targeted attack on a single username defeats per-IP alone again.
Do the password-verify work even when the user doesn’t exist, with a precomputed dummy hash. Without this, missing users return in <1ms and real users take ~200ms of argon2id work. That latency gap is a username oracle.
One error message. “Invalid credentials.” Not “user not found,” not “wrong password.”
Argon2id, tuned to ~200-300ms per verify. Bcrypt works but has a 72-byte input truncation footgun — two passwords sharing a 72-byte prefix hash to the same value. If you keep using bcrypt, HMAC-SHA256 the password with a server-side pepper first to fold it to a fixed length, then bcrypt that.
Refresh token in httpOnly cookie, scoped to /auth/refresh. The SPA never touches it. Access token goes in the JSON response body and lives in memory — not localStorage, which XSS can read.
For browser clients, strongly consider a BFF pattern. In-memory access tokens in a SPA are still reachable from XSS (the attacker runs in your origin; they can wait for your fetch). A Backend-for-Frontend flips the model: the SPA authenticates against your own backend, which holds access tokens server-side and attaches them to downstream API calls. The browser never sees the access token at all — just an opaque session cookie. You trade one network hop per request for a material reduction in XSS blast radius. My default for a serious product: BFF. My default for a side project with no sensitive actions: in-memory access tokens.

Refresh Token Rotation With Reuse Detection

A refresh token that can be used repeatedly until expiry is not much better than a long-lived access token. The pattern that actually limits blast radius is rotation with reuse detection:

Each refresh token is single-use. When it’s redeemed, issue a new refresh token and invalidate the old one.
Track tokens in a family — a chain rooted at the original login. If a token from a family is ever redeemed twice (reuse detected), invalidate the entire family. Force re-login.

The logic assumes a stolen refresh token will eventually be redeemed out-of-order with the legitimate session, which trips the reuse detector. It’s not foolproof, but it caps damage to one refresh cycle.

Two details matter more than the rotation flow itself. First: the refresh token you send to the client must never equal the value you store in the database. Treat the token like a password. If your DB leaks — SQL injection, backup on S3, log exposure — plaintext tokens are immediately usable against your API. Store only a SHA-256 hash of the token, look up by hash, and the plaintext never touches disk. Second: if crypto/rand fails, panic. A silent failure gives you all-zero tokens, which means every session shares the same predictable secret.

// auth/refresh.go
package auth

import (
	"context"
	"crypto/rand"
	"crypto/sha256"
	"encoding/base64"
	"errors"
	"time"
)

type RefreshRecord struct {
	TokenHash string    // SHA-256(plaintext token), the lookup key
	FamilyID  string    // shared by all tokens rotated from the same login
	UserID    string
	ExpiresAt time.Time
	UsedAt    *time.Time // nil until redeemed
}

// Store is the persistence interface. Use Redis or Postgres — must be durable.
// All lookups are by TokenHash, never by plaintext.
type Store interface {
	Insert(ctx context.Context, r RefreshRecord) error
	GetByHash(ctx context.Context, tokenHash string) (*RefreshRecord, error)
	// MarkUsedIfUnused is the atomic primitive that prevents TOCTOU races.
	// Returns true iff this call was the one that flipped the record from
	// unused to used AND the record was still unexpired at flip time.
	// SQL: UPDATE refresh SET used_at=$1 WHERE token_hash=$2
	//      AND used_at IS NULL AND expires_at > $1
	//      then check rows-affected.
	// Redis: Lua CAS guarded on both unused and unexpired.
	// Without the expiry in the atomic condition, a record that expired
	// between GetByHash and MarkUsedIfUnused would still rotate — small
	// window, but the token lifetime invariant deserves to be atomic too.
	MarkUsedIfUnused(ctx context.Context, tokenHash string, at time.Time) (bool, error)
	InvalidateFamily(ctx context.Context, familyID string) error
}

type RefreshManager struct {
	store Store
	ttl   time.Duration
}

// NewRefreshManager fails fast on a zero/negative TTL. Without this guard,
// Mint would issue tokens with ExpiresAt == time.Now(), which the login
// handler's cookie-MaxAge check catches but only after the bad token is
// already in the store.
func NewRefreshManager(store Store, ttl time.Duration) (*RefreshManager, error) {
	if ttl <= 0 {
		return nil, errors.New("refresh TTL must be positive")
	}
	return &RefreshManager{store: store, ttl: ttl}, nil
}

// Mint is called at login. Creates a new family.
// Returns the plaintext token (sent to the client) and its expiry so the
// caller can align cookie lifetimes. Only the hash is stored.
func (m *RefreshManager) Mint(ctx context.Context, userID string) (string, time.Time, error) {
	return m.issue(ctx, userID, randomID())
}

// Rotate is called on refresh. Hashes the presented token, looks it up,
// atomically marks it used, and issues a replacement. The critical step is
// MarkUsedIfUnused: if it returns false we lost the race, which is textbook
// reuse — another caller already rotated this token.
func (m *RefreshManager) Rotate(ctx context.Context, presented string) (userID, newToken string, expiresAt time.Time, err error) {
	h := hashToken(presented)
	rec, err := m.store.GetByHash(ctx, h)
	if err != nil || rec == nil {
		return "", "", time.Time{}, errors.New("invalid refresh token")
	}
	if time.Now().After(rec.ExpiresAt) {
		return "", "", time.Time{}, errors.New("refresh token expired")
	}
	marked, err := m.store.MarkUsedIfUnused(ctx, h, time.Now())
	if err != nil {
		return "", "", time.Time{}, err
	}
	if !marked {
		// The record existed and was unexpired at GetByHash time, but we
		// could not flip it from unused → used. Someone else already did.
		// That is reuse. Kill the family, force re-login.
		_ = m.store.InvalidateFamily(ctx, rec.FamilyID)
		return "", "", time.Time{}, errors.New("refresh token reuse detected")
	}
	newTok, exp, err := m.issue(ctx, rec.UserID, rec.FamilyID)
	return rec.UserID, newTok, exp, err
}

// issue generates a new plaintext token, stores only its hash, and returns
// the plaintext to the caller. The plaintext is never persisted server-side.
func (m *RefreshManager) issue(ctx context.Context, userID, familyID string) (string, time.Time, error) {
	token := randomID()
	exp := time.Now().Add(m.ttl)
	err := m.store.Insert(ctx, RefreshRecord{
		TokenHash: hashToken(token),
		FamilyID:  familyID,
		UserID:    userID,
		ExpiresAt: exp,
	})
	return token, exp, err
}

// hashToken produces a stable lookup key from the plaintext token.
// SHA-256 is fine here — the token itself already has 256 bits of entropy,
// so there's no need for a password-hashing KDF.
func hashToken(token string) string {
	sum := sha256.Sum256([]byte(token))
	return base64.RawURLEncoding.EncodeToString(sum[:])
}

// randomID returns 256 bits of cryptographically secure randomness,
// url-safe base64 encoded. It panics on entropy failure — a rare but
// catastrophic condition where silently returning zeros would produce
// predictable tokens.
func randomID() string {
	b := make([]byte, 32)
	if _, err := rand.Read(b); err != nil {
		panic("crypto/rand failed: " + err.Error())
	}
	return base64.RawURLEncoding.EncodeToString(b)
}

The refresh endpoint is a thin wrapper: read the cookie, call Rotate, set the new cookie, return a fresh access token. Everything that matters lives in Rotate.

CSRF on the refresh endpoint. SameSite=Strict is a strong first line — modern browsers won’t attach the cookie on cross-site requests — but it’s not a complete defense. Older browsers, same-origin but attacker-controlled subdomains, and edge-case navigation flows can all bypass it. Add a defense-in-depth check: require the refresh endpoint to see an Origin header matching your domain, or issue a double-submit CSRF token (random value set in a non-httpOnly cookie AND echoed in a request header, compared server-side). The cost is one string compare; the payoff is not having to trust the browser alone.

This is also where in-memory stores stop working. Refresh records must be persisted — Redis is fine, Postgres is fine, a map protected by sync.Mutex is not fine because it evaporates on restart and doesn’t coordinate across instances. The whole point is durability.

OAuth2/OIDC: Use an IdP, But Know What It’s Doing

If you’re adding “Login with Google” to your app, implement the client side and delegate everything else to an identity provider (Auth0, Keycloak, Zitadel, Ory Hydra, whatever). The code below is for understanding what the IdP does on your behalf — not for production unless you really have reason to roll your own.

The authorization code flow with PKCE, stepped through:

User clicks “Log in.” Your server generates a random state and a PKCE code_verifier. It stores code_verifier in a session cookie (httpOnly) and redirects to the provider’s /authorize with state, code_challenge = SHA256(code_verifier), and code_challenge_method=S256.
User authenticates at the provider. Provider redirects back to your callback with code and state.
Your callback verifies state matches the session cookie (defeats login CSRF), then POSTs code + code_verifier to the provider’s /token. The code_verifier proves you’re the same client that started the flow — PKCE defeats code interception.
Provider returns an ID token. You verify its signature (via the provider’s JWKS), iss, aud, exp, and nonce. You check email_verified is true before trusting the email claim.
You find or create a local user. If creating, never auto-link by email unless the provider explicitly guarantees verification. Otherwise: require the user to sign in to the existing account first and confirm linking.
You issue your own access + refresh tokens. The provider’s tokens are discarded.

The part people most often get wrong is step 1’s state handling:

// handlers/oauth_start.go
func (h *Handlers) OAuthStart(w http.ResponseWriter, r *http.Request) {
	verifier := randomID()                              // 43-128 chars, url-safe
	challenge := s256(verifier)                         // base64url(sha256(verifier))
	state := randomID()

	// Bind verifier AND state to this browser via a short-lived, signed cookie.
	// This is what defeats login CSRF — state alone in a server map doesn't.
	setFlashCookie(w, "oauth_flow", oauthFlow{
		State:    state,
		Verifier: verifier,
		Expires:  time.Now().Add(10 * time.Minute),
	})

	u := h.provider.AuthURL(state, challenge)
	http.Redirect(w, r, u, http.StatusFound)
}

// handlers/oauth_callback.go
func (h *Handlers) OAuthCallback(w http.ResponseWriter, r *http.Request) {
	flow, err := readFlashCookie(r, "oauth_flow")
	if err != nil || time.Now().After(flow.Expires) {
		http.Error(w, "flow expired", http.StatusBadRequest)
		return
	}
	clearFlashCookie(w, "oauth_flow")

	// Constant-time comparison for state — it's a secret-adjacent value.
	if subtle.ConstantTimeCompare([]byte(flow.State), []byte(r.URL.Query().Get("state"))) != 1 {
		http.Error(w, "state mismatch", http.StatusBadRequest)
		return
	}
	code := r.URL.Query().Get("code")
	if code == "" {
		http.Error(w, "missing code", http.StatusBadRequest)
		return
	}

	idToken, err := h.provider.Exchange(r.Context(), code, flow.Verifier)
	if err != nil {
		http.Error(w, "exchange failed", http.StatusBadGateway)
		return
	}
	claims, err := h.provider.VerifyIDToken(r.Context(), idToken)
	if err != nil {
		http.Error(w, "id token invalid", http.StatusBadRequest)
		return
	}
	if !claims.EmailVerified {
		http.Error(w, "email not verified", http.StatusBadRequest)
		return
	}

	// Look the user up by (provider, sub) — NOT by email. Email comes along
	// for user-profile display only. If no identity exists for this (provider, sub)
	// pair, create a fresh local account; never attach to an existing local
	// account via matching email. That path is the account-takeover vector.
	userID, err := h.users.FindOrCreateByOIDCSub(r.Context(),
		h.provider.Name(), claims.Sub, claims.Email /* display-only */)
	if err != nil {
		http.Error(w, "cannot create session", http.StatusConflict)
		return
	}
	// Issue your own tokens and set cookies the same way as /login.
	h.issueSessionCookies(w, r, userID)
}

// Linking an OIDC identity to a pre-existing local account is a SEPARATE
// flow. The user signs in with their existing local credentials, hits a
// "connect Google account" button, completes the OAuth flow, and only then
// is the (provider, sub) pair attached. Do not merge by email at callback
// time under any circumstances.

Three things to internalize:

State must be bound to the browser’s session, not stored in a global server-side map. A map-backed state check accepts any state the server issued, from any caller. That’s a login CSRF hole.
PKCE is non-negotiable. Even for confidential clients (server-side apps with a client secret), it costs nothing and closes the code-interception class.
Verify email_verified before trusting the email. Otherwise users who sign up via IdP with a fake email can take over existing accounts keyed by that email.

The flash cookie that carries state and verifier is the weakest link if you implement it wrong — it’s what the callback trusts to match against the provider redirect. It must be HMAC-signed. Without signing, an attacker can forge a state they can match, which defeats the whole point of having state. Without a bound expiry inside the payload, they can replay stale flows. Encryption is optional here: state and verifier are ephemeral, single-use, and have no secret content from the browser’s perspective — signing for integrity is the requirement. The pattern:

// session/flash.go
package session

import (
	"crypto/hmac"
	"crypto/sha256"
	"encoding/base64"
	"encoding/json"
	"errors"
	"net/http"
	"os"
	"strconv"
	"time"
)

// flashKey is a server-side secret loaded at startup. Rotate via overlapping
// keys — accept old, sign with new, retire old after max flow lifetime.
//
// FLASH_COOKIE_KEY must be the BASE64 encoding of at least 32 random bytes.
// Generate with: openssl rand -base64 32
// A typed 32-character passphrase passes the length check but has <160 bits
// of entropy, which is what operators typically ship if we don't insist on
// base64 input. Fail the boot rather than accepting a weak key.
var flashKey []byte

func init() {
	raw := os.Getenv("FLASH_COOKIE_KEY")
	if raw == "" {
		panic("FLASH_COOKIE_KEY is not set (generate with: openssl rand -base64 32)")
	}
	// Accept standard, URL-safe, padded and unpadded — operators generate
	// keys with whichever tool they have at hand, and rejecting a valid key
	// because it used base64url is the kind of friction that makes people
	// paste plaintext instead.
	var key []byte
	var err error
	for _, enc := range []*base64.Encoding{
		base64.StdEncoding, base64.RawStdEncoding,
		base64.URLEncoding, base64.RawURLEncoding,
	} {
		key, err = enc.DecodeString(raw)
		if err == nil {
			break
		}
	}
	if err != nil {
		panic("FLASH_COOKIE_KEY must be base64 (std or url, padded or not)")
	}
	if len(key) < 32 {
		panic("FLASH_COOKIE_KEY must decode to at least 32 bytes (got " +
			strconv.Itoa(len(key)) + ") — generate with: openssl rand -base64 32")
	}
	flashKey = key
}

type oauthFlow struct {
	State    string
	Verifier string
	Expires  time.Time
}

func setFlashCookie(w http.ResponseWriter, name string, v any) {
	payload, _ := json.Marshal(v)
	mac := hmac.New(sha256.New, flashKey)
	mac.Write(payload)
	sig := mac.Sum(nil)
	// Encode payload + signature. No confidentiality here (json is visible to
	// the browser), but we don't need it — state/verifier are ephemeral and
	// used once. We only need integrity: the server must detect tampering.
	value := base64.RawURLEncoding.EncodeToString(payload) + "." +
		base64.RawURLEncoding.EncodeToString(sig)
	http.SetCookie(w, &http.Cookie{
		Name:  name,
		Value: value,
		// Scope to the OAuth routes only. Broader paths (e.g. /auth) send
		// this cookie to /auth/refresh and other unrelated handlers, where
		// a log line or error reflector could surface the signed payload.
		Path:     "/auth/oauth",
		HttpOnly: true,
		Secure:   true,
		SameSite: http.SameSiteLaxMode, // Lax, not Strict — this cookie must
		// survive a top-level redirect from the IdP back to us.
		MaxAge: 600, // 10 minutes
	})
}

func readFlashCookie(r *http.Request, name string) (oauthFlow, error) {
	var zero oauthFlow
	c, err := r.Cookie(name)
	if err != nil {
		return zero, err
	}
	parts := splitTwo(c.Value, '.')
	if parts == nil {
		return zero, errors.New("malformed")
	}
	payload, err := base64.RawURLEncoding.DecodeString(parts[0])
	if err != nil {
		return zero, err
	}
	sig, err := base64.RawURLEncoding.DecodeString(parts[1])
	if err != nil {
		return zero, err
	}
	mac := hmac.New(sha256.New, flashKey)
	mac.Write(payload)
	if !hmac.Equal(sig, mac.Sum(nil)) {
		return zero, errors.New("bad signature")
	}
	var f oauthFlow
	if err := json.Unmarshal(payload, &f); err != nil {
		return zero, err
	}
	return f, nil
}

One coupling to be careful about: the cookie Path must be the common prefix of your OAuth start and callback routes. The snippet uses /auth/oauth as an example because both handlers live under it. If your routes are /auth/login/google/start and /auth/login/google/callback, the Path must be /auth/login/google — otherwise the browser won’t send the cookie to the callback and the flow silently breaks. Don’t widen the path back to /auth to “make it work”: that’s the original bug.

If you need confidentiality — say the payload contains a user ID you don’t want the browser to see — reach for crypto/aes + GCM, not custom. For OAuth state/verifier, signing is enough: the values are random and short-lived.

Client IP Extraction: Do Not Trust Headers You Haven’t Pinned

The httpx.ClientIP(r) call in the login handler is security-critical. It feeds the rate limiter, the audit log, and any geo-based risk signal you add later. Getting it wrong means an attacker rotates X-Forwarded-For values and your IP-based limits collapse to nothing.

The correct pattern depends on your deployment:

// httpx/clientip.go
package httpx

import (
	"net"
	"net/http"
	"net/netip"
	"strings"
)

// trustedProxies is the set of CIDRs your reverse proxies live in. Load from
// config. Anything outside this set is untrusted, which means its
// X-Forwarded-For header is also untrusted.
var trustedProxies []netip.Prefix

// ClientIP returns the originating client IP. If the immediate peer is a
// trusted proxy, it consults X-Forwarded-For and walks right-to-left,
// skipping entries contributed by additional trusted proxies, and returns
// the first untrusted entry (which is the real client). If the immediate
// peer is NOT a trusted proxy, X-Forwarded-For is ignored — anything it
// says is attacker-supplied.
func ClientIP(r *http.Request) string {
	peer, _, _ := net.SplitHostPort(r.RemoteAddr)
	peerAddr, err := netip.ParseAddr(peer)
	if err != nil || !isTrusted(peerAddr) {
		return peer
	}
	xff := r.Header.Get("X-Forwarded-For")
	if xff == "" {
		return peer
	}
	parts := strings.Split(xff, ",")
	for i := len(parts) - 1; i >= 0; i-- {
		candidate := strings.TrimSpace(parts[i])
		addr, err := netip.ParseAddr(candidate)
		if err != nil {
			return peer // malformed — fall back to peer
		}
		if !isTrusted(addr) {
			return addr.String() // first non-proxy entry from the right
		}
	}
	return peer
}

func isTrusted(a netip.Addr) bool {
	for _, p := range trustedProxies {
		if p.Contains(a) {
			return true
		}
	}
	return false
}

Rule of thumb: never read X-Forwarded-For unless you know exactly how many trusted proxies your traffic crosses. If you have one ingress and one CDN, that’s two hops, and you walk back past both before you trust an address. If you don’t know, return r.RemoteAddr and accept that your per-IP limits are really per-proxy limits.

Service-to-Service: mTLS, Not Bearer Tokens

Passing JWTs between services works but has a subtle problem: the token in the Authorization header represents the user, not the caller. Service A calling Service B on behalf of a user has no cryptographic way to prove “I am Service A” — it just forwards a user token. If Service B trusts anything on the basis of “who called me,” it has no basis to.

The fix is a second identity layer: each service gets its own cryptographic identity, and calls are mutually authenticated over mTLS. The user’s JWT still flows through as context, but the transport proves the caller.

Running this by hand is painful. SPIFFE/SPIRE automates it: each workload gets an X.509 SVID (Short-lived Verifiable Identity Document) with a URI SAN like spiffe://example.org/ns/prod/sa/orders-api. Rotation is automatic. Istio and Linkerd use SPIFFE under the hood for their mTLS modes.

In code, the server side of a service is just a TLS config that requires and verifies client certs, plus an authorization check against the peer’s SPIFFE ID:

// trustBundle is hot-reloadable. A background goroutine refreshes it from the
// SPIRE Workload API and stores the new pool atomically. Readers pull the
// current pool on every handshake — never a snapshot.
var trustBundle atomic.Pointer[x509.CertPool]

// baseTLSConfig is the common TLS baseline. ClientCAs is deliberately nil
// here — it's populated per-handshake via GetConfigForClient so rotation
// actually takes effect.
var baseTLSConfig = &tls.Config{
	MinVersion:            tls.VersionTLS13,
	ClientAuth:            tls.RequireAndVerifyClientCert,
	VerifyPeerCertificate: verifySpiffeID("spiffe://example.org/ns/prod/sa/orders-api"),
}

// ServerTLSConfig returns the config you pass to http.Server. Every handshake
// calls GetConfigForClient, which clones the base and injects the CURRENT
// trust bundle. This is what actually makes rotation work — setting
// tlsCfg.ClientCAs once at startup captures a snapshot and revoked CAs stay
// trusted until the process restarts.
func ServerTLSConfig() *tls.Config {
	return &tls.Config{
		GetConfigForClient: func(*tls.ClientHelloInfo) (*tls.Config, error) {
			cfg := baseTLSConfig.Clone()
			cfg.ClientCAs = trustBundle.Load()
			return cfg, nil
		},
	}
}

// verifySpiffeID returns a VerifyPeerCertificate hook that fails closed
// unless the peer cert's URI SAN EXACTLY matches the expected SPIFFE ID.
// Prefix matching here is the bug — "spiffe://example.org/ns/prod/sa/orders"
// would wrongly accept "...sa/orders-admin".
func verifySpiffeID(expected string) func([][]byte, [][]*x509.Certificate) error {
	return func(_ [][]byte, chains [][]*x509.Certificate) error {
		if len(chains) == 0 || len(chains[0]) == 0 {
			return errors.New("no verified chain")
		}
		leaf := chains[0][0]
		for _, uri := range leaf.URIs {
			if uri.String() == expected { // exact, never HasPrefix
				return nil
			}
		}
		return fmt.Errorf("no matching SPIFFE ID on peer cert (wanted %q)", expected)
	}
}

The shape here is what people miss. tls.Config.ClientCAs set once at startup is a snapshot; every handshake reads that snapshot. To actually hot-reload, you need a per-handshake hook — GetConfigForClient on the server side, GetClientCertificate on the client side. A background goroutine refreshes trustBundle from the SPIRE Workload API, and every handshake pulls the current pool.

One thing to skip: CRL/OCSP. SPIFFE SVIDs have short TTLs (typically one hour) — revocation happens naturally when a compromised workload can’t renew. If you need faster revocation, shorten the SVID TTL. Standing up an OCSP responder for workloads that rotate hourly is theater.

The point isn’t the code — it’s the pattern. Stop treating service-to-service auth like user auth with a different secret. Give each workload an identity, rotate it, and let the platform handle distribution.

Zero Trust: A Model, Not a Middleware

“Zero trust” gets sold as a magical middleware. It isn’t. It’s a design principle: every request is authenticated and authorized against current context, not against a past login event. No permanent trust, no implicit trust from network location.

The practical implications in an auth layer:

Short token lifetimes with continuous refresh. 15 minutes max on access tokens. The session ends naturally if the user goes idle.
Context evaluation on sensitive actions. For a payment, re-check recent signals: same IP range, same device fingerprint, no impossible travel (new login from Lisbon 10 minutes after one from Tokyo).
Step-up auth on risk changes. New country? Re-challenge for MFA. Admin action from a browser that hasn’t done admin actions before? Step up.
Revoke everywhere, instantly. Signing-key rotation on compromise. Session store with a kill switch. The entire point of short TTL is so revocation is cheap.

A concrete example: impossible-travel detection. Given a login history, flag if the user just logged in from a location too far from their previous session to have traveled.

// auth/risk/travel.go
package risk

import (
	"math"
	"time"
)

type Login struct {
	At  time.Time
	Lat float64
	Lon float64
}

// ImpossibleTravel returns a risk score 0.0 (normal) to 1.0 (definitely impossible).
// The threshold is implied speed. 900 km/h is commercial flight; >1200 is impossible
// for consumer travel. Intermediate values flag for step-up auth.
func ImpossibleTravel(prev, curr Login) float64 {
	dt := curr.At.Sub(prev.At).Hours()
	if dt <= 0 {
		return 0 // clock problem, don't judge
	}
	km := haversineKM(prev.Lat, prev.Lon, curr.Lat, curr.Lon)
	kmh := km / dt
	switch {
	case kmh < 900:
		return 0.0
	case kmh < 1200:
		return 0.5 // step up to MFA
	default:
		return 1.0 // block, require full re-auth
	}
}

func haversineKM(lat1, lon1, lat2, lon2 float64) float64 {
	const R = 6371.0
	toRad := func(d float64) float64 { return d * math.Pi / 180 }
	dLat := toRad(lat2 - lat1)
	dLon := toRad(lon2 - lon1)
	a := math.Sin(dLat/2)*math.Sin(dLat/2) +
		math.Cos(toRad(lat1))*math.Cos(toRad(lat2))*math.Sin(dLon/2)*math.Sin(dLon/2)
	c := 2 * math.Atan2(math.Sqrt(a), math.Sqrt(1-a))
	return R * c
}

That’s one signal, end-to-end. A real risk engine combines several (device fingerprint drift, ASN change, time-of-day anomaly) and weights them by action sensitivity. But it’s the same shape: take context, compute a number, branch on it. Don’t hand-wave with “we’ll compute a score” — decide what goes in, what comes out, and what the system does with it.

Hardening Checklist

Everything in the threat model table distilled to action items:

TLS 1.3 only. In Go: MinVersion: tls.VersionTLS13. Drop all CBC ciphers, drop all non-ECDHE suites. No negotiation.
HSTS with long max-age + preload. Once you’re on HTTPS, make downgrade impossible.
Rate limit at the edge. Use a real distributed rate limiter (Redis INCR with expiry, or a sidecar). Per-IP and per-username for auth endpoints; per-user for sensitive actions.
Argon2id for password hashing, tuned to 200-300ms per verify on your hardware. Bcrypt works but truncates inputs at 72 bytes — if you use it, HMAC-SHA256 with a server pepper first.
Short access tokens (≤15 min), long refresh tokens (≤7 days) with rotation. Family-based reuse detection.
Signing-key rotation runbook. You will rotate a key in an incident. Make sure you’ve practiced. Keys published via JWKS with kid make this a config change.
Audit log every auth event. Login (success/fail), logout, refresh, password change, MFA enrollment, permission change. Stream to an append-only store. Alert on velocity anomalies.
Generic error messages on the wire, detailed logs on the server. “invalid credentials” for login failures; the specific reason in your logs only.
SameSite=Strict cookies for session cookies. Lax if you need cross-site top-level navigation, never None without very specific reason.
CSRF protection for cookie-based sessions. Double-submit token or origin header check on state-changing routes.
Content-Security-Policy that blocks inline scripts. XSS that can’t execute can’t steal access tokens held in memory.

What I’d Actually Choose

If I’m starting a new distributed system today:

User-facing auth: OIDC via a dedicated IdP. Auth0 if you’re fine with SaaS pricing; Keycloak if you want self-hosted and open source with weight behind it; Zitadel if you want self-hosted and modern. Don’t build your own identity provider. The code above is for understanding, not replacing these.

Service-to-service: mTLS with SPIFFE/SPIRE identities, run inside a service mesh if you have one (Istio/Linkerd do this automatically). JWTs for the user context still flow through as headers, but the transport identity comes from the cert.

Token strategy: RS256 or ES256 from a single issuer. 15-minute access tokens. 7-day refresh tokens with single-use rotation and family-based reuse detection. Access tokens in memory on the SPA, refresh tokens in httpOnly SameSite=Strict cookies scoped to the refresh endpoint.

Go libraries: github.com/golang-jwt/jwt/v5 for issuance and validation. github.com/lestrrat-go/jwx/v2 if you need heavy JWK/JWE work. Both are fine; v5 is enough for most cases.

TypeScript libraries: jose over jsonwebtoken. Better defaults, better JWKS support, maintained.

The biggest mistake I see teams make: building auth as an afterthought and bolting it on later. Auth is a cross-cutting concern that touches every service. Design it first, make it a shared library or sidecar, and get it right before you have 30 services each doing their own thing. Retrofitting auth across a mature service graph is painful in a way that few other migrations are.