Authorization Patterns for Go Microservices

Authorization in a service mesh — RBAC, ABAC, policy engines, SPIFFE service identity, and deny-by-default patterns for Go microservices without the panics and silent bypasses.

SecurityGolangMicroservices

A customer reads another tenant’s invoice. A service account for the reporting job somehow issues refunds. A product manager toggles a feature flag in production because the admin UI “just worked” for them. None of these are authentication bugs. The tokens were valid. The identities were real. The system failed at the next step: deciding what a valid identity is actually allowed to do.

Authentication tells you who is on the other end of the wire. Authorization tells you what they can do to what. In a monolith the line between those two can blur inside a single middleware. In a service mesh they’re separate systems, usually owned by different teams, enforced in different languages, and failing in different ways. This post is about the second one — authorization — in Go microservices, and the patterns I actually ship.

If you want the AuthN side — JWT strategy, refresh rotation, OIDC — I covered that in Authentication Patterns for Distributed Systems. This post assumes you already have trustworthy claims arriving at the service boundary. The question here is what you do with them.

The Threats That Authorization Actually Defends Against

Before code, name the enemy. Authorization bugs have a distinctive flavor:

ThreatWhat it looks likeDefense
Broken object-level authorization (BOLA)GET /invoices/42 returns any invoice if you know the IDPer-object ownership check, not just “user is logged in”
Broken function-level authorizationAdmin endpoint protected only by “the UI doesn’t link there”Explicit verb check on every handler, deny-by-default routing
Privilege escalation via role mutationUser edits their own profile and sets role: adminNever bind roles from the request body; server-assigned only
Confused deputyInternal service X calls Y with its own identity, but on behalf of user Z with more permissions than Z hasPropagate user context and caller identity; authorize on both
Policy-engine fail-openOPA sidecar times out, middleware logs “skipping” and allows the callFail closed. No policy answer = deny
Role explosion400 roles, nobody knows what ops_v2_readonly_legacy grantsDerive roles from attributes; review quarterly; kill unused
Scope leakage across servicesUser JWT has billing:* and payments service honors it as payments:*Per-service scope namespacing; audience pinning
Silent parse failure → permitClaims missing roles field, code treats empty slice as “no restrictions”Parse failure maps to deny, never to wildcard

Most production AuthZ incidents I’ve been pulled into are one of rows 1, 2, 3, or 8. They’re not exotic. They’re what happens when authorization is written handler-by-handler instead of as a system.

RBAC vs ABAC vs ReBAC: Pick With Intent

There are three models worth knowing. Every “policy engine” blog post pretends they’re interchangeable. They’re not.

RBAC (Role-Based Access Control) assigns users to roles, roles to permissions. “Admins can delete invoices.” Simple, auditable, easy to explain to a compliance team. Fails when permissions depend on context — who owns the object, what tenant it belongs to, whether it’s business hours. Role explosion is the classic failure: every new context dimension spawns a new role, and you end up with tenant_42_billing_readonly_eu.

ABAC (Attribute-Based Access Control) decides from attributes of the subject, action, resource, and environment. “A user can delete an invoice if they belong to the same tenant, have role billing-admin, and the invoice is in draft status.” More expressive, harder to audit, easy to write a policy nobody can reason about six months later. OPA and Cedar live here.

ReBAC (Relationship-Based Access Control) decides from the graph of relationships between subjects and resources. “This user can edit this document because they’re a member of a group that owns the folder it lives in.” Zanzibar/SpiceDB/OpenFGA live here. Great for collaboration products with nested sharing (think Google Docs, Notion). Overkill for most line-of-business systems.

My rule of thumb: start with RBAC for coarse-grained verbs, layer ABAC on top for contextual constraints. The role tells you what verbs exist; attributes tell you which specific resources those verbs apply to in context. ReBAC only if your product’s core mental model is a sharing graph.

A concrete example. In a billing system:

  • RBAC says: “role billing-admin has verbs invoice:read, invoice:write, invoice:void.”
  • ABAC says: “additionally, an admin can only act on invoices where invoice.tenant_id == user.tenant_id and invoice.status != 'finalized'.”

Neither alone is enough. Together they’re legible.

The Architecture: Admission at the Service Edge

The biggest single improvement I’ve made to microservice AuthZ in the last three years: stop sprinkling if user.HasRole(...) through handlers. Put one admission middleware at the entry of each service that evaluates every request against a policy. Handlers receive only requests that have already been permitted and can assume their caller is allowed to be there.

This has three benefits. First, you can audit all of a service’s authorization decisions by reading one file. Second, a missed check in a handler can’t fail open — the middleware already said yes. Third, you can turn the policy into data and change it without a deploy.

Here’s the shape. A request arrives. The middleware extracts (a) the user context from the JWT, (b) the caller service identity from the mTLS peer cert, (c) the resource and verb from the route. It builds a decision input, asks a policy evaluator, and acts on the answer.

// authz/middleware.go
package authz

import (
	"context"
	"fmt"
	"net/http"
)

// Decision is the result of a policy evaluation. Explicit allow/deny plus a
// reason string so audit logs have something useful to search for.
type Decision struct {
	Allow  bool
	Reason string
}

// Input is everything the policy needs to decide. Keep the fields small and
// stable — this struct is the contract between services and the policy.
type Input struct {
	User     UserContext     // end-user identity from the JWT
	Caller   ServiceIdentity // peer service from mTLS SVID
	Resource Resource        // "invoice", plus ID and tenant
	Verb     string          // "read", "write", "void"
	Env      Environment     // request time, source IP, etc.
}

type Evaluator interface {
	Evaluate(ctx context.Context, in Input) (Decision, error)
}

// RouteResolver turns an HTTP request into the resource+verb the policy
// reasons about. Each service implements its own — the billing service maps
// POST /invoices/:id/void to (resource=invoice, verb=void). One small
// interface keeps the middleware generic.
type RouteResolver interface {
	Resolve(r *http.Request) (resource Resource, verb string, err error)
}

// buildInput extracts user context from the request's JWT claims (parsed
// and attached to ctx by an earlier middleware — see the auth article),
// caller identity from the mTLS peer cert (attached by RequireMTLS above),
// and resource+verb from the route resolver. UserFrom and envFrom are the
// mirror of CallerFrom — trivial ctx accessors left as an exercise; the
// interesting bit is that every failure here becomes a deny, never a
// default value.
func buildInput(r *http.Request, route RouteResolver) (Input, error) {
	user, ok := UserFrom(r.Context())
	if !ok {
		return Input{}, fmt.Errorf("no user context on request")
	}
	caller, ok := CallerFrom(r.Context())
	if !ok {
		return Input{}, fmt.Errorf("no caller identity on request")
	}
	res, verb, err := route.Resolve(r)
	if err != nil {
		return Input{}, fmt.Errorf("route resolve: %w", err)
	}
	// Tenant isolation is a required dimension, not an optional one. An empty
	// Resource.TenantID used to bypass the ABAC tenant check downstream —
	// invert that default: missing tenant = deny at input build. A resource
	// that genuinely has no tenant (global admin surfaces) should carry an
	// explicit sentinel like "_global" that the policy whitelists, never "".
	if res.TenantID == "" {
		return Input{}, fmt.Errorf("resource tenant_id is required")
	}
	return Input{User: user, Caller: caller, Resource: res, Verb: verb, Env: envFrom(r)}, nil
}

// Middleware builds the input and defers to the evaluator. Deny-by-default:
// any error from the evaluator is an automatic deny, logged as such. A panic
// inside the evaluator is caught and treated as an evaluator error so a bad
// policy cannot crash the server nor slip a request past unaudited.
func Middleware(ev Evaluator, audit AuditSink, route RouteResolver) func(http.Handler) http.Handler {
	return func(next http.Handler) http.Handler {
		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			in, err := buildInput(r, route)
			if err != nil {
				audit.Write(r.Context(), AuditEvent{
					Allow: false, Reason: "input-build-failed", Error: err.Error(),
				})
				http.Error(w, "forbidden", http.StatusForbidden)
				return
			}
			dec, err := safeEvaluate(r.Context(), ev, in, audit)
			if err != nil {
				// Fail closed. Never treat evaluator errors as permit.
				audit.Write(r.Context(), AuditEvent{
					Input: in, Allow: false, Reason: "evaluator-error", Error: err.Error(),
				})
				http.Error(w, "forbidden", http.StatusForbidden)
				return
			}
			audit.Write(r.Context(), AuditEvent{
				Input: in, Allow: dec.Allow, Reason: dec.Reason,
			})
			if !dec.Allow {
				http.Error(w, "forbidden", http.StatusForbidden)
				return
			}
			next.ServeHTTP(w, r.WithContext(context.WithValue(r.Context(), inputKey{}, in)))
		})
	}
}

// safeEvaluate invokes the evaluator with a recover guard. A panicking
// evaluator gets turned into a normal error + an "evaluator-panic" audit
// event, then denies. Without this, a nil-map deref in a rule leaks state:
// the request dies with a 500 and no audit line, which is exactly the hole
// attackers probe for.
func safeEvaluate(ctx context.Context, ev Evaluator, in Input, audit AuditSink) (dec Decision, err error) {
	defer func() {
		if p := recover(); p != nil {
			audit.Write(ctx, AuditEvent{
				Input: in, Allow: false, Reason: "evaluator-panic", Error: fmt.Sprintf("%v", p),
			})
			err = fmt.Errorf("evaluator panic: %v", p)
		}
	}()
	return ev.Evaluate(ctx, in)
}

type inputKey struct{}

// InputFrom returns the authorized Input that the middleware stashed on the
// context after the evaluator said yes. Handlers call this when they need to
// see the subject/tenant/resource the policy already validated, rather than
// re-parsing from the request. Mirrors CallerFrom/UserFrom.
func InputFrom(ctx context.Context) (Input, bool) {
	in, ok := ctx.Value(inputKey{}).(Input)
	return in, ok
}

Four things I want you to notice. The middleware has no opinion about how decisions are made — Evaluator is an interface, so you can swap OPA, Cedar, or in-code rules without touching the hot path. The evaluator error path denies, never permits. Every decision — allow or deny — is written to audit. A deny without an audit line is a debugging black hole waiting to happen. And the safeEvaluate wrapper turns a panic inside a rule into a deny with an audit event: the one shape of failure that otherwise leaks, because a 500 from a panicked handler skips the audit write the deny path relies on.

Parsing Claims: Deny on Failure, Never Permit

The single most common critical bug I find in Go authorization code is a type assertion without the comma-ok check. It looks harmless:

// DO NOT DO THIS
userID := claims["sub"].(string)
roles := claims["roles"].([]string)

If the claim is missing or the wrong type, the service panics. In an HTTP handler, a panic becomes a 500, the middleware recovers, and the request is denied — but now you’ve taught any attacker how to DoS your service by sending a malformed JWT, and you’ve also taught yourself to ignore 500s in that middleware. Worse versions of this bug look like:

// DO NOT DO THIS EITHER
rolesRaw, _ := claims["roles"].([]any)
var roles []string
for _, r := range rolesRaw {
    s, _ := r.(string)
    roles = append(roles, s) // silently drops non-strings, silently empty on parse fail
}
// ... later ...
if len(roles) == 0 {
    // "user has no roles, must be a regular user" — wrong
}

Parse failures silently yield empty slices, and “no roles” gets treated as “default user” instead of “unauthenticated.” The fix is to be paranoid and explicit. Every claim extraction returns an error; any error on the AuthZ path is a deny.

// authz/claims.go
package authz

import (
	"errors"
	"fmt"
)

type UserContext struct {
	Subject  string
	TenantID string
	Roles    []string
	Scopes   []string
}

// FromClaims converts a jwt.MapClaims to a strongly-typed UserContext.
// Every field is checked with comma-ok. A missing required field is an error,
// never a default value.
func FromClaims(claims map[string]any) (UserContext, error) {
	sub, err := stringClaim(claims, "sub")
	if err != nil {
		return UserContext{}, fmt.Errorf("sub: %w", err)
	}
	tenant, err := stringClaim(claims, "tenant_id")
	if err != nil {
		return UserContext{}, fmt.Errorf("tenant_id: %w", err)
	}
	roles, err := stringSliceClaimNonEmpty(claims, "roles")
	if err != nil {
		return UserContext{}, fmt.Errorf("roles: %w", err)
	}
	scopes, err := stringSliceClaim(claims, "scopes")
	if err != nil {
		// scopes is optional; only error on wrong type, not missing
		if !errors.Is(err, errClaimMissing) {
			return UserContext{}, fmt.Errorf("scopes: %w", err)
		}
	}
	return UserContext{Subject: sub, TenantID: tenant, Roles: roles, Scopes: scopes}, nil
}

var (
	errClaimMissing = errors.New("claim missing")
	errEmptyClaim   = errors.New("claim is empty array")
)

func stringClaim(m map[string]any, key string) (string, error) {
	raw, present := m[key]
	if !present {
		return "", fmt.Errorf("%w: %s", errClaimMissing, key)
	}
	s, ok := raw.(string)
	if !ok {
		return "", fmt.Errorf("claim %q is %T, want string", key, raw)
	}
	if s == "" {
		return "", fmt.Errorf("claim %q is empty", key)
	}
	return s, nil
}

func stringSliceClaim(m map[string]any, key string) ([]string, error) {
	raw, present := m[key]
	if !present {
		return nil, fmt.Errorf("%w: %s", errClaimMissing, key)
	}
	arr, ok := raw.([]any)
	if !ok {
		return nil, fmt.Errorf("claim %q is %T, want []any", key, raw)
	}
	out := make([]string, 0, len(arr))
	for i, v := range arr {
		s, ok := v.(string)
		if !ok {
			return nil, fmt.Errorf("claim %q[%d] is %T, want string", key, i, v)
		}
		out = append(out, s)
	}
	return out, nil
}

// stringSliceClaimNonEmpty is stringSliceClaim plus a len==0 guard. Required
// authority claims like "roles" must never arrive as []. An empty array is
// not the same as a missing claim — it's an authority-stripping signal we
// treat as adversarial: a mis-issued or tampered token that parses cleanly
// but carries no grants. Accepting "" here is how you ship a downgrade bug
// where the RBAC loop iterates zero roles and the ABAC layer carries the
// whole decision, or worse, a "default user" branch kicks in.
func stringSliceClaimNonEmpty(m map[string]any, key string) ([]string, error) {
	out, err := stringSliceClaim(m, key)
	if err != nil {
		return nil, err
	}
	if len(out) == 0 {
		return nil, fmt.Errorf("%w: %s", errEmptyClaim, key)
	}
	return out, nil
}

Verbose? Yes. This is the kind of code I want boring and obvious. The worst place to be clever is the boundary where untrusted input becomes authorization context. Every branch has one job: turn an unusable input into an error. Callers map errors to deny.

One rule that earns its own line: empty arrays on required authority claims are a reject, not a zero value. A JWT that parses cleanly with "roles": [] is either mis-issued or tampered — in both cases the correct response is deny, not “user has no special permissions.” That’s what stringSliceClaimNonEmpty enforces. Use the plain stringSliceClaim for genuinely optional list claims (like scopes, where absence means “no narrowing”); use the non-empty variant for anything the access decision is going to loop over.

Policy Engines: OPA, Cedar, or In-Code

Once admission lives in middleware and claims are parsed safely, the real design question is where policy rules live. There are three reasonable answers.

In-code Go rules are the right default for small teams and simple domains. You write func(in Input) Decision in Go, it’s tested like any other Go code, it’s fast, and it’s grep-able. The tradeoff: policy changes require a deploy of every service, and a product manager cannot read the rules. Fine for ten services, painful at a hundred.

OPA (Open Policy Agent) with Rego centralizes policy as data. Policies live in their own repo, get versioned and rolled out independently of services, and can be distributed via a bundle server. The runtime is a sidecar or a Go library (opa). Rego is its own language and there’s a learning curve — I’ve seen teams write policies nobody on the team can confidently modify six months later. Worth it when policy has to be owned by a security team separate from service owners, or when you need to change rules without redeploying.

Cedar (from AWS, also open source) is a newer alternative with a more readable policy language than Rego and formal-verification tooling. I’ve used it on smaller projects and liked it. Ecosystem is thinner than OPA.

My take: in-code for the first ten services, OPA when policy ownership has to split from service ownership, Cedar if your domain fits its shape. Don’t reach for a policy engine because it’s trendy. The best policy engine is the one your team will actually edit correctly in six months.

Here’s what an in-code evaluator looks like, combining RBAC roles with ABAC constraints:

// authz/policy/inmem.go
package policy

import (
	"context"
	"slices"

	"myorg/authz"
)

// role → set of verbs. Kept tiny and reviewed quarterly.
var roleVerbs = map[string]map[string]bool{
	"billing-admin":    {"invoice:read": true, "invoice:write": true, "invoice:void": true},
	"billing-readonly": {"invoice:read": true},
	"support":          {"invoice:read": true},
}

type InMem struct {
	// ExpectedTrustDomain is the trust domain this service belongs to. The
	// service-caller table is keyed by full SPIFFE URI, but the URI string
	// alone is not enough — if the trust bundle ever drifts and a foreign CA
	// mints a matching URI under a different trust domain, the string key
	// still collides. Re-check the trust domain against this field.
	ExpectedTrustDomain string
}

func (e InMem) Evaluate(_ context.Context, in authz.Input) (authz.Decision, error) {
	verbKey := in.Resource.Type + ":" + in.Verb

	// Caller (service identity) check runs first. A request from a service
	// that isn't on the allowlist dies here, regardless of which user is on
	// whose behalf it's acting. This is the confused-deputy seam.
	if !callerAllowed(in.Caller, verbKey, e.ExpectedTrustDomain) {
		return authz.Decision{Allow: false, Reason: "caller " + in.Caller.SPIFFEID + " not allowed for " + verbKey}, nil
	}

	// RBAC: does any of the user's roles grant this verb?
	hasVerb := false
	for _, r := range in.User.Roles {
		if roleVerbs[r][verbKey] {
			hasVerb = true
			break
		}
	}
	if !hasVerb {
		return authz.Decision{Allow: false, Reason: "no role grants " + verbKey}, nil
	}

	// ABAC: tenant isolation. An admin in tenant A cannot act on tenant B.
	// Resource.TenantID is guaranteed non-empty by buildInput — a blank
	// tenant never reaches the evaluator, so the check is an equality
	// test, not a presence test. The old "if != '' && !=" shape let
	// missing-tenant-id silently bypass isolation.
	if in.Resource.TenantID != in.User.TenantID {
		return authz.Decision{Allow: false, Reason: "cross-tenant access denied"}, nil
	}

	// ABAC: finalized invoices can't be voided except by the ops role.
	if in.Verb == "void" && in.Resource.Attrs["status"] == "finalized" {
		if !slices.Contains(in.User.Roles, "ops") {
			return authz.Decision{Allow: false, Reason: "finalized invoice requires ops role to void"}, nil
		}
	}

	return authz.Decision{Allow: true, Reason: "role+tenant ok"}, nil
}

Two small things that matter disproportionately. First, the Reason string is not decorative — it goes to the audit log and is what you’ll search when a customer says “why did this fail?” Make it specific. “Cross-tenant access denied” is useful; “forbidden” is not. Second, notice deny paths return a nil error but Allow: false. Deny is a normal outcome, not an exception.

Service-to-Service Authorization Is a Separate Problem

This is where I see the most sophisticated engineers get confused. A user JWT answers “who is the human?” but in service-to-service calls there’s a second subject: the calling service. You need both.

Consider: the reports service needs to fetch invoices from the billing service on behalf of a user. The user has role billing-readonly. Who’s making the call? From the billing service’s perspective, it has to answer three questions:

  1. Is the calling service allowed to talk to me at all? (Mesh-level allowlist.)
  2. Is the calling service allowed to call this specific endpoint? (Per-endpoint service scope.)
  3. Is the end-user the call is on behalf of allowed to do this specific thing to this specific resource? (User-level authorization.)

All three have to pass. The classic confused-deputy vulnerability: the reports service has blanket access to billing endpoints because it’s an “internal service,” and forwards user calls without checking the user is allowed. A regular user then gets admin-level data just because they can tickle a reports endpoint that doesn’t check.

Service identity comes from the mTLS peer certificate, not from a header. Headers are forgeable; a valid TLS handshake against your trust bundle is not. SPIFFE gives every workload an X.509 SVID with a URI SAN like spiffe://corp.example/ns/prod/sa/reports-api, rotated automatically by SPIRE. You extract that SAN server-side and treat it as the caller identity.

There are two sharp edges in that extraction, and both are in-scope for any service that treats a SPIFFE ID as an authorization subject:

  1. Trust-domain pinning. The trust domain is the Host part of the URI (corp.example above). A SPIFFE ID is only meaningful relative to a trust domain — spiffe://evil.example/ns/prod/sa/reports-api is a different identity, not the same one. The threat this defends against is foreign-CA-in-bundle: if your trust bundle ever ingests a CA from another federation (an over-broad Istio MeshConfig, a SPIRE federation misconfigured to trust a partner’s domain, a cluster migration that left an old root in place), that CA can mint a valid cert carrying any trust domain in its URI SAN. A URI-only check accepts the forgery and maps it to a real identity in your service map. You must compare the trust domain against an expected allowlist on every connection, not once at startup.
  2. Exactly-one SPIFFE SAN. The SPIFFE X.509-SVID spec mandates exactly one spiffe:// URI SAN per certificate. Real-world code grabs the first URI SAN it finds — which silently accepts a forged cert that staples a legitimate SAN next to an attacker-chosen one, betting on iteration order. Reject any peer cert with zero or with more than one spiffe:// URI SAN.
// authz/spiffe.go
package authz

import (
	"context"
	"crypto/tls"
	"crypto/x509"
	"errors"
	"fmt"
	"net/http"
)

type ServiceIdentity struct {
	SPIFFEID string // e.g. spiffe://corp.example/ns/prod/sa/reports-api
	TrustDom string
}

// FromPeerCert pulls the SPIFFE URI SAN from the TLS peer certificate and
// verifies its trust domain against the expected one. Returns an error if
// there's no client cert, no SPIFFE SAN, more than one SPIFFE SAN, or the
// trust domain doesn't match. Callers treat any error as unauthenticated.
func FromPeerCert(tlsState *tls.ConnectionState, expectedTrustDomain string) (ServiceIdentity, error) {
	if tlsState == nil || len(tlsState.PeerCertificates) == 0 {
		return ServiceIdentity{}, errors.New("no peer certificate")
	}
	cert := tlsState.PeerCertificates[0]
	return spiffeIDFromCert(cert, expectedTrustDomain)
}

// spiffeIDFromCert enforces the SPIFFE X.509-SVID rule: exactly one
// spiffe:// URI SAN, and its trust domain must equal the expected one.
// Multiple SPIFFE SANs is a spec violation we treat as an attack — a
// forged cert may staple a legitimate SAN next to an attacker-chosen one.
// Trust-domain mismatch defends against a foreign CA in the trust bundle
// minting IDs in an unrelated domain.
func spiffeIDFromCert(cert *x509.Certificate, expectedTrustDomain string) (ServiceIdentity, error) {
	if expectedTrustDomain == "" {
		return ServiceIdentity{}, errors.New("expected trust domain must be configured")
	}
	var found *ServiceIdentity
	for _, u := range cert.URIs {
		if u.Scheme != "spiffe" {
			continue
		}
		if found != nil {
			return ServiceIdentity{}, errors.New("peer cert has more than one spiffe URI SAN; SPIFFE spec requires exactly one")
		}
		id := ServiceIdentity{SPIFFEID: u.String(), TrustDom: u.Host}
		found = &id
	}
	if found == nil {
		return ServiceIdentity{}, errors.New("no spiffe URI SAN in peer cert")
	}
	if found.TrustDom != expectedTrustDomain {
		return ServiceIdentity{}, fmt.Errorf("spiffe trust domain %q does not match expected %q", found.TrustDom, expectedTrustDomain)
	}
	return *found, nil
}

// RequireMTLS ensures every request on this server has a verified peer cert
// with a SPIFFE SAN in the expected trust domain. Apply before the authz
// middleware. The trust domain is configured per service — it never comes
// from the request.
func RequireMTLS(expectedTrustDomain string) func(http.Handler) http.Handler {
	return func(next http.Handler) http.Handler {
		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			id, err := FromPeerCert(r.TLS, expectedTrustDomain)
			if err != nil {
				http.Error(w, "unauthenticated caller", http.StatusUnauthorized)
				return
			}
			r = r.WithContext(withCaller(r.Context(), id))
			next.ServeHTTP(w, r)
		})
	}
}

type callerKey struct{}

func withCaller(ctx context.Context, id ServiceIdentity) context.Context {
	return context.WithValue(ctx, callerKey{}, id)
}

func CallerFrom(ctx context.Context) (ServiceIdentity, bool) {
	id, ok := ctx.Value(callerKey{}).(ServiceIdentity)
	return id, ok
}

And an example service-to-service policy in the in-code evaluator — read as: “only reports-api and billing-ui can hit read endpoints, only billing-ui can hit write endpoints, nothing else gets in regardless of the user’s role”:

// authz/policy/service_caller.go
package policy

import "myorg/authz"

var serviceVerbs = map[string]map[string]bool{
	"spiffe://corp.example/ns/prod/sa/reports-api": {"invoice:read": true},
	"spiffe://corp.example/ns/prod/sa/billing-ui":  {"invoice:read": true, "invoice:write": true, "invoice:void": true},
	"spiffe://corp.example/ns/prod/sa/ops-console": {"invoice:read": true, "invoice:void": true},
}

// callerAllowed keys into the allowlist by full SPIFFE URI but also re-checks
// the trust domain against the one this service was configured with. The map
// lookup alone would accept a forged cert if a foreign CA slipped into the
// bundle and minted the same URI path under a different trust domain.
// Defense-in-depth against the exact failure mode that URI-SAN-only checks
// miss, paid for once per request.
func callerAllowed(caller authz.ServiceIdentity, verbKey, expectedTrustDomain string) bool {
	if expectedTrustDomain == "" || caller.TrustDom != expectedTrustDomain {
		return false
	}
	return serviceVerbs[caller.SPIFFEID][verbKey]
}

Both the caller check and the user check run on every request. If either says no, the request dies. That’s the shape of defense in depth that actually works: not redundant checks of the same thing, but different checks against different subjects.

Per-Service Scope Narrowing

A token that grants admin:* globally is a blast-radius grenade. When one service gets compromised, the attacker has admin on everything. The fix is namespace scopes per service: the billing service only ever understands billing:* scopes; the payments service only ever understands payments:*. A token with billing:write carries no meaning in the payments service and is ignored.

This is enforced by the validator, not by convention. Each service is configured with its own audience (aud claim) and only accepts tokens minted with that audience. Scopes are prefixed per service in the issuer, and the service strips or ignores scopes outside its namespace.

// authz/scopes.go
package authz

// HasScopeInNamespace asks a single question: does this token carry the
// given verb within the given service namespace? The old split API
// (FilterScopes that stripped the prefix, then HasScope that checked the
// un-trimmed form) invited a call-site bug where code filtered-then-
// compared-against-the-raw-scope and silently always returned false. One
// function couples the two steps so you can't hold them wrong.
//
// Scopes are compared by whole-string equality — no HasPrefix, no Contains.
// A token with "billing:write" matches HasScopeInNamespace(scopes, "billing",
// "write"); it does not match "writer" or "write-all".
func HasScopeInNamespace(scopes []string, namespace, verb string) bool {
	want := namespace + ":" + verb
	for _, s := range scopes {
		if s == want {
			return true
		}
	}
	return false
}

This pairs with audience pinning at the issuer: when the billing service calls the payments service on behalf of a user, the billing service requests a new token with aud: payments scoped narrowly to the specific verb it needs (the token-exchange pattern from RFC 8693). The user’s original JWT stays at the edge; downstream services never see it. That’s how you contain compromise.

Propagating Context Across Service Boundaries

The user’s identity and scopes need to travel with the call chain without becoming forgeable headers. Two patterns, in order of preference:

Token exchange per hop. At every service boundary, the caller swaps its current token for a new one with audience narrowed to the next service. The exchange is mediated by the IdP, which can enforce policy on what exchanges are allowed (the reports service can exchange for a billing-read token but not a billing-write token). Costs a round trip per hop, so cache short-lived exchanged tokens aggressively.

Signed propagation headers. For latency-sensitive call paths, propagate the user’s subject, tenant, and minimal scopes in headers signed with a service mesh-managed key. The receiving service verifies the signature and the upstream caller’s SPIFFE ID. Cheaper than token exchange, but the signing key becomes a critical asset. Istio supports this via its RequestAuthentication + AuthorizationPolicy primitives.

Whichever you pick, never trust plain headers without a signature. X-User-ID: admin from a compromised internal service is an instant domain admin. If your mesh isn’t enforcing mTLS end-to-end and signing propagated identity, treat every inter-service header as attacker-controlled.

Trace IDs are different. Those can flow as unsigned headers because they carry no authority. Don’t conflate observability context with authorization context.

Audit: Every Decision, Permit Or Deny

Audit is where most teams cut corners and regret it. You need a record of every authorization decision — permit and deny — with enough structure to answer: “who tried to do what, where, when, with what roles, and what did the policy decide and why?” Not some decisions. All of them.

// authz/audit.go
package authz

import (
	"context"
	"crypto/hmac"
	"crypto/sha256"
	"encoding/hex"
	"encoding/json"
	"time"
)

type AuditEvent struct {
	Time     time.Time `json:"time"`
	TraceID  string    `json:"trace_id"`
	Input    Input     `json:"input"` // serialized via Input.MarshalJSON (safe projection)
	Allow    bool      `json:"allow"`
	Reason   string    `json:"reason"`
	Error    string    `json:"error,omitempty"`     // populated only on deny paths with an underlying error
	PolicyID string    `json:"policy_id,omitempty"` // version of the policy that decided
}

// auditSubjectPepper is the HMAC key used to hash subject IDs on the audit
// stream. Must be loaded from the secret store at startup, never from a flag
// or a repo. Rotating this key rotates the correlation identity of every
// subject in the stream — treat that as a deliberate operation.
var auditSubjectPepper []byte

// SetAuditSubjectPepper wires the HMAC pepper at process start. Panics if
// called with empty bytes so a misconfigured deploy fails loudly.
func SetAuditSubjectPepper(key []byte) {
	if len(key) == 0 {
		panic("audit subject pepper must be non-empty")
	}
	auditSubjectPepper = key
}

// SafeInput is the allowlist projection of Input that is safe to emit to an
// audit stream. Audit streams outlive service DBs, ship to SIEMs, and often
// leave the primary trust domain — so we strip anything an attacker would
// pay for if the stream leaked. Explicit allowlist:
//   - SubjectHash: HMAC-SHA256(pepper, user.Subject). Bare-hash-of-subject
//     is dictionary-reversible; the pepper breaks that.
//   - TenantID: needed for per-tenant alerting and deny-rate dashboards.
//   - CallerSPIFFE: the calling service identity (not a secret).
//   - TrustDomain: companion to CallerSPIFFE.
//   - ResourceType, ResourceKey, Verb: the decision's subject in coarse form.
//
// Explicitly NOT emitted: raw Subject, Roles, Scopes (bearer-equivalent),
// Resource.Attrs (customer data), full Env.
type SafeInput struct {
	SubjectHash  string `json:"subject_hash"`
	TenantID     string `json:"tenant_id"`
	CallerSPIFFE string `json:"caller_spiffe"`
	TrustDomain  string `json:"trust_domain"`
	ResourceType string `json:"resource_type"`
	ResourceKey  string `json:"resource_key"`
	Verb         string `json:"verb"`
}

// MarshalJSON on Input is the single enforcement point: anywhere an Input
// is serialized (audit, structured logs, traces), it emits the safe
// projection. Code that needs the full Input reads fields directly in-
// process and never relies on json.Marshal. Making this a method on Input
// means a reviewer cannot forget to call a helper — you can't accidentally
// ship the unredacted form.
func (in Input) MarshalJSON() ([]byte, error) {
	return json.Marshal(SafeInput{
		SubjectHash:  hmacSubject(in.User.Subject),
		TenantID:     in.User.TenantID,
		CallerSPIFFE: in.Caller.SPIFFEID,
		TrustDomain:  in.Caller.TrustDom,
		ResourceType: in.Resource.Type,
		ResourceKey:  in.Resource.ID,
		Verb:         in.Verb,
	})
}

func hmacSubject(sub string) string {
	if len(auditSubjectPepper) == 0 || sub == "" {
		return ""
	}
	mac := hmac.New(sha256.New, auditSubjectPepper)
	mac.Write([]byte(sub))
	return hex.EncodeToString(mac.Sum(nil))
}

type AuditSink interface {
	Write(ctx context.Context, ev AuditEvent)
}

// AppendOnlyProducer is the narrow interface the audit sink needs: push
// bytes at an append-only backend (Kafka, Kinesis, an immutable logging
// bucket's batch API). AppendAsync returns an error so the sink surfaces
// backpressure. A naive producer that satisfies this interface without
// a durable queue — e.g., a bare `go func(){ ... }()` goroutine spray —
// is a silent-loss failure mode and must not ship. The minimum shape is
// a bounded channel, an overflow counter, and a non-nil error returned
// the moment the channel would block.
type AppendOnlyProducer interface {
	AppendAsync(payload []byte) error
}

// StreamSink writes audit events to an append-only stream. Never write to
// a store the service owner can rewrite — audit integrity matters more than
// ease of access. The Alarm callback fires on marshal failure AND on any
// non-nil error from the producer, so operators see producer backlog
// pressure, dropped events, and malformed events on the same alert path.
type StreamSink struct {
	Producer AppendOnlyProducer
	// Alarm MUST be non-blocking and panic-free. We recover around the call
	// anyway — a panicking pager must not take the audit path down with it.
	Alarm func(err error, ev AuditEvent)
}

func (s *StreamSink) alarm(err error, ev AuditEvent) {
	defer func() { _ = recover() }()
	if s.Alarm != nil {
		s.Alarm(err, ev)
	}
}

type fallbackAuditEvent struct {
	MarshalError bool   `json:"marshal_error"`
	Allow        bool   `json:"allow"`
	TraceID      string `json:"trace_id,omitempty"`
}

func (s *StreamSink) Write(ctx context.Context, ev AuditEvent) {
	ev.Time = time.Now().UTC()
	ev.TraceID = traceIDFrom(ctx)
	b, err := json.Marshal(ev)
	if err != nil {
		// A marshal failure here is a bug, not a routine error — the
		// event struct is ours. Page on it, and fall back to a minimal
		// safe payload so the decision is still traceable.
		s.alarm(err, ev)
		b, _ = json.Marshal(fallbackAuditEvent{MarshalError: true, Allow: ev.Allow, TraceID: ev.TraceID})
	}
	// Best-effort write, but producer backpressure must surface. A producer
	// that can't accept the payload returns an error here; we alarm so the
	// operator sees the pressure. Silent drop under load is the failure
	// mode audit logs exist to prevent.
	if pErr := s.Producer.AppendAsync(b); pErr != nil {
		s.alarm(pErr, ev)
	}
}

The traceIDFrom helper is a thin context accessor over whatever tracing library you’re using — OpenTelemetry, a custom correlation-ID middleware, whatever — and should return an empty string rather than panicking when there is no trace:

// authz/trace.go
package authz

import "context"

type traceIDKey struct{}

// WithTraceID is called by upstream tracing middleware; traceIDFrom reads
// what it stashed. Keep the type unexported so no other package can plant
// arbitrary strings under this key.
func WithTraceID(ctx context.Context, id string) context.Context {
	return context.WithValue(ctx, traceIDKey{}, id)
}

func traceIDFrom(ctx context.Context) string {
	id, _ := ctx.Value(traceIDKey{}).(string)
	return id
}

A concrete word on redaction. AuditEvent embeds Input by value, but Input.MarshalJSON emits SafeInput — a hard-coded allowlist of subject-hash, tenant, caller SPIFFE, resource type/key, and verb. Scopes, roles, raw subject, and Resource.Attrs never reach the wire. That matters because audit streams usually outlive the service they record, ship to SIEMs and third-party log platforms, and are read by people who don’t have a need-to-know for every claim on every token. Scopes in particular are close to bearer-equivalent — the stream carrying them is a credential store. If a downstream audit consumer genuinely needs an extra field, add it to the allowlist in one place, explicitly, with a name on the commit. Never serialize Input without going through MarshalJSON.

Two opinions I’ll die on. Audit storage must be append-only and outside the blast radius of the service it records. If the billing service’s audit log lives in the billing service’s database, a compromise of billing erases the evidence. Stream to Kafka with short retention-then-archive, or to an immutable cloud logging bucket. And audit the reasons, not just the outcome. A deny log without a reason is just a counter; a deny log with “cross-tenant access attempt, user tenant=A, resource tenant=B” is a security alert.

Alerts I configure on every AuthZ audit stream: deny-rate spike per user/tenant/service, any deny with reason cross-tenant, any deny with reason evaluator-error (that means the policy engine is flaking), and any permit for verbs in a small allowlist of sensitive actions (admin impersonation, key rotation, user deletion).

When the Policy Engine Is Down

The uncomfortable question. Your OPA sidecar crashloops. Your Cedar daemon wedges. Your in-code evaluator gets an unparseable policy bundle. What does the service do?

The only correct answer is deny. But “deny everything” means an outage. Teams under pressure reach for fail-open (“just this once”), ship it, and never go back. That is how you get an incident.

The right pattern is a small static allowlist of absolutely-must-work endpoints that fail to a safe baseline policy — typically read-only endpoints for the core flow — while everything else fails closed. Think of it as a readonly mode: when policy is unavailable, the service serves reads to authenticated users in their own tenant and nothing else. Writes, deletes, and admin actions hard-deny. This turns a policy outage into a degraded mode, not a security collapse.

// authz/fallback.go
package authz

import (
	"context"
	"fmt"
)

// FallbackEvaluator wraps a primary evaluator. If the primary errors or
// panics, it defers to a tiny in-code baseline. The baseline is deliberately
// restrictive and covers only the endpoints the business cannot run without.
type FallbackEvaluator struct {
	Primary  Evaluator
	Baseline Evaluator
	Alarm    func(err error)
}

func (f FallbackEvaluator) Evaluate(ctx context.Context, in Input) (dec Decision, err error) {
	// A panic inside Primary must not escape as a 500 — it has to land in
	// Baseline the same as any other primary failure, otherwise policy
	// outages turn into availability incidents and the baseline is dead
	// code. The recover here is the shape that keeps invariants intact.
	primaryDec, primaryErr := func() (d Decision, e error) {
		defer func() {
			if p := recover(); p != nil {
				e = fmt.Errorf("primary evaluator panic: %v", p)
			}
		}()
		return f.Primary.Evaluate(ctx, in)
	}()
	if primaryErr == nil {
		return primaryDec, nil
	}
	// Alarm is wired to whatever the operator pager is. We do not trust
	// that implementation to be panic-free or non-blocking — a misbehaving
	// Alarm that panics would turn a policy-outage (already bad) into a
	// 500 cascade (worse). Contract: Alarm implementations MUST be
	// non-blocking and panic-free; this defer enforces the panic half.
	func() {
		defer func() { _ = recover() }()
		if f.Alarm != nil {
			f.Alarm(primaryErr) // page oncall; this should be rare and loud
		}
	}()
	return f.Baseline.Evaluate(ctx, in)
}

// StaticReadOnly is a baseline evaluator that permits read verbs on same-tenant
// resources only. Everything else denies.
type StaticReadOnly struct{}

func (StaticReadOnly) Evaluate(_ context.Context, in Input) (Decision, error) {
	if in.Verb != "read" {
		return Decision{Allow: false, Reason: "baseline: writes denied"}, nil
	}
	if in.Resource.TenantID != in.User.TenantID {
		return Decision{Allow: false, Reason: "baseline: cross-tenant denied"}, nil
	}
	return Decision{Allow: true, Reason: "baseline: same-tenant read"}, nil
}

The baseline evaluator must be boring Go code with its own tests, not a second OPA instance. The whole point is to be simple enough to be certainly correct during the exact incident where the complicated thing broke.

What I’d Actually Choose

For a new Go microservices system today:

Architecture: Admission middleware at every service’s entry, pulling user context from JWT (issued as covered in the auth article) and caller identity from mTLS SPIFFE SVIDs. Handlers receive only permitted requests. No if user.HasRole(...) in handlers, ever.

Policy model: RBAC for verbs, ABAC for contextual constraints (tenant isolation, ownership, resource state). Skip ReBAC unless sharing-graph is your product’s core.

Policy engine: In-code Go rules for the first ten services. Switch to OPA when security and service-owner teams need to diverge, or when you need to roll policy changes independently of service deploys. Cedar if it fits your domain and you want better tooling than Rego.

Service-to-service: mTLS with SPIFFE/SPIRE, either in a mesh (Istio, Linkerd) or via go-spiffe directly. Peer cert URI SAN is the caller identity; do not accept service names from headers. Token exchange or signed identity propagation for user context — never unsigned headers.

Scopes: Per-service namespacing, strict audience pinning, token exchange at service boundaries to narrow scope as calls propagate. No blanket admin:* tokens in circulation.

Audit: Every decision, permit and deny, with a specific reason, streamed to append-only storage outside the service’s blast radius. Alerts on deny-spikes, cross-tenant attempts, evaluator errors, and permits on sensitive verbs.

Fail-closed fallback: Baseline read-only same-tenant policy in-code, guarding the service against policy-engine outages. Pager-grade alerts when the baseline is engaged.

The mistake I see most: teams treat authorization as middleware to write once, wire up, and forget. It isn’t. Authorization is a living system — roles drift, policies rot, services change their resource models, attackers probe at the seams between services. Put someone on the hook for reviewing roles and policies quarterly. Kill unused roles on sight. Read your audit logs. The systems that stay secure are the ones whose owners keep looking at them; there is no configure-and-walk-away path that ends well.

← Back to blog