The kubectl-from-laptop Era
I used to deploy Kubernetes manifests by running kubectl apply from my laptop. It worked until it didn’t — configuration drift, no audit trail, and the inevitable “who changed that in production?” conversation. One Friday afternoon, someone ran a kubectl apply with a stale local copy and overwrote a config change that had been made directly on the cluster. We spent the weekend figuring out what the correct state should be.
Switching to GitOps with Flux solved all of it. Git became the single source of truth. Every change went through a PR. The cluster reconciled itself automatically. And when something broke, git log told us exactly who changed what and when.
This post covers how I set it up and the patterns that work at scale.
What GitOps Actually Means
GitOps is a specific pattern, not a buzzword:
- Declarative configuration — the entire desired state of your cluster lives in Git
- Automated sync — an operator in the cluster continuously reconciles actual state to match Git
- Drift detection — if someone runs
kubectl editdirectly, the operator reverts it - Pull-based deployment — the cluster pulls its own config, rather than CI pushing to it
The practical result: infrastructure changes follow the exact same workflow as application code. PRs, reviews, approvals, merge, automatic deployment. No more SSH-ing into boxes or running kubectl from laptops.
Why Flux
I chose Flux over ArgoCD for a few reasons: it’s a CNCF graduated project, it uses a pull-based model (the cluster watches Git, not the other way around), and it composes well with Kustomize and Helm without requiring a UI or a separate server. ArgoCD is fine too — but Flux fits better when you want GitOps as infrastructure, not as an application you manage.
Setting Up Flux
You need a Kubernetes cluster (1.26+), kubectl configured, and the Flux CLI installed. I strongly recommend a dedicated Git repository for infrastructure config, separate from application code — different access controls, different review cadence.
Bootstrap
# Prompt for the token so it doesn't land in shell history
read -rs GITHUB_TOKEN && export GITHUB_TOKEN
export GITHUB_OWNER=<your-org-or-username>
export GITHUB_REPO=flux-infrastructure
# Bootstrap Flux (--personal=true for a user account, false for an org)
flux bootstrap github \
--owner=$GITHUB_OWNER \
--repository=$GITHUB_REPO \
--branch=main \
--path=./clusters/production \
--personal=false \
--private=true
A word on that GITHUB_TOKEN: never commit it, never let it land in shell history you sync, and never reuse a long-lived org-wide token. Use read -rs (as above) or gh auth token rather than putting the value on the command line. It should be a short-lived fine-grained PAT scoped to exactly the one repository, with the minimum permissions bootstrap needs (contents: read/write, administration: write for deploy keys). Treat it as CI-only — generate it, run bootstrap, revoke it. The moment flux bootstrap returns, scrub it from the shell:
unset GITHUB_TOKEN
# then revoke the PAT in GitHub settings
This command:
- Creates a repository if it doesn’t exist
- Generates a deploy key with read/write access
- Installs Flux components in your cluster
- Configures Flux to sync the specified path in your repository
For production, bootstrap Flux itself with a read-only deploy key and add image automation separately. The two needs are different: Flux core only reads from Git, while image-automation writes tag updates back. Bundling them onto a single read/write key widens the blast radius of every Flux component to “can rewrite the GitOps repo”:
# Step 1: bootstrap Flux core with a READ-ONLY deploy key.
# No image-automation components here, so --read-write-key=false is honest.
flux bootstrap github \
--owner=$GITHUB_OWNER \
--repository=$GITHUB_REPO \
--branch=main \
--path=./clusters/production \
--personal=false \
--private=true \
--read-write-key=false \
--network-policy=true
Read-only is the right default for Flux core because the blast radius of a leaked write key is much larger. A leaked read key exposes your manifests — embarrassing, but the attacker can’t change what runs in your cluster through that path. A leaked write key lets an attacker push malicious manifests back to the GitOps repo, where Flux will then obediently reconcile them onto your cluster. You’ve just given them remote code execution on production via a deploy key that nobody rotates because “it’s just a Git key”.
Add image automation as a second step. The hardened path is a separate deploy key attached to a second GitRepository that only ImageUpdateAutomation references, so Flux core keeps its read-only key and the write permission is isolated to one controller:
# Step 2: generate a dedicated write-capable deploy key for image automation,
# attach it to the repo manually (GitHub -> Settings -> Deploy keys, allow write),
# then create a Secret + GitRepository that ONLY the ImageUpdateAutomation uses.
umask 077
WORK="$(mktemp -d)"
ssh-keygen -t ed25519 -N '' -f "$WORK/flux-image-automation" -C flux-image-automation
# Paste $WORK/flux-image-automation.pub into GitHub as a deploy key with WRITE access.
# Pin GitHub's SSH host keys against the published fingerprints before trusting
# the output of ssh-keyscan. TOFU ("we'll verify later") is not verification --
# if the first scan is hijacked, the write-capable deploy key talks to the
# attacker's host and pushes tag updates into an attacker-controlled mirror.
# Current fingerprints are published at:
# https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/githubs-ssh-key-fingerprints
EXPECTED_RSA="SHA256:uNiVztksCsDhcc0u9e8BujQXVUpKZIDTMczCvj3tD2s"
EXPECTED_ED25519="SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU"
ssh-keyscan -t rsa,ed25519 github.com | tee "$WORK/kh" \
| ssh-keygen -lf - | grep -E "$EXPECTED_RSA|$EXPECTED_ED25519" \
|| { echo "host key mismatch -- refusing to build known_hosts Secret"; exit 1; }
# Sanity-check the file isn't empty (network failure, DNS hijack to a silent host).
# Note: don't redirect ssh-keyscan stderr to /dev/null -- that hides the failure
# signal, and the only reason we can detect a silent-host scenario is by checking
# output non-emptiness here.
[ -s "$WORK/kh" ] || { echo "ssh-keyscan produced no output"; exit 1; }
kubectl -n flux-system create secret generic flux-image-automation-key \
--from-file=identity="$WORK/flux-image-automation" \
--from-file=identity.pub="$WORK/flux-image-automation.pub" \
--from-file=known_hosts="$WORK/kh"
# Tear down the ephemeral working directory. Plain rm is the right tool here;
# `shred` is ineffective on journaled/COW filesystems -- see the GPG section
# below for the full rationale.
rm -rf "$WORK"
On rm vs shred: shred -u was the old advice, but it is effectively a no-op
on ext4 with journaling, btrfs, zfs, APFS, and any tmpfs-backed path — the
filesystem may redirect writes to new blocks, leaving the original key material
in freed extents. The meaningful controls are (1) generate the key inside an
$(mktemp -d) with umask 077 so intermediate files never sit in /tmp with
world-readable defaults, (2) delete the directory after use, and (3) rely on
full-disk encryption at rest on the workstation running this script. For
production automation, generate the key inside an HSM-backed keyring or an
ephemeral CI container that’s torn down after the Secret is created, not on a
long-lived laptop.
Then commit a second GitRepository that uses this write key, install the image automation controllers (flux install --components-extra=image-reflector-controller,image-automation-controller --export > ... && git commit), and reference the new GitRepository from ImageUpdateAutomation.sourceRef. The existing flux-system GitRepository stays read-only.
The pragmatic shortcut — if you’d rather not manage two keys — is to re-run flux bootstrap with --read-write-key=true and the image-automation components; but be honest about what you’re doing: that rotates the single Flux deploy key to read/write, so every Flux controller (source-controller, kustomize-controller, image-automation-controller) gets write access to the repo, not just image-automation. That’s a wider blast radius than the two-keys path above. For a homelab or dev cluster, fine. For production, do the two keys.
Branch Protection Is a Prerequisite, Not a Nice-to-Have
Everything GitOps claims — audit trail, single source of truth, reviewed changes — collapses if main is not protected. Before you merge the bootstrap PR, configure on main:
- Required pull request reviews (at least 1, 2 for production repos).
- Required signed commits.
- Required status checks (lint, policy, schema validation).
- Restrict who can push directly to the branch (ideally: nobody).
- No force-push, no branch deletion, linear history on.
Without these, any developer PAT with write access becomes a path straight to production: commit to main, Flux reconciles it in minutes. The drift-attack window is however often you sync (10 minutes in the examples below). Assume the attacker will time-travel through your sync interval.
Verify Commit Signatures at Reconcile Time
Branch protection enforces signed commits in GitHub. Flux can verify those signatures independently before applying anything, which closes the gap where a compromised GitHub account or a bypass of branch protection (admin override, stale webhooks) slips an unsigned commit onto main:
# clusters/production/flux-system/gotk-sync.yaml (excerpt)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
interval: 1m
ref:
branch: main
url: ssh://git@github.com/organization/flux-infrastructure
secretRef:
name: flux-system # deploy-key Secret provisioned by `flux bootstrap`
verify:
mode: head
secretRef:
name: git-signing-keys
Note the ssh:// URL: flux bootstrap github provisions an SSH deploy key by default, so the GitRepository authenticates over SSH, not HTTPS. If you’re using HTTPS (token-based) auth, replace the URL with https://... and reference a token Secret — the two must match.
git-signing-keys is a Secret holding the public keys (GPG, or cosign/gitsign public key) of committers allowed to change production. If a commit on main isn’t signed by a key in that Secret, Flux refuses to reconcile and raises an alert. This is the trust anchor that makes “git is the source of truth” actually true — without it, Flux trusts whoever wrote the last commit, which is too much trust to hand an external system.
Secrets Do Not Go in Git
The single most common GitOps failure mode: someone commits a Helm values.yaml with a database password, notices a week later, force-pushes to remove it (which doesn’t actually remove it), and now you have a plaintext credential in git history that your compliance auditor will find. Don’t be that team.
Pick one of these and standardize early:
- SOPS (Mozilla): encrypts YAML fields with KMS/age keys; Flux decrypts at reconcile time via the kustomize-controller’s decryption config. My default for small-to-midsize teams — clear ergonomics, files are readable diffs, keys rotate cleanly.
- External Secrets Operator: git holds references; actual secrets live in Vault / AWS Secrets Manager / GCP Secret Manager. My default when there’s already a secrets backend in place, or when non-Flux workloads need the same secret.
- Sealed Secrets (Bitnami): encrypts Secrets with a cluster-bound public key. Simple to start, but the controller’s key is the single root of trust and rotation is awkward. I’d pick SOPS over this today.
What I won’t do: commit plaintext secrets “just for dev.” Dev environments leak. Treat the repo as untrusted for secret material from day one.
Repository Structure
This is the structure I use for multi-environment deployments. It follows the Kustomize bases/overlays pattern:
├── clusters/
│ ├── development/
│ │ ├── flux-system/ # Flux components for dev
│ │ ├── infrastructure.yaml # Infrastructure for dev
│ │ └── apps.yaml # Applications for dev
│ ├── staging/
│ │ ├── flux-system/ # Flux components for staging
│ │ ├── infrastructure.yaml # Infrastructure for staging
│ │ └── apps.yaml # Applications for staging
│ └── production/
│ ├── flux-system/ # Flux components for prod
│ ├── infrastructure.yaml # Infrastructure for prod
│ └── apps.yaml # Applications for prod
├── infrastructure/
│ ├── base/ # Base infrastructure definitions
│ │ ├── ingress-nginx/ # Ingress controller
│ │ ├── cert-manager/ # Certificate management
│ │ └── monitoring/ # Prometheus and Grafana
│ └── overlays/ # Environment-specific overrides
│ ├── development/
│ ├── staging/
│ └── production/
└── apps/
├── base/ # Base application definitions
│ ├── app1/
│ └── app2/
└── overlays/ # Environment-specific overrides
├── development/
├── staging/
└── production/
The key insight: base/ defines the common configuration, overlays/ applies environment-specific patches. You define ingress-nginx once and override replica counts and resource limits per environment.
Core Workflows
Infrastructure First
Infrastructure components (ingress, cert-manager, monitoring) deploy before applications. Start by declaring your Helm repository sources:
# infrastructure/base/sources/helm-repositories.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: ingress-nginx
namespace: flux-system
spec:
interval: 1h
url: https://kubernetes.github.io/ingress-nginx
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: cert-manager
namespace: flux-system
spec:
interval: 1h
url: https://charts.jetstack.io
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: prometheus-community
namespace: flux-system
spec:
interval: 1h
url: https://prometheus-community.github.io/helm-charts
Then define the ingress controller HelmRelease. Chart versions in this post are illustrative — pin to a current release that you’ve cross-checked against the project’s security advisories before applying to a cluster:
# infrastructure/base/ingress-nginx/release.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: ingress-nginx
namespace: ingress-nginx
spec:
interval: 1h
chart:
spec:
chart: ingress-nginx
version: "4.0.13"
sourceRef:
kind: HelmRepository
name: ingress-nginx
namespace: flux-system
values:
controller:
metrics:
enabled: true
serviceMonitor:
enabled: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Production gets more resources and autoscaling via an overlay patch:
# infrastructure/overlays/production/ingress-nginx/release-patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: ingress-nginx
namespace: ingress-nginx
spec:
values:
controller:
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 80
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
A Flux Kustomization ties it all together and adds health checks:
# clusters/production/infrastructure.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 10m
path: ./infrastructure/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: flux-system
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: ingress-nginx-controller
namespace: ingress-nginx
The healthChecks field is critical — Flux won’t report the Kustomization as ready until the ingress controller deployment is actually running. Without health checks, you get false positives.
Application Deployment
Applications follow the same pattern. The important bit is the dependsOn field that ensures infrastructure deploys first:
# apps/base/app1/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app1
namespace: apps
spec:
replicas: 2
selector:
matchLabels:
app: app1
template:
metadata:
labels:
app: app1
spec:
containers:
- name: app1
image: ghcr.io/organization/app1:v1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Production scales up via overlay:
# apps/overlays/production/app1/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app1
namespace: apps
spec:
replicas: 5
The Flux Kustomization for apps references infrastructure as a dependency:
# clusters/production/apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps
namespace: flux-system
spec:
interval: 5m
path: ./apps/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: flux-system
dependsOn:
- name: infrastructure
dependsOn: infrastructure means Flux won’t even attempt to deploy apps until infrastructure health checks pass. This prevents the common failure mode of apps starting before their ingress controller or cert-manager is ready.
Automated Image Updates
This is one of Flux’s most powerful features and the one that surprised me most. Flux can watch a container registry, detect new image tags matching a policy, update the manifests in Git, and apply the changes — fully automated.
Configure an image repository to scan:
# apps/base/app1/image-repository.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: app1
namespace: flux-system
spec:
image: ghcr.io/organization/app1
interval: 1m
Define a semver policy for which tags to accept. Keep the range tight — patch-only for anything that touches production, so a backdoored minor release can’t auto-deploy before a human reads the changelog:
# apps/base/app1/image-policy.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: app1
namespace: flux-system
spec:
imageRepositoryRef:
name: app1
policy:
semver:
# Unambiguous explicit bounds. `~` and `^` semantics drift across
# toolchains (npm, Composer, Flux, Helm) -- spell out the range you
# actually mean so the policy doesn't silently change if the library
# Flux uses for semver parsing swaps interpretation.
range: '>=1.2.0 <1.3.0' # patches only (1.2.x), no minor bumps
Auto-promoting an unsigned image from a registry is the supply-chain pattern that produced Codecov and SolarWinds-class incidents — don’t skip cosign verification. Important: ImageRepository has no native verify field. Flux verifies signatures natively only on OCIRepository (used for Helm charts and OCI-packaged manifests). For container images scanned by ImageRepository, verification must be enforced at admission time — the reflector will happily discover an unsigned tag otherwise.
Two concrete options. Pick one; don’t just add a “(with verification)” comment and call it done.
Option A: co-locate a Kyverno ClusterPolicy with a verifyImages rule. This is the admission-time gate that rejects any unsigned image when Flux applies the Deployment, regardless of how the tag got into the manifest:
# infrastructure/base/policy/verify-app1-signatures.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-app1-signatures
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
rules:
- name: verify-cosign-signature
match:
any:
- resources:
# Match at the controller level too, otherwise `mutateDigest: true`
# rewrites the Pod spec only after the controller has already
# created a Pod from a mutable tag -- the Deployment/StatefulSet
# in etcd still references the tag, and the next rollout fetches
# whatever image the tag points to at that moment. Including the
# controller kinds makes the digest pin happen on the template,
# which is the object Flux actually reconciles.
kinds:
- Pod
- Deployment
- StatefulSet
- DaemonSet
- ReplicaSet
- Job
- CronJob
verifyImages:
- imageReferences:
- "ghcr.io/organization/app1:*"
failureAction: Enforce
mutateDigest: true # pin resolved digest into the Pod spec
verifyDigest: true
required: true
attestors:
- entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...your cosign pubkey...
-----END PUBLIC KEY-----
rekor:
url: https://rekor.sigstore.dev
With this in place, the existing ImageRepository stays as-is; Kyverno rejects any Pod whose image lacks a valid cosign signature from the listed key, so ImageUpdateAutomation can’t land an unsigned tag into production even if the registry is compromised.
Option B: switch chart/manifest sources to OCIRepository with native cosign verify. This works for Helm charts and OCI-packaged manifests (not for container images consumed via ImageRepository):
# infrastructure/base/sources/app1-oci.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
name: app1-manifests
namespace: flux-system
spec:
interval: 5m
url: oci://ghcr.io/organization/app1-manifests
ref:
semver: ">=1.2.0 <1.3.0" # explicit bounds; see ImagePolicy note on `~`/`^` ambiguity
verify:
provider: cosign
secretRef:
name: cosign-pub # Secret containing cosign.pub
For the container-image path this article uses (ImageRepository + ImageUpdateAutomation), Option A is the required control. Option B is the right call if you’re already packaging manifests as OCI artifacts.
And configure the automation to update manifests. For production, don’t push directly to main — commit to a separate branch and open a PR, so the human review and required-signatures gate still apply:
# apps/base/app1/image-update.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageUpdateAutomation
metadata:
name: app1
namespace: flux-system
spec:
interval: 1h
sourceRef:
kind: GitRepository
# Points at the write-capable GitRepository from the two-keys bootstrap path,
# NOT the read-only flux-system GitRepository. Only image-automation holds
# the write key; other controllers reconcile through flux-system (read-only).
name: flux-system-image-automation
git:
checkout:
ref:
branch: main
commit:
author:
email: fluxcdbot@users.noreply.github.com
name: fluxcdbot
signingKey:
secretRef:
name: flux-bot-gpg
messageTemplate: 'Update app1 to {{.NewTag}}'
push:
# Flux pushes to this branch but does NOT open the PR itself.
# Wire up one of: a GitHub Actions workflow triggered on push to
# `flux-image-updates` that runs `gh pr create --base main`, or a
# Flux Notification-controller Provider + Alert that calls a webhook
# which opens the PR. Without either, the branch accumulates commits
# that never reach main and the production rollout stalls silently.
branch: flux-image-updates # production: separate branch + PR
update:
path: ./apps/base/app1
strategy: Setters
The signingKey.secretRef: flux-bot-gpg is a Secret holding the bot’s GPG private key. Create it out-of-band (it must not be committed) and add the corresponding public key to the git-signing-keys Secret referenced by GitRepository.verify, otherwise Flux will reject the bot’s own commits:
# Generate a dedicated GPG key for the bot. The key has no passphrase because
# a non-interactive controller can't supply one; compensate by keeping the
# private material inside an ephemeral, isolated GNUPGHOME that we delete
# immediately after creating the Kubernetes Secret. For production, generate
# this key in an HSM-backed keyring (smartcard, KMS-backed pkcs11, cloud HSM)
# or inside an ephemeral CI container that's torn down after the Secret exists.
# Don't do this on a shared laptop.
umask 077
EPHEMERAL_HOME="$(mktemp -d)"
export GNUPGHOME="$EPHEMERAL_HOME"
# Expiration: 1y. Automation keys that "never" expire outlive the team that
# owns them, and a leak discovered two years later still reconciles commits.
# Rotation procedure: every ~10 months, generate a new bot key in a fresh
# ephemeral home, append its public key to git-signing-keys (ADDITIVE -- see
# below), cut over ImageUpdateAutomation to the new flux-bot-gpg Secret, then
# remove the old public key from git-signing-keys once no commits signed by
# the old key remain on main.
gpg --batch --passphrase '' --quick-gen-key 'fluxcdbot <fluxcdbot@users.noreply.github.com>' default default 1y
KEY_ID=$(gpg --list-secret-keys --with-colons fluxcdbot@users.noreply.github.com | awk -F: '/^sec:/ {print $5; exit}')
# Private key -> Secret used by ImageUpdateAutomation to sign commits.
# Streamed directly into kubectl; never touches /tmp or a persistent FS path.
kubectl create secret generic flux-bot-gpg \
--namespace=flux-system \
--from-literal="identity=$(gpg --export-secret-keys --armor "$KEY_ID")" \
--from-literal="identity.pub=$(gpg --export --armor "$KEY_ID")" \
--from-literal="git.openpgp.id=$KEY_ID"
# Public key -> ADD to git-signing-keys so GitRepository.verify accepts bot
# commits. This is the critical part: the existing git-signing-keys Secret
# already contains every human committer's public key. A naive `create secret
# ... --dry-run=client | kubectl apply -f -` REPLACES the Secret with only
# fluxbot.asc, evicting every human key, after which Flux rejects every
# human-signed commit and the cluster stops reconciling.
#
# Correct pattern: patch the existing Secret so the bot key is added alongside
# the humans' keys. `kubectl patch` with a strategic merge on `data` keeps
# every other key intact.
FLUXBOT_PUB_B64=$(gpg --export --armor "$KEY_ID" | base64 -w0)
kubectl patch secret git-signing-keys \
--namespace=flux-system \
--type=strategic \
-p "{\"data\":{\"fluxbot.asc\":\"$FLUXBOT_PUB_B64\"}}"
# If git-signing-keys does not yet exist, create it once from a directory that
# holds every committer's .asc file (bot + humans) and commit that directory
# to a sealed-secrets / SOPS workflow:
# kubectl -n flux-system create secret generic git-signing-keys \
# --from-file=keys/
# Never rebuild the Secret from scratch in a script -- the pattern must be
# additive.
# Tear down the ephemeral GNUPGHOME. Plain `rm -rf` is the right tool here;
# `shred` is ineffective on journaled/COW filesystems (ext4, btrfs, zfs, APFS)
# and on tmpfs it's a no-op. Rely on disk encryption at rest for the residue.
unset GNUPGHOME
rm -rf "$EPHEMERAL_HOME"
If you’d rather not manage a bot signing key, skip signingKey entirely — but then exclude the bot from git-signing-keys and route every bot commit through the PR flow. The human-signed merge commit becomes what GitRepository.verify sees on main, so the trust anchor is the reviewer’s key, not the bot’s.
Here the bot signs its own commits (so commit-signature verification on GitRepository still holds), pushes to flux-image-updates, and a CI job or GitHub automation opens a PR against main. The PR runs policy checks, waits for review, and only then merges — at which point Flux reconciles the change. For staging and dev I’ll still push.branch: main and skip the PR, because the whole point of lower environments is fast feedback. For production, the extra hop is worth it. The Git history shows exactly when each image version was deployed, and every deployment is traceable to an approved PR.
Advanced Patterns
Multi-Tenancy
When multiple teams share a cluster, each team gets its own namespace with RBAC isolation — and, critically, their Flux Kustomization runs as a scoped ServiceAccount, not as the default flux-system controller SA. This is the point where most GitOps multi-tenancy setups fail quietly:
# tenants/team-a/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: team-a
labels:
# Enforce restricted PodSecurity so a wildcard Role can't spawn privileged pods
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
---
# tenants/team-a/reconciler-sa.yaml
# This is the SA Flux impersonates when reconciling team-a's manifests.
apiVersion: v1
kind: ServiceAccount
metadata:
name: team-a-reconciler
namespace: team-a
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-a-reconciler
namespace: team-a
subjects:
- kind: ServiceAccount
name: team-a-reconciler
namespace: team-a
roleRef:
kind: Role
name: team-a-namespace-admin
apiGroup: rbac.authorization.k8s.io
---
# tenants/team-a/netpol-default-deny.yaml
# Default-deny ingress + egress for the namespace. Required mitigation (2)
# referenced by the Role below. Without this, a compromised tenant pod can
# reach the cluster API, cloud metadata (169.254.169.254), and sibling tenants.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: team-a
spec:
podSelector: {}
policyTypes: ["Ingress", "Egress"]
# No ingress/egress rules = deny-all. Tenants add explicit allow rules
# (DNS, specific services) on top of this baseline.
---
# tenants/team-a/rbac.yaml
# Namespace-scoped Role. Wildcards here are acceptable ONLY because all
# three mitigations are in place:
# (1) PodSecurity=restricted blocks privileged pods (Namespace labels above),
# (2) default-deny NetworkPolicy above caps lateral/egress movement,
# (3) the RoleBinding above is the ONLY binding to this Role.
# If you can't guarantee all three, enumerate verbs instead of "*" and
# explicitly EXCLUDE "escalate", "bind", and "impersonate" -- those three
# let a holder of this Role grant themselves additional permissions.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: team-a-namespace-admin
namespace: team-a
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
The critical field on the Kustomization is spec.serviceAccountName. Without it, Flux reconciles tenant manifests using the kustomize-controller SA in flux-system, which has cluster-admin. That means any manifest team-a commits — including a ClusterRoleBinding granting themselves cluster-admin — gets applied with cluster-admin privileges. The namespace boundary becomes cosmetic:
# tenants/team-a/flux.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-a
namespace: flux-system
spec:
interval: 10m
path: ./tenants/team-a
prune: true
sourceRef:
kind: GitRepository
name: flux-system
targetNamespace: team-a
# This is the line that makes multi-tenancy real.
# Flux will impersonate this SA; any resource it tries to create
# outside team-a's RBAC is rejected by the API server.
serviceAccountName: team-a-reconciler
With serviceAccountName set, the API server enforces the tenant boundary for you: if team-a commits a ClusterRoleBinding, Flux tries to apply it as system:serviceaccount:team-a:team-a-reconciler, the API server says no, and reconciliation fails with a permission error. That failure is visible, auditable, and non-damaging. Skipping this field is the single most common Flux multi-tenancy mistake, and it turns a “tenant namespace” into a decorative label.
Promotion Workflows
For regulated environments, changes must flow through dev, staging, then production. I’ve used both branch-based and path-based promotion, and I strongly prefer path-based.
Branch-based promotion (each environment syncs from a different branch):
# clusters/development/flux-system/gotk-sync.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
interval: 1m
ref:
branch: development
url: ssh://git@github.com/organization/flux-infrastructure
secretRef:
name: flux-system
Path-based promotion (all environments on main, different paths) is simpler and what I recommend:
# clusters/development/flux-system/gotk-sync.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
interval: 1m
ref:
branch: main
url: ssh://git@github.com/organization/flux-infrastructure
secretRef:
name: flux-system
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: flux-system
namespace: flux-system
spec:
interval: 10m
path: ./clusters/development
prune: true
sourceRef:
kind: GitRepository
name: flux-system
Promotion is a PR that copies config from one path to another. With branch-based promotion, you end up with merge conflicts and cherry-pick headaches. Concrete example: you land a dev-only experiment on the development branch that touches values.yaml for ingress-nginx. Two weeks later you want to promote an unrelated fix on the same file to staging and production. Now you’re cherry-picking individual commits across three long-lived branches, each of which has drifted independently, and Git happily gives you a three-way conflict every time someone forgot which branch was ahead. Path-based keeps everything on main and promotion is just a file copy — much cleaner.
Policy Enforcement with Kyverno
GitOps handles how changes get deployed. But you also need to enforce what can be deployed. I pair Flux with Kyverno for policy enforcement.
Kyverno is a cluster-wide admission controller — a compromised Kyverno chart means an attacker can rewrite or silently bypass every admission policy, which is effectively full-cluster takeover. For admission-controller charts, I pull them as a cosign-verified OCI artifact rather than a traditional HelmRepository, so Flux refuses to reconcile an unsigned chart:
# infrastructure/base/policy/kyverno.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
name: kyverno
namespace: flux-system
spec:
interval: 1h
url: oci://ghcr.io/kyverno/charts/kyverno
ref:
semver: "3.2.x" # illustrative -- pin to a version you've reviewed
verify:
provider: cosign
secretRef:
name: kyverno-cosign-pub # Secret holding the Kyverno project's cosign.pub
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kyverno
namespace: kyverno
spec:
interval: 1h
chartRef:
kind: OCIRepository
name: kyverno
namespace: flux-system
values:
admissionController:
replicas: 3
Versions above are illustrative — cross-check current security advisories and the project’s published cosign key before pinning. For non-cluster-wide charts (app-level releases), a regular HelmRepository is usually fine; for anything that gates admission, signs images, or holds cluster-admin, OCIRepository + spec.verify is the baseline.
Kyverno policies run as admission controllers — they reject resources that violate your rules before they’re created:
# infrastructure/base/policy/require-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-labels
spec:
validationFailureAction: Enforce # Pascal-case; lowercase is deprecated in Kyverno v1.10+
rules:
- name: require-team-label
match:
resources:
kinds:
- Deployment
- Service
validate:
message: "The label 'team' is required"
pattern:
metadata:
labels:
team: "?*"
If someone submits a Deployment without a team label, the admission controller rejects it. Combined with Flux, this means the GitOps reconciliation will fail and Flux will report the error — giving you a clear signal that the manifest in Git doesn’t comply.
Monitoring
Flux exposes Prometheus metrics. I scrape them with a PodMonitor:
# infrastructure/base/monitoring/flux-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: flux-system
namespace: monitoring
spec:
selector:
matchLabels:
app: helm-controller
podMetricsEndpoints:
- port: http-prom
interval: 15s
namespaceSelector:
matchNames:
- flux-system
The metrics I alert on: reconciliation failures (something in Git doesn’t apply cleanly), reconciliation duration spikes (the cluster is struggling to converge), and source fetch failures (Git or Helm repo is unreachable). A Grafana dashboard showing these three things gives you full visibility into your GitOps pipeline health.
Lessons Learned
After running Flux in production across multiple clusters, here’s what I wish I’d known at the start:
Start with prune: false. When you’re first setting up Flux, disable pruning until you’re confident in your manifests. Pruning means Flux deletes resources that are no longer in Git — which is exactly what you want eventually, but terrifying when you’re still learning the repo structure. Enable it once you trust the workflow.
Pin your Helm chart versions. Never use version: "*" or omit the version field. A Helm chart upgrade that you didn’t review will eventually break something. Pin versions, update deliberately, test in staging first.
Use dependsOn liberally. Infrastructure before apps. CRDs before the controllers that use them. cert-manager before anything that needs TLS. The dependency graph is your safety net.
Path-based promotion over branch-based. I’ve tried both. Branch promotion creates merge conflicts and makes it hard to see the current state of all environments at once. Path-based keeps everything on main and promotion is just a file copy in a PR.
GitOps does not mean secure by default. The audit trail is only as good as your branch protection and commit-signature verification. The tenancy model is only as good as the reconciler ServiceAccount it impersonates. The image automation is only as safe as the signatures you verify. I’ve watched more than one team adopt Flux, skip all three, and treat the result as “we have GitOps now” — right up until a compromised developer token or a typosquatted image made the point for them. Budget a day for the security controls during bootstrap, not later.
The hardest part isn’t technical. It’s convincing teams to stop SSH-ing into boxes and running kubectl directly. The first time someone manually edits a resource and Flux reverts it, there will be frustration. That’s the system working as designed. Once people internalize that Git is the only way to make changes, everything gets better.
GitOps with Flux is the best operational pattern I’ve adopted in the last five years. The audit trail alone is worth it — but the real payoff is the confidence that what’s in Git is what’s running in your cluster. No drift, no surprises, no more weekend debugging sessions caused by stale local manifests.