Ship Audits Faster: Pragmatic Compliance for 99% Uptime Teams

compliance

Ship Audits Faster: Pragmatic Compliance for 99% Uptime Teams
Turn regulators into quiet stakeholders without slowing deploys.

Stop Treating Compliance As Paperwork; Make It a Feature

We’ve all been there: a compliance audit lands on the calendar, and suddenly we’re printing screenshots, rewriting policies, and promising never to run kubectl as admin again. The root problem isn’t that regulations exist; it’s that we treat compliance as paperwork glued on top of engineering, instead of as a product feature. Features are owned, tested, versioned, and observable. Paperwork is… well, paperwork. So let’s flip it. We define compliance as a latency budget on risk: keep security and governance within limits without slowing deploys below our service-level needs. If a control adds time, it had better decrease risk in a measurable way.

We start by making controls visible in the same way we expose error budgets. When a developer opens a PR, they should see not just tests and coverage, but also whether the change violates a control. When SREs tune autoscaling, they should be able to see whether logs still meet retention requirements. Treat every control like a testable spec: name it, codify it, and give it an owner. The owner isn’t a department; it’s a person who can say “yes” or explain exactly what must change.

We’re not chasing perfection. We’re aiming for repeatability. If we can run a build and deterministically learn whether we meet the baseline, we can ship faster. It’s the uncertainty that burns weekends. Reducing compliance to a set of pass/fail checks and automations turns it from mystery into an engineering problem we can actually solve.

Trace Controls to Data and Code, Not Departments

Compliance only helps if it protects something we care about. That “something” isn’t a meeting invite; it’s data and the systems that process it. Let’s map controls to data classification and code paths. Start with an inventory that ties data types to repositories, services, and environments. For each data class—public, internal, confidential, restricted—identify what controls apply: encryption, retention, access, monitoring, change approvals. The point isn’t to produce a massive spreadsheet; it’s to draw a straight line from a law or standard to a block of code, a Terraform module, or a pipeline step.

We prefer standards that already speak the language of controls. The catalog in NIST SP 800-53 gives us a structured vocabulary: access control, audit and accountability, configuration management, and so on. We make a short list of the controls that actually matter for our threat model and customers, then translate each into a “how would we test this?” statement. For example, “all data at rest is encrypted” becomes “every storage resource must have CMK encryption enabled, and every connection string in code must specify TLS.”

Finally, we draw the flow of the most sensitive data from ingress to egress. Any step where it’s stored or transformed is a control point. If we can’t point to a test or automation at that step, we’ve got a gap. When we talk to auditors, we lead with this map. It’s remarkable how fewer questions we get when we speak in systems rather than policy binders.

Policy as Code That Fails Builds, Not People

If a control lives in a PDF, it will eventually die in production. We put controls in the same place as our tests: the CI/CD pipeline. Policy-as-code tools let us express rules in a way that our systems can evaluate on every commit. The goal is to give fast, actionable feedback. A developer shouldn’t need to read a compliance manual to learn that an S3 bucket is public; the pipeline should reject the change with a clear message.

Open Policy Agent (OPA) and similar frameworks are perfect for this. We write rules that inspect Terraform plans, Kubernetes manifests, or even Dockerfile contents. Critically, we scope rules so they enforce the baseline and leave room for exceptions through documented annotations and pull-request approvals. That way, we don’t box ourselves into a corner when an edge case appears. And we treat exceptions as tech debt: they expire, they carry a ticket, and they’re reviewed.

Here’s a tiny Rego example that blocks public S3 buckets in Terraform while still allowing a time-bound exception:

package terraform.s3

default deny = false

deny[msg] {
  input.resource_type == "aws_s3_bucket_public_access_block"
  input.values.block_public_acls == false
  not input.values.tags["compliance-exception"]
  msg := sprintf("Public ACLs disabled: %v needs compliance-exception tag with expiry", [input.name])
}

deny[msg] {
  input.values.tags["compliance-exception"]
  time.now_ns() > time.parse_rfc3339_ns(input.values.tags["compliance-exception-expiry"])
  msg := sprintf("Exception expired for %v; re-review required", [input.name])
}

Put this in the build, fail fast with readable errors, and we’ll see developers fix issues before they hit production. It’s kinder than a flaming Slack thread at midnight.

Kubernetes and Cloud Guardrails You Can Live With

We like cluster-level controls that don’t require developers to memorize yet another policy doc. In Kubernetes, Pod Security Standards and admission policies are our friends. They enforce least-privilege defaults automatically while letting us grant exceptions intentionally. We backstop that with cloud resource policies so that even if a risky container makes it through, the blast radius is small.

Let’s start with an admission policy that enforces non-root, read-only filesystem, and drops dangerous capabilities. This example is intentionally simple but useful in real life:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
  name: baseline-security
spec:
  matchConstraints:
    resourceRules:
    - apiGroups: [""]
      apiVersions: ["v1"]
      operations: ["CREATE", "UPDATE"]
      resources: ["pods"]
  validations:
  - expression: "object.spec.securityContext.runAsNonRoot == true"
    message: "Pods must run as non-root"
  - expression: "object.spec.containers.all(c, c.securityContext.readOnlyRootFilesystem == true)"
    message: "Containers must use read-only root filesystems"
  - expression: "object.spec.containers.all(c, c.securityContext.capabilities.drop.exists(cap, cap == 'ALL'))"
    message: "Drop ALL Linux capabilities by default"

Pair this with namespace-level exceptions via labels and admission bindings so teams can request temporary waivers. We complement runtime guardrails with build-time checks using Gatekeeper or conftest. The CIS Kubernetes Benchmark is a pragmatic baseline for cluster hardening, and Gatekeeper gives us a policy deployment model that scales across clusters with CRDs.

Don’t forget cloud-side guardrails—S3 public access blocks, VPC endpoint-only data paths, and KMS CMKs with rotation. The goal isn’t handcuffs; it’s rails. We make the safe thing the default thing, and exceptions are visible, intentional, and temporary.

Evidence on Autopilot: Logs, Attestations, and Retention

Auditors don’t need our jokes; they need evidence. We’d rather not assemble it manually. So we wire evidence collection directly into our pipelines and platforms. Build systems emit attestations about what ran, by whom, and with what inputs. Artifact registries store provenance. Production logs are structured, retained, and searchable. When someone asks, “prove that only signed images were deployed,” we can show it in one query.

We like supply-chain attestation because it’s specific and machine-verifiable. The SLSA levels help us step up maturity without boiling the ocean: start by generating provenance at build time, then enforce verification at deploy. Store SBOMs alongside images and link them to a release ticket. For logging and retention, the guidance in the AWS Well-Architected Security and Reliability pillars maps nicely to what auditors expect: immutable logs, centralized visibility, and lifecycle policies.

Here’s a minimal GitHub Actions snippet that signs and attests containers, then uploads evidence to an artifact store. It’s not production-ready, but it illustrates the flow:

name: build-and-attest
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t ghcr.io/acme/app:${{ github.sha }} .
      - run: cosign sign --key ${{ secrets.COSIGN_KEY }} ghcr.io/acme/app:${{ github.sha }}
      - run: cosign attest --predicate sbom.json ghcr.io/acme/app:${{ github.sha }}
      - uses: actions/upload-artifact@v4
        with:
          name: build-evidence
          path: |
            sbom.json
            ${{ runner.temp }}/cosign-*.sig
            ${{ runner.temp }}/cosign-*.att

Evidence shouldn’t be a scavenger hunt. We collect it where the work happens, store it where it can’t be edited, and surface it when asked.

Least-Privilege Access Without Making On-Call Miserable

Least privilege can feel like death by a thousand denied requests. Our trick is to bind permissions to workflows rather than to humans. CI has the right to deploy; humans approve the workflow, not push to production directly. Engineers get break-glass roles with auto-expiry and webhook alerts. Secrets are short-lived and issued just-in-time. Combining these patterns gets us strong controls without breaking incident response.

Terraform helps us standardize IAM policies so we’re not handcrafting JSON at 2 a.m. We keep policies terse and testable, tag them with owners, and embed conditions that align with our process. For example, we allow deployments only from a dedicated CI role, scoped to specific resources and guarded by a source identity condition:

resource "aws_iam_role" "ci_deploy" {
  name = "ci-deploy"
  assume_role_policy = data.aws_iam_policy_document.ci_assume_role.json
}

data "aws_iam_policy_document" "ci_assume_role" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals { type = "Federated", identifiers = [aws_iam_openid_connect_provider.github.arn] }
    condition {
      test = "StringEquals"
      variable = "token.actions.githubusercontent.com:sub"
      values = ["repo:acme/app:ref:refs/heads/main"]
    }
  }
}

resource "aws_iam_policy" "deploy" {
  name   = "ci-deploy-policy"
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect   = "Allow",
      Action   = ["ecs:UpdateService", "lambda:UpdateFunctionCode"],
      Resource = ["arn:aws:ecs:...", "arn:aws:lambda:..."]
    }]
  })
}

The Security pillar of the AWS Well-Architected framework reinforces this approach: identity federation, scoped roles, and conditions tied to source and time. We keep on-call happy by making elevation fast, logged, and reversible, and we automate the revocation path so nobody forgets.

Ship Faster With Guardrails, Not Gates

Compliance shouldn’t be a cage. We want fast, frequent deploys that quietly meet the baseline. That means designing guardrails that help developers move faster by making the secure, compliant path the easiest path. Lint the things that matter, fail early with clear messages, and provide a fix or a link to one. If we can’t point to an automation for a control, we ask ourselves whether we need the control or the automation. “Manual forever” isn’t a plan; it’s a tax.

We also measure the system. How long does it take for a control failure to be discovered? How often do we see exceptions, and how quickly do they expire? What percentage of deploys carry attestations? We publish a simple scoreboard next to our error budget and throughput metrics. Teams are competitive in all the right ways when they can see where they stand. Compliance becomes part of “how we ship” rather than an annual event.

Finally, we bring compliance folks into planning the same way we bring SRE into design reviews. When a new service needs PII, we invite the right stakeholders early, ask them to help define the tests, and write the policies while the code is still fresh. We’ve learned that an extra hour up front beats fifty emails later. The net result: fewer Friday fire drills, happier auditors, and more time building things our users care about.

Audit Day as a Non-Event: Dashboards and Dry Runs

We practice audits the way we practice failovers. A month before the real thing, we run a “quiet audit.” We pick a representative set of controls and ask ourselves to produce evidence within a fixed window—say, two hours, end to end. If we can’t, we don’t write an essay; we add an automation. Dashboards must answer common questions: Who deployed last Tuesday? Where are the SBOMs for the payments service? Which clusters enforce non-root containers? We treat each request as a query against our systems, not a safari through Confluence.

The dashboard is a living thing. It pulls build attestations, artifact signatures, Terraform compliance checks, cluster admission metrics, and IAM role usage. Anything stale gets flagged. Exceptions show up in red with expiry dates and owners. We embed quick actions: regenerate a report, download the evidence bundle, or open the remediation PR. When the real auditor arrives, we screen-share the dashboards and walk through how the system ensures the controls. Papers and PDFs are supporting actors, not the main show.

We finish with a short retro: which controls created the most friction, which automations saved us the most time, and what would make next time even more boring. Boring is the goal. If audit day feels like any other Tuesday, we’ve won. And if someone still wants screenshots, fine—we’ll generate them with a button and get back to shipping.

Share