Incident Response

Security incidents happen. A dependency is compromised, credentials are leaked, an attacker finds an exposed endpoint. The difference between a contained incident and a catastrophe is how quickly and methodically you respond.

Incident response is not improvisation under pressure — it's a rehearsed sequence of steps designed to restore service, limit damage, and preserve the evidence needed to understand what happened.

Phases

Detection

You can't respond to what you can't see. Detection depends on the observability infrastructure described in logging and metrics, augmented with security-specific signals.

Anomaly signals to monitor:

  • Unexpected spike in authentication failures — potential credential stuffing or brute force
  • Requests from unusual geographic locations or IP ranges
  • API calls to admin endpoints from non-admin accounts
  • Sudden increase in outbound network traffic — potential data exfiltration
  • Container or process spawning unexpected child processes
  • Changes to firewall rules, security groups, or IAM policies outside of normal deploy windows

These signals should feed into alerting rules. A security alert follows the same principles as an operational alert (see metrics) — it fires on sustained conditions, has a severity level, and links to a runbook.

Containment

Once an incident is confirmed, the immediate priority is stopping the bleeding. Containment means limiting the attacker's access and preventing further damage — not fixing the root cause. Fix comes later.

Credential compromise:

  1. Rotate the compromised credential immediately. API keys, tokens, database passwords — generate new ones and deploy them.
  2. Revoke all active sessions associated with the compromised credential.
  3. Audit the access logs for the compromised credential to determine what was accessed.
# Example: rotate a Kubernetes secret and restart affected pods
kubectl create secret generic db-credentials \
    --from-literal=password="$(openssl rand -base64 32)" \
    --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/api-server

Compromised host or container:

  1. Isolate the host — remove it from the load balancer and restrict its network access. Don't shut it down yet; the running state contains forensic evidence.
  2. Capture the current state: memory dump, process list, network connections, filesystem snapshot.
  3. Spawn a replacement from a known-good image. Restore service first.
  4. Analyze the isolated host offline.

Exposed data:

  1. Identify exactly what was exposed — which records, which fields, which time window.
  2. Revoke or rotate any credentials that were part of the exposed data.
  3. Determine whether notification obligations apply (customers, regulators, partners).

Recovery

Recovery means returning to a known-good state with confidence that the attack vector is closed.

Roll back to a known-good state. If the incident involved a compromised deployment, revert to the last known-good build. If it involved infrastructure changes, restore from the last verified configuration. The version control history in your infrastructure-as-code repository is your source of truth for what "known good" looks like.

Rotate broadly, not narrowly. If one set of credentials was compromised, assume adjacent credentials may be as well. Rotate all credentials in the affected scope — not just the one you know was leaked. The marginal cost of rotating an extra secret is far lower than the cost of missing one that was also compromised.

Verify the fix. After closing the attack vector, confirm it's actually closed. If the incident was an exposed endpoint, verify the firewall rule is in place. If it was a vulnerable dependency, verify the patched version is deployed. Trust, but verify — by testing the specific path the attacker used.

Post-Incident Review

After service is restored and the immediate threat is contained, conduct a structured review. The goal is not blame — it's understanding what happened and what to change so it doesn't happen again.

Document:

  • Timeline — when the incident started, when it was detected, when containment began, when service was restored. The gaps between these timestamps reveal where your detection and response can improve.
  • Attack vector — how did the attacker get in? A vulnerable dependency, a misconfigured firewall rule, a leaked credential, social engineering?
  • Blast radius — what was affected? Which services, which data, which users?
  • What worked — which detection signals fired? Which runbooks were useful? What allowed you to respond quickly?
  • What to change — concrete action items, filed as issues, with owners and deadlines. "Improve monitoring" is not an action item. "Add an alert for authentication failures exceeding 100/minute on the auth service" is.

Secrets Rotation

Secrets rotation is both an incident response activity and a preventive practice. Regular rotation limits the window during which a compromised credential is useful.

Rotation Strategy

Secret Type Rotation Frequency Method
API keys (third-party) 90 days, or on suspected compromise Generate new key, deploy, revoke old key
Database passwords 90 days, or on suspected compromise Update password, update config, restart connections
TLS certificates Automated via ACME (Let's Encrypt) caddy handles this automatically
Internal CA certificates Annually Reissue and distribute; see host-config CA trust section
SSH keys On personnel change, or annually Replace authorized_keys, remove old public keys
JWT signing keys 90 days Deploy new key, allow overlap period for token expiry, remove old key

Zero-Downtime Rotation

The hardest part of rotation is avoiding downtime during the transition. The standard pattern:

  1. Add the new credential alongside the old one. The service accepts both.
  2. Update all clients to use the new credential.
  3. Remove the old credential after confirming no traffic uses it.

For JWT signing keys, this means the verification side accepts tokens signed by either key during the overlap window. For database passwords, this means the database allows both passwords briefly while application instances are restarted with the new one.

The alternative — rotating atomically and restarting everything at once — works for small deployments but causes unnecessary downtime at scale.

References