Logging

Logs capture discrete events — things that happened at a specific moment in time. A good log tells you what occurred, where, and with enough context to understand why. A bad log is noise that fills your storage and teaches you nothing.

The goal is not to log everything. It's to log the right things, in a consistent format, with enough context to be useful when something goes wrong.

Conventions

Event-centric messages. Each log entry should describe a single event. Name it clearly: gap_scan_started, order_sync_completed, payment_failed. The event name is the primary thing you search for. It should be specific enough to identify the code path that produced it.

Structured key-value pairs. Context belongs in fields, not embedded in a message string. {"event": "order_sync_completed", "order_id": "abc-123", "duration_ms": 450} is queryable. "Order abc-123 sync completed in 450ms" is not — or at least, not without writing a regex.

snake_case everywhere. Event names, field names, and field values all use snake_case. This keeps things consistent with our metric naming conventions and avoids friction when correlating across signals.

Watch cardinality. Every distinct field value creates a series in your logging backend. Fields like user_id or order_id are fine for filtering individual requests, but think twice before indexing unbounded values — they can quietly degrade query performance in systems like grafana-loki that index by label.

Avoid accidental field names. In libraries with variadic key-value pairs like slog.Info(msg, ...any), arguments must come in pairs: key, value, key, value. If you accidentally pass an odd number of arguments, the last value becomes a key. If you swap the order, a variable becomes a field name.

// Wrong — order_id becomes a field name, "order-123" becomes the value
slog.Info("order_sync_completed", orderID, "order_id")

// Wrong — duration_ms is the value, 450 becomes a field name
slog.Info("order_sync_completed", "duration_ms", durationMS, 450)

// Correct
slog.Info("order_sync_completed", "order_id", orderID, "duration_ms", durationMS)

When a variable ends up in the key position, your logging backend creates a new indexed field for every distinct value. If that variable is a number or ID, you'll pollute your schema with thousands of meaningless columns and degrade query performance. Many logging systems (including grafana-loki) have cardinality limits that will reject ingestion once you exceed them. Linters like sloglint can catch these mistakes at compile time.

Log Levels

Log levels signal severity. Use them consistently so that filtering by level actually works.

ERROR — something failed and requires attention. A request couldn't be completed, a dependency is unreachable, data was lost or corrupted. Errors should be actionable: if nobody needs to investigate, it's not an error.

WARN — something unexpected happened but the system recovered. A retry succeeded, a fallback was used, a deprecated code path was hit. Warnings are a leading indicator — a cluster of warnings often precedes errors.

INFO — significant state transitions in normal operation. Service started, sync completed, configuration loaded, job finished. INFO logs should tell the story of what the system did at a glance. Most production systems run at INFO level.

DEBUG — detailed internal state useful during development or incident investigation. Function inputs and outputs, cache hits and misses, query plans. DEBUG logs are high-volume and should never be enabled in production by default. Use the LOG_LEVEL environment variable to toggle them on when needed.

Planning Log Events

When designing a new feature or workflow, document the events you plan to emit before writing code. This forces clarity about what matters and prevents gaps in visibility after deployment.

A simple table works well:

Event Fields Description
gap_scan_started bucket_size, retention_days Beginning gap detection
gap_scan_completed bucket_size, items_scanned, gaps_found, duration_ms Scan finished
gap_fill_started item_id, bucket_size, gap_count Filling gaps for item
gap_fill_completed item_id, bucket_size, buckets_inserted Successfully filled
gap_fill_skipped item_id, bucket_size, reason Skipped — e.g., outside retention window
gap_fill_failed item_id, bucket_size, error API or database error
gap_fill_run_completed items_processed, total_gaps_filled, errors, duration_ms Nightly run summary

This becomes a contract between the code and the dashboards that depend on it. When an event name changes, you know which queries and alerts need updating.

Implementation

Structured Logging in Go

Use log/slog (standard library, Go 1.21+) or zerolog for structured logging. Both produce JSON output suitable for machine parsing and human reading.

import "log/slog"

func main() {
    logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: slog.LevelInfo,
    }))
    slog.SetDefault(logger)

    slog.Info("service_started",
        "port", cfg.Server.Port,
        "version", version,
    )
}

Output:

{
  "time": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "msg": "service_started",
  "port": 8080,
  "version": "1.2.0"
}

Context Propagation

Attach request-scoped fields (request ID, user ID, trace ID) to a logger instance and pass it through the request context. This ensures every log entry within a request automatically includes the fields needed to correlate them.

func middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        requestID := r.Header.Get("X-Request-ID")
        if requestID == "" {
            requestID = uuid.NewString()
        }

        logger := slog.Default().With(
            "request_id", requestID,
            "method", r.Method,
            "path", r.URL.Path,
        )

        ctx := context.WithValue(r.Context(), loggerKey, logger)
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Every downstream function that extracts the logger from context gets request_id, method, and path for free — no manual threading required.

What to Log

At service boundaries:

  • Inbound requests — method, path, status code, duration, request ID
  • Outbound calls — target service, method, status, duration, whether a retry was needed
  • Background jobs — job name, start/complete/fail events, items processed, duration

At decision points:

  • Branching logic that affects behavior — "used fallback because primary was unavailable"
  • Skipped work — "skipped sync because last run was less than 5 minutes ago"
  • Retries — each attempt, with the error that triggered it

Don't log request or response bodies (they contain PII), individual loop iterations in bulk operations (log the summary instead), or expected non-events ("user was already active, no action taken" — unless you're debugging).

Log Shipping

Applications write to stdout. That's it. The application's only responsibility is producing structured log lines. Everything else — collection, buffering, routing, and storage — is the infrastructure's job.

In containerized environments, a log agent like fluent-bit tails the container's stdout, enriches the entries with metadata (container name, namespace, host), and forwards them to a storage backend like grafana-loki. This separation means the application doesn't need to know where its logs end up, and the shipping pipeline can be changed without modifying application code.

Application → stdout → Fluent Bit → Loki → Grafana

This pipeline is simple, debuggable, and each stage can be replaced independently.

References