Logging
Logs capture discrete events — things that happened at a specific moment in time. A good log tells you what occurred, where, and with enough context to understand why. A bad log is noise that fills your storage and teaches you nothing.
The goal is not to log everything. It's to log the right things, in a consistent format, with enough context to be useful when something goes wrong.
Conventions
Event-centric messages. Each log entry should describe a single event. Name it clearly: gap_scan_started, order_sync_completed, payment_failed. The event name is the primary thing you search for. It should be specific enough to identify the code path that produced it.
Structured key-value pairs. Context belongs in fields, not embedded in a message string. {"event": "order_sync_completed", "order_id": "abc-123", "duration_ms": 450} is queryable. "Order abc-123 sync completed in 450ms" is not — or at least, not without writing a regex.
snake_case everywhere. Event names, field names, and field values all use snake_case. This keeps things consistent with our metric naming conventions and avoids friction when correlating across signals.
Watch cardinality. Every distinct field value creates a series in your logging backend. Fields like user_id or order_id are fine for filtering individual requests, but think twice before indexing unbounded values — they can quietly degrade query performance in systems like grafana-loki that index by label.
Avoid accidental field names. In libraries with variadic key-value pairs like slog.Info(msg, ...any), arguments must come in pairs: key, value, key, value. If you accidentally pass an odd number of arguments, the last value becomes a key. If you swap the order, a variable becomes a field name.
// Wrong — order_id becomes a field name, "order-123" becomes the value
slog.Info("order_sync_completed", orderID, "order_id")
// Wrong — duration_ms is the value, 450 becomes a field name
slog.Info("order_sync_completed", "duration_ms", durationMS, 450)
// Correct
slog.Info("order_sync_completed", "order_id", orderID, "duration_ms", durationMS)
When a variable ends up in the key position, your logging backend creates a new indexed field for every distinct value. If that variable is a number or ID, you'll pollute your schema with thousands of meaningless columns and degrade query performance. Many logging systems (including grafana-loki) have cardinality limits that will reject ingestion once you exceed them. Linters like sloglint can catch these mistakes at compile time.
Log Levels
Log levels signal severity. Use them consistently so that filtering by level actually works.
ERROR — something failed and requires attention. A request couldn't be completed, a dependency is unreachable, data was lost or corrupted. Errors should be actionable: if nobody needs to investigate, it's not an error.
WARN — something unexpected happened but the system recovered. A retry succeeded, a fallback was used, a deprecated code path was hit. Warnings are a leading indicator — a cluster of warnings often precedes errors.
INFO — significant state transitions in normal operation. Service started, sync completed, configuration loaded, job finished. INFO logs should tell the story of what the system did at a glance. Most production systems run at INFO level.
DEBUG — detailed internal state useful during development or incident investigation. Function inputs and outputs, cache hits and misses, query plans. DEBUG logs are high-volume and should never be enabled in production by default. Use the LOG_LEVEL environment variable to toggle them on when needed.
Planning Log Events
When designing a new feature or workflow, document the events you plan to emit before writing code. This forces clarity about what matters and prevents gaps in visibility after deployment.
A simple table works well:
| Event | Fields | Description |
|---|---|---|
gap_scan_started |
bucket_size, retention_days |
Beginning gap detection |
gap_scan_completed |
bucket_size, items_scanned, gaps_found, duration_ms |
Scan finished |
gap_fill_started |
item_id, bucket_size, gap_count |
Filling gaps for item |
gap_fill_completed |
item_id, bucket_size, buckets_inserted |
Successfully filled |
gap_fill_skipped |
item_id, bucket_size, reason |
Skipped — e.g., outside retention window |
gap_fill_failed |
item_id, bucket_size, error |
API or database error |
gap_fill_run_completed |
items_processed, total_gaps_filled, errors, duration_ms |
Nightly run summary |
This becomes a contract between the code and the dashboards that depend on it. When an event name changes, you know which queries and alerts need updating.
Implementation
Structured Logging in Go
Use log/slog (standard library, Go 1.21+) or zerolog for structured logging. Both produce JSON output suitable for machine parsing and human reading.
import "log/slog"
func main() {
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
slog.SetDefault(logger)
slog.Info("service_started",
"port", cfg.Server.Port,
"version", version,
)
}
Output:
{
"time": "2025-01-15T10:30:00Z",
"level": "INFO",
"msg": "service_started",
"port": 8080,
"version": "1.2.0"
}
Context Propagation
Attach request-scoped fields (request ID, user ID, trace ID) to a logger instance and pass it through the request context. This ensures every log entry within a request automatically includes the fields needed to correlate them.
func middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestID := r.Header.Get("X-Request-ID")
if requestID == "" {
requestID = uuid.NewString()
}
logger := slog.Default().With(
"request_id", requestID,
"method", r.Method,
"path", r.URL.Path,
)
ctx := context.WithValue(r.Context(), loggerKey, logger)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Every downstream function that extracts the logger from context gets request_id, method, and path for free — no manual threading required.
What to Log
At service boundaries:
- Inbound requests — method, path, status code, duration, request ID
- Outbound calls — target service, method, status, duration, whether a retry was needed
- Background jobs — job name, start/complete/fail events, items processed, duration
At decision points:
- Branching logic that affects behavior — "used fallback because primary was unavailable"
- Skipped work — "skipped sync because last run was less than 5 minutes ago"
- Retries — each attempt, with the error that triggered it
Don't log request or response bodies (they contain PII), individual loop iterations in bulk operations (log the summary instead), or expected non-events ("user was already active, no action taken" — unless you're debugging).
Log Shipping
Applications write to stdout. That's it. The application's only responsibility is producing structured log lines. Everything else — collection, buffering, routing, and storage — is the infrastructure's job.
In containerized environments, a log agent like fluent-bit tails the container's stdout, enriches the entries with metadata (container name, namespace, host), and forwards them to a storage backend like grafana-loki. This separation means the application doesn't need to know where its logs end up, and the shipping pipeline can be changed without modifying application code.
Application → stdout → Fluent Bit → Loki → Grafana
This pipeline is simple, debuggable, and each stage can be replaced independently.