Observability

Observability is the ability to understand a system's internal state from its external outputs. In practice, this means designing your services to answer the question: what is happening, and why?

A well-observed system combines three complementary signals — logs, metrics, and traces — each offering a different lens on the same underlying behavior. Logs tell you what happened. Metrics tell you how things are behaving over time. Traces show you the path a request takes across services. Together they let you move from "something is wrong" to "here's why" without guessing.

The Four Golden Signals

These four measurements, drawn from Google's SRE handbook, capture the essential health of any service. If you instrument nothing else, instrument these.

Latency measures the time it takes to serve a request. Track it as a distribution, not an average — the 99th percentile matters more than the median. A service with a 50ms median and a 2-second p99 is a service with a problem that averages will hide.

Traffic quantifies demand. For an API, this is requests per second. For a pipeline, it might be records processed per minute. The shape of your traffic tells you what "normal" looks like, which is the prerequisite for recognizing abnormal.

Errors are the rate of requests that fail, whether explicitly (a 500 response) or implicitly (a 200 that returns wrong data). Distinguish between client errors and server errors — they have different causes and different remedies. A spike in 4xx errors is your users' problem; a spike in 5xx errors is yours.

Saturation reflects how close a resource is to its limit. CPU, memory, disk, connection pools — when any of these approaches capacity, latency rises and errors follow. Saturation is the early warning signal that gives you time to act before users notice.

Articles

Document Description
logging Structured logging conventions, log levels, event planning, and implementation patterns
metrics Metric types, naming conventions, instrumentation, and alerting fundamentals

The Three Signals

Each signal has distinct strengths. Understanding where they overlap — and where they don't — prevents both gaps in visibility and redundant instrumentation.

Logs are the richest signal. They capture arbitrary context about individual events: request IDs, user IDs, error messages, stack traces. They're what you reach for when debugging a specific incident. The cost is volume — logs are expensive to store and query at scale, so they require discipline around what you emit and how long you retain it.

Metrics are the most efficient signal. A counter or histogram consumes a fixed amount of storage regardless of traffic volume. They're ideal for dashboards, alerting, and capacity planning — answering questions like "what's the error rate?" or "how full is the connection pool?" The tradeoff is that metrics are aggregated; they tell you that errors are happening, not which request failed or why.

Traces follow a single request as it moves through multiple services. Each service adds a span to the trace, recording its contribution to the overall latency. Traces are essential for diagnosing performance problems in distributed systems — they show you where time is being spent. The cost is instrumentation complexity and sampling decisions.

Logs and metrics are the foundation. Traces become important as your architecture grows beyond a few services. Start with the first two and add tracing when you need to answer "which service is slow?" across a request path.

Correlation

The signals become more powerful when they're connected. A metric alert fires — error rate is elevated. You click through to the logs filtered by the same time window and service. A log entry includes a trace ID, which takes you to the distributed trace showing the failing downstream call.

This requires consistency: shared labels between metrics and log fields, trace IDs propagated through request context, and a common time reference. The conventions in logging and metrics are designed with this correlation in mind.

References