Metrics
Metrics are quantitative measurements sampled over time. Where logs capture individual events, metrics describe continuous signals — request rates, error percentages, resource utilization. They answer the question "how is the system behaving?" without the per-event cost of logging.
A single counter takes the same amount of storage whether your service handles ten requests or ten million. This efficiency is what makes metrics the right tool for dashboards, alerting, and capacity planning.
Conventions
snake_case for everything. Metric names, label keys, and label values all follow the same convention we use for logging. Consistency across signals makes dashboarding and correlation straightforward.
Include the unit in the name. http_request_duration_seconds is unambiguous. http_request_duration is not — is it seconds, milliseconds, or microseconds? Prometheus conventions use base units: seconds for duration, bytes for size, ratios for percentages (0.0 to 1.0).
Mind your cardinality. Each unique combination of label values creates a distinct time series. A metric with labels method and status is manageable. Add user_id and you've created a series for every user — that's a fast path to storage bloat and slow queries. Reserve high-cardinality identifiers for logs and traces.
Metric Types
Understanding the four fundamental types prevents misuse — and misuse leads to dashboards that look right but tell you the wrong thing.
Counters are monotonically increasing values. They only go up (or reset to zero when the process restarts). Total requests served, total errors, total bytes transferred. You almost never look at a counter's raw value — you apply rate() to see the per-second change.
var httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests processed.",
},
[]string{"method", "status"},
)
// In your handler:
httpRequestsTotal.WithLabelValues(r.Method, strconv.Itoa(statusCode)).Inc()
Gauges represent a value that goes up and down. Current memory usage, active connections, queue depth, temperature. Gauges are sampled at scrape time — they show the instantaneous value, not the average.
var activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "db_active_connections",
Help: "Number of active database connections.",
},
)
// When connections change:
activeConnections.Set(float64(pool.Stats().InUse))
Histograms observe values and count them into configurable buckets. They're the right type for latency and size distributions — anything where the distribution matters more than the average. A histogram produces three time series: _bucket (counts per bucket), _sum (total of all observed values), and _count (total number of observations).
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency distribution.",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
},
[]string{"method", "path"},
)
// In your handler:
start := time.Now()
// ... handle request ...
requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(time.Since(start).Seconds())
Choose bucket boundaries that reflect your SLOs. If your target is "95% of requests under 500ms," you need buckets around that boundary to measure it accurately.
Summaries are similar to histograms but calculate quantiles on the client side. They're less flexible — you can't aggregate summaries across instances, and the quantile values are fixed at instrumentation time. Prefer histograms unless you have a specific reason not to.
Instrumentation
The Standard Metrics
Every HTTP service should expose at minimum:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
http_requests_total |
Counter | method, status |
Traffic and error rate |
http_request_duration_seconds |
Histogram | method, path |
Latency distribution |
http_requests_in_flight |
Gauge | — | Current concurrency / saturation |
These three metrics, combined with infrastructure metrics from prometheus-exporters, cover all four golden signals.
Background Jobs
For recurring jobs and batch processing, the pattern is similar — just adapted for the lifecycle:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
job_runs_total |
Counter | job_name, result |
Completion count (success/failure) |
job_duration_seconds |
Histogram | job_name |
How long runs take |
job_items_processed_total |
Counter | job_name |
Throughput |
job_last_success_timestamp_seconds |
Gauge | job_name |
Staleness detection |
The job_last_success_timestamp_seconds gauge is particularly useful for alerting — if the value is too far in the past, the job has stopped succeeding.
Exposing a Metrics Endpoint
Prometheus scrapes metrics over HTTP. Expose a /metrics endpoint that serves the Prometheus exposition format:
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.Handler())
This endpoint should not require authentication — the Prometheus server needs to reach it, and the metrics themselves are not sensitive. Place it on an internal port or behind network-level access controls if needed.
Collection Strategies
Pull-based collection means a central system (like prometheus) scrapes your service at regular intervals. The service exposes a metrics endpoint; the collector comes to it. This works well when your infrastructure can reach every target, and it makes it easy to tell when a service has gone silent — a missing scrape target is itself a signal.
Push-based collection means the service sends metrics to a central receiver. This is the better choice when network isolation prevents inbound connections, or when services are short-lived and may not survive until the next scrape.
Neither approach is universally better. Choose based on your network topology and service lifecycle.
Alerting
Metrics become actionable through alerting rules. An alert defines a condition that, when true for a sustained period, triggers a notification.
# Alert when error rate exceeds 5% for 5 minutes
groups:
- name: http
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m # must persist to avoid flapping
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.instance }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency exceeds 2 seconds"
Alert Design Principles
Alert on symptoms, not causes. Users experience high latency and errors — alert on those. CPU usage is a cause; it might matter or it might not. If high CPU doesn't affect latency or error rate, it isn't a problem yet.
Use the for clause. A condition must persist for the specified duration before firing. This prevents alerts from flapping on transient spikes. Five minutes is a reasonable default for most service-level alerts.
Severity should reflect urgency. critical means someone needs to look now — it pages on-call. warning means something needs attention within business hours. Don't cry wolf with critical alerts on non-urgent conditions.
Every alert needs a runbook. The annotation should link to (or contain) enough context that the person responding knows what to check first. An alert without guidance is just anxiety.
Runbooks
A runbook is a sequence of steps that takes you from "this alert fired" to "service is restored." The priority is always restoring normal operation first — especially for user-facing errors. Root cause analysis comes after the bleeding stops.
Good runbooks share a structure: confirm the problem, correlate with other signals, take the most conservative corrective action, verify recovery, then investigate at leisure. The best runbooks can be automated entirely.
HighErrorRate
Symptom: 5xx error rate exceeds threshold.
- Check logs. Open grafana-loki filtered to the alerting service and the time window of the alert. The error-level logs should tell you exactly what's failing — a downstream dependency, a database query, a nil pointer, a bad deployment.
- Check recent deployments. Query git history for the service:
git log --since="2 hours ago" --oneline. If a deploy landed shortly before the error spike, it's the likely cause. - Roll back. If a recent change correlates, revert to the last known-good commit and redeploy. Don't debug in production while users are affected — restore service first, investigate the diff later.
- Check dependencies. If no recent deploy, check the health of downstream services (databases, APIs, message queues). A dependency outage will surface as errors in every service that calls it.
- Verify recovery. Watch the error rate metric. Once it drops below threshold and stays there for the
forduration, the alert will resolve.
HighLatency
Symptom: p99 request latency exceeds SLO.
- Check saturation metrics. High latency often follows resource exhaustion. Look at CPU, memory, connection pool utilization, and disk I/O on the affected service and its database.
- Check logs for slow operations. Filter for requests with high
duration_msvalues. Are they concentrated on a specific endpoint or query? - Scale if saturated. If the cause is resource exhaustion under legitimate load:
- Horizontal scaling — increase replica count or spawn additional workers. This is the right response when the workload is parallelizable and the bottleneck is compute or concurrency.
- Vertical scaling — upgrade the compute tier of the constrained resource (e.g., a database server that's hit its memory or IOPS ceiling). This is the right response when the bottleneck is a single resource that can't be distributed.
- Check for regressions. If load hasn't changed but latency has, a recent code change may have introduced an expensive query or removed a cache. Correlate with deploy history and roll back if needed.
- Verify recovery. Watch the p99 metric return to normal levels.
JobStale
Symptom: job_last_success_timestamp_seconds is too far in the past — a scheduled job has stopped completing successfully.
- Check job logs. Filter for the job name. Look for error or panic entries. Common causes: a dependency is down, input data has changed shape, a rate limit was hit.
- Check if the job is still running. A job that's stuck (not failing, just slow) won't update the timestamp either. Look at process or pod status.
- Restart the job. If it's stuck, kill the current run and trigger a fresh one. If it's failing, fix the immediate cause (restore the dependency, adjust the input) and retrigger.
- Verify recovery. Confirm the timestamp updates after the next successful run.
Automating Runbooks
The steps above are deliberately mechanical — confirm, correlate, act, verify. This makes them candidates for automation. A runbook that a human follows reliably is a runbook that a script can execute:
- Auto-rollback — if error rate spikes within minutes of a deploy, automatically revert to the previous version. This is the highest-value automation because it covers the most common cause of production incidents.
- Auto-scaling — if saturation metrics cross a threshold, trigger horizontal scaling (add replicas) or vertical scaling (resize the instance) without human intervention. Build in cooldown periods to prevent thrashing.
- Auto-restart — if a job has been stuck beyond its expected duration, kill and retrigger it. This handles transient failures (network blips, lock contention) that resolve on retry.
Automation doesn't replace understanding — it buys you time. The system restores itself while you investigate why it broke.
Retention
Not all data needs the same resolution forever. A common pattern:
- Recent data at full resolution for real-time debugging
- Downsampled data at reduced resolution (e.g., 5-minute averages) for weeks-to-months analysis
- Long-term archives compacted and stored in object storage (S3, GCS) for historical trends and capacity planning
Federated collection — aggregating from multiple regional nodes — extends this model to larger-scale systems where no single instance can hold everything.