Improving the Observability of Golang Services

Improving Observability of GoLang Services

This blog post is aimed at GoLang developers looking to improve their services’ observability. It skips the basics and jumps straight to advanced topics, such as asynchronous structured logging, metrics with exemplars, tracing with TraceQL, aggregating pprof and continuous profiling, microbenchmarks and basic statistics with benchstat, blackbox performance tests, and basic PID controllers for determining a system’s maximum load. We’ll also briefly touch on current research in the observability space, including active casual profiling and passive critical section detection.

The Three Pillars of Observability: Logs, Metrics, Traces

If you’re reading this, you likely don’t need a refresher on the basics of observability. Let’s dive into the non-obvious stuff and focus on making it as easy as possible to move between the three main observability surfaces. We’ll also discuss how to add tracing to the mix so that pprof data can be linked to tracing and back.

If you’re instead looking for a short and straightforward introduction to monitoring basics and ways to introduce basic observability into your service quickly, “Distributed Systems Observability” by Cindy Sridharan is a great place to start.

Structured Logging

Logging can become a bottleneck if you’re not using a zero-allocation logging library. If you haven’t already, consider using zap or zerolog – both are great choices.

zerolog767 ns/op552 B/op6 allocs/op
zap848 ns/op704 B/op2 allocs/op
go-kit3614 ns/op2895 B/op66 allocs/op
logrus5661 ns/op6092 B/op78 allocs/op

Golang has also an ongoing proposal for introducing structured logging: slog. Be sure to check it out and provide feedback on the proposal!

Up-to-date benchmarks can be found on Logbench

Structured logging is essential for extracting data from logs. Adopting a json or logfmt format simplifies ad-hoc troubleshooting and allows for quick and dirty graphs/alerts while you work on proper metrics. Most log libraries also have ready-to-use hooks for gRPC/HTTP clients and servers and common database clients, which greatly simplifies their introduction into existing codebases.

If you find text-based formats inefficient, you can optimize your logging to a great extent. For instance, zerolog supports binary CBOR format, and Envoy has protobufs for their structured access logs.

In some cases, logs themselves can become performance bottlenecks. You don’t want your service to get stuck because Docker can’t pull events out of the stderr pipe fast enough when you enable DEBUG logs.

One solution is to sample your logs:


sampled := log.Sample(zerolog.LevelSampler{
    DebugSampler: &zerolog.BurstSampler{
        Burst: 5,
        Period: 1*time.Second,
        NextSampler: &zerolog.BasicSampler{N: 100},
    },
})

Alternatively, you can make their emission fully asynchronous so they never block: