Skip to content

runtime/trace: execution tracer overhaul #60773

Closed
@mknyszek

Description

@mknyszek

Execution tracer overhaul

Authored by mknyszek@google.com with a mountain of input from others.

In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan Boreham for your invaluable feedback.

Background

Original design document.

Go execution traces provide a moment-to-moment view of what happens in a Go program over some duration. This information is invaluable in understanding program behavior over time and can be leveraged to achieve significant performance improvements. Because Go has a runtime, it can provide deep information about program execution without any external dependencies, making traces particularly attractive for large deployments.

Unfortunately limitations in the trace implementation prevent widespread use.

For example, the process of analyzing execution traces scales poorly with the size of the trace. Traces need to be parsed in their entirety to do anything useful with them, making them impossible to stream. As a result, trace parsing and validation has very high memory requirements for large traces.

Also, Go execution traces are designed to be internally consistent, but don't provide any way to align with other kinds of traces, for example OpenTelemetry traces and Linux sched traces. Alignment with higher level tracing mechanisms is critical to connecting business-level tasks with resource costs. Meanwhile alignment with lower level traces enables a fully vertical view of application performance to root out the most difficult and subtle issues.

Lastly, the implementation of the execution tracer has evolved organically over time and it shows. The codebase also has many old warts and some age-old bugs that make collecting traces difficult, and seem broken. Furthermore, many significant decision decisions were made over the years but weren't thoroughly documented; those decisions largely exist solely in old commit messages and breadcrumbs left in comments within the codebase itself.

Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was reduced from about -10% throughput and +10% request latency in web services to about 1% in both for most applications. This reduced overhead in conjunction with making traces more scalable enables some exciting and powerful new opportunities for traces.

Goals

The goal of this document is to define an alternative implementation for Go execution traces that scales up to large Go deployments.

Specifically, the design presented aims to achieve:

  • Make trace parsing require a small fraction of the memory it requires today.
  • Streamable traces, to enable analysis without storage.
  • Partially self-describing traces, to reduce the upgrade burden on trace consumers.
  • Fix age-old bugs and present a path to clean up the implementation.

This document also describes the existing state of the tracer in detail and explains how we got there.

Design

Link to design document.

CC @felixge @nsrip-dd @prattmic @aclements @rhysh @dominikh @bboreham @thepudds

Metadata

Metadata

Assignees

Labels

FrozenDueToAgeNeedsFixThe path to resolution is known, but the work has not been done.compiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions