Skip to content

runtime: remove idle GC workers for the 2-minute GC cycle #44163

Closed
@mknyszek

Description

@mknyszek

Background

Today the Go GC will spin up dedicated GC workers which use 25% of GOMAXPROCS resources. But it will also soak up any additional GOMAXPROCS that are idle and run "idle GC workers" to do so.

The idea here is that those cores are idle anyway, so why not use them to speed up the GC cycle? There are negative performance implications to having the GC on longer than necessary (floating garbage accumulation for memory, and write barrier enabled for CPU time).

EDIT: To be clear, I'm not proposing removing dedicated GC workers, which use 25% of CPU resources during a GC cycle. Those are very, very important. Just the idle workers that can increase that number to 100% if the application is idle.

Why?

When a Go process is mostly idle, idle enough that the "every-2-minutes" GC kicks in, it's going to go full blast on all CPU cores, resulting in a CPU utilization spike. While this on it's own isn't a big deal, it makes a difference in the face of monitoring systems and auto-scaling systems. The former is surprising; one might see these spikes in their monitoring and spend time investigating them, only to find there's nothing to do about that (today). The latter is dangerous: if an auto-scaling system decides to give an application more resources because it notices this spike, then the Go GC is now going to soak up those resources, resulting in a significant (if temporary, and generally bounded) increase in resources.

From the earlier proposal to remove idle GC workers (no longer relevant):

I believe today these idle GC workers are doing more harm than good.

Firstly, when a Go process is mostly idle, idle enough that the "every-2-minutes" GC kicks in, it's going to go full blast on all CPU cores, resulting in a CPU utilization spike. While this on it's own isn't a big deal, it makes a difference in the face of monitoring systems and auto-scaling systems. The former is surprising; one might see these spikes in their monitoring and spend time investigating them, only to find there's nothing to do about that (today). The latter is dangerous: if an auto-scaling system decides to give an application more resources because it notices this spike, then the Go GC is now going to soak up those resources, resulting in a significant (if temporary, and generally bounded) increase in resources.

Secondly, the Go pacer fails to account for this extra CPU utilization. This significantly skews the pacer's function and makes it consistently undershoot its target in the steady-state. The goal of the pacer is to reach some GC CPU utilization target, but that target is somewhat meaningless if the idle GC workers can kick in and blow that away in applications that have any idle time. The result is that if idle GC workers are running, the GC will consistently undershoot its goal.

Worse still, this extra CPU utilization isn't easy to account for in the pacer. If we pace assuming the idle GC workers are running, it's possible for the GC to really panic and spike utilization just when you really need it. For instance, take an application with many goroutines blocked on syscalls. Let's say most of the time they're not doing a lot, but suddenly (by chance) many wake up at once during a GC cycle. Suddenly the GC CPU utilization will drop (no space for idle workers) and the GC will think it's extremely behind, so at that moment it'll start forcing all these goroutines to assist, basically up to the utilization that the idle workers were doing. This makes our notion of a steady-state fragile, since it now depends on the CPU resources available. (I have a new pacer design in the works and while it can deal with this, it ends up with consistent undershoot when the idle workers are running; I don't think you can win here.)

Thirdly, there's a latency cost. Consider again the case of goroutines waiting on, say, network requests. The Go scheduler is going to schedule idle GC workers, and while they yield readily (they're "low priority" in a sense), their function is primarily non-preemptible and likely will be for the forseeable future. The cost of scheduling them away is small, but not free.

Why not?

The application could suddenly become not idle during the GC cycle, and there would be a latency cost in the form of assists (vs. idle GC workers, which tend to introduce very little latency cost). Given that the 2-minute GC kicks in, however, I really doubt that this will every be an issue. The application's steady-state would have to change mid-GC, and on the time scale of 2 minutes, that's already extremely unlikely.

From the earlier proposal to remove idle GC workers (no longer relevant):

I'm honestly not sure. The only scenario I could construct where the idle GC workers come in handy is a bursty application, since you increase your chance that the GC fires outside of a burst, and you get a nice boost in throughput and latency since the GC conveniently tries to finish as fast as possible outside the burst.

But this argument is still somewhat dubious because there's no guarantee the GC won't just fire during the burst, so I'm not sure how much this actually helps. However, one reason I'm creating this issue is that maybe someone out there can think of another scenario, or has another use-case.

Why not now?

I don't think there's any reason not to do this very targeted fix. It will require some additional testing to make sure everything still works correctly, because the 2-minute GC is a fairly rare case, so bugs here will manifest as surprising and very rare failures.

From the earlier proposal to remove idle GC workers (no longer relevant):

Removing the idle GC workers is unfortunately going to be risky from an implementation standpoint. @prattmic tried this out and unfortunately it uncovered at least one bug. It's possible that over time the idle GC workers have become critical to GC progress in some scenarios. Oops.

@prattmic concluded from his investigation into this that it's going to uncover a non-trivial amount of latent, subtle bugs (and I say bugs since theoretically the idle GC workers should have no effect on the function of the GC otherwise) and it's going to take a while to squash all of them.

As a result, it's going to be critical to land this early in the development cycle and give it plenty of time to soak, with an easy roll-back option (perhaps a runtime-internal constant).

However, despite all this, I think this is worth doing sooner rather than later. The longer we wait, the greater chance we have that seemingly orthogonal changes to the GC tangle this more.

CC @aclements @prattmic

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeGarbageCollectorNeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.early-in-cycleA change that should be done early in the 3 month dev cycle.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions