Go channels on steroids
Dmitry Vyukov
dvyukov@
Jan 28, 2014
Channels are the main synchronization and communication primitive in Go, they need to be fast and scalable.
Goals:
- make single-threaded (non-contended) channel operations faster
- make contended buffered (producer/consumer) channel operations faster
- make non-blocking failing operations (e.g. checking of "stop" channel) faster
- make chan semaphores (chan struct{}) faster
- make select statements faster
Non-goals:
- make channels completely lock-free (this would significantly complicate implementation and make it slower for common cases)
- make contended synchronous channel operations faster
The rest of the document describes details of the design.
There are 3 internal types of channels:
1. Sync channels. They do not need any buffering and buffer management code. Also they implement direct hand off semantics (a goroutine directly chooses the pair and accomplishes communication with it).
2. Async channels. This is traditional producer-consumer queues based on ring buffer. They do not implement hand off semantics -- an unblocked consumer competes on general rights with other consumers, if it loses the competition it blocks again.
3. Async channels with zero-sized elements (chan struct{}). This is essentially semaphores. They do not need buffers (consume O(1) memory), and do not implement hand off semantics.
Before diving into selects, let's first consider how standalone send/recv work.
Sync channels are mostly mutex-protected, except for non-blocking operation failure fast-path (e.g. non-blocking recv from an empty chan). Sync chan contains the following data:
struct Hchan {
Lock;
bool closed;
SudoG* sendq; // waiting senders
SudoG* recvq; // waiting receivers
};
Send locks the mutex, and checks whether it needs to block or satisfy an inverse operation:
bool syncchansend(Hchan *c, T val, bool block) {
if(c->closed) // fast-path
panic(“closed”);
if(!block && c->recvq == nil) // fast-path
return false;
lock(c);
if(c->closed) {
unlock(c);
panic(“closed”);
}
if(sg = removewaiter(&c->recvq)) {
// Have a blocked receiver, communicate with it.
unlock(c);
sg->val = val;
sg->completed = true;
unblock(sg->g);
return true;
}
if(!block) {
unlock(c);
return false;
}
// Block and wait for a pair.
sg->g = g;
sg->val = val;
addwaiter(&c->sendq, sg);
unlock(c);
block();
if(!sg->completed)
panic(“closed”); // unblocked by close
// Unblocked by a recv.
return true;
}
Async send/recv is lock-free if it does not need to manipulate wait queues, wait queues are protected by the mutex. Non-blocking failing operations are fast-pathed as well.
Let’s first consider how non-blocking operations proceed.
The async channel contains the following data:
struct Hchan {
uint32 cap; // channel capacity
Elem* buf; // ring buffer of size cap
// send and receive positions,
// low 32 bits represent position in the buffer,
// high 32 bits represent the current “lap” over the ring buffer
uint64 sendx;
uint64 recvx;
};
struct Elem {
// current lap,
// the element is ready for writing on laps 0, 2, 4, ...
// for reading -- on laps 1, 3, 5, ...
uint32 lap;
T val; // user data
};
Sends synchronize with each other by means of advancing sendx with CAS, whoever advances the position writes to the element. Sends synchronize with recvs by means of lap variable in each element, basically, lap value says whether this element is ready for reading/writing on the current lap (high 32 bits of sendx/recvx).
Below is the send algorithm:
bool asyncchansend_nonblock(Hchan* c, T val) {
uint32 pos, lap, elap;
uint64 x, newx;
Elem *e;
for(;;) {
x = atomicload64(&c->sendx);
pos = (uint32)x;
lap = (uint32)(x >> 32);
e = &c->buf[pos];
elap = atomicload32(&e->lap);
if(lap == elap) {
// The element is ready for writing on this lap.
// Try to claim the right to write to this element.
if(pos + 1 < c->cap)
newx = x + 1; // just increase the pos
else
newx = (uint64)(lap + 2) << 32;
if(!cas64(&c->sendx, x, newx))
continue; // lose the race, retry
// We own the element, do non-atomic write.
e->val = val;
// Make the element available for reading.
atomicstore32(&e->lap, elap + 1);
return true;
} else if((int32)(lap - elap) > 0) {
// The element is not yet read on the previous lap,
// the chan is full.
return false;
} else {
// The element has already been written on this lap,
// this means that c->sendx has been changed as well,
// retry.
}
}
}
Recv operation is completely symmetrical, except that recvs start at lap 1 and read the element instead of writing.
Now, let's consider how blocking operations are implemented. The channel structure is extended with a mutex and send/recv waiter queues:
struct Hchan {
…
Lock;
SudoG* sendq;
SudoG* recvq;
};
To do a blocking send, a goroutine tries to do the non-blocking send. If it succeeds, it checks whether there are recv waiters, and if so unblocks one of them.
If the non-blocking send fails (the chan is full), it locks the mutex, adds itself to the send waiter queue and after that re-checks if the chan is still full. If the chan is full, the goroutine blocks. If the chan is not full, the goroutine removes itself from waiter queue, unlocks the mutex and retries.
Blocking receive proceeds absolutely the same, except for s/send/recv/ s/recv/send/.
The main tricky aspect of such blocking algorithms is to ensure that no deadlocks are possible (a sender is blocked indefinitely on a non-full channel; or a receiver is blocked indefinitely on a non-empty channel). By doing this check-store-recheck thing, we ensure than either (1) sender sees that there is a recv waiter and unblocks it, or (2) receiver sees the element in the buffer and consumes it, or (3) both 1 and 2 (in this case the competition is resolved by means of the mutex); but NOT (4) sender does not see recv waiters and receiver does not see the element in the buffer and blocks indefinitely.
Below is the blocking send algorithm:
void asyncchansend(Hchan* c, T val) {
for(;;) {
if(asyncchansend_nonblock(c, val)) {
// Send succeeded, see if we need to unblock a receiver.
if(c->recvq != nil) {
lock(c);
sg = removewaiter(&c->recvq);
unlock(c);
if(sg != nil)
unblock(sg->g);
}
return;
} else {
// The channel is full.
lock(c);
sg->g = g;
addwaiter(&c->sendq, sg);
if(notfull(c)) {
removewaiter(&c->sendq, sg);
unlock(c);
continue;
}
unlock(c);
block();
// Retry send.
}
}
}
Zero-sized async channels generally resemble non-zero-sized async channels:
- operations are lock-free in non-blocking case
- wait queues are still protected by the mutex
- non-blocking failing operations are fast-pathed
The differences are:
- Hchan contains a single counter instead of send/recv positions and the ring buffer; the counter represents number of elements in the channel
- non-blocking send/receive do a CAS loop to adjust the counter
- full/empty predicates merely check the counter value
The rest, including the blocking algorithm, is the same.
Close operations locks the mutex, sets closed flag and then unblocks all waiters. Async send/recv operations check the closed flag before blocking.
This allows to achieve the same guarantees that present for async send/recv blocking. Namely, either (1) close sees a waiter, or (2) a waiter sees closed flag set or (3) both 1 and 2 (in this case the competition is again resolved by means of the mutex).
Now we are ready for The Select.
Select operation does not lock mutexes of all involved channels at once, instead it proceeds by doing fine-grained operations on individual channels.
Select consists of 4 phases:
0. Shuffle all involved channels to provide the pseudo-random guarantee (all subsequent phases work with this shuffled list of channels).
1. Check all channels one-by-one to see if any of them is ready for communication, if so do the communication and exit. This makes selects that do not block faster and more scalable, as they do not need to sort and lock mutexes. Moreover, such select does not even need to touch all channels if the first one is ready.
2. Prepare for blocking on all channels.
3. Block. Goto 1.
Phase 2 needs a more detailed description.
Essentially it proceeds the same way as blocking in async send/recv. That is, lock channel mutex, add itself to send/recv waiter queue and after that re-check if the chan is still not ready for communication. If the channel is not ready, then proceed to the next channel. Otherwise, remove itself from all waiter queues and goto phase 1.
There is another tricky aspect. We add select as waiter to several channels, but we do not want several sync channel operations to complete communication with the select (for sync channels unblocking completes successful communication). In order to prevent this, select-related entries in waiter queues contain a pointer to a select-global state word. Before unblocking such waiters other goroutines try to CAS(statep, nil, sg), which gives them the right to unblock/communicate with the waiter. If the CAS fails, goroutines ignore the waiter (it’s being signaled by somebody else).
This algorithm requires to implement isready(c) predicate for all channel types, which does not represent a significant problem. High-level algorithm of a select operation follows:
Scase *select(Select *sel) {
randomize channel order;
for(;;) {
// Phase 1.
foreach(Scase *cas in sel) {
if(chansend/recv_nonblock(cas->c, ...))
return cas;
}
// Phase 2.
selectstate = nil;
foreach(Scase *cas in sel) {
lock(cas->c);
cas->sg->g = g;
cas->sg->selectstatep = &selectstate;
addwaiter(&cas->c->sendq/recvq, cas->sg);
if(isready(cas->c)) {
unlock(c);
goto ready;
}
unlock(cas->c);
}
// Phase 3.
block();
ready:
CAS(&selectstate, nil, 1);
foreach(Scase *cas in sel) {
lock(cas->c);
removewaiter(&cas->c->sendq/recvq, cas->sg);
unlock(cas->c);
}
// If we were unblocked by a sync chan operation,
// the communication has completed.
if(selectstate > 1)
return selectstate; // denotes the completed case
}
}
There are two prerequisites (op top of revision 18999). Since chan-related data structures and algorithms will change significantly, we need to:
1. Remove special handling of chans from GC (GC_CHAN_PTR program). It's possible to adopt GC_CHAN_PTR for new representations, but it looks better to just give chan objects proper types.
2. Similarly, move chanlen/chancap into runtime instead of guessing in the compiler which word means what (there is no single len word in async chans).
Below is an annotated evaluation of the prototype (https://codereview.appspot.com/12544043/) on standard synthetic channel benchmarks.
benchmark old ns/op new ns/op delta
BenchmarkChanNonblocking 24 8 -66.53%
BenchmarkChanNonblocking-2 92 4 -95.57%
BenchmarkChanNonblocking-4 104 2 -97.95%
BenchmarkChanNonblocking-8 114 1 -99.02%
BenchmarkChanNonblocking-16 92 0 -99.37%
BenchmarkChanNonblocking-32 82 0 -99.46%
// This takes advantage of the non-blocking fast path.
BenchmarkSelectUncontended 222 159 -28.38%
BenchmarkSelectUncontended-2 128 97 -23.98%
BenchmarkSelectUncontended-4 87 52 -39.79%
BenchmarkSelectUncontended-8 46 29 -37.69%
BenchmarkSelectUncontended-16 23 15 -34.75%
BenchmarkSelectUncontended-32 18 11 -36.26%
// Single-threaded speedup, because we don’t sort/lock all mutexes and don’t zero select descriptor.
BenchmarkSelectContended 221 152 -31.22%
BenchmarkSelectContended-2 459 159 -65.36%
BenchmarkSelectContended-4 678 254 -62.54%
BenchmarkSelectContended-8 985 365 -62.94%
BenchmarkSelectContended-16 749 337 -55.01%
BenchmarkSelectContended-32 731 421 -42.41%
// Scalability improvements caused by finer-grained locking and lock-free paths.
BenchmarkSelectNonblock 104 34 -66.92%
BenchmarkSelectNonblock-2 51 17 -66.60%
BenchmarkSelectNonblock-4 25 8 -64.56%
BenchmarkSelectNonblock-8 12 4 -63.44%
BenchmarkSelectNonblock-16 6 2 -63.22%
BenchmarkSelectNonblock-32 5 2 -56.15%
// Again, non-blocking fast paths.
BenchmarkChanUncontended 61 37 -38.30%
BenchmarkChanUncontended-2 30 18 -39.07%
BenchmarkChanUncontended-4 15 9 -38.58%
BenchmarkChanUncontended-8 8 5 -37.20%
BenchmarkChanUncontended-16 4 2 -36.43%
BenchmarkChanUncontended-32 3 2 -32.46%
// Single-threaded speedup, because instead of lock/unlock we do a single CAS.
BenchmarkChanContended 60 39 -34.05%
BenchmarkChanContended-2 268 291 +8.58%
BenchmarkChanContended-4 301 289 -3.99%
BenchmarkChanContended-8 331 332 +0.30%
BenchmarkChanContended-16 254 591 +132.68%
BenchmarkChanContended-32 242 726 +200.00%
// Extremely contended async chans are slower because of increased contention in lock-free paths.
BenchmarkChanSync 127 127 +0.00%
BenchmarkChanSync-2 395 393 -0.51%
BenchmarkChanSync-4 358 340 -5.03%
BenchmarkChanSync-8 293 312 +6.48%
BenchmarkChanSync-16 353 361 +2.27%
BenchmarkChanSync-32 216 220 +1.85%
// Mostly flakes due to extreme contention.
BenchmarkChanProdCons0 134 140 +4.48%
BenchmarkChanProdCons0-2 407 573 +40.79%
BenchmarkChanProdCons0-4 667 745 +11.69%
BenchmarkChanProdCons0-8 934 883 -5.46%
BenchmarkChanProdCons0-16 760 762 +0.26%
BenchmarkChanProdCons0-32 700 759 +8.43%
// Highly contended sync channels are slower (due to fast paths), but are not interesting.
BenchmarkChanProdCons10 88 71 -18.86%
BenchmarkChanProdCons10-2 223 104 -53.36%
BenchmarkChanProdCons10-4 699 197 -71.82%
BenchmarkChanProdCons10-8 821 512 -37.64%
BenchmarkChanProdCons10-16 616 588 -4.55%
BenchmarkChanProdCons10-32 553 575 +3.98%
// Highly contended async chans are faster with few threads (because they are lock-free), but slower with more threads (because they are lock-free); both results are not very interesting.
BenchmarkChanProdCons100 68 43 -37.12%
BenchmarkChanProdCons100-2 206 95 -53.83%
BenchmarkChanProdCons100-4 397 338 -14.86%
BenchmarkChanProdCons100-8 400 469 +17.25%
BenchmarkChanProdCons100-16 310 773 +149.35%
BenchmarkChanProdCons100-32 291 753 +158.76%
// Same as BenchmarkChanProdCons10.
BenchmarkChanProdConsWork0 730 691 -5.34%
BenchmarkChanProdConsWork0-2 427 487 +14.05%
BenchmarkChanProdConsWork0-4 767 855 +11.47%
BenchmarkChanProdConsWork0-8 1096 1062 -3.10%
BenchmarkChanProdConsWork0-16 906 965 +6.51%
BenchmarkChanProdConsWork0-32 858 916 +6.76%
// Sync channels are a bit slower under moderate load (goroutines do some local work), because of the fast paths. Use buffering!
BenchmarkChanProdConsWork10 648 594 -8.33%
BenchmarkChanProdConsWork10-2 593 519 -12.48%
BenchmarkChanProdConsWork10-4 1089 344 -68.41%
BenchmarkChanProdConsWork10-8 1334 500 -62.52%
BenchmarkChanProdConsWork10-16 1107 572 -48.33%
BenchmarkChanProdConsWork10-32 1005 575 -42.79%
// This is a much more interesting producer/consumer case - buffered channel and goroutines do some local work -- new channels are significantly faster because there are no mutexes.
BenchmarkChanProdConsWork100 620 607 -2.10%
BenchmarkChanProdConsWork100-2 543 365 -32.78%
BenchmarkChanProdConsWork100-4 823 211 -74.36%
BenchmarkChanProdConsWork100-8 1030 563 -45.34%
BenchmarkChanProdConsWork100-16 855 755 -11.70%
BenchmarkChanProdConsWork100-32 788 760 -3.55%
// Same as BenchmarkChanProdConsWork10.
BenchmarkSelectProdCons 1180 1031 -12.63%
BenchmarkSelectProdCons-2 880 683 -22.39%
BenchmarkSelectProdCons-4 1213 433 -64.30%
BenchmarkSelectProdCons-8 1613 578 -64.17%
BenchmarkSelectProdCons-16 1298 805 -37.98%
BenchmarkSelectProdCons-32 1289 773 -40.03%
// Takes advantage of finer-grained locking and lock-free paths.
BenchmarkChanCreation 150 108 -28.00%
BenchmarkChanCreation-2 104 65 -37.31%
BenchmarkChanCreation-4 56 56 -0.70%
BenchmarkChanCreation-8 57 47 -18.06%
BenchmarkChanCreation-16 63 48 -23.17%
BenchmarkChanCreation-32 77 51 -34.53%
// This is faster because the benchmark also includes send/recv operations, chan creation must not be significantly affected.
BenchmarkChanSem 55 28 -48.01%
BenchmarkChanSem-2 260 76 -70.69%
BenchmarkChanSem-4 303 95 -68.38%
BenchmarkChanSem-8 309 116 -62.46%
BenchmarkChanSem-16 215 134 -37.67%
BenchmarkChanSem-32 196 153 -21.94%
// Takes advantage of special lock-free paths for chan struct{}.