-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: defer/recover support in race detector #26813
Comments
CC @dvyukov |
Difference between |
@adilbaig can you provide a step-by-step repro instructions? |
It is there in the code comments I believe -
Did you want anything specific ? |
Something that I can copy-paste into my console. |
@dvyukov How about this. # Have RabbitMQ and Postgres setup (tested with version 9.6), on Ubuntu
sudo apt install rabbitmq-server postgresql-9.6
go get "github.com/streadway/amqp"
go build -o race rabbitmqmemoryleak.go && ./race
# Open `top` in another shell and see that memory usage (RES) is flat for this process. CTRL+C ./race.
go build -race -o race rabbitmqmemoryleak.go && ./race
# Now, open `top` in another shell and see the memory leak You don't actually need to create Postgres tables. I see the leaks when the SQL errors out (for example due to bad login credentials). |
Can reproduce it now.
So it will crash without race detector sooner or later too, makes GC slow, crash messages unreadable, etc. |
I haven't seen that happen, as I ran it without the race detector over the weekend. Can you confirm that it's stable without I'm a novice at Go, but, given your explanation, am i right to think this will be an issue for any evented system (that errors out on events).? |
no.
I don't know about these libraries, but what first comes to my mind is that there is a way to use them without infinite recursion. You both have an infinite loop in handler and call it recursively. I would expect that you need to do only 1 of these things, but not both. Have you looked at other usage examples of these libraries? Do all of them use it in the same way? |
I guess it only looks recursive in the dump, but it isn't, it's just running in a loop one message after another erroring out at every message. So I'm guessing with the race detector on, a reference to something in the
Yep, it's largely just fetching and processing a message in a loop. Here's a canonical example from the site: https://www.rabbitmq.com/tutorials/tutorial-one-go.html (Code is at https://github.com/rabbitmq/rabbitmq-tutorials/blob/master/go/receive.go) |
No, it calls someErroneousSQLTransaction which calls sql.(*DB).Begin which calls handler again. |
Maybe I am missing something, but I fail to see how a Also, @adilbaig, you are creating a transaction but are not using the tx to make your queries but instead, you are using the main db handle itself to make the Also, we already have options to transaction types - https://golang.org/pkg/database/sql/#IsolationLevel. You can set them directly while creating a My suspicion is that creating multiple transactions from mulitple goroutines is somehow causing this behavior in the race detector. Can you try with |
@agnivade That was a mistake. Thanks for pointing it out!
Nice! I will use that in my code. The change doesn't make a difference though. I also think the bug resides somewhere in sql creating connections (not necessarily transactions) inside goroutines. Here's the code: func someErroneousSQLTransaction(pgb *sql.DB) error {
ctx := context.Background()
refTx, err := pgb.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelRepeatableRead})
if err != nil {
return err
}
defer func() {
if err != nil {
refTx.Rollback()
} else {
refTx.Commit()
}
}()
var referenceID int64
switch err = refTx.QueryRow(`SELECT "ID" FROM public.blah LIMIT 1`).Scan(&referenceID); err {
case sql.ErrNoRows:
break
case nil:
break
default:
return err
}
return nil
} |
If that is the case, then I think we can take out rabbitMQ from the equation. Perhaps you can shorten the test by just spawning 10 goroutines which share a (*sql.DB) connection and perform the same query. And run that with race detector and see what happens. Also, we are still unclear as to why the behavior does not show with |
@agnivade - Well no, it only leaks when you're processing RMQ messages. Just plain channels throwing sql errors doesn't leak. The combination of the two libraries is playing a part. |
Right. It seems that the trace is indeed broken. |
Here is minimized reproducer:
The repeated frames must not be present in the trace. We could either defer racefuncexit instead of calling it, or manually call racefuncexit necessary amount of times before calling deferred functions (can we figure out how many frames we skip?). In both cases we need to be careful handling runtime functions because runtime is not instrumented, so we can potentially call racefuncexit too many times, or rely on racefuncexit being deferred when it's actually not. |
I think this is unstaffed, so we can move this right to Unplanned. |
Roger that @dvyukov! I might be interested in this so will kindly page you sometime for CL reviews on it. I'll move it over. |
Hi @dvyukov @odeke-em, can you describe a little bit more about the current behavior of these types of programs? I have a program which uses lib/pq to connect to a Postgres database. lib/pq uses panics extensively, as a way of short-circuiting the call stack. I was using it with the race detector on. I also observe large and growing memory usage when I keep the program running, on both Mac and Linux, with Go 1.13 and tip. Is it possible that panics are somehow not being garbage collected properly when the race detector is on? |
I would expect the memory leak to be minimal (8 bytes per frame). That is tied to goroutine, so if the goroutine exits, the memory should be released. |
That actually fits the profile perfectly. I had a long-running program that was reading rows out of a database in a for loop, then sending them to a downstream server, all in one blocking thread. The database library (github.com/lib/pq) uses panic and recover while sending and reading data, and I was using It also makes sense why I didn't see this on HTTP servers that are also long running. Those also use defer and recover() but inside of the spawned goroutine for an incoming connection so the recover will get cleaned up. See #37233 for more information. |
Change https://golang.org/cl/220586 mentions this issue: |
As far as I can tell, there is no public documentation on this topic, which cost me several days of debugging. I am possibly unusual in that I run binaries in production with the race detector turned on, but I think that others who do the same may want to be aware of the risk. Updates #26813. Updates #37233. Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655 Reviewed-on: https://go-review.googlesource.com/c/go/+/220586 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Change https://golang.org/cl/221019 mentions this issue: |
…ak potential As far as I can tell, there is no public documentation on this topic, which cost me several days of debugging. I am possibly unusual in that I run binaries in production with the race detector turned on, but I think that others who do the same may want to be aware of the risk. Updates #26813. Updates #37233. Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655 Reviewed-on: https://go-review.googlesource.com/c/go/+/220586 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> (cherry picked from commit ba093c4) Reviewed-on: https://go-review.googlesource.com/c/go/+/221019 Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
We just realized we were hitting this for many months in our dev and qa environments where we are running a server compiled with the race defector enabled. Ultimately disabling it kept the resident memory stable and not leaking to the point of being OOM killed over the course of the day. We however were not seeing 8 bytes of leakage at a time as opposed to many MB. When I core dumped the process and ran I eventually ran strace -k filtering for mmap calls since malloc can call mmap if the size requested is over a threshold. Doing this showed a lot of calls into tsan/sanitizer library shadow methods. I’m wondering if the shadow variables maintained end up making wholesale copies of certain values or regions of memory? Either way, the detector is too useful and we will keep it on, but getting it to not leak memory would excise a lot of noise from our development and testing processes. |
No, the additional regions allocated by the race runtime should only hold internal metadata and never user data, or copies of user data. |
I've had users report this when using https://github.com/anacrolix/stm with the race detector enabled. Per other comments above, there is a lot of defer/recover, and very long call stacks that don't look correct. |
Change https://go.dev/cl/458218 mentions this issue: |
Once this is fixed, we should unwind the documentation change in CL 221019 |
Currently, all stack walking logic is in one venerable, large, and very, very complicated function: runtime.gentraceback. This function has three distinct operating modes: printing, populating a PC buffer, or invoking a callback. And it has three different modes of unwinding: physical Go frames, inlined Go frames, and cgo frames. It also has several flags. All of this logic is very interwoven. This CL reimplements the monolithic gentraceback function as an "unwinder" type with an iterator API. It moves all of the logic for stack walking into this new type, and gentraceback is now a much-simplified wrapper around the new unwinder type that still implements printing, populating a PC buffer, and invoking a callback. Follow-up CLs will replace uses of gentraceback with direct uses of unwinder. Exposing traceback functionality as an iterator API will enable a lot of follow-up work such as simplifying the open-coded defer implementation (which should in turn help with #26813 and #37233), printing the bottom of deep stacks (#7181), and eliminating the small limit on CPU stacks in profiles (#56029). Fixes #54466. Change-Id: I36e046dc423c9429c4f286d47162af61aff49a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/458218 Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Austin Clements <austin@google.com>
FYI, @dvyukov 's test case no longer reproduces for me. I also tried forcing baz to use stack-allocated defers (by adding a named result parameter) and heap-allocated defers (by adding a I also tried adding a |
Could something have changed in recent versions is Go (I presume 1.21 @mdempsky ?) that could have fixed this issue? |
Hi Go Devs,
I've noticed a strange memory leak when running my Go program with the Race Detector. The program is an ad-hit tracker, it receives data from RabbitMQ and writes them down to the database. When there are errors in the DB, the program leaks memory (the RES mem usage grows unchecked until the machine runs out of RAM).
Run the program (below) with and without the -race detector to see the difference.
What version of Go are you using (
go version
)?go version go1.10.3 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?What did you do?
Run this program without the race detector:
Observe the memory usage in
top
. It should flatline. Then, run it with the -race detectorThe memory usage grows unchecked.
What did you expect to see?
Flatline memory usage, as measured with
top
when running with -race. I expect it to take more memory to maintain internal data structures, but it should flatline at some point.What did you see instead?
Constantly growing RES mem usage in
top
.Strangely running
go run -race experimental/rabbitmqmemoryleak.go
does not leak memory.The text was updated successfully, but these errors were encountered: