Detecting Race Conditions With Go

Sep 26, 2013


I always find it interesting when I realize that something I have been practicing or dealing with for a long time has a name. This time it happens to be race conditions. This is something you can’t avoid thinking about as soon as you have more than one routine sharing any kind of resource. If you’re not thinking about race conditions in your code, now is the time.

A race condition is when two or more routines have access to the same resource, such as a variable or data structure and attempt to read and write to that resource without any regard to the other routines. This type of code can create the craziest and most random bugs you have ever seen. It usually takes a tremendous amount of logging and luck to find these types of bugs. Over the years I have really perfected my logging skills to help identify these problems when they occur.

Back in June with Go version 1.1, the Go tooling introduced a race detector. The race detector is code that is built into your program during the build process. Then once your program is running, it is able to detect and report any race conditions it finds. It is seriously cool and does an incredible job in identifying the code that is the culprit.

Let’s take a very simple program that contains a race condition and build the code with the race detector.

package main

import (
    "fmt"
    "sync"
)

var Wait sync.WaitGroup
var Counter int = 0

func main() {

    for routine := 1; routine <= 2; routine++ {

        Wait.Add(1)
        go Routine(routine)
    }

    Wait.Wait()
    fmt.Printf("Final Counter: %d\n", Counter)
}

func Routine(id int) {

    for count := 0; count < 2; count++ {

        value := Counter
        value++
        Counter = value
    }

    Wait.Done()
}

The program looks innocent enough. The program spawns two routines that each increment the global Counter variable twice. When both routines are done running, the program displays the value of the global Counter variable. When I run the program it displays the number 4 which is the correct answer. So everything must be working correctly, right?

Let’s run the code through the Go race detector and see what it finds. Open a Terminal session where the source code is located and build the code using the -race option.

go build -race

Then run the program:

==================
WARNING: DATA RACE
Read by goroutine 5:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:29 +0x44
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

Previous write by goroutine 4:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:33 +0x65
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

Goroutine 5 (running) created at:
  main.main()
      /Users/bill/Spaces/Test/src/test/main.go:17 +0x66
  runtime.main()
      /usr/local/go/src/pkg/runtime/proc.c:182 +0x91

Goroutine 4 (finished) created at:
  main.main()
      /Users/bill/Spaces/Test/src/test/main.go:17 +0x66
  runtime.main()
      /usr/local/go/src/pkg/runtime/proc.c:182 +0x91

==================
Final Counter: 4
Found 1 data race(s)

Looks like the tool detected a race condition with the code. If you look below the race condition report, you can see the output for the program. The value of the global Counter variable is 4. This is the problem with these types of bugs, the code could work most of the time and then randomly something bad happens. The race detector is telling us something bad is lurking in the trees.

The result of the warning tells us exactly where the problem is:

Read by goroutine 5:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:29 +0x44
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

        value := Counter

Previous write by goroutine 4:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:33 +0x65
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

        Counter = value

Goroutine 5 (running) created at:
  main.main()
      /Users/bill/Spaces/Test/src/test/main.go:17 +0x66
  runtime.main()
      /usr/local/go/src/pkg/runtime/proc.c:182 +0x91

        go Routine(routine)

You can see that the race detector has pulled out the two lines of code that is reading and writing to the global Counter variable. It also identified the point in the code where the routine was spawned.

Let’s make a quick change to the program to cause the race condition to raise its ugly head:

package main

import (
    "fmt"
    "sync"
    "time"
)

var Wait sync.WaitGroup
var Counter int = 0

func main() {

    for routine := 1; routine <= 2; routine++ {

        Wait.Add(1)
        go Routine(routine)
    }

    Wait.Wait()
    fmt.Printf("Final Counter: %d\n", Counter)
}

func Routine(id int) {

    for count := 0; count < 2; count++ {

        value := Counter
        time.Sleep(1 * time.Nanosecond)
        value++
        Counter = value
    }

    Wait.Done()
}

I have added a billionth of a second pause into the loop. I put the pause right after the routine reads the global Counter variable and stores a local copy. Let’s run the program and see what the value of the global Counter variable is with this simple change:

Final Counter: 2

This pause in the loop has caused the program to fail. The value of the Counter variable is now 2 and no longer 4. So what happened? Let’s break down the code and understand why the billionth of a second pause revealed the bug.

Without the pause the program runs as follows:


Without the pause the first routine that is spawned runs to completion and then the second routine begins to run. This is why the program appears to be running properly. The code is serializing itself because of how fast it is able to run on my machine.

Let’s look at how the program runs with the pause:


I didn’t complete the diagram for space but it shows enough. The pause is causing a context switch between the two routines that are running. This time we have a much different story. Let’s look at the code that is being run in the diagram:

value := Counter

time.Sleep(1 * time.Nanosecond)

value++

Counter = value

With each iteration of the loop the value of the global Counter variable is captured locally, then the local copy is incremented and finally written back to the global Counter variable. If these three lines of code do not run immediately, without interruption, we begin to have problems. The diagram shows how the read of the global Counter variable and then the context switch is causing all of the initial problems.

In the diagram, before the incremented value by Routine 1 is written back to the global Counter variable, Routine 2 wakes up and reads the global Counter variable. Essentially both routines perform the same exact reads and writes to the global Counter variable so we end up with a final value of 2.

To fix this problem you might think we just need to reduce the incrementing of the global Counter variable from three lines of code to one line of code:

package main

import (
    "fmt"
    "sync"
    "time"
)

var Wait sync.WaitGroup
var Counter int = 0

func main() {

    for routine := 1; routine <= 2; routine++ {

        Wait.Add(1)
        go Routine(routine)
    }

    Wait.Wait()
    fmt.Printf("Final Counter: %d\n", Counter)
}

func Routine(id int) {

    for count := 0; count < 2; count++ {

        Counter = Counter + 1
        time.Sleep(1 * time.Nanosecond)
    }

    Wait.Done()
}

When we run this version of the program we get the right answer again:

Final Counter: 4

If we run this code through the race detector our problems should go away:

go build -race

And the output:

==================
WARNING: DATA RACE
Write by goroutine 5:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:30 +0x44
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

Previous write by goroutine 4:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:30 +0x44
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

Goroutine 5 (running) created at:
  main.main()
      /Users/bill/Spaces/Test/src/test/main.go:18 +0x66
  runtime.main()
      /usr/local/go/src/pkg/runtime/proc.c:182 +0x91

Goroutine 4 (running) created at:
  main.main()
      /Users/bill/Spaces/Test/src/test/main.go:18 +0x66
  runtime.main()
      /usr/local/go/src/pkg/runtime/proc.c:182 +0x91

==================
Final Counter: 4
Found 1 data race(s)

We still have a race condition with line 30 of the program:

Write by goroutine 5:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:30 +0x44
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

        Counter = Counter + 1

Previous write by goroutine 4:
  main.Routine()
      /Users/bill/Spaces/Test/src/test/main.go:30 +0x44
  gosched0()
      /usr/local/go/src/pkg/runtime/proc.c:1218 +0x9f

        Counter = Counter + 1

Goroutine 5 (running) created at:
  main.main()
      /Users/bill/Spaces/Test/src/test/main.go:18 +0x66
  runtime.main()
      /usr/local/go/src/pkg/runtime/proc.c:182 +0x91

        go Routine(routine)

The program runs correctly using one line of code to perform the increment. So why do we still have a race condition? Don’t be deceived by the one line of Go code we have for incrementing the counter. Let’s look at the assembly code generated for that one line of code:

0064 (./main.go:30) MOVQ Counter+0(SB),BX ; Copy the value of Counter to BX
0065 (./main.go:30) INCQ ,BX              ; Increment the value of BX
0066 (./main.go:30) MOVQ BX,Counter+0(SB) ; Move the new value to Counter

There are actually three lines of assembly code being executed to increment the counter. These three lines of assembly code eerily look like the original Go code. There could be a context switch after any of these three lines of assembly code. Even though the program is working now, technically the bug still exists.

Even though the example I am using is simple, it shows you how complex finding these bugs can be. Any line of assembly code produced by the Go compiler can be paused for a context switch. Our Go code may look like it is safely accessing resources when actually the underlying assembly code is not safe at all.

To fix this program we need to guarantee that reading and writing to the global Counter variable always happens to completion before any other routine can access the variable. Channels are a great way to serialize access to resources. In this case I will use a Mutex (Mutual Exclusion Lock).

package main

import (
    "fmt"
    "sync"
    "time"
)

var Wait sync.WaitGroup
var Counter int = 0
var Lock sync.Mutex

func main() {

    for routine := 1; routine <= 2; routine++ {

        Wait.Add(1)
        go Routine(routine)
    }

    Wait.Wait()
    fmt.Printf("Final Counter: %d\n", Counter)
}

func Routine(id int) {

    for count := 0; count < 2; count++ {

        Lock.Lock()

        value := Counter
        time.Sleep(1 * time.Nanosecond)
        value++
        Counter = value

        Lock.Unlock()
    }

    Wait.Done()
}

Let’s build the program with the race detector and see the result:

go build -race
./test

Final Counter: 4

This time we get the right answer and no race condition is identified. The program is clean. The Mutex protects all the code between the Lock and Unlock, making sure only one routine can execute that code at a time.

To learn more about the Go race detector and to see more examples read this post:

http://blog.golang.org/race-detector

It’s not a bad idea to test your programs with the race detector on if you are using multiple routines. It will save you a lot of time and headaches early on in your unit and quality assurance testing. We are lucky as Go developers to have such a tool so check it out.