Thursday, October 17, 2013

My Channel Select Bug

I was testing new functionality on a program that is already running in production when suddenly the code behaved very badly. What I saw shocked me and then it became obvious why it happened. I also have a race condition just waiting to be a problem.

I have tried to provide a simplified version of the code and the two bugs.

package main

import (
    "fmt"
    "os"
    "os/signal"
    "time"
)

var Shutdown bool = false

func main() {
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, os.Interrupt)

    for {
        select {
        case <-sigChan:
            Shutdown = true
            continue

        case <-func() chan struct{} {
            complete := make(chan struct{})
            go LaunchProcessor(complete)
            return complete
        }():
            return
        }
    }
}

func LaunchProcessor(complete chan struct{}) {
    defer func() {
        close(complete)
    }()

    fmt.Printf("Start Work\n")

    for count := 0; count < 5; count++ {
        fmt.Printf("Doing Work\n")
        time.Sleep(1 * time.Second)

        if Shutdown == true {
            fmt.Printf("Kill Early\n")
            return
        }
    }

    fmt.Printf("End Work\n")
}

The idea behind this code is to run a task and terminate. It allows the operating system to request the program to terminate early. I always like shutting down the program cleanly when possible.

The sample code creates a channel that is bound to an operating system signal and looks for <ctrl> C from the terminal window. If <ctrl> C is issued, the Shutdown flag is set to true and the program continues back into the select statement. The code also spawns a Go routine that performs the work. That routine checks the Shutdown flag to determine if the program needs to terminate early.

Bug Number 1

Take a look at this part of the code:

case <-func() chan struct{} {
    complete := make(chan struct{})
    go LaunchProcessor(complete)
    return complete
}():

I thought I was being so clever when I wrote this code. I thought it would be cool to execute a function on the fly to spawn the Go routine. It returns a channel that the select waits on to be told the work is complete. When the Go routine is done it closes the channel and the program terminates.

Let's run the program:

Start Work
Doing Work
Doing Work
Doing Work
Doing Work
Doing Work
End Work

As expected the program starts and spawns the Go routine. Once the Go routine is complete the program terminates.

This time I will hit <ctlr> C while the program is running:

Start Work
Doing Work
^CStart Work
Doing Work
Kill Early
Kill Early

When I hit <ctrl> C the program launched the Go routine again!!

I thought that the function associated with the case would only be executed once. Then the select would just wait on the channel moving forward. I had no idea that the function would be executed every time the loop iterated back to the select statement.

To fix the code I needed to remove the function out of the select statement and spawn the Go routine outside of the loop:

func main() {
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, os.Interrupt)

    complete := make(chan struct{})
    go LaunchProcessor(complete)

    for {

        select {
        case <-sigChan:
            Shutdown = true
            continue

        case <-complete:
            return
        }
    }
}

Now when we run the program we get a better result:

Start Work
Doing Work
Doing Work
^CKill Early

This time when I hit <ctrl> C the program terminate early and doesn't spawn another Go routine again.

Bug Number 2

There is a second less obvious bug lurking in the code as well. Take a look at these pieces of code:

var Shutdown bool = false

if whatSig == syscall.SIGINT {
    Shutdown = true
}

if Shutdown == true {
    fmt.Printf("Kill Early\n")
    return
}

The code is using a package level variable to signal the running Go routine to shut down when <ctrl> C is hit. The code is working every time I hit <ctrl> C so why is there a bug?

First let's run the race detector against the code:

go build -race
./test

While it is running I hit <ctrl> C again:

Start Work
Doing Work
^C==================
WARNING: DATA RACE
Read by goroutine 5:
    main.LaunchProcessor()
        /Users/bill/Spaces/Test/src/test/main.go:46 +0x10b
    gosched0()
        /Users/bill/go/src/pkg/runtime/proc.c:1218 +0x9f

Previous write by goroutine 1:
    main.main()
        /Users/bill/Spaces/Test/src/test/main.go:25 +0x136
    runtime.main()
        /Users/bill/go/src/pkg/runtime/proc.c:182 +0x91

Goroutine 5 (running) created at:
    main.main()
        /Users/bill/Spaces/Test/src/test/main.go:18 +0x8f
    runtime.main()
        /Users/bill/go/src/pkg/runtime/proc.c:182 +0x91

Goroutine 1 (running) created at:
    _rt0_amd64()
        /Users/bill/go/src/pkg/runtime/asm_amd64.s:87 +0x106

==================
Kill Early
Found 1 data race(s)

My use of the Shutdown flag comes up on the race detector. This is because I have two Go routines trying to access the variable in an unsafe way.

My initial reason for not securing access to the variable was practical but wrong. I figured that since the variable is only used to shutdown the program when it becomes necessary, I didn't care about a dirty read. If by chance, within the microsecond of glare there was between writing to the variable and reading the variable, if a dirty read occurred, I would catch it again on the next loop. No harm done, right?  Why add complicated channel or locking code for something like this?

Well, there is a little thing called the Go Memory Model.

http://golang.org/ref/mem

The Go Memory Model does not guarantee that the Go routine reading the Shutdown variable will ever see the write by the main routine. It is valid for the write to the Shutdown variable by the main routine to never be written back to main memory. This is because the main routine never reads the Shutdown variable.

This is not happening today but as the Go compiler becomes more sophisticated it could decide to eliminate the write to the Shutdown variable altogether. This behavior is allowed by the Go Memory Model. Also, we don't want code that can't pass the race detector, it is just bad practice, even for practical reasons.

Here is a final version of the code with all bugs fixed:

package main

import (
    "fmt"
    "os"
    "os/signal"
    "sync/atomic"
    "time"
)

var Shutdown int32 = 0

func main() {
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, os.Interrupt)

    complete := make(chan struct{})
    go LaunchProcessor(complete)

    for {

        select {
        case <-sigChan:
            atomic.StoreInt32(&Shutdown, 1)
            continue

        case <-complete:
            return
        }
    }
}

func LaunchProcessor(complete chan struct{}) {
    defer func() {
        close(complete)
    }()

    fmt.Printf("Start Work\n")

    for count := 0; count < 5; count++ {
        fmt.Printf("Doing Work\n")
        time.Sleep(1 * time.Second)

        if atomic.LoadInt32(&Shutdown) == 1 {
            fmt.Printf("Kill Early\n")
            return
        }
    }

    fmt.Printf("End Work\n")
}

I prefer to use an if statement to check if the Shutdown flag is set so I can sprinkle that code as needed. This solution changes the Shutdown flag from a boolean value to an int32 and uses the atomic functions Store and Load.

In the main routine if a <ctrl> C is detected, the Shutdown flag is safely changed from 0 to 1. In the LaunchProcessor Go routine, the value of the Shutdown flag is compared to 1. If that condition is true the Go routine returns.

It's amazing sometimes how a simple program like this can contain a few gotchas. Things you may have never thought about or realized when you started. Especially when the code always seems to work.

3 comments:

  1. See http://play.golang.org/p/Xy4WdIOmKY for a better way to write this program.

    - Move work to main routine to eliminate complete channel.
    - Use platform independent os.Interrupt constant instead of syscall constant.
    - Register for the one signal that we are interested in.
    - Check channel directly to determine if signal is delivered.
    - This eliminates a goroutine.
    - This eliminates concurrent access to shutdown variable.

    ReplyDelete
    Replies
    1. Thanks for the comment. The actual work is much more complex than the sample. I just tried to simply it for the article. Assume you can't do the work inside the loop and you need the go routine. The point however was how the select statement behaved.

      Delete
    2. OK, see http://play.golang.org/p/84EHvmRMBy

      At least fix the concurrency bug. There's no guarantee that the worker goroutine will observe the change to Shutdown.

      Delete