Fredrb's Blog

Simple lock implementation

This is a simple sync.Locker implementation in Go for learning purposes. This is not a lock to be used seriously.


I’ve been reading about concurrent programming, and I’m going to walk through some examples I found in the book The Art of Multiprocess Programming on lock implementations that helped me understand the inner workings of locks.

Locks are an important construct of concurrent programs. However, they are a lower level concept compared to channels and the Go teams recommends the latter whenever possible. From the documentation in the sync package:

Package sync provides basic synchronization primitives such as mutual exclusion locks. Other than the Once and WaitGroup types, most are intended for use by low-level library routines. Higher-level synchronization is better done via channels and communication.

Although they are not recommended for regular applications, there are still situations where you might want to use them.

Review of mutual exclusion and critical sections

Before jumping into the implementation, let’s recap on why locks are needed. Mutual exclusion is a property of software when two (or more) processes shouldn’t access the same resource simultaneously. This is usually the case when accessing shared memory. A portion of code that is mutually exclusive is called a critical section. Critical sections of different threads do not overlap, and at any moment at most one of thread is engaged in its critical section.

Here’s a simple example assuming that Thread A gets to the lock first:

Time Thread A Thread B
1 Concurrent Operation Concurrent Operation
2 Critical section Wait
3 Concurrent Operation Critical section
4 Concurrent Operation Concurrent Operation

This is an over simplification of how concurrent programs are modelled, and how parallelism actually happens at the OS and CPU level. But I hope this is sufficient to set the context for the remainder of this post.

Implementing a simple lock algorithm

The book offers code example in Java. I ported them to Go and ran them myself with benchmarks.

This lock uses CAS (Compare-and-swap) atomic operation, which does exactly what the name says. It compares a value, if it matches, it swaps with another value. The entire operation is done in an atomic way, this means that if the comparison was successful, once the swap is done, we can be sure that the value swapped out was the value we checked for, an no other thread changed it in the meantime [1]. The CAS API in Go receives an address, the values, and returns a boolean value if the swap was successful:

func CompareAndSwapInt32(addr *int32, old, new int32) (swapped bool)

Spinning Locks

A common strategy for implementing locks is to “spin” the processor while the lock isn’t available. Also known as “busy waiting”. This happens when the program keeps trying to acquire the lock in a loop. Of course, the drawback is that it might steal too much time from the core while a thread that is actually holding the lock is waiting. Spinning makes sense if you expect the delay until the lock is freed will be short.

Code

The idea for this implementation is ridiculously simple: is the state is 0 set it to 1 and enter the critical section. If state is 1, wait until it’s 0 and go in.

type CASLock struct {
	state uint32
}

func (l *CASLock) Lock() {
	for !atomic.CompareAndSwapUint32(&l.state, 0, 1) {}
}

func (l *CASLock) Unlock() {
	atomic.StoreUint32(&l.state, 0)
}

I wrote tests and a benchmark to this function to check its correctness and how the runtime changes when there is higher contention. I will skip the tests, I hope you’d take my word that this works. With that said, let’s focus only on the benchmark results.

The benchmark cases were 2, 10, 25 and 50 goroutines, all trying to count up to 500. This means that the number of increment operations are the same in every test case, only the number of goroutines varies.

Since the contention increases in every test case, it’s expected that the running time will also increase. However, there’s one other interesting property in the result. Using benchstat, I could observe that there is a big deviation between the 10 collected samples:

$ benchstat ./results/caslock
> name            time/op
> _All/2,_250-12  11.6µs ± 2%
> _All/10,_50-12  44.3µs ±21%
> _All/25,_20-12  84.7µs ±21%
> _All/50,_10-12   100µs ±14%

There’s a 21% difference between the fastest and slowest run with 25 goroutines! This is reduces the confidence that our benchmark is accurate.

The root cause for this issue is probably because our lock is burning CPU on that unharnessed for loop. The goroutine is constantly busy waiting for the lock to be freed up. We are busy waiting. This can be fixed with a backoff if lock wasn’t acquired. Here’s the new implementation using a simple sleep call [2]:

func (l *CASLock) Lock() {
	for !atomic.CompareAndSwapUint32(&l.state, 0, 1) {
		time.Sleep(time.Nanosecond)
	}
}

With this new version, we get a significant improvement in every test case, and a slight improvement in the deviation:

$ benchstat ./results/caslock ./results/caslock-sleep
> name            old time/op  new time/op  delta
> _All/2,_250-12  11.6µs ± 2%  10.0µs ± 1%  -13.81%  (p=0.000 n=10+10)
> _All/10,_50-12  44.3µs ±21%  20.0µs ± 2%  -54.84%  (p=0.000 n=10+10)
> _All/25,_20-12  84.7µs ±21%  33.5µs ±10%  -60.49%  (p=0.000 n=10+10)
> _All/50,_10-12   100µs ±14%    44µs ± 6%  -55.85%  (p=0.000 n=10+10)

This is likely due to moving the goroutine into a Waiting state and freeing up the CPU for goroutines that actually can get the work done.

As a side-note: the book shows an example for implementing an exponential backoff with a random timer. I’ve used 1~5 nanoseconds in the randomised timer and got less improvements than directly sleeping for 1 nanosecond. Perhaps the backoff exponential timer would make more sense if the critical section wasn’t so fast? Since the lock gets acquired and shortly unlocked – a fast iteration with very small sleep timer will reduce the time that the critical section is under-utilised.

I’ll leave that exercise to another day.


There are some follow-up to this exercise that I might try on the next couple of days:

Footnotes

[1]: As long as all operations on this variable are atomic, that is.

[2]: After playing around with this example and trying to compare different implementation – I accidentally arrived at an implementation that is faster by swapping the time.Sleep(...) function for a runtime.Gosched(). They are pretty much doing the same thing: removing the goroutine from the CPU to allow other goroutines to make progress. But I was surprised to see that the latter improved the numbers. I didn’t investigate it further, so I decided to leave it out of this write up.

Appendix A: Benchmark function

type testCase struct {
	workers    int
	countUntil int
	expected   int
}

var (
	counter   = 0
	testCases = []testCase{
		{2, 250, 500},
		{10, 50, 500},
		{25, 20, 500},
		{50, 10, 500},
	}
)

func runTestCase(lock sync.Locker, workers, countUntil int) {
	wg := sync.WaitGroup{}
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			defer wg.Done()
			for c := 0; c < countUntil; c++ {
				lock.Lock()
				counter += 1
				lock.Unlock()
			}
		}()
	}
	wg.Wait()
}

func Benchmark_All(b *testing.B) {
	for _, tc := range testCases {
		testName := fmt.Sprintf("%d, %d", tc.workers, tc.countUntil)
		b.Run(testName, func(b *testing.B) {
			// Benchmark can be run for different implementation of sync.Locker
			// Implement `lockerFactory()` that returns any lock implementation
			// given an external parameter or environment variable.
			lockerInstance := lockerFactory()
			benchWrapper(b, lockerInstance, &tc)
		})
	}
}

⇦ Back Home | ⇧ Top |

If you hated this post, and can't keep it to yourself, consider sending me an e-mail at fred.rbittencourt@gmail.com or complain with me at Twitter X @derfrb. I'm also occasionally responsive to positive comments.