Fun with Futex

Implementing an optimized lock in Linux requires some Operating System help. You can only get so far by doing everything in user-land. We are going to take a look how one can implement a simple spin lock in C (just like the spin lock in Go I implemented a while back) and then a slightly more elaborate lock using operating system’s primitives. The idea behind a mutex (mutual exclusion) is straightforward: we want some part of the code to be accessed only by a single thread at a time. If the resource a thread is trying to access (the critical zone) is already being accessed by something else: wait. The trick about implementing them is how you to do wait part!

As usual, for the ones allergic to rambling, you can jump directly to the final code on GitHub.

Behold the spin lock!

The simplest way to wait in a mutex to “spin”. in other words: retry until your condition is met. Here’s a first pass implementation of our mutex in C:

typedef struct {
	atomic_bool locked;
} spin_lock_t;

void lock(spin_lock_t *lock) {
	while (atomic_exchange(&lock->locked, true)) {
		// spin!
	}
}

void unlock(spin_lock_t *lock) {
	atomic_store(&lock->locked, false);
}

You don’t really need more than that for a simple spin lock. And it probably won’t differ too much in implementation in other programming languages with similar tomic constructs. atomic_exchange returns the old value from locked. If that value was false, it means that the lock was free, and the caller thread can claim it. If the value was true, then the lock is already in use by another thread, thus the caller must wait!

To test this out, I wrote the following test harness. We start three threads, and each thread tries to increment the counter, acquiring the lock for each increment.

Note that this is not a great example for locks, in a real scenario you will likely use an atomic for the increment. This is just a simple example and shouldn’t be a distraction from the actual implementation I’m interested in: the mutex!

unsigned int counter = 0;
spin_lock_t spin_lock;

void* counter_thread(void* arg) {
	for (int i = 0; i < 1000000; i++) {
		lock(&spin_lock);
		counter++;
		unlock(&spin_lock);
	}
}

int main() {
	pthread_t threads[NUM_THREADS];

	init_lock(&spin_lock);
	for (int i = 0; i < 5; i++) {
		if (pthread_create(&threads[i], NULL, counter_thread, NULL) != 0) {
			perror("Failed to create thread");
			return 1;
		}
	}

	for (int i = 0; i < NUM_THREADS; i++) {
		pthread_join(threads[i], NULL);
	}

	printf("Final counter value: %d\n", counter);

	return 0;
}

The problem with this implementation is that your CPU consumption will be pretty high, specially if you spun up as many threads as available cores (can be noticed by room temperature going up). To exacerbate the problem let’s sleep for a second in the critical zone (please don’t ever do this)

@@ -91,6 +91,7 @@ spin_lock counter_lock;
 void* counter_thread(void* arg) {
        for (int i = 0; i < 1000000; i++) {
                lock(&spin_lock);
+               sleep(1);
                counter++;
                unlock(&spin_lock);
        }
}

No fun!

htop

htop showing 9 of my cores begging for help

Clearly only one thread is doing the work at a time (incrementing the counter) while the others are burning through my CPU in a while loop.

Enter futex!

The futex syscall from Linux is a secret weapon to deal better with the “wait” part of the lock. Instead of mindlessly spinning, let’s put the thread to sleep if the lock is not available, and wake them up when another thread releases it.

Futex offers a few different operations, but in order to implement our basic lock we only need two: FUTEX_WAIT to put threads to sleep and FUTEX_WAKE to wake them up. They are counter intuitive at first. When waiting on a futex, we tell the API the following:

don’t wake this thread up while the value is still X

However, this does not mean that once the value changed, the thread will be woken up. We still need another operation to actually change the value to something other than X and then call FUTEX_WAKE.

Here are a few scenarios I’ve played around with futex to better understand how they behave:

Case 1: bread and butter case

The most important thing to understand about futexes is that putting to sleep and waking up works around the integer value of the underlying futex 32bit integer. When you sleep, you sleep as long as the value is set to some A. If you want to wake a thread up, you have to make sure the futex value is not A, and then call the FUTEX_WAIT syscall.

futex_1

Case 2: Thread is not put to sleep because condition isn’t met

When you wait on a value, the kernel will first atomically check that value to avoid risking putting this thread to sleep if the value was updated between the previous atomic read and the calling of futex. In such cases the futex call fails with EAGAIN.

futex_2

Case 3: Calling wake without changing the value won’t do anything

If you call futex wait and the value is unchanged, the sleeping thread will not wake up! Likewise, just changing the value has no effect on the waiters if it’s not followed by a FUTEX_WAKE syscall.

futex_3

Mutex implementation using futex

Let’s start building our new mutex! The struct will look pretty similar to the previous one, but we will now use a 32bit integer rather than a boolean. This is required by the futex API.

typedef struct {
	uint32_t ftx;
} futex_lock;

We first are going to implement a wrapper around the syscall (as suggested in futex man pages):

static int futex(uint32_t *uaddr, int futex_op, uint32_t val,
	const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3) {
	return syscall(SYS_futex, uaddr, futex_op, val,
			timeout, uaddr2, val3);
}

We still need the operation in a while loop, so the condition to check if the lock is available is virtually the same. We can’t guarantee that another thread didn’t call lock after the present one was woken up. So we need to repeat the compare and swap operation.

void lock(futex_lock *lock) {
	uint32_t unlocked = 0;
	// fast-path: atomic is set without entering the loop
	while (!atomic_compare_exchange_weak(&lock->ftx, &unlocked, 1)) {
		// C11 atomic API: since we pass &unlocked by reference
		// it sets the actual value if comparison failed
		unlocked = 0;
		const unsigned int wait_condition = 1;
		// call futex with `wait_condition=1`. Remeber that
		// the kernel will first atomically check the condition
		// and exit early without putting the thread to sleep if
		// `lock->ftx != wait_condition`
		long s = futex(&lock->ftx, FUTEX_WAIT,
				wait_condition,
				NULL,
				NULL,
				0);
		// we get EAGAIN here if thread was _not_ put to sleep
		if (s == -1 && errno != EAGAIN)
			err(EXIT_FAILURE, "futex-FUTEX_WAIT");
	}
}

When calling FUTEX_WAKE, we need to specify at most how many threads should be woken up. Because we are using this in a mutex implementation and there is no priority logic implemented here, we only wake up one thread and let that thread take the lock.

Technically, you could pass INT_MAX and wake up every thread that is waiting on that address. The implementation makes no guarantees on which threads gets picked up first.

void unlock(futex_lock *self) {
	uint32_t locked = 1;
	if (atomic_compare_exchange_strong(&self->ftx, &locked, 0)) {
		const unsigned int callers_to_awake = 1;
		// wakes up at most one caller that is put sleep at a 
		// different value than 0 (unlocked)
		long s = futex(&self->ftx, FUTEX_WAKE,
				callers_to_awake,
				NULL,
				NULL,
				0);
		if (s == -1)
			err(EXIT_FAILURE, "futex-FUTEX_WAKE");
	}
	// for safety, we shouldn't be able to call unlock twice
	assert(locked == 1);
}

So, futex are always better… right?

Well, not quite. With the unbounded spin lock, the futex implementation is definetly faster and uses less resources. However, a very simple change to the spin lock can make it significantly better than our futex version:

diff --git a/main.c b/main.c
index 86c0141..414d4f0 100644
--- a/main.c
+++ b/main.c
@@ -11,8 +11,7 @@
+#include <sched.h>

 void lock(spin_lock *lock) {
-       while (atomic_exchange(&lock->locked, true)) {}
+       while (atomic_exchange(&lock->locked, true)) {
+               sched_yield();
+       }
 }

This will make our spin be a little more efficient, by yielding the CPU in this thread, allowing the CPU to go to a thread that could do some work to unlock our mutex. This is the poor man’s wait!

❯ perf stat ./futex-lock-count
Final counter value: 1000000

 Performance counter stats for './futex-lock-count':
	692.93 msec task-clock:u           #    8.577 CPUs utilized
	<REDACTED>
	0.080790594 seconds time elapsed

	0.100786000 seconds user
    0.575473000 seconds sys

❯ perf stat ./spin-lock-count
Final counter value: 1000000

 Performance counter stats for './spin-lock-count':
	133.78 msec task-clock:u           #    8.654 CPUs utilized
	<REDACTED>
	0.015458606 seconds time elapsed

    0.028445000 seconds user
    0.105634000 seconds sys

Note that this benchmark is very fragile and results will vary depending on other workloads on my machine. But for my silly little exercise this is enough to give me a ballpark of the how they compare

Our spin lock solution is at least 5x faster than with futex! Notice that the amount of CPU was utilized, but we used much more CPU time (task-clok) on the futex implementation. One important thing to notice here, is that the time spent on system (kernel) is significantly larger in the futex solution. This could give us some hints into what needs to be improved.

Improved futex lock (avoid syscalls)

Let’s take the reference implementation from the paper “Futexes Are Tricky” and take it for a spin (or no spin) in our test harness.

It turns out that one expensive operation that we have to do here is calling wake on the futex even when there are no threads in the queue. This can be quite expensive due to having to unnecessarily dip into kernel land. The main thing that this approach from the paper will do different, is to avoid this extra kernel space context switch, and only call wake if we believe that there is a thread waiting.

As we will see below, we might still call wake when there are no threads. But in scenarios we are certain there are no threads waiting, no syscalls are made.

The first thing we need, is a distinction in the mutex if:

It’s available;
It’s locked and there are no waiters; or
It’s locked and there are waiters

const uint32_t AVAILABLE = 0;
const uint32_t LOCKED = 1;
const uint32_t LOCKED_WAITERS = 2;

The lock function now has to deal with the case where we try to lock, and the lock is not available but also there are already a thread waiting.

I strongly suggest that you read the paper to understand how they arrived at this algorithm. It’s not long and very informative on futexes.

void lock(futex_lock *lock) {
	uint32_t c = AVAILABLE;
	if (!atomic_compare_exchange_weak(&lock->ftx, &c, LOCKED)) {
		if (c != LOCKED_WAITERS)
			c = atomic_exchange(&lock->ftx, LOCKED_WAITERS);
		while (c != AVAILABLE) {
			long s = futex(&lock->ftx, FUTEX_WAIT,
					LOCKED_WAITERS,
					NULL,
					NULL,
					0);
			if (s == -1 && errno != EAGAIN)
				err(EXIT_FAILURE, "futex-FUTEX_WAIT");
			c = atomic_exchange(&lock->ftx, LOCKED_WAITERS);
		}
	}
}

The important difference happens in the unlock function. We want to make sure that if we are confident that there were no waiters, then the FUTEX_WAKE should not be called!

void unlock(futex_lock *lock) {
	if (atomic_fetch_sub(&lock->ftx, 1) != LOCKED) {
		atomic_store(&lock->ftx, 0);
		unsigned int callers_to_awake = 1;
		long s = futex(&lock->ftx, FUTEX_WAKE,
				callers_to_awake,
				NULL,
				NULL,
				0);
		if (s == -1)
			err(EXIT_FAILURE, "futex-FUTEX_WAKE");
	}
}

Can you spot why we might still call FUTEX_WAKE even if there are no waiters in the queue? Take a closer look at this line in the lock function:

c = atomic_exchange(&lock->ftx, LOCKED_WAITERS);

When a thread awakes, it will try again to take the lock, and it will swap the value for a LOCKED_WAITERS There are two possibilities here:

There were multiple threads asleep, so the value is correct
The thread that was woken up was the only thread in the queue. So we will incorrectly set the value.

Incorrectly setting the value will cause the unlock function to unnecessarily call FUTEX_WAKE. However, that’s a small price to pay compared to the correctness issue of assuming that the thread was the only one asleep and risking the next unlock to incorrectly not call FUTEX_WAKE!

For sure there must be implementations that will further optimize this, but for now let’s see how our new mutex fares against the spin lock:

❯ perf stat ./futex-lock2-count
Final counter value: 1000000

 Performance counter stats for './futex-lock2-count':
	158.13 msec task-clock:u          #    6.908 CPUs utilized
    <REDACTED>
    0.022890447 seconds time elapsed

    0.031259000 seconds user
    0.119167000 seconds sys

Sadly, spin locks are still faster.

Counting is a bad example

However this might not be entirely due to our implementation. Counting integers is hardly a good example for mutexes. I’ve picked it since it’s an easy exercise to illustrate the concepts, but if we want to validate the performance of our implementation, we probably wouldn’t even solve this problem with a mutex! If all we need to do is count, then simply incrementing an atomic variable would be sufficient.

I propose that we see the performance of our two mutex implementation in an example that would yield higher contention. Meaning that we need more waiting time. A good way to exemplify that, would be adding an operation in the critical zone that requires a little more than a single CPU instruction and therefore would make threads hold onto the lock for a while longer. We should also add another expensive computation outside the lock so at least the example is realistic where some of the work can be parallelized and some can’t.

const unsigned int FIB_VALUE = 30;

void* fib_thread(void* arg) {
	for (int i = 0; i < 100; ++i) {
		fib(FIB_VALUE);
		lock(&external_lock);
		fib(FIB_VALUE);
		unlock(&external_lock);
		fib(FIB_VALUE);
	}
	return NULL;
}

Let’s now run our harness using this new thread and see the results!

 Performance counter stats for './output/spin_lock.out 2':
	 37,905.51 msec task-clock:u           #    8.582 CPUs utilized
	<REDACTED>
       4.416729782 seconds time elapsed

      16.072219000 seconds user
      21.731787000 seconds sys

 Performance counter stats for './output/futex_lock2.out 2':
	  9,588.06 msec task-clock:u           #    2.681 CPUs utilized
	<REDACTED>
       3.576809001 seconds time elapsed

       9.537161000 seconds user
       0.020920000 seconds sys

Nice! Now not only our the total elapsed time of our futex lock implementation is lower, we’ve used much less CPU time. The spin lock implementation took a lot of time spinning rather than doing useful work, even if the elapsed time was not too much longer, ouir program hogged the CPU that could be used by other processes.

The fun part is that it was even audibly noticeable since the CPU fans went crazy while running the example on spin lock!

What’s next?

There are many other aspects of locks and futexes that I would like to explore, but this article is large enough as it is, so I will leave it at that.

The best of both worlds would be to implement a lock that spins for a little bit, making sure that if the lock gets freed up fast, we can benefit from the low overhead approach.
I’ve read some code from Rust standard library and Go’s runtime to understand how they tackled the mutex implementation. It would be nice to dig a little deeper and fully understand their implementation and trade-offs. As a matter of fact, both Rust and Go implementation use a hybrid approach of spin locks and futexes (in Linux, at least)!
Fairness: our mutex implementation is not exactly fair. If a thread swoops in and grabs the atomic before a sleeping thread is awaken it will very impolitely cut the queue.
Memory ordering: I’ve written one implementation of futex lock in C that uses acquire ordering for locks and release for unlock. This sort of makes sense intuitively after reading some reference code and explanation on memory ordering. However I don’t fully understand that to write about it so we skipped it in this article.

References

Here are some good references I read in preparation to writing this article:

futex(2) syscall man page
- The best source of text for understanding what the syscall does
Futexes Are Ticky (2005) paper by Ulrich Drepper
- This paper is a must read for understanding the weird parts of futexes. It walks through from a simple (and buggy) implementation of a mutex using futex to a more complicated but correct implementation.
Chapter 8: Operating System Primitives from Mara Bos’ Rust Atomics and Locks book.
- Her whole book is worth a read on the topic. More specifically, chapters 4 and 9 also cover lock implementations.
Chapter 28: Thread Locks from Operating Systems Three Easy Pieces
- This chapter is quite condensed and covers the needs for locks, the OS API, as well as more complex implementations, like queue locks, which I won’t cover in this article.
Eli Bendersky’s Basic of Futexes article
Chapter 7 on Spin Locks and Contention from The Art of Multiprocessor Programming
Rust’s Mutex implementation for Linux using futex

#C #Concurrency #Futex #Linux #Locks

⇦ Back Home | ⇧ Top |

If you hated this post, and can't keep it to yourself, consider sending me an e-mail at fred.rbittencourt@gmail.com. I'm more responsive to positive comments though.