Implementing atomic wait and notify

29^th October, 2023

I’ve recently had to port a Windows 10 exclusive program to Windows 7. This program made extensive use of the WaitOnAddress and WakeByAddress functions, which are only available since Windows 8. These are low-level, immediate-mode primitives that can turn any memory location into a synchronization object. If you’re up to speed with C++ 20, you might know them as atomic_wait and atomic_notify. In Linux land this is a futex. And in the WASM threading proposal these operations also have dedicated instructions.

Here’s how you might use these functions to create a simple lock:

void Acquire(long *lock) {
    long taken = 1;
    while (InterlockedCompareExchange(lock, 1, 0))
        WaitOnAddress(lock, &taken, sizeof(long), INFINITE);
}
void Release(long *lock) {
    InterlockedExchange(lock, 0);
    WakeByAddressSingle(lock);
}

Notice how this looks almost identical to the equivalent spinlock implementation:

void AcquireSpinlock(long *lock) {
    while (InterlockedCompareExchange(lock, 1, 0))
        YieldProcessor();
}
void ReleaseSpinlock(long *lock) {
    InterlockedExchange(lock, 0);
}

These functions are incredibly natural to use. They’re also really flexible - here’s a lock implementation that only uses a single bit of state:

void AcquireBitlock(long *bits, int index) {
    for (;;) {
        long old = InterlockedOr(bits, (1<<index));
        if (!(old & (1<<index)))
            return;
        WaitOnAddress(bits, &old, sizeof(long), INFINITE);
    }
}
void ReleaseBitlock(long *bits, int index) {
    InterlockedAnd(bits, ~(1<<index));
    WakeByAddressSingle(bits);
}

This kind of flexibility would be impossible with traditional OS provided synchronization primitives. I didn’t want to lose these capabilities when porting over to Windows 7. So I decided to implement WaitOnAddress and WakeByAddress myself. I wasn’t even sure if this was possible to do efficiently, but it turns out to be relatively simple.

Naïve implementation#

The first implementation that came to mind was this incredibly straightforward one:

void Wait(long *address, long compare) {
    while (*address == compare)
        Sleep(1);
}
void Wake(void *address, bool all) {
    // No-op.
}

This obviously works, but it’s incredibly inefficient. As I’ve shown in my previous blog post, the Sleep function is imprecise. Sleep(1) can delay the program by 10 milliseconds on average, unless the OS scheduler frequency is increased. But even under ideal conditions, a 1 millisecond delay between every check of the address is really bad… or is it actually?

Measuring thread wake latency#

Suppose we have a Win32 synchronization primitives which is initially unsignaled. Suppose we also have a thread A which blocks on that primitive by calling WaitForSingleObject or the like. Now suppose thread B comes along some time later and signals the primitive. How long will it take for thread A to wake up? I’ve always assumed that this wake latency is quite low, less than 100 µs. But I’ve never actually measured it. Who knows, maybe the latency is so bad that the 10 millisecond delay from Sleep actually isn’t too bad in comparison. So I wrote some benchmarks to measure the latency:

event = UNSIGNALLED;
void ThreadA() {
    Sleep(DELAY);
    t0 = Stopwatch();
    Signal(event);
}
void ThreadB() {
    Wait(event);
    t1 = Stopwatch();
}

The Sleep at the start of thread A is necessary to ensure thread B is actually unscheduled while waiting for the signal. Many operating systems don’t actually put the thread to sleep immediately when you call Wait, instead they will spin a couple of times first in an attempt to avoid a costly context switch. This is called Hybrid Spinning, and I don’t want it influencing the results here.

I tested the latency of 4 different synchronization primitives: spinning, WaitOnAddress and WakeByAddress (atomic wait/notify), the default Win32 event, and a spinlock with Sleep of 1 millisecond between each check. I measured the time between t0 and t1 from the code above averaged from 10,000 runs. The DELAY was set to 50 milliseconds. Here are the results:

method	wake latency (µs)
spin	0.35
atomic wait	13.10
Win32 event	23.50
sleep	997.00

To summarize, atomic wait (WaitOnAddress) and events (as well as other OS primitives) have approximately 10 microseconds of latency. Sleeping has the expected 1 millisecond latency - way too high in comparison. Don’t use Sleep to implement synchronization primitives on Windows.

Note that this can be entirely different on other operating systems. Here are the results from my M1 MacBook. I used ulock_wait as the atomic wait, dispatch semaphores instead of events, and nanosleep for 1 µs instead of Sleep(1):

method	wake latency (µs)
spin	0.10
atomic wait	22.70
semaphore	20.40
sleep	25.90

MacOS’ nanosleep seems much more precise than Windows’ Sleep, and so sleeping for a couple of microseconds had a surprisingly low latency. Of course this says nothing about how power efficient this method is - the thread needs to be scheduled in and out every microsecond, which can’t be too good for battery life and CPU usage, but I didn’t measure that in this test.

Implementing synchronization primitives using sleep might not be such a terrible idea on macOS. This is an interesting discovery, but it doesn’t matter too much for me since the program I was porting was Windows-only. On Windows, implementing anything based on sleep is a terrible idea because the wake latency is just too high.

You can find the code for the Windows experiment here, and the code for the Mac experiment here.

Apple’s awful C++ 20 implementation#

I have to go on a little rant about this. Originally the macOS tests above were written in C++20 directly using the new atomic_wait, semaphore, and barrier. However, Apple clang’s libc++ implementation of these primitives had very peculiar behavior:

method	latency DELAY = 1ms	latency DELAY = 10ms	latency DELAY = 50ms
spin	0.12	0.14	0.33
atomic wait	327.00	4900.00	15100.00
semaphore	332.00	5080.00	15000.00
sleep	4.23	6.33	6.71

Not only are the latencies for the atomic wait and semaphore abysmal, the DELAY between thread A waiting on the primitive and thread B waking it up heavily influences the result. I expected these primitives to map directly to ulock_wait and the dispatch semaphores, which shouldn’t behave this poorly. When I took a look at the source code I immediately noticed the cause. The barrier, semaphore, and atomic wait all use some dumb version of polling with exponential backoff instead of the native primitives.

The authors had a good reason to avoid using ulock_wait, as it’s a private API meaning you will get blocked from the App Store for using it. This is something I wasn’t aware of before. But still, the dispatch semaphore API is not private, so why didn’t they use that? And a possible 15 millisecond latency on atomic wait is unacceptably high in my opinion. Maybe they should read this blog post and implement a better atomic wait ;)

Either way I would avoid using C++ 20 synchronization primitives on Mac if wake latency is important to you. You can find the C++ 20 test implementation here, in case you want to check the behavior on your system.

Actual implementation#

Now that I’ve confirmed my instincts that the atomic wait implementation using Sleep is terrible, I had to decide on which Win32 synchronization primitive I would use for the backbone. I narrowed the possibilities down to events, semaphores, and condition variables. All 3 of these are suited to implement an atomic wait, and they’re all supported on Windows 7.

Events seem like the most natural fit for this use case, but they are heavyweight kernel objects that must be manually managed. Condition variables aren’t particularly well suited, but they live entirely in user-space and can be created on the fly from any pointer. The SRW lock they’re attached to also comes in handy since we’ll need a lock anyway. Semaphores.. aren’t well suited to the problem, and are also kernel objects. So they have the downsides of both events and condition variables, without any upside. I decided to go with condition variables in the end.

Wait lists#

I map addresses to wait lists. When a thread waits on an address, it adds itself to the corresponding wait list. And when a thread notifies an address, it goes through the corresponding wait list to find which threads it needs to wake up. The wait list is a simple circular doubly linked list:

struct Node {
    Node *prev;
    Node *next;
    long *address;
    CONDITION_VARIABLE condition;
};
struct List {
    SRWLOCK lock;
    Node sentinel;
};

#define TABLE_SIZE 4096
List table[TABLE_SIZE];

I make use of a sentinel node because it greatly simplifies the insertion and removal logic. The downside is that the sentinel node’s prev and next pointers need to be initialized before the list can be used. In a more modern language like Zig, this could easily be done at compile time. But in C you would need to write a static initializer like this:

List table[TABLE_SIZE] = {
  {.sentinel={ .prev=&table[0].sentinel, .next=&table[0].sentinel}},
  {.sentinel={ .prev=&table[1].sentinel, .next=&table[1].sentinel}},
  {.sentinel={ .prev=&table[2].sentinel, .next=&table[2].sentinel}},
  // repeat TABLE_SIZE (4096) times...
};

This is of course possible with metaprogramming, or with a lot of patience. But I just run an initialization routine near the top of the program’s main function:

void Init() {
    for (int i = 0; i < TABLE_SIZE; ++i) {
        Node *sentinel = &table[i].sentinel;
        sentinel->prev = sentinel;
        sentinel->next = sentinel;
    }
}

Locks#

You might have noticed that each List has a SRW lock, this is how I prevent race conditions from multiple threads using the same wait list. I was kind of pushed into this choice by deciding to use a condition variable as the underlying synchronization primitive. These can only be used with SRW locks or Win32 critical sections, but critical sections are a lot more heavyweight than SRW locks so there’s no point in using them.

The actual WaitOnAddress itself is apparently fully lock-free, so Microsoft probably used something like this lock-free hash table instead of resorting to locks. But I think locks work well enough. They’re only contended here when many threads are synchronizing on the same address - in which case there is already bound to be high contention regardless. I haven’t actually tested this, but my intuition tells me that the overhead of a lock-free hash table dwarfs any concurrency advantages over the simple blocking table I use.

Wait and Wake#

With all the pieces in place, and without further ado, here is the complete implementation of the Wake function:

void Wait(long *address, long compare) {
    int slot = (uintptr_t)address % TABLE_SIZE;
    List *list = &table[slot];

    Node node = { .address = address };
    AcquireSRWLockExclusive(&list->lock);
    {
        node.prev = list->sentinel.prev;
        node.next = &list->sentinel;
        list->sentinel.prev->next = &node;
        list->sentinel.prev = &node;

        while (*address == compare) // atomic read relaxed
            SleepConditionVariableSRW(&node.condition, &list->lock, INFINITE, 0);

        node.prev->next = node.next;
        node.next->prev = node.prev;
    }
    ReleaseSRWLockExclusive(&list->lock);
}

It’s quite simple and small actually. I didn’t expect it to turn out that way. We map the address to one of the wait lists from the table, and then add ourselves to it. We then sleep until the address changes, waking up whenever someone notifies us, and then finally remove ourselves from the list.

The critical thing to look out for is that short time gap right after the thread tests the address and decides to wait but before it actually waits. If it’s possible for another thread to signal the address during this time it could do so while no one is waiting, and the thread will go to sleep indefinitely. This actually bit me on my first attempt when I was using spinlocks instead of SRW locks to protect the wait lists.

Of course this is just the minimal implementation for demonstration purposes. In a real implementation you would first poll a couple of times in a loop before taking the mutex and waiting. And also if your platform doesn’t support lightweight user-space locks and condition variables, you would cache them in thread-local storage. You could also allow 1, 2, 4, and 8 byte sized addresses, just like the real WaitOnAddress.

Since I like making diagrams, here is an illustration of the Wait algorithm:

The Wake implementation is even more straightforward:

void Wake(void *address, bool all) {
    int slot = (uintptr_t)address % TABLE_SIZE;
    List *list = &table[slot];

    AcquireSRWLockExclusive(&list->lock);
    {
        Node *end = &list->sentinel;
        for (Node *node = end->next; node != end; node = node->next) {
            if (node->address == address) {
                WakeConditionVariable(&node->condition);
                if (!all)
                    break;
            }
        }
    }
    ReleaseSRWLockExclusive(&list->lock);
}

It just iterates the wait list for the passed in address and wakes up matching threads. You can also chose whether you want to wake up at most 1 thread or all of them. You could instead have a numThreadsToWakeUp parameter if you need that level of control, but I can’t imagine a use case for that.

Here’s a diagram of the Wake algorithm:

Events#

I ended up using condition variables and SRW locks in my implementation, but these are only supported since Windows 7. If you needed to also support Windows XP or Windows 2000 then you could use events. Or if you’re on another operating system that doesn’t natively support events you could use semaphores.

The implementation with these two primitives is only slightly different. The biggest difference is that you’re free to use a spinlock for mutual exclusion on the wait lists. With condition variables I was forced into using the matching locks. My untested intuition tells me spinlocks are better suited to this problem because the critical sections of code involve very few operations.

Here’s a sketch of how Wait would look like:

void Wait(long *address, long compare) {
    int slot = (uintptr_t)address % TABLE_SIZE;
    List *list = &table[slot];

    Node node = {
        .address = address,
        .event = CreateEvent(NULL, FALSE, FALSE, NULL),
    };
    Acquire(&list->spinlock);
    {
        node.prev = list->sentinel.prev;
        node.next = &list->sentinel;
        list->sentinel.prev->next = &node;
        list->sentinel.prev = &node;
    }
    Release(&list->spinlock);

    while (*address == compare) // atomic read relaxed
        WaitForSingleObject(node.event, INFINITE);

    Acquire(&list->spinlock);
    {
        node.prev->next = node.next;
        node.next->prev = node.prev;
    }
    Release(&list->spinlock);
    CloseHandle(node.event);
}

So it’s pretty much exactly the same, except the event is a kernel object that we need to manually manage with CreateEvent and CloseHandle. You would definitely cache these in thread-local storage in a real implementation, instead of creating a new one every time.

Wake looks pretty much identical in this implementation:

void Wake(void *address, bool all) {
    int slot = (uintptr_t)address % TABLE_SIZE;
    List *list = &table[slot];

    Acquire(&list->spinlock);
    {
        Node *end = &list->sentinel;
        for (Node *node = end->next; node != end; node = node->next) {
            if (node->address == address) {
                SetEvent(node->event);
                if (!all)
                    break;
            }
        }
    }
    Release(&list->spinlock);
}

Note that with these separately managed kernel objects, you could do something a little bit more efficient here. Instead of waking up the waiting threads while holding the spinlock, you could do it outside:

void Wake(void *address, bool all) {
    int slot = (uintptr_t)address % TABLE_SIZE;
    List *list = &table[slot];

    int cursor = 0;
    HANDLE wake[MAX_THREADS];
    Acquire(&list->spinlock);
    {
        Node *end = &list->sentinel;
        for (Node *node = end->next; node != end; node = node->next) {
            if (node->address == address) {
                wake[cursor++] = node->event;
                if (!all)
                    break;
            }
        }
    }
    Release(&list->spinlock);
    for (int i = 0; i < cursor; ++i)
        SetEvent(wake[i]);
}

This way, waiting threads wouldn’t be woken up only to immediately block on the spinlock while it’s still held by the Wake thread. Also, calling kernel functions such as SetEvent under a lock is always a bit of a smell - they can be quite expensive if a kernel transition occurs.

Normally this wouldn’t be thread safe, as after you release the lock, the waiting threads could remove themselves from the list and delete the event before you can call SetEvent - so you would be calling it on an invalid event object. However this will only result in an error on most OSes, it won’t actually do anything bad like corrupting memory. The worst that can happen is if you cache the event handle in a thread’s local storage, that thread will spuriously wake up a couple of times the next time it calls Wait, no big deal.