Click here to monitor SSC

Software Engineer @alexdcode

Why lock-free data structures just aren’t lock-free enough

Published 19 March 2012 10:39 pm

Today’s post will explore why the current ways to communicate between threads don’t scale, and show you a possible way to build scalable parallel programming on top of shared memory.

The problem with shared memory

Soon, we will have dozens, hundreds and then millions of cores in our computers. It’s inevitable, because individual cores just can’t get much faster. At some point, that’s going to mean that we have to rethink our architecture entirely, as millions of cores can’t all access a shared memory space efficiently. But millions of cores are still a long way off, and in the meantime we’ll see machines with dozens of cores, struggling with shared memory.

Alex’s tip: The best way for an application to make use of that increasing parallel power is to use a concurrency model like actors, that deals with synchronisation issues for you. Then, the maintainer of the actors framework can find the most efficient way to coordinate access to shared memory to allow your actors to pass messages to each other efficiently.

At the moment, NAct uses the .NET thread pool and a few locks to marshal messages. It works well on dual and quad core machines, but it won’t scale to more cores. Every time we use a lock, our core performs an atomic memory operation (eg. CAS) on a cell of memory representing the lock, so it’s sure that no other core can possibly have that lock. This is very fast when the lock isn’t contended, but we need to notify all the other cores, in case they held the cell of memory in a cache. As the number of cores increases, the total cost of a lock increases linearly.

A lot of work has been done on “lock-free” data structures, which avoid locks by using atomic memory operations directly. These give fairly dramatic performance improvements, particularly on systems with a few (2 to 4) cores. The .NET 4 concurrent collections in System.Collections.Concurrent are mostly lock-free. However, lock-free data structures still don’t scale indefinitely, because any use of an atomic memory operation still involves every core in the system.

A sync-free data structure

Some concurrent data structures are possible to write in a completely synchronization-free way, without using any atomic memory operations. One useful example is a single producer, single consumer (SPSC) queue. It’s easy to write a sync-free fixed size SPSC queue using a circular buffer*. Slightly trickier is a queue that grows as needed. You can use a linked list to represent the queue, but if you leave the nodes to be garbage collected once you’re done with them, the GC will need to involve all the cores in collecting the finished nodes. Instead, I’ve implemented a proof of concept inspired by this intel article which reuses the nodes by putting them in a second queue to send back to the producer.

* In all these cases, you need to use memory barriers correctly, but these are local to a core, so don’t have the same scalability problems as atomic memory operations.

Performance tests

I tried benchmarking my SPSC queue against the .NET ConcurrentQueue, and against a standard Queue protected by locks. In some ways, this isn’t a fair comparison, because both of these support multiple producers and multiple consumers, but I’ll come to that later.

I started on my dual-core laptop, running a simple test that had one thread producing 64 bit integers, and another consuming them, to measure the pure overhead of the queue.

image

So, nothing very interesting here. Both concurrent collections perform better than the lock-based one as expected, but there’s not a lot to choose between the ConcurrentQueue and my SPSC queue. I was a little disappointed, but then, the .NET Framework team spent a lot longer optimising it than I did.

So I dug out a more powerful machine that Red Gate’s DBA tools team had been using for testing. It is a 6 core Intel i7 machine with hyperthreading, adding up to 12 logical cores. Now the results get more interesting.

image

As I increased the number of producer-consumer pairs to 6 (to saturate all 12 logical cores), the locking approach was slow, and got even slower, as you’d expect.

What I didn’t expect to be so clear was the drop-off in performance of the lock-free ConcurrentQueue. I could see the machine only using about 20% of available CPU cycles when it should have been saturated. My interpretation is that as all the cores used atomic memory operations to safely access the queue, they ended up spending most of the time notifying each other about cache lines that need invalidating.

The sync-free approach scaled perfectly, despite still working via shared memory, which after all, should still be a bottleneck. I can’t quite believe that the results are so clear, so if you can think of any other effects that might cause them, please comment!

Obviously, this benchmark isn’t realistic because we’re only measuring the overhead of the queue. Any real workload, even on a machine with 12 cores, would dwarf the overhead, and there’d be no point worrying about this effect. But would that be true on a machine with 100 cores?

Still to be solved.

The trouble is, you can’t build many concurrent algorithms using only an SPSC queue to communicate. In particular, I can’t see a way to build something as general purpose as actors on top of just SPSC queues. Fundamentally, an actor needs to be able to receive messages from multiple other actors, which seems to need an MPSC queue. I’ve been thinking about ways to build a sync-free MPSC queue out of multiple SPSC queues and some kind of sign-up mechanism. Hopefully I’ll have something to tell you about soon, but leave a comment if you have any ideas.

Edit: The example code is available on BitBucket.

3 Responses to “Why lock-free data structures just aren’t lock-free enough”

  1. CliveT says:

    I really enjoyed your article! I’m fascinated by memory barriers and weak memory models – I came across MESI protocols etc a few years ago and think that they’re really interesting.

    Just a couple of questions:

    (i) 20% cpu as evidence of lock contention in the cache coherency protocol… surely the OS doesn’t know that the CPU is stalled on a memory cache miss and hence wouldn’t be able to subtract the time handling the miss from the time allocated to the process?
    ie the OS accounts for the time spent in the code of the process but there’s no context switch while handling a memory miss.

    (ii) The sync-free approach scaled perfectly, despite still working via shared memory… perhaps not. If the producer and consumer act within the same cache hierarchy, then all of the access can be happening inside the cache – the store buffers may not get flushed for a very long time. Hence you might not be using shared memory at all.

    (iii) Are you going to make the code available? I imagine that the Microsoft code has been proved correct – any bugs in your code and the measurements are invalid. Did your benchmark check that the produced/consumed numbers matched ie I assume you had a counter and checked that the values you read from the queue were incrementing steadily too.

    (iv) As a base for actors I imagine you’ll need to maintain temporal ordering between the SPSCs that are used to implement MPSC. If A sends B a message, and C sends B a message, then the messages need to arrive in the order “fromA”, “fromC” (since A and C might have communicated in the meantime). I imagine that this need to interleave the SPSC queues correctly is going to cause the interaction problems.

  2. Alex.Davies says:

    Hi Clive,
    i) Yes, I’m dubious that I’m right about the reason that ConcurrentQueue gets so much slower. Given that you’re probably right that the CPU% counter can’t measure cache misses, it could be something else, maybe it allocates objects that need garbage collection

    ii) Good point, even faster :)

    iii) Added a link to the code, yes, it checks that there are no missing messages as it proceeds

    iv) Yes, it will be hard to guarantee that messages arrive in causal order, I don’t know how to solve that

  3. jhm says:

    I thought actor model doesn’t have a requirement on the order of message arrival. Perhaps I’m misunderstanding Clive’s 4th point?

Leave a Reply