Wednesday, July 23, 2008

A working lock-free stack implementation

WiP

This is the third post in the mini-series on a lock-free stack, included in the OmniThreadLibrary. You've learned how the lock-free stack was implemented in the first place and which problem was hidden in that implementation. I've also mentioned that the problem was fixed in the meantime and that the working implementation is already available in the repository and as a snapshot, but I haven't said a word about the new implementation. This will now be rectified.

In the initial implementation lied an excellent example of the ABA problem.  You can read more about it in the Wikipedia, where you'll also learn that ABA problem is commonly solved by adding ‘tag’ bits. That's exactly what was done in the OTL, except that we didn't use just few measly bits, but a 4-byte counter.

Instead of containing only a pointer to the first node, each chain head is now composed of two parts — a pointer to the first node and a 4-byte spin count. This second part will make sure that the ABA problem cannot occur.

type
TOmniHeadAndSpin = packed record
Head: POmniLinkedData;
Spin: cardinal;
end;{ TOmniHeadAndSpin }

As the problem lied in the PopLink method, we'll start today's exploration just there. This is the new version in all its glory. (Of course, assembler is still provided by the ‘GJ’, I had nothing to do with it.)

class function TOmniBaseContainer.PopLink(var chain: TOmniHeadAndSpin): POmniLinkedData;
asm
push edi
push ebx
mov edi, eax //edi = @chain
@Spin:
mov ecx, 1 //Increment spin reference for 1
lock xadd [edi + 4], ecx //Get old spin reference to ecx
mov eax, [edi] //eax := chain.Head
mov edx, [edi +4] //edx := chain.Spin
test eax, eax
jz @Exit //Is Empty?
inc ecx //Now we are ready to cmpxchg8b
cmp edx, ecx //Is reference the some?
jnz @Spin
mov ebx, [eax] //ebx := Result.Next
lock cmpxchg8b [edi] //Now try to xchg
jnz @Spin //Do spin ???
@Exit:
pop ebx
pop edi
end;

You may have noticed that this is now a class function (a class static function, to be more precise). That made more sense as it needs no information from the class — it only works on a chain, which is passed as a parameter. Also, the code is much longer and not very easy to understand, so I'll try the trick that helped in the previous installment. I'll write an approximation of the PopLink in Delphi.

class function TOmniBaseContainer.PopLink(var chain: TOmniHeadAndSpin): POmniLinkedData;
label
Spin;
var
chainSpin: cardinal;
nextNode : POmniLinkedData;
spinRef : cardinal;
begin
// push edi
// push ebx
// mov edi, eax //edi = @chain
Spin:
// mov ecx, 1 //Increment spin reference for 1
// lock xadd [edi + 4], ecx //Get old spin reference to ecx
{.$ATOMIC}
spinRef := chain.Spin;
chain.Spin := chain.Spin + 1;
{.$END ATOMIC}
// mov eax, [edi] //eax := chain.Head
// mov edx, [edi +4] //edx := chain.Spin
{.$ATOMIC}
Result := chain.Head;
chainSpin := chain.Spin;
// test eax, eax
// jz @Exit //Is Empty?
if not assigned(Result) then
Exit;
// inc ecx //Now we are ready to cmpxchg8b
Inc(spinRef);
// cmp edx, ecx //Is reference the some?
// jnz @Spin
if spinRef <> chainSpin then
goto Spin;
// mov ebx, [eax] //ebx := Result.Next
// nextNode = ebx
nextNode := Result.Next;
// lock cmpxchg8b [edi] //Now try to xchg
{.$ATOMIC}
if (Result = chain.Head) and (chainSpin = chain.Spin) then begin
chain.Head := nextNode;
chain.Spin := spinRef;
end
else begin
chainSpin := chain.Spin;
Result := chain.Head;
{.$END ATOMIC}
// jnz @Spin //Do spin ???
goto Spin;
end;
//@Exit:
// pop ebx
// pop edi
end;

In the beginning, a little housekeeping is done (two registers stored away because Delphi compiler expects us not to modify them) and address of the chain record is loaded into the edi register.

With a little help of locked xadd, current version of the header spin value is loaded into the spinRef (ecx register) and is incremented in the header.

//  mov   ecx, 1                            //Increment spin reference for 1
// lock xadd [edi + 4], ecx //Get old spin reference to ecx
{.$ATOMIC}
spinRef := chain.Spin;
chain.Spin := chain.Spin + 1;
{.$END ATOMIC}

Xadd instructions stores first operand ([edi + 4], which equals to chain.Spin) into the second operand (ecx) and sum of both operands (original value of the second operand is used for summation) into the first operand.

Then we fetch current chain.Head and chain.Spin and store them into the Result (eax) and chainSpin (edx), respectively. This is a non-atomic fetch and doesn't guarantee that both values really belong to the same state of the chain record. Another thread may have been invoked between those two movs and it could call PopLink on the same chain and modify the header.

//  mov   eax, [edi]                        //eax := chain.Head
// mov edx, [edi +4] //edx := chain.Spin
Result := chain.Head;
chainSpin := chain.Spin;
// test eax, eax
// jz @Exit //Is Empty?
if not assigned(Result) then
Exit;

Because of the same reason there is no guarantee  that chainSpin is indeed equal to spinRef+1 so this is now checked for.

//  inc   ecx                               //Now we are ready to cmpxchg8b
Inc(spinRef);
// cmp edx, ecx //Is reference the some?
// jnz @Spin
if spinRef <> chainSpin then
goto Spin;

Now we're getting ready for the pointer swap. First we need an address of the second node in the ebx register.

//  mov   ebx, [eax]                        //ebx := Result.Next
nextNode := Result.Next;

Finally, we can use cmpxchg8b instruction to do the swap. Cmpxchg8b is a very special beast. It is similar to the cmpxchg, but it compares and swaps full 8 bytes in one go!

//  lock cmpxchg8b [edi]                    //Now try to xchg
{.$ATOMIC}
if (Result = chain.Head) and (chainSpin = chain.Spin) then begin
chain.Head := nextNode;
chain.Spin := spinRef;
end
else begin
chainSpin := chain.Spin;
Result := chain.Head;
{.$END ATOMIC}
// jnz @Spin //Do spin ???
goto Spin;
end;

Cmpxchg8b compares combination of edx and eax registers with its operand [edi]. In our case, edi is pointing to the chain (8 byte value in which first 4 bytes are a node pointer and second 4 a spin count), and eax and edx were loaded from the same structure just few instructions back. In other words, just like in the initial implementation we're checking if our internal values are still relevant. If that is not true, cmpxchg8b reloads edx and eax from the [edi] and the code then jumps back to the Spin label.

If values match, registers ecx and ebx are copied into the operand [edi]. Here, value in the ecx register is equal to the edx (we just verified that) and ebx contains the pointer to the next node. In other words, chain head is modified to point to the second node in chain.

There's not much difference between this implementation and the original one. The biggest and most important change is introduction of spin counter, which makes sure that the ABA problem, mentioned in the previous post, cannot occur.

[To be completely frank, ABA situation can still occur, at least in theory. To experience a problem with the new code, one thread would have to stop just before cmpxchg8b and wait there for a very long time. So long, in fact, that other threads would be able to wrap the spin counter from $FFFFFFFF to 0 and back to the original value. A little less than 4,3 billion PushLinks would have to be called. If the original thread is then awaken — and the spin counter contains the exact same value as it had before the thread was paused — the ABA problem would occur. That will never happen in practice, I'm totally sure.]

Besides the PopLink and introduction of spin counters, nothing much has changed. The PushLink method was slightly modified, as it was also changed from a normal method to a class static one, but it still works exactly as before.

class procedure TOmniBaseContainer.PushLink(const link: POmniLinkedData;
var chain: TOmniHeadAndSpin);
asm
mov ecx, eax
mov eax, [edx] //edx = chain.Head
@Spin:
mov [ecx], eax //link.Next := chain.Head
lock cmpxchg [edx], ecx //chain.Head := link
jnz @Spin
end; { TOmniBaseStack.PushLink }

That's it. This version of the lock-free stack has successfully passed 4-hour stress test, which makes me believe that it really works. Winking

Next time on the agenda: How the lock-free queue is made. This time I'll really blog about it. Nothing more will distract me. Promise!

Labels: , , , , , ,

3 Comments:

Anonymous Anonymous said...

This is really interesting! Keep up the great work!

I'm eager to try it out as soon as I make meet my current deadline.

Also, would love to hear your impression of AsyncCalls 2.21:

http://andy.jgknet.de/async/

Any estimate on when you'll tag OmniThreadLibrary as 1.0? It'll help me get permission to use it at work.

22:53  
Blogger gabr said...

Andy, hi!

I know of AsyncCalls (and I'm recommending it around), but I never took a deeper look. Maybe one of those days ...

As of version 1.0, hmmm. I want to finish the thread pool (not much work here) and I also want to take another try at the IOmniCommunicationEndpoint. Sending objects and interfaces is much too complicated at the moment.

The problem here is that I'll be going on vacation in a week. I don't think version 1.0 will be released before that. Maybe somewhere near the end of the August.

Better to get it right then to release it quick, I think.

23:42  
Anonymous Yuhong Bao said...

Note that the XADD and CMPXCHG instructions were introduced in the 486 processor and CMPXCHG8B in the Pentium. Older processors that do not have these instructions will crash with an illegal opcode exception.

16:47  

Post a Comment

Links to this post:

Create a Link

<< Home