Tech Blog

Improving the ESB Toolkit: Fixing the “endless loop” bug when creating Fault messages

The ESB Toolkit is great. However, it’s not without its fair share of bugs. I’ve been
meaning to blog about this since I first came across this bug a year or so ago, but
it’s taken me a while.

If you use the ExceptionManagementDb functionality, you’re probably familiar
with putting something like this in your exception handling logic in an orchestration:


FaultMessage = Microsoft.Practices.ESB.ExceptionHandling.ExceptionMgmt.CreateFaultMessage();


// Set properties


FaultMessage.FailureCategory
= “Error”;


FaultMessage.FaultCode
= “1”;


FaultMessage.FaultDescription
= ex.Message


FaultMessage.FaultSeverity
= Microsoft.Practices.ESB.ExceptionHandling.FaultSeverity.Critical;



// Add original request

Microsoft.Practices.ESB.ExceptionHandling.ExceptionMgmt.AddMessage(FaultMessage,
RequestMessage);

One of the more difficult bugs is the “endless loop” bug you get when you use the
above code either:

a) outside of an exception handler

OR

b) when catching a BizTalk exception (i.e. any exception which inherits from Microsoft.XLANGs.BaseTypes.XLANGsException)

The first case is documented here whilst
the second is something I came across the first time we hit an EmptyPartException in
a catch block.

In these cases (and possibly others) then you find that the Host Instance running
the orchestration ends up pegging the CPU at 100% and stops responding: your code
effectively stops at the CreateFaultMessage statement.

The reason for this is that there is a bug in the ESB code that causes an endless
loop to occur in certain scenarios. At the point the bug occurs, the code walks through
the exception context segments in the orchestration, attempting to find the last exception.
It’s supposed to keep looping until it finds the last exception segment. Instead,
in certain situations, it never exits the loop.

I found this out by decompiling the Microsoft.Practices.ESB.ExceptionHandling.dll assembly
using .NET Reflector ILSpy.

I recompiled the assembly and then traced through the code as it executed.

The problem lies in a method called GetServiceXlangInfo().

The actual code that causes the problem is highlighted in yellow below:


try


{


int exceptionSegmentIndex
= segmentIndex;


object successorSegment
= null;


exception = null;


while (exception
== null && exceptionSegmentIndex > -1)


{


exception = Context.RootService.RootContext.__MyService._stateMgrs[index].__MyService._segments[exceptionSegmentIndex].ExceptionContext._exception;


if (exception
== null)


{


object successorSegment
= Context.RootService.RootContext.__MyService._stateMgrs[index].__MyService._segments[exceptionSegmentIndex].ExceptionContext._successorSegment;


if (successorSegment
== null)


{


break;


}


exceptionSegmentIndex = (int)successorSegment;


}

}

Looking carefully, you can see that a while loop is entered, which will only exit
if (successorSegment
== null)
or if an exception is found.

However, the logic is faulty: what’s supposed to happen is that starting at the current
segment in the orchestration it looks for the exception object. If it doesn’t find
it, then it moves down the segments looking for the exception object, and then exits
the loop when it finds the exception object, or if there are no more segments to search.

However it appears that if you’re not in an exception handler, or you’re catching
an exception that inherits from XLANGsBaseException,
then you end up with a situation where not only is no exception object found, but
where successorSegment is always equal to the current segment i.e. you stop
moving down the tree of segments, and just stay iterating over the same segment, never
finding the exception.

The fix is to break out of the loop if no successor segment is found OR if the current
segment is the same as the successor segment.

i.e. replace the line: if (successorSegment
== null)

with this: if ((successorSegment
== null) || (exceptionSegmentIndex == (int)successorSegment))

Whilst you’re at it, you may as well also check for out-of-bound indexers, as I can
foresee other bugs arising. The entire bit of code to replace would therefore look
like this:


try


{


int exceptionSegmentIndex
= segmentIndex;


object successorSegment
= null;


exception = null;


while ((exception
== null) && (exceptionSegmentIndex > -1))


{


//
FIX: Added code to check if exceptionSegmentIndex is out of bounds before using it
as an indexer


if (exceptionSegmentIndex
Context
.RootService.RootContext.__MyService._stateMgrs[index].__MyService._segments.Length)>


{


exception = Context.RootService.RootContext.__MyService._stateMgrs[index].__MyService._segments[exceptionSegmentIndex].ExceptionContext._exception;


}


if (exception
== null)


{


//
FIX: Added code to check if exceptionSegmentIndex is out of bounds before using it
as an indexer


if (exceptionSegmentIndex
Context
.RootService.RootContext.__MyService._stateMgrs[index].__MyService._segments.Length)>


{


successorSegment = Context.RootService.RootContext.__MyService._stateMgrs[index].__MyService._segments[exceptionSegmentIndex].ExceptionContext._successorSegment;


}


//
FIX: Fixes the endless loop error that happens occasionally


if ((successorSegment
== null) || (exceptionSegmentIndex == (int)successorSegment))


{


break;


}


exceptionSegmentIndex = (int)
successorSegment;


}

}

This isn’t a simple issue for an end-user to fix as Microsoft doesn’t supply you with
the source code for this assembly (although the source for much else of the ESB toolkit
is supplied).

The solution is to pull the CreateFaultMessage() and GetServiceXlangInfo() methods
(and any other methods/member variables required) into your own class, and then call
your fixed version of the CreateFaultMessage() method.
I’m unsure how much Microsoft would frown upon this, but if it fixes a production
issue then I don’t see many other choices.

I was hoping that the toolkit v2.1 would fix this bug, but it hasn’t – here’s hoping
a future release will.

I appreciate that the current thinking is not to use this code outside of an exception
block – but if you do, it shouldn’t bring BizTalk to its knees!

In the meantime, I logged a bug report on this with Microsoft, although from what
I can see they’re aware of the issue.

Back to Tech Blog