This post identifies the cause and fix of the intermittent exception raised during the local – i.e. workstation – build and test cycle of the lean engine ‘backtest-desktop” configuration.
The lean engine backtesting desktop configuration, compiled in debug mode, raised an exception that appears to be the result of an intermittent race condition occurring between the disposal of a .net blocking collection instance and the abort of the thread the collection runs on.
An intermittent exception is an unacceptable user experience. The objective is to remove the possibility of the exception being raised.
Computing environment configuration
· Lean engine assembly v22.214.171.124, pulled local on 18Feb17 from https://github.com/QuantConnect/Lean .
· VS Community 2015 Update 3.
· Windows 10 version 1607
Symptoms – the user experience
In vs the lean engine build completed cleanly but start debugging (aka F5), run without breakpoints, intermittently raised a theadAbort exception.
On a run that raised an exception the lean winform control was instantiated but the trade bar graph and results were not always painted. Log messages in the bottom winform pane were displayed. A variety of content was logged out to the console, this message was rare:
And the log didn’t always display the “Engine.Main(): Analysis Complete …” message.
The debugger manifests the exception at the closing brace of ParallelRunnerController.ProcessHoldQueue:
With the vs exception helper View Detail … showing:
Root cause assessment
The exception appears to be the result of an intermittent race condition occurring between the disposal of a .net blocking collection instance and the abort of the thread that collection runs on.
The ParallelRunnerController, ProcessHoldQueue method enqueue work items on _holdQueue a blocking collection instance. In addition to encapsulating a blocking collection, ProcessHoldQueue is the threadstart delegate of _processQueueThread a thread initiated by the ParallelRunnerController Start method.
The abort of thread _processQueueThread is initiated by ParallelRunnerController.Dispose, a method called from FileSystemData.Run.finally after the FileSystemData.Exit method had earlier called _cancellationTokenSource.Cancel to signal the blocking collection _holdQueue to stop blocking its enumerator GetConsumingEnumerable.
What this means is that if the .net blocking collection implementation has not finished cleaning up after receiving the cancellation token then when ParallelRunnerController.Dispose calls _holdQueue.Dispose the _holdQueue dispose is delayed and then clobbered by the _processQueueThread.Abort, the next dispose method statement called.
If that assessment is correct then the assertion is that introducing a delay between the _holdQueue.Dispose and _processQueueThread.Abort statements, then _holdQueue.Dispose has time to complete, thus stopping the exception being raised.
The assertion above was tested by adding a Thread.Sleep between the _holdQueue.Dispose and _processQueueThread.Abort statements of ParallelRunnerController.Dispose.
Verify user experience change
Two tests were used to determine if introducing the execution delay changed the user experience i.e. were the number of exceptions reduced. Both tests ran \Launcher\bin\Debug\QuantConnect.Lean.Launcher.exe from within the VS debugger, via start control, aka F5 ten times.
Test one set the execution delay (i.e. thread.sleep) to 10 seconds, 10000 ms. The idea being that if a 10 secs delay didn’t alleviate the exception then the hypothesis was wrong. Shorter sleeps haven’t been explored. The expected outcome was; no exceptions in a sample of ten consecutive Launcher runs. After the ten runs no exceptions had been raised. So lengthening the thread sleep improved the user experience.
Test two set the execution delay to 0 seconds. The expected test outcome was frequent exceptions in a similar ten run sample. An exception was raised in each test run.
The test runs demonstrated that this lean engine configuration (default, backtesting-desktop) and build (debug) triggered a thread abort exception until there was sufficient delay between the blocking collection disposal and the thread abort calls to enable the blocking collection to be fully disposed.