[SCHEDULER-835]  Weather-Related Failures During Dawn Tasks
Type Bug
Priority High
Severity Major
Component Dispatch Scheduler Engine
Fixed In Version [3.5.33.5.3
Versions Affected [3.53.5
Severity Closed
Resolution Complete
Reported By Bob Denny
Resources Bob Denny
Start Date 4/5/2012

Description
This is ugly. See this fantastic report. Clearly, weather must remain disconnected during Scheduler's dawn cal frames, so that a change from safe to unsafe won't prevent or interrupt them. I gave him a workaround in AcquireSupport, but it will still die if the weather changes from safe to unsafe during dawn cal frames.

This became a big grab-bag of things. 

Comments
5/2/2012 4:16:26 PM   Bob Denny
SVN Comment
Author rbdenny
Repository svn+ssh://rbdenny@a2_svn_dc3/home/rbdenny/svn/astro/scheduler
SVN Revision 290
Affected files /trunk/Help/RelNotes.htm (Modified)
/trunk/Scheduler/ACPSequencer.cs (Modified)
/trunk/Scheduler/Engine.cs (Modified)
Check-in comment CLeanup from yesterday's fixes, logging improvements, release notes, all for 3.5.2 GEM:835
5/1/2012 8:03:48 PM   Bob Denny
SVN Comment
Author rbdenny
Repository svn+ssh://rbdenny@a2_svn_dc3/home/rbdenny/svn/astro/scheduler
SVN Revision 289
Affected files /trunk/Scheduler/ACPSequencer.cs (Modified)
Check-in comment More changes for script error detection and reporting. ** UNTESTED ** Committing for safety. GEM:835
5/1/2012 8:02:14 PM   Bob Denny
Startup script failed, Scheduler did not know! Next it looked to open the  dome, but didn't because the scope was never connected so the dome looked like it was not there. Next it tried to AF, failed for no scope.

Need to do much better at detecting scriptfails and aborts. Read the Console.Text property. But this turned out to be a pain because it does not convert via TLBIMP and thus that member is missing from the ACP early binding interface! It's due to having an optional parameter (# lines). So I had to access it using late binding. 

This resulted in some substantial changes and improvements in the ACP Sequencer handling code. Better error reporting (now the script error is reported in the engine log). 
4/20/2012 10:57:13 AM   Bob Denny
SVN Comment
Author rbdenny
Repository svn+ssh://rbdenny@a2_svn_dc3/home/rbdenny/svn/astro/scheduler
SVN Revision 288
Affected files /trunk/Scheduler/ACPSequencer.cs (Modified)
Check-in comment Status reporting fix GEM:835
4/11/2012 11:04:41 AM   Bob Denny
Released 3.5.2 executable to the Comm Center for testing.
4/11/2012 10:31:27 AM   Bob Denny
SVN Comment
Author rbdenny
Repository svn+ssh://rbdenny@a2_svn_dc3/home/rbdenny/svn/astro/scheduler
SVN Revision 286
Affected files /trunk/Help/Advanced.htm (Modified)
/trunk/Help/RelNotes.htm (Modified)
/trunk/Scheduler/ACPSequencer.cs (Modified)
/trunk/Scheduler/Engine.cs (Modified)
/trunk/Scheduler/MainForm.cs (Modified)
Check-in comment Grab-bag of changes to the ACP Sequencer interface and the Engine for the issues shown in this Gemini issue, plus a few more relating to stability. GEM:835
4/11/2012 8:39:14 AM   Bob Denny
Liebenberg's problem ended up being thread safety between the engine-disabled loop's polling for weather and the test shutdown running in a separate thread from the menu selection. Wow.
4/6/2012 3:33:06 PM   Bob Denny
Oh, man I just looked at the files that David attached to this message and one of them shows different things than the body of his message.

20120329@173924:
29-Mar-2012 21:16:05.9: Send Observation M95 to ACP Sequencer
29-Mar-2012 21:16:15.9: Sequencer is now active
29-Mar-2012 21:24:07.4: ** Script interrupted by weather safety event.
29-Mar-2012 21:24:10.4: Post-job status check done (stat=Running)
29-Mar-2012 21:24:10.4: ACP ABORT: Acquisition process was interrupted by weather unsafe.
29-Mar-2012 21:24:10.4: ACP data acquisition failed for Observation M95.
29-Mar-2012 21:24:10.4:   (Plan M95 will be resubmitted in its entirety.
29-Mar-2012 21:24:10.4:   (Observation M95 interrupted by weather unsafe event.)
29-Mar-2012 21:24:10.8: **Dome closing when asked to open. Assuming weather closure.
29-Mar-2012 21:24:10.9: **EXCEPTION IN SCHEDULER:
29-Mar-2012 21:24:10.9: **Dome failed to open for reason other than hardware weather closure
29-Mar-2012 21:24:11.0: Traceback:
   at DC3.Scheduler.Engine.DoStartupIf(Boolean openDome)
   at DC3.Scheduler.Engine.Run()

This looks like a too-brief weather unsafe. It tried to open up before the shutdown completed. Probably the shutter ended up in an error condition.
4/6/2012 5:36:50 AM   Bob Denny
See this update on the Comm Center. It looks like the Clarity problems were due to a watchdog script he was using. This may turn out to be a tempest in a teapot.
4/5/2012 4:39:39 PM   Bob Denny
Oh, Mark Foster has lots of extra stuff tacked on, using POTH, MaxIm is connected to the scope, I think FocusMax is bypassing ACP for AcquireStar, lots of unknowns. I asked for a test under more controlled conditions.
4/5/2012 4:35:36 PM   Bob Denny
SVN Comment
Author rbdenny
Repository svn+ssh://rbdenny@a2_svn_dc3/home/rbdenny/svn/astro/scheduler
SVN Revision 282
Affected files /trunk/Help/RelNotes.htm (Modified)
/trunk/Scheduler/ACPSequencer.cs (Modified)
/trunk/Scheduler/AssemblyInfo.cs (Modified)
/trunk/Scheduler/Scheduler.csproj (Modified)
Check-in comment Disconnect weather after assuring that dome is closed and before starting dawn cal frames. Add logic to wait for broken weather servers that complete the Connected call before they are ready to provide data. GEM:835
4/5/2012 2:47:53 PM   Bob Denny
Adding Dale Liebenberg's problem to this one and closing SCHEDULER-834 which I had opened for it.

04-Apr-2012 00:23:50.5: ACP data acquisition failed for Observation NGC 1803.
04-Apr-2012 00:23:50.5: (Observation NGC 1803 interrupted by operator or system.)
04-Apr-2012 00:24:51.8: -- Observatory Shutdown --
04-Apr-2012 00:24:54.7: **EXCEPTION IN SCHEDULER:
04-Apr-2012 00:24:54.7: The server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERFAULT))
04-Apr-2012 00:24:54.7: Traceback:
at ACP.WeatherClass.get_Safe()
at DC3.Scheduler.ACPSequencer.get_FilteredWeatherSafe()
at DC3.Scheduler.ACPSequencer.get_WeatherSafe()
at DC3.Scheduler.Engine.PollWeather()
at DC3.Scheduler.Engine.Run()

4/5/2012 2:35:38 PM   Bob Denny
More related info, I am expanding this to cover the weather related issues including this one from David Campbell and a second one from Mark Foster. It's likely that there are several separate but related issues here.

David:
02-Apr-2012 09:21:11.4: Weather successfully connected.
02-Apr-2012 09:21:11.7: **EXCEPTION IN SCHEDULER:
02-Apr-2012 09:21:11.7: Invalid procedure call or argument
02-Apr-2012 09:21:11.7: Traceback:
at ACP.WeatherClass.get_Safe()
at DC3.Scheduler.ACPSequencer.get_FilteredWeatherSafe ()
at DC3.Scheduler.ACPSequencer.get_WeatherSafe()
at DC3.Scheduler.Engine.PollWeather()
at DC3.Scheduler.Engine.Run()
02-Apr-2012 09:21:11.7: Release ACP sequencer

31-Mar-2012 04:21:54.9: -- Observatory Shutdown --
31-Mar-2012 04:21:54.9: Weather disconnected.
31-Mar-2012 04:21:54.9: Start ACP Sequencer's ShutdownObs script
31-Mar-2012 04:22:04.9: Sequencer is now active
31-Mar-2012 04:24:29.6: Sequencer is no longer active
31-Mar-2012 04:24:29.9: Weather successfully connected.
31-Mar-2012 04:24:29.9: **EXCEPTION IN SCHEDULER:
31-Mar-2012 04:24:29.9: Invalid procedure call or argument
31-Mar-2012 04:24:29.9: Traceback:
at ACP.WeatherClass.get_Safe()
at DC3.Scheduler.ACPSequencer.get_FilteredWeatherSafe ()
at DC3.Scheduler.ACPSequencer.get_WeatherSafe()
at DC3.Scheduler.Engine.PollWeather()
at DC3.Scheduler.Engine.Run()
31-Mar-2012 04:24:30.4: Run statistics:


See above, this was wrong.
28-Mar-2012 20:39:19.1: Send Observation M95 to ACP Sequencer
28-Mar-2012 20:39:29.1: Sequencer is now active
28-Mar-2012 20:40:40.1: Sequencer is no longer active
28-Mar-2012 20:40:43.1: Post-job status check done (stat=Running)
28-Mar-2012 20:40:43.1: ACP ERROR: Failed to complete acquisition process (see ACP run log)
28-Mar-2012 20:40:43.1: Data acquisition failed for Observation M95.
28-Mar-2012 20:40:43.1: (ACP reported See the ACP run log file)
28-Mar-2012 20:40:43.3: **EXCEPTION IN SCHEDULER:
28-Mar-2012 20:40:43.3: Property DomeOpen: Unexpected shutter status shutterClosing
28-Mar-2012 20:40:43.3: Traceback:
at DC3.Scheduler.ACPSequencer.get_DomeOpen()
at DC3.Scheduler.Engine.DoStartupIf(Boolean openDome)
at DC3.Scheduler.Engine.Run()
28-Mar-2012 20:40:43.8: Run statistics:


Mark

05-Apr-2012 09:51:27.3: Start ACP Sequencer's SchedulerCalFrames.txt observing plan
05-Apr-2012 09:51:37.4: Sequencer is now active
05-Apr-2012 10:46:33.3: Sequencer is no longer active
05-Apr-2012 10:46:33.3: -- Observatory Shutdown --
05-Apr-2012 15:41:22.7: **EXCEPTION IN SCHEDULER:
05-Apr-2012 15:41:22.7: Exception from HRESULT: 0x800A9C68
05-Apr-2012 15:41:22.7: Traceback:
at ACP.UtilClass.set_WeatherConnected(Boolean )
at DC3.Scheduler.ACPSequencer.DisconnectWeather()
at DC3.Scheduler.ACPSequencer.DoShutdown()
at DC3.Scheduler.Engine.DoShutdownIf()
at DC3.Scheduler.Engine.Run()
05-Apr-2012 15:41:22.8: Run statistics:

4/5/2012 11:45:48 AM   Bob Denny
Damn, I removed the code to disconnect the weather during dawn cal frames during the 3.5 refactoring. This (at a minimum) means 3.5.2 hot fix. Changes made.

In addition, I added defensive code for misbehaving weather servers that return from Connected but aren't really connected. It now waits for Weather.Safe to stop raising errors after doing a successful Connected = True.