Windows 8 VIDEO_TDR_FAILURE Madness

I finally figured out why I kept on getting VIDEO_TDR_FAILURE BSOD’s when installing Windows 8 on my SuperMicro workstations. It turns out that the problem goes away when I use a PCIe slot associated with CPU #1, instead of a slot associated with CPU #2.

Some history on my adventures with Windows 8 and SuperMicro SuperWorkstations:
I got ACPI_BIOS_ERROR BSOD’s while installing Windows 8, SuperMicro provided a Beta BIOS that resolved the problem.
The Windows 8 install hangs if installing to a SSD drive on a LSI 2308 SAS controller, that issue is still unresolved, but can be worked around by connecting the SSD to the Intel SATA controller.
I got VIDEO_TDR_ERROR BSOD’s while installing Windows 8 with a NVidia Quadro 5000 graphic card, same with an ATI FirePro V7900 or a NVidia GeForce GTX 680 or an ATI HD 7970. And this post is about resolving that problem.

 

SuperMicro released v1.0a BIOS updates for the X9DAi and X9DA7 motherboards used in the 7470A-T and 7470A-73 SuperWorkstations. I was hoping this will resolve the VIDEO_TDR_FAILURE BSOD’s, but no.

The X9DA7 BIOS updated without issue, but the X9DAi update reported an error at the end of the update process; “Error when sending Enable Message to ME”.

I contacted SuperMicro support, and they asked me to make sure that there is no jumper on JPME1. There is no mention of JPME1 in the motherboard manual, but it is located next to JIPMB1, next to PCIe slot #1. The header had a jumper on pins 2 and 3, where the same header on the X9DA7 motherboard had a jumper between 1 and 2. I removed the jumper, and the BIOS update succeeded.

JPME1

 

Unlike the ACPI_BIOS_ERROR BSOD that happens during the WinPE phase of the install, the VIDEO_TDR_FAILURE BSOD happens on the first boot after the install, during the hardware detection and driver install phase. This means that the technique I used to kernel debug the initial boot phase will not work, as the second boot is using the BCD already deployed to the target hard drive. I had to modify the BCD of the already installed image, prior to the install continuing after the reboot.

 

I tested many permutations of graphic cards and configurations, and it quickly became very annoying to have to type my Win8 product key every single time I boot and install. To avoid this I created configuration files in the sources directory on the install media, and this bypassed the key question. You can read more about the meaning of the file contents here:

EI.cfg:

[EditionID]
Professional
[Channel]
Retail
[VL]
0

PID.txt:

[PID]
Value=XXXXX-XXXXX-XXXXX-XXXXX-XXXXX

 

To modify the BCD of the installed image, and be able to easily repeat the second phase of install testing, I installed a second hard drive, and deployed WinPE to the second drive. By using F11 during boot to choose the boot drive, I could select booting from the second drive at any time.

 

I have a variety WinPE v3 (Win7) based utility images, and I updated them to use WinPE v4 (Win8). In the process I lost the boot menu, and the first image in the menu automatically started booting. After some trial and error, I found the bootmenupolicy BCD option, and when set to legacy mode, the old style menu is back:

bcdedit /set {default} bootmenupolicy legacy

 

I installed Win8 on the primary drive, and during the reboot, instead of booting to the installed Win8 drive, I used F11 and booted to my secondary WinPE drive. From WinPE I modified the boot BCD to enable kernel debugging over the network:

bcdedit -store c:\boot\bcd /set {default} nocrashautoreboot yes
bcdedit -store c:\boot\bcd /set {default} debugtype net
bcdedit -store c:\boot\bcd /set {default} hostip 3232235876
bcdedit -store c:\boot\bcd /set {default} port 50000
bcdedit -store c:\boot\bcd /set {default} key my.secret.debug.key
bcdedit -store c:\boot\bcd /debug {default} yes

This is equivalent to:

bcdedit /dbgsettings net host:192.168.1.100 port:50000 key:my.secret.debug.key

But unlike the dbgsettings command, this allows me to specify a BCD store. Also note that the IP address is stored as a single numeric value instead of the dotted IP format.

 

While still in WinPE, I captured the state of the primary Win8 drive by making a drive image using Symantec Ghost, the real Ghost, currently sold as Symantec Ghost Solution Suite, not the same named but volume snapshot based Norton Ghost or Symantec System Recovery. By saving a drive image, I can easily change hardware or configurations, test the install starting at the second phase, reboot to the secondary WinPE drive using F11, restore the entire drive image, and try again, while leaving the kernel debug options intact.

 

I tested with following hardware configurations in various permutations:

 

With the kernel debugger attached, I captured the following crash details in WinDbg for NVidia based cards:

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: fffffa80211cd010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff8800782d0d8, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 0000000000000002, Optional internal context dependent data.

Debugging Details:
------------------

FAULTING_IP:
nvlddmkm+1ae0d8
fffff880`0782d0d8 4055 push rbp

DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT

BUGCHECK_STR: 0x116

PROCESS_NAME: System

CURRENT_IRQL: 0

STACK_TEXT:
fffff880`12c76078 fffff801`66fef0ea : 00000000`00000000 00000000`00000116 fffff880`12c761e0 fffff801`66f734b8 : nt!DbgBreakPointWithStatus
fffff880`12c76080 fffff801`66fee742 : 00000000`00000003 fffff880`12c761e0 fffff801`66f73e90 00000000`00000116 : nt!KiBugCheckDebugBreak+0x12
fffff880`12c760e0 fffff801`66ef4144 : fffffa80`2094b100 fffff880`021ee9c0 fffffa80`1f54e400 00000000`00000000 : nt!KeBugCheck2+0x79f
fffff880`12c76800 fffff880`04b33dcb : 00000000`00000116 fffffa80`211cd010 fffff880`0782d0d8 00000000`00000000 : nt!KeBugCheckEx+0x104
fffff880`12c76840 fffff880`04b32518 : fffff880`0782d0d8 fffffa80`211cd010 fffff880`12c76949 00000000`000000c7 : dxgkrnl!TdrBugcheckOnTimeout+0xef
fffff880`12c76880 fffff880`04a1e608 : fffffa80`211cd010 fffff880`12c76949 00000000`00000000 00000000`00000002 : dxgkrnl!TdrIsRecoveryRequired+0x168
fffff880`12c768b0 fffff880`04a4d539 : 00000000`00000000 fffff780`00000320 00000000`00000000 fffffa80`1f54e400 : dxgmms1!VidSchiReportHwHang+0x438
fffff880`12c769b0 fffff880`04a4ba49 : fffffa80`00000002 fffffa80`1f54e400 fffffa80`1f54e840 fffffa80`1f54e840 : dxgmms1!VidSchiCheckHwProgress+0xe5
fffff880`12c76a00 fffff880`04a16fe5 : ffffffff`ff676980 00000000`00000001 fffff880`12c76b69 fffffa80`1f54e400 : dxgmms1!VidSchiWaitForSchedulerEvents+0x20d
fffff880`12c76aa0 fffff880`04a4b646 : 00000000`00000000 00000000`0000000f fffffa80`1f54e400 fffffa80`1f54e400 : dxgmms1!VidSchiScheduleCommandToRun+0x289
fffff880`12c76bd0 fffff801`66e9b521 : fffffa80`1f5abb00 fffffa80`1f54e400 fffff880`03b01140 00000000`06a21e1e : dxgmms1!VidSchiWorkerThread+0xca
fffff880`12c76c10 fffff801`66ed9dd6 : fffff880`03af5180 fffffa80`1f5abb00 fffff880`03b01140 fffffa80`19aac040 : nt!PspSystemThreadStartup+0x59
fffff880`12c76c60 00000000`00000000 : fffff880`12c77000 fffff880`12c71000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

STACK_COMMAND: .bugcheck ; kb

FOLLOWUP_IP:
nvlddmkm+1ae0d8
fffff880`0782d0d8 4055 push rbp

SYMBOL_NAME: nvlddmkm+1ae0d8

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nvlddmkm

IMAGE_NAME: nvlddmkm.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 4fdf93d7

FAILURE_BUCKET_ID: 0x116_IMAGE_nvlddmkm.sys

BUCKET_ID: 0x116_IMAGE_nvlddmkm.sys

 

With the kernel debugger attached, I captured the following crash details in WinDbg for ATI based cards:

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: fffffa801ed114d0, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff8800725cefc, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 000000000000000d, Optional internal context dependent data.

Debugging Details:
------------------

FAULTING_IP:
atikmpag+8efc
fffff880`0725cefc 4055 push rbp

DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT

BUGCHECK_STR: 0x116

PROCESS_NAME: System

CURRENT_IRQL: 0

STACK_TEXT:
fffff880`06fa9ee8 fffff803`e6ff20ea : 00000000`00000000 00000000`00000116 fffff880`06faa050 fffff803`e6f764b8 : nt!DbgBreakPointWithStatus
fffff880`06fa9ef0 fffff803`e6ff1742 : 00000000`00000003 fffff880`06faa050 fffff803`e6f76e90 00000000`00000116 : nt!KiBugCheckDebugBreak+0x12
fffff880`06fa9f50 fffff803`e6ef7144 : fffffa80`1e2df4e0 fffff880`020b99c0 fffffa80`1d31f010 00000000`00000000 : nt!KeBugCheck2+0x79f
fffff880`06faa670 fffff880`04d31dcb : 00000000`00000116 fffffa80`1ed114d0 fffff880`0725cefc 00000000`00000000 : nt!KeBugCheckEx+0x104
fffff880`06faa6b0 fffff880`04d30548 : fffff880`0725cefc fffffa80`1ed114d0 fffff880`06faa7b9 00000000`00000180 : dxgkrnl!TdrBugcheckOnTimeout+0xef
fffff880`06faa6f0 fffff880`04c11608 : fffffa80`1ed114d0 fffff880`06faa7b9 00000000`0000000f fffffa80`1d31f8f8 : dxgkrnl!TdrIsRecoveryRequired+0x198
fffff880`06faa720 fffff880`04c459f9 : 00000000`00000001 fffff880`06faa8a0 fffff880`06faa920 00000000`00000000 : dxgmms1!VidSchiReportHwHang+0x438
fffff880`06faa820 fffff880`04c3ff72 : fffffa80`1d31f010 fffff780`00000320 fffffa80`1d31f770 fffffa80`1d31f010 : dxgmms1!VidSchWaitForCompletionEvent+0x411
fffff880`06faa8e0 fffff880`04c4206c : fffffa80`1d31f010 fffffa80`1d31f450 fffffa80`1d31f450 00000000`00000000 : dxgmms1!VidSchiWaitForEmptyHwQueue+0x9a
fffff880`06faa9d0 fffff880`04c3ea85 : 00000000`00000000 fffffa80`1d31f010 fffffa80`1d31f450 00000000`00000000 : dxgmms1!VidSchiSuspend+0x74
fffff880`06faaa00 fffff880`04c09fe5 : ffffffff`ff676980 00000000`00000001 fffff880`06faab69 fffffa80`1d31f010 : dxgmms1!VidSchiWaitForSchedulerEvents+0x249
fffff880`06faaaa0 fffff880`04c3e646 : 00000000`00000000 fffffa80`1d585660 fffffa80`1d44d7f0 fffffa80`1d31f010 : dxgmms1!VidSchiScheduleCommandToRun+0x289
fffff880`06faabd0 fffff803`e6e9e521 : fffffa80`1d6b9b00 fffffa80`1d31f010 fffff880`03932140 00000000`04d91ecb : dxgmms1!VidSchiWorkerThread+0xca
fffff880`06faac10 fffff803`e6edcdd6 : fffff880`03926180 fffffa80`1d6b9b00 fffff880`03932140 fffffa80`19ac7500 : nt!PspSystemThreadStartup+0x59
fffff880`06faac60 00000000`00000000 : fffff880`06fab000 fffff880`06fa5000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

STACK_COMMAND: .bugcheck ; kb

FOLLOWUP_IP:
atikmpag+8efc
fffff880`0725cefc 4055 push rbp

SYMBOL_NAME: atikmpag+8efc

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: atikmpag

IMAGE_NAME: atikmpag.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 4fdf9279

FAILURE_BUCKET_ID: 0x116_IMAGE_atikmpag.sys

BUCKET_ID: 0x116_IMAGE_atikmpag.sys

 

This was not really helping me much, and I decided to repeat the tests but use the checked build of Windows 8 to help troubleshoot.

With the kernel debugger attached, I captured the following ASSERT during the boot:

Windows 8 Kernel Version 9200 MP (1 procs) Checked x64
Built by: 9200.16384.amd64chk.win8_rtm.120725-1247
Machine Name:
Kernel base = 0xfffff802`0e01d000 PsLoadedModuleList = 0xfffff802`0e760ac0
System Uptime: 0 days 0:00:06.228 (checked kernels begin at 49 days)
Assertion: The BIOS has reported inconsistent resources (_CRS). Please upgrade your BIOS.
ACPI!PnpBiosGetDeviceResourceList+0x15e:
fffff880`012c3c2a cd2c int 2Ch
...
Unknown bugcheck code (0)
Unknown bugcheck description
Arguments:
Arg1: 0000000000000000
Arg2: 0000000000000000
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

PROCESS_NAME: System

FAULTING_IP:
ACPI!PnpBiosGetDeviceResourceList+15e
fffff880`012c3c2a cd2c int 2Ch

ERROR_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.

EXCEPTION_CODE: (NTSTATUS) 0xc0000420 - An assertion failure has occurred.

DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT

BUGCHECK_STR: 0x0

CURRENT_IRQL: 0

LOCK_ADDRESS: fffff8020e7c5d60 -- (!locks fffff8020e7c5d60)

Resource @ nt!PiEngineLock (0xfffff8020e7c5d60) Exclusively owned
Threads: fffffa8019a36040-01<*>
1 total locks, 1 locks currently held

PNP_TRIAGE:
Lock address : 0xfffff8020e7c5d60
Thread Count : 1
Thread address: 0xfffffa8019a36040
Thread wait : 0x105eccd4

LAST_CONTROL_TRANSFER: from fffff880012b736f to fffff880012c3c2a

STACK_TEXT:
fffff880`009b4b30 fffff880`012b736f : fffffa80`23a9e900 fffff880`012a7e01 fffff880`009b4c08 fffff880`012a7e70 : ACPI!PnpBiosGetDeviceResourceList+0x15e
fffff880`009b4bd0 fffff880`0125acba : fffffa80`23a9e900 fffffa80`19ac54c0 fffff880`012a7e70 fffffa80`1f477010 : ACPI!ACPIBusIrpQueryResourceRequirements+0x8b
fffff880`009b4c50 fffff802`0e91b6a4 : fffffa80`23a9e900 fffffa80`19ac54c0 fffff880`009b4db0 fffffa80`23a9e900 : ACPI!ACPIDispatchIrp+0x2a6
fffff880`009b4cf0 fffff802`0e91cd1b : fffffa80`23a9e900 fffff880`009b4db0 00000001`c00000bb 00000000`00000000 : nt!IopSynchronousCall+0x10c
fffff880`009b4d80 fffff802`0e915bdb : fffffa80`23a9e900 fffff880`009b4e50 fffffa80`23a4f850 00000000`0000001e : nt!PpIrpQueryResourceRequirements+0x5f
fffff880`009b4e10 fffff802`0e91748d : fffffa80`23a9b8e0 00000000`00000000 ffffffff`80000218 fffffa80`23a9b8e0 : nt!PiQueryResourceRequirements+0x47
fffff880`009b4ea0 fffff802`0e91a1f2 : fffffa80`23a9b8e0 fffffa80`23a9b8e0 00000000`00000001 00000000`00000000 : nt!PiProcessNewDeviceNode+0x159d
fffff880`009b5070 fffff802`0e08feb5 : fffffa80`19adcd20 00000000`00000000 fffff880`009b5358 00000000`00000000 : nt!PipProcessDevNodeTree+0x1fe
fffff880`009b5310 fffff802`0e08fb59 : 00000000`00000000 00000000`00000000 00000000`00000000 fffffa80`37e19cc0 : nt!PnpDeviceActionWorker+0x345
fffff880`009b53d0 fffff802`0ed4010d : 00000000`00000000 fffff8a0`00000007 fffff8a0`00f08c00 00000000`00000000 : nt!PnpRequestDeviceAction+0x2ed
fffff880`009b5420 fffff802`0ed3b39d : fffff802`0d536800 fffff802`0e7c83c0 00000000`00000006 fffff802`0d536800 : nt!IopInitializeBootDrivers+0x905
fffff880`009b5650 fffff802`0ed2deb5 : fffff802`0d536800 00000000`00000000 fffff802`0d536800 fffff802`0d51ebf0 : nt!IoInitSystem+0xb5d
fffff880`009b59b0 fffff802`0e82d013 : fffff802`0d536800 fffffa80`19a36040 00000000`00000000 fffffa80`19ab3040 : nt!Phase1InitializationDiscard+0x1899
fffff880`009b5bc0 fffff802`0e1b289e : fffff802`0d536800 fffff802`0d536800 00000000`00000000 00000000`00000000 : nt!Phase1Initialization+0x13
fffff880`009b5bf0 fffff802`0e24ef96 : fffff802`0e82d000 fffff802`0d536800 fffff802`0e6c6180 00000000`f8ffffff : nt!PspSystemThreadStartup+0x1a2
fffff880`009b5c60 00000000`00000000 : fffff880`009b6000 fffff880`009b0000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

STACK_COMMAND: kb

FOLLOWUP_IP:
ACPI!PnpBiosGetDeviceResourceList+15e
fffff880`012c3c2a cd2c int 2Ch

SYMBOL_STACK_INDEX: 0

SYMBOL_NAME: ACPI!PnpBiosGetDeviceResourceList+15e

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: ACPI

IMAGE_NAME: ACPI.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 50109dd0

BUCKET_ID_FUNC_OFFSET: 15e

FAILURE_BUCKET_ID: 0x0_ACPI!PnpBiosGetDeviceResourceList

BUCKET_ID: 0x0_ACPI!PnpBiosGetDeviceResourceList

 

This is interesting, the kernel ASSERT’s on a problem reported by the BIOS.

I contacted SuperMicro support, they said they will investigate the BIOS failure, and they suggested I try to use PCIe slot #3 instead of slot #5. The motherboard manual mentions that slots #1, #2, and #3 are to be used if CPU #1 is installed, and slots #4, #5, and #6 to be used only if CPU #2 is installed.

PCIe

I have both processors installed, so not using the more conveniently located slot #5 never came to mind. I moved the graphic card to CPU #1 slot #3, and voila, install succeeded and Windows 8 was up and running!

 

I repeated the checked build test with the graphic card in slot #3, and the same BIOS ASSERT error was reported, so the BIOS ASSERT seems to be unrelated to the ACPI_TDR_FAILURE error.

 

This was a very frustrating problem, and I still don’t understand the root cause, but I am happy to be able to finally switch both workstations to Windows 8.

Windows 8 Install Crash With NVidia Quadro 5000

I got Windows 8 RTM installed on my two SuperMicro SuperWorkstation machines, with a bit of trouble along the way, but nothing I could not work around. But, I ran into a problem with NVidia Quadro 5000 cards causing a VIDEO_TDR_FAILURE BSOD during the Windows 8 install process.

 

I was running my two workstations with ATI FirePro V7900 graphic cards, but I decided I wanted a bit more rendering horsepower. I wanted a card that had a good balance between modern architecture, great 2D performance, good 3D performance, OpenCL or CUDA support, and reasonable power consumption. I found the Tom’s Hardware Workstation Graphics 2012 benchmark site to be a very informative, and I decided that the NVidia Quadro 5000 was a very good choice.

I replaced my FirePro V7900 with the Quadro 5000, and started the Windows 8 x64 RTM install. All went well, until the first reboot during the install, and the machine would blue screen crash with a VIDEO_TDR_FAILURE. During the install process the hardware is identified, the appropriate drivers extracted, and on the reboot those drivers are started. It appears that soon after the NVidia driver loads, that it crashes.

 

The Timeout Detection and Recovery (TDR) feature was added to Windows Vista, and was a way for the OS to recover from a renderer failure without the need to restart the machine. Typically the user will see a notification that the graphic subsystem was restarted, but in cases where the restart fails, a VIDEO_TDR_FAILURE blue screen crash is generated.

The web is full of reports of NVidia VIDEO_TDR_FAILURE crashes, and solutions typically involve replacing the hardware or updating drivers. In my case I had two new machines, and two new graphic cards, and a brand new operating system, and both cards on both machines crashed.

I contacted SuperMicro support, and responsive as they always are, said they would investigate.

I also contacted PNY support, as PNY is the manufacturer of the NVidia Quadro 5000, here is their reply.

Again, I am sorry, but we do not list Windows 8 (yet) as being compatible with the Quadro 5000, or any other Quadro or Geforce card we manufacture. Until it is publically and commercially available, we cannot provide support for Windows 8. Windows 8 is not available to the end user yet, and it is in testing, as is the Nvidia driver. If you find issues, you must report them to Microsoft in order to improve compatibility in the final release. There is obviously a compatibility problem with Windows 8 and the Quadro 5000 right now (according to your testing of TWO cards), and unfortunately there is nothing we can do to fix it while in is not available to the public. My best advice is to try it again when it is officially released sometime in 2013.

Not very helpful at all, and their concept of Windows 8 release timing, and their responsibility, is way out there.

 

The real problem here is that it is the in-box NVidia drivers that are crashing, not drivers I install later. And as it is the in-box graphic drivers that crash, there is no (easy) way to update the drivers used by the Windows 8 install media.

 

I had previously used a Quadro 4000 card on the same machines, and they installed without incident, so it appears to be something unique the Quadro 5000 cards.

At this time I am waiting for SuperMicro to get back to me with suggestions, as I have little hope of hearing anything useful from PNY.

Debugging Windows 8 Install BSOD

In my last post I described how to prevent Windows from automatically restarting when encountering a BSOD during the OS install process. This allowed me to see the  ACPI_BIOS_ERROR fault code while installing Windows 8 on my new SuperMicro workstation. The new Windows 8 BSOD page looks friendly, but no longer displays any error parameters other than the main fault code.

In order to get additional details of the crash, I had to hook up a kernel debugger to the machine. Windows 8 adds USB3 and TCPIP kernel debug support, and I will describe how I used the TCPIP network option to capture details of the crash.

 

First thing to do is prepare our tools, download the Windows 8 Debugging Tools for Windows package, and the Windows 8 Symbols.

Unfortunately the debugging tools are no longer available as a standalone download, and you need to install the SDK or WDK on a Windows 8 system in order to get them, but you can choose to only install the debugging tools. Once you installed the debugging tools on one machine, you can copy the MSI installers or the directory to any other machines, including Windows 7 systems. You will find the tools in the “C:\Program Files (x86)\Windows Kits\8.0\Debuggers” folder.

Microsoft is pretty good at publishing symbols for most released versions of their products to their public symbol server, but I prefer to extract the symbols to a working directory on my machine, or to upload the symbols to our internal symbol server. You can install the downloaded symbols MSI package directly, or use the following command to extract the symbols from the MSI file to a location on disk. Run an elevated (right click run as administrator) command prompt, and type:

msiexec /a [symbol msi file name] /qb targetdir="[output directory]"

 

Next we need to enable kernel network debugging in the BCD options. This needs to be done on a Windows 8 machine as the network debugging command is not supported in older versions of BCDEdit. I should also call out that network debugging support is required for hardware logo certification, but not all current adapters support it. Insert the bootable Windows 8 USB key, run an elevated command prompt, and type:

bcdedit –store [usb key drive]:\boot\bcd /dbgsettings net hostip:[IP of WinDbg machine] port:50000

BCDEdit will output the connection security key that is required by WinDbg.

 

Start WinDbg, and enable network kernel debugging, entering the port number and security key.

WinDbg.Network

 

Boot the target machine, you will see the target machine connecting to WinDbg:

Microsoft (R) Windows Debugger Version 6.2.8400.0 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Using NET for debugging
Opened WinSock 2.0
Waiting to reconnect...
Connected to target 192.168.1.106 on port 50000 on local IP 192.168.1.100.
Connected to Windows 8 8400 x64 target at (Fri Jul 20 11:07:21.583 2012 (UTC - 7:00)), ptr64 TRUE
Kernel Debugger connection established.

And then the ACPI_BIOS_ERROR crash:

25: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

ACPI_BIOS_ERROR (a5)
The ACPI Bios in the system is not fully compliant with the ACPI specification.
The first value indicates where the incompatibility lies:
This bug check covers a great variety of ACPI problems.  If a kernel debugger
is attached, use "!analyze -v".  This command will analyze the precise problem,
and display whatever information is most useful for debugging the specific
error.
Arguments:
Arg1: 0000000000000003, ACPI_FAILED_MUST_SUCCEED_METHOD
    ACPI tried to run a control method while creating device extensions
    to represent the ACPI namespace, but this control method failed.
Arg2: fffffa8019f2f288, The ACPI Object that was being run
Arg3: ffffffffc0000034, return value from the interpreter
Arg4: 00000000494e495f, Name of the control method (in ULONG format)

Debugging Details:
------------------

ACPI_OBJECT:  fffffa8019f2f288

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  0xA5

PROCESS_NAME:  System

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from fffff803ca1e617a to fffff803ca0e5870

STACK_TEXT: 
fffff880`053eb418 fffff803`ca1e617a : 00000000`00000000 00000000`000000a5 fffff880`053eb580 fffff803`ca16b930 : nt!DbgBreakPointWithStatus
fffff880`053eb420 fffff803`ca1e57d2 : 00000000`00000003 00000000`494e495f fffff803`ca168810 00000000`000000a5 : nt!KiBugCheckDebugBreak+0x12
fffff880`053eb480 fffff803`ca0eb044 : 00000000`c0000034 fffff880`01038255 fffffa80`1a50fe78 00000000`c0000034 : nt!KeBugCheck2+0x79f
fffff880`053ebba0 fffff880`01043949 : 00000000`000000a5 00000000`00000003 fffffa80`19f2f288 ffffffff`c0000034 : nt!KeBugCheckEx+0x104
fffff880`053ebbe0 fffff880`0103bded : 00000000`00000000 00000000`00000000 00000000`00008004 00000000`c0000034 : ACPI!ACPIBuildCompleteMustSucceed+0x39
fffff880`053ebc20 fffff880`010346bd : fffffa80`1a500000 00000000`00008000 00000000`00000000 fffffa80`37e80000 : ACPI!AsyncCallBack+0x7f
fffff880`053ebc50 fffff880`01034f56 : fffffa80`1a500000 fffff880`01072be0 00000000`00000000 00000000`00000002 : ACPI!RunContext+0x141
fffff880`053ebc90 fffff880`010386e3 : fffffa80`19b1c3a0 00000000`00000000 00000000`00000000 fffffa80`19a35258 : ACPI!InsertReadyQueue+0xd6
fffff880`053ebcc0 fffff880`0103862a : fffff803`ca2eb490 fffff880`01072be0 00000000`00000000 00000000`546c6d41 : ACPI!RestartCtxtPassive+0x2f
fffff880`053ebcf0 fffff803`ca0cb181 : fffffa80`19e06b00 00000000`00000080 fffff880`04ac6540 00000000`00000000 : ACPI!ACPIWorkerThread+0xea
fffff880`053ebd50 fffff803`ca0dae26 : fffff880`04aba180 fffffa80`19e06b00 fffff880`04ac6540 fffffa80`19a8f940 : nt!PspSystemThreadStartup+0x59
fffff880`053ebda0 00000000`00000000 : fffff880`053ec000 fffff880`053e6000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

STACK_COMMAND:  kb

FOLLOWUP_IP:
ACPI!ACPIBuildCompleteMustSucceed+39
fffff880`01043949 cc              int     3

SYMBOL_STACK_INDEX:  4

SYMBOL_NAME:  ACPI!ACPIBuildCompleteMustSucceed+39

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: ACPI

IMAGE_NAME:  ACPI.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  4fe6a2b1

BUCKET_ID_FUNC_OFFSET:  39

FAILURE_BUCKET_ID:  0xA5_ACPI!ACPIBuildCompleteMustSucceed

BUCKET_ID:  0xA5_ACPI!ACPIBuildCompleteMustSucceed

Followup: MachineOwner

 

Even with all the crash details, it still doesn’t really help me make progress, as it has been two days since I logged the support request with SuperMicro, and no response yet.

Windows 8 and Server 2012 on SuperMicro results in ACPI_BIOS_ERROR BSOD

I ran out of disk space on my development workstation, all those VM images add up. The machine has four drive bays, and all four have 3TB drives. I can replace the 3TB drives with 4TB drives, but migrating the RAID5 array will be time consuming and risky. I can add an external SAS storage enclosure, but they do not power down when the machine goes to sleep. So I looked at buying a new machine with more drive bays.

I’ve been using DELL Precision Workstations for my development machines for many years, they are fast and very reliable. My current workstation is a T5500, and I specifically chose the T5500 over the T7600 because of its features to physical size ratio. The T7600 does offer five drive bays over the T5500’s four, but if I’m going to change machines, adding only one more drive is not really worth the cost and effort.

Rather than buying a pre-configured and tested machine, I opted for the more exciting, sometimes rewarding, often frustrating, option of building my own. In order not to spend too much time on the project, I opted to use a chassis and motherboard combo, and just add peripherals. I chose the SuperMicro SuperWorkstation 7047A-T, containing the X9DAi motherboard. I specifically picked this model because it has eight hot-swap drive bays, is low noise, has a high efficiency PSU, and supports dual Intel Xeon E5-2600 processors.

I used 32GB Kingston KVR1600D3D4R11SK4/32GI memory, two Xeon E5-2660 processors, and an NVidia Quadro 4000 graphic card.

I prepared a USB key with Windows 8 x64 Release Preview. Microsoft does provide a tool to convert ISO images to USB keys, but I’ve been doing this by hand since long before the tool existed, and it is really easy and ultimately quicker to update.

Mount the ISO install image as a virtual drive using Virtual CloneDrive. Launch an elevated (right click run as administrator) command prompt, and run:

diskpart

list disk
select disk [number]
clean
create partition primary
select partition 1
active
format fs=fat32 quick
assign
exit

robocopy [virtual cd drive]:\ [usb key drive]:\ /mir

Once the USB key has been properly formatted, you only have to repeat the robocopy steps for any new builds or bits you want to copy.

I booted from the USB key, black screen with spinning circle animation, blue screen of sad face death, and an immediate reboot.

The machine rebooted so quickly I didn’t get a chance to see what the error was.

I tried Windows Server 2012 RC, same problem. I tried later builds of Windows 8 and Server 2012 (we are part of the Windows 8 Pre-Release Program, I hope I can say that now, at some point I was not even allowed to say that, like the Fight Club rules).

I logged a support case with SuperMicro, and I posted on the Microsoft Windows Server support forum. No reply yet from SuperMicro, no useful reply yet from the forum.

I think it is really silly that the default configuration of Windows is set to automatically reboot after a BSOD, even more so for an install situation. BSOD’s are serious, users and administrators need to know something terrible happened, even if they don’t immediately know what the error codes mean or what to do about it. I do know how to change the reboot option from inside windows, but I don’t know how to change it in the installer.

I was looking for a BCD option to disable auto-reboot, and after quite a bit of searching, I found a BcdOSLoaderBoolean_DisableCrashAutoReboot WMI BCD option on MSDN. After some more searching I found a NOCRASHAUTOREBOOT BCDEdit option.

That was really unusually difficult to find. Try it yourself, search for “nocrashautoreboot” and restrict the results to microsoft.com, there was only one hit on a Microsoft site, in a Word DOC file. Try the search on the rest of the web, and you get more hits.

Now that I knew what option to set, the rest was pretty easy. Insert the bootable USB key back in a working machine, open an elevated command prompt, and set the BCD option:

attrib -r [usb key drive]:\boot\bcd
bcdedit -store [usb key drive]:\boot\bcd -set {default} nocrashautoreboot yes

Start the install again, wait for the crash, and this time we can see the error is ACPI_BIOS_ERROR:

ACPI_BIOS_ERROR

There are many reports on the web about ACPI_BIOS_ERROR and Windows 8, most resolved by updating the BIOS, but also several reports of this error with SuperMicro motherboards, and unfortunately it seems without a positive resolution.

To make sure the problem was not peripheral or hardware related, I also installed Windows 7 and Windows Server 2008 R2, both installed and ran ok.

I use a KVM switch, and as I switched back to the machine while it was applying Windows Updates, there was some screen corruption that went away after the reboot. I updated the NVidia driver and the problem has not resurfaced, this may be a driver issue, or it may be a hardware issue:

NVIDIA

I am very disappointed that my brand new machine can only run Windows 7 and not Windows 8. I have yet to hear from SuperMicro support, but I hope they can resolve the problem with a BIOS update before Windows 8 and Windows Server 2012 is released in August.

CrashPlan Memory Utilization

I’ve been using CrashPlan as an online backup solution for quite some time, and it works really well.

I like the fact that I can subscribe to the consumer plan, with almost 3.5TB of data backed up, and that the backup client installs on a server OS. Many of the other “unlimited” backup providers I tested have restrictions in place that makes such a setup impossible.

CrashPlan sends email notifications about backup status, and I noticed that something was wrong with the backup:
CrashPlan.Email

I logged onto the machine, opened the main UI, and after a few seconds the UI just closed. opened it again, same thing, after about 15s the UI closed.

My initial thoughts were that it is a crash, but on attaching a debugger, the exit call stack showed that the process was cleanly terminated after receiving a signal.

On looking at the NT eventlog I could see that the service was restarting about every 15s:

The CrashPlan Backup Service service entered the stopped state.
The CrashPlan Backup Service service entered the running state.
The CrashPlan Backup Service service entered the stopped state.
The CrashPlan Backup Service service entered the running state.
The CrashPlan Backup Service service entered the stopped state.
The CrashPlan Backup Service service entered the running state.

The service wasn’t crashing, it was externally being stopped and restarted. I looked in the CrashPlan directory, and I found several log files with a naming like restart_1342296082496.log. The contents of these files looked like this:

Sat 07/14/2012 13:01:22.53 : "C:\Program Files\CrashPlan\bin\restart.bat"
ECHO is off.
Sat 07/14/2012 13:01:22.53 : APP_BASE_NAME=CrashPlan
Sat 07/14/2012 13:01:22.53 : APP_DIR=C:\Program Files\CrashPlan
ECHO is off.
Sat 07/14/2012 13:01:22.53 : Stopping CrashPlanService
The CrashPlan Backup Service service is stopping.
The CrashPlan Backup Service service was stopped successfully.

Sat 07/14/2012 13:01:25.05 : Sleeing 15 seconds...

Pinging 127.0.0.1 with 32 bytes of data:
Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

Ping statistics for 127.0.0.1:
Packets: Sent = 15, Received = 15, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
Sat 07/14/2012 13:01:39.08 : Starting CrashPlanService

The CrashPlan Backup Service service was started successfully.

ECHO is off.
Sat 07/14/2012 13:01:39.13 : Exiting...

I looked for a newer version, but 3.2.1 was the latest version. I logged a support ticket with CrashPlan, but I continued my investigation. I found a log file service.log.0, several MB in size, and inside it I found this:

[07.14.12 12:32:39.480 ERROR   QPub-BackupMgr       backup42.service.backup.BackupController] OutOfMemoryError occurred...RESTARTING! message=OutOfMemoryError in BackupQueue!

So it seems that the service is running out of memory. I now had a few good keywords to search on, and I found this post of a user with the same problem. At about the same time I received a reply from CrashPlan support, not bad for weekend service, with the same solution.

The CrashPlan backup service and desktop applications are Java apps, and as such the maximum amount of memory they use are capped by configuration. I have had similar problems with other memory hungry Java apps, like Jaikoz, that simply fail unless you increase the memory limit.

To fix the problem, shutdown the service, open the CrashPlanService.ini file in the program directory, and increase the maximum memory utilization parameter to 2GB, the default is 512MB, and restart the service:

Virtual Machine Parameters=-Xrs -Xms15M –Xmx2048M

After upping the memory all seemed well, and the service has been running for more than a day. But, I wanted to know just how much memory is CrashPlan using, and it turns out to be insane.

Here are the current stats for the amount of data I backup, as well as the resource utilization by the backup service and desktop app:

CrashPlan.Size
CrashPlan.Memory.Desktop
CrashPlan.Memory.Service

As you can see, the desktop app’s peak private bytes exceed 250MB, and the service exceeds 1.3GB, that’s right 1.3GB of memory!

Those numbers are simply outrageous.

Trend Micro SafeSync, neat, but unreliable

I wanted to write about Trend Micro SafeSync, but it reminded me of my Streamload experience, and I ended up writing this instead. This time I am really going to write about SafeSync.

SafeSync is another online backup and sync and share application. Actually, they offer both online storage through a mapped drive, and syncing folders online, this makes it unique compared to many existing offerings.

I have used almost all online backup and sync and share type applications out there, my favorite remains DropBox. SafeSync used to be Humyo, before being acquired by Trend. I have used Humyo when they were in Beta, it was just ok, but between then and now their product seem to have come a long way.

Of all the online backup and sync and share applications, a few things remain constant;
Free is unsustainable, somebody has to pay for the staff, the bandwidth, the disks, and the infrastructure. These vendors are running on venture capital, waiting for acquisition, for paid customers, for indirect monetization, or failure.
Unlimited storage is unfeasible, the increased home bandwidth capacity makes it easy to upload Terabytes of data, and we are back at the cost factor.
Usability and coolness is critical, especially usability on mobile devices, and coolness on web frontends.
Reliability is critical, and this brings me back to SafeSync.

SafeSync offers many things common to many other backup or sync and share providers, but three things stood out; they offer unlimited storage, they offer data access using WebDAV, and the web frontend allows convenient access to pictures and music.

The product is offered as a yearly service, listed as $59.99 on the Trend eStore, or $35.95 on the Trend US product page, weird. Regardless, when you add the product to the cart, the cost is $35.95.

I installed the software on three systems, two running Windows 7 Ultimate x64, and one running Windows Server 2008 R2. The install creates a new drive that is mapped to the online storage, and a user session application that can be used to sync a local folder to the online storage.

Here are some screenshots:
SafeSync.Config.Folders
SafeSync.Config.Connection
SafeSync.Status

The web frontend really reminds me of Streamload:
SafeSync.Online

A very neat feature is WebDAV access to the storage. This means that you can access the data using any WebDAV client, and there is no need to install the SafeSync client software. Here is a Trend KB for details, basically you connect to “dav.trendmicro.safesync.com” using your SafeSync credentials.

You can use the built in Windows WebDAV client to access the storage, but you have to make a registry change, else you will get a "the folder you entered does not appear to be be valid" error. After you make the change reboot, or just restart the WebClient service. See this Microsoft KB for details:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\WebClient\Parameters]
"BasicAuthLevel"=dword:00000002

Open explorer and map a network drive to “\\dav.trendmicro.safesync.com”:
WebDAV.Windows

Here are some explorer screenshots of a mapped drive using SafeSync, Windows, and WebDrive:
SafeSync.Cross
Windows.Cross
WebDrive.Cross

So this all sounds great, well, not so great, the client application has serious stability issues.

On two machines, every time I logout of Windows, Windows reports that SafeSync is not responding, after a minute or so, Windows eventually logs out.
On one machine, every time I logout, Windows paints the logging out screen, and never completes, requiring a power cycle.

SafeSync interferes with applications that are accessing files in a shared folder. It appears that SafeSync notices a file modification, then opens the file, and does not allow other applications access to the file. As an example, I create backups of my CD collection using dbPoweramp, and I shared the output folder in SafeSync. While dbPoweramp is still using the files, SafeSync opens the file and dbPoweramp fails. This is not a problem with dbPoweramp, and other sync applications, like DropBox, work just fine in the same situation.
dBPoweramp.Error.Writing

Adobe PhotoShop CS4 x64 crashes every time I open an image that is located in a shared folder. The crash is caused by the SafeSync explorer shell extension.
FAULTING_IP:
HrfsShellExtension!DllUnregisterServer+202ef

If a sync is in progress, and the machine goes to sleep, then later wakes up, SafeSync does not reconnect, instead it reports that the server is unavailable. In order to resolve this you have to logout and log back in.

I added a folder to sync, this folder was very large, the status window indicated it would take several days to complete, I wanted to remove the mapping. On clicking the remove button, I received this funny error message, “Unexpected and unknown error, it is possible a logical error”. The only way to stop the sync was to uninstall.
SafeSync.Logical.Error

SafeSync crashed while uninstalling.
SafeSync.Uninstall.Crash

Lastly, the website does not display any file extensions.

I have not quite given up on the service, just the client software, it is simply too buggy. I am still using the online storage through a WebDrive mapped drive.
But, I still do not believe that unlimited storage is a sustainable business practice, and I would not be surprised if SafeSync limits storage, dramatically increases pricing, or is terminated.

Zotac ZBOXHD-ID11 4GB RAM

In this post I describe my experience while upgrading the BIOS, in order to support 4GB of memory.

This is the third post in a series of posts related to the Zotac ZBOX ZBOXHD-ID11.

Summary:
– 4GB is supported after upgrading the BIOS.
– BIOS has to be updated using less than 4GB, else ID11 fails to post.

[Update: 20 May 2010]
After writing this post, the machine started bluescreen / BSOD crashing.
Mostly MEMORY_MANAGEMENT / 0x0000001A errors, with occasional 0x000000BE and 0x0000003B crashes.
When I initially installed the 4GB RAM, I ran memtest for one cycle, and the RAM tested fine. I just reran memtest, and it is reporting that the memory as bad.
I replaced the memory with a new stick, I ran memtest overnight, and everything seems back to normal.
I hope it was just a bad stick, and not the ID11 that killed the memory.

When I ordered my ID11, I also ordered a 4GB Kingston SODIM RAM stick.
When I received the ID11, the specs said 2GB only, and after contacting Zotac support, and posting in their support forum, they confirmed that 4GB is not supported.
I reverted to using a 2GB Kingston SODIM RAM stick.

I was pleasantly surprised when Zotac announced a BIOS update that added 4GB support.

The BIOS changes are described as follows:
Version 05/11/10
.Added support on 4GB memory modules
.Added CMOS selection on Logo LED

I downloaded the BIOS update, extracted the contents, and tried running the AFUWIN AMI BIOS update utility. After a warning message appeared telling me to not run other apps and not to power down, on clicking ok, nothing happened. I tried again this time running AFUWIN.exe as administrator, still nothing.

I went to the AMI site, and downloaded their latest Windows BIOS update utility. Since I was running Windows 7 Ultimate x64, I ran AFUWINx64.exe, this binary automatically UAC prompted for elevated access, and presented this warning:

I opened the A140PA19.rom file, and the information tab showed the following:

I started the flash, and got this warning:

I accepted, and the flash completed:

I rebooted, and the POST screen showed a CMOS Checksum Bad error:

I pressed F1 to enter setup, and I made the following changes:
[Exit] [Load Optimal Defaults]
[Advanced] [PC Health Monitor] [CPUFAN TargetTemp Value] = 50
[Advanced] [IDE Configuration] [Configure SATA as] = AHCI
[Advanced] [PCIPnP] [Plug & Play OS] = Yes

The two BIOS changes are visible under these sections:
[Chipset] [North Bridge Configuration] “PCI MMIO Allocation: 4GB to 3072MB”
[Chipset] [South Bridge Configuration] [LOGO LED indicator:]

I rebooted, and everything worked fine.

Next I powered down, and replaced the 2GB RAM with 4GB RAM.

On reboot the following changes were visible on the POST screen and in the BIOS:

Booting into Windows, the following 4GB related changes were visible:

So far everything appears to work fine.
One of these days I will really get to testing media playback performance.

By the way.
In my first impressions post I reported that the ID11 came with the wrong power cable. Zotac support sent me the correct replacement cables free of charge: