Unraid vs. Ubuntu Bare Metal SMB Performance

In my last test I compared Unraid SMB performance with an Ubuntu VM running on Unraid, and Ubuntu outperformed Unraid. I was wondering if the VM disk image synthetically improved performance, maybe IO caching, so this time I tested Ubuntu on the same hardware that runs Unraid.

I configured the system to boot from either the Unraid USB stick, or an Ubuntu Server USB stick. In both cases the hardware was exactly the same, and the SMB share was on the same 4 x 1TB Samsung 860 Pro SSD BTRFS volume. I mounted the BTRS volume using the same mount options that Unraid uses. The Ubuntu Samba server used default options, the only change I made was to register the share.

Samba Config:
[cstshare]
comment = Samba on Ubuntu
path = /mnt/cache/CacheSpeedTest
read only = no
browsable = yes

BTRFS Mount:
mount -t btrfs -o noatime,nodiratime -U 89d1ad3a-83f3-4086-9006-5f0931370d36 /mnt/cache

I ran the same tests as before, and the results again showed that the Unraid SMB ReadWrite and Write performance is much worse compared to Ubuntu. It was interesting to note that the Ubuntu ReadWrite performance was higher than the theoretical 1Gbps limit at 1MB and 2MB block sizes. I re-tested twice and got the same results, my assumption is that the DiskSpd options to disable local and remote caching were not effective.

I have now tested Unraid vs. W2K19 VM, Ubuntu VM, and now Ubuntu Bare Metal, and Unraid ReadWrite and Write performance is always abysmal.

I have again reported my findings in the Unraid SMB performance issue thread, and we continue to wait for a fix.

Recovering the Firmware on a Supermicro BPN-SAS3-846EL1 Backplane

In a previous adventure I replaced an Adaptec HBA with a LSI SAS3 HBA, and the chassis drive bay LED’s stopped working. I suspect the LSI card does not play nice with the SGPIO sideband controller, and I decided to replace the chassis with one similar to my SC846 chassis, where the LSI card and drive bay LED’s do work fine.

What should have been a simple replacement turned into quite a recovery operation.

Since I now had SAS3 HBA’s in both my servers, I really wanted to get SAS3 backplanes, but I did not want to pay SAS3 chassis prices. I found a refurbished SuperChassis 846E16-R1200B chassis and two refurbished Supermicro BPN-SAS3-846EL1 backplanes on eBay. The one SAS3 backplane would replace the SAS2 backplane in my existing SC846, and the second SAS3 backplane would replace the SAS2 backplane in the newly purchased SC846. The combination of the 24 bay 4U chassis with a SAS2 backplane and a replacement SAS3 backplane is much cheaper compared to any native SAS3 chassis I could find.

I do understand that using SATA3 drives on a SAS3 backplane will not perform like SAS3 drives, but with multipath the aggregate throughput can still outperform SAS2.

I received my chassis and my backplanes. The chassis was clean but a bit dinged in one corner, and the expanders were clean, but the metal frames showing a little rust. I had no idea how old the firmware on the backplanes were, so I contacted Supermicro support to ask for the current firmware. After asking for my serial numbers, Supermicro sent me the latest firmware for my hardware. The firmware update instructions were included in the “ReleaseNote.txt” file that came with the firmware.

I removed the motherboard from the old chassis and installed it in the new SC846. I removed the SAS2 backplane, and installed the SAS3 backplane in its place. The power cable layout on the SAS3 backplane is a bit different, and I had to use a few molex power splitters to extend the power cables to reach the power plugs on the SAS3 backplane. The standard rails are too long to fit in my rack, and the short rails are too short for the chassis, so as I again used short outer rails and standard inner rails.

I powered the machine up through remote IPMI KVM, all looked good, and I booted into my Ubuntu Server USB stick so I could SSH into the box, and update the firmware.

The instructions from “ReleaseNote.txt” say:

How to Flash Firmware
-------------------------
Under Linux/Windows Environment to use CLIXTL (ver.6.00)
1. use "CLIXTL -l" to show SAS addresses
2. use "CLIXTL -f all  -d " to update firmware
3. use "CLIXTL -f 3  -d  -r" to update MFG and reset expander

example:
CLIXTL -l
CLIXTL -f all -t 500304800000007F -d SAS3-EXPFW_66.16.11.00.fw
CLIXTL -f 3 -t 500304800000007F -d BPN-SAS3-846EL-PRI_16_11.bin -r

The existing firmware on the expanders were v66.06.01.02 and MFG v06.02, while the new firmware was v66.16.11.00 and MFG v16.11.

Firmware Update History
-------------------------
01. migrate expander firmware to phase 16 
02. enhance TMP, VOL, FAN, and PWS status in SES pages
03. present version information of current running firmware
04. Dynamic SES page element presentation
05. move BMC IP to SCSI network inquiry 
06. support I2C R/W as slave to let BMC to identify platforms
07. redundant side has sensor information 
08. firmware rewrite and optimization

I did the following:

$ sudo ./CLIXTL -i -t 5003048001B24DBF
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 5003048001B24DBF
ENCLOSURE ID - 5003048001B24DBF
ENCLOSURE INFORMATION:
PLATFORM NAME - SMC846ELSAS3P
SERIAL NUMBER -
VENDOR ID - LSI
PRODUCT ID - SAS3x40
VERSION INFORMATION:
FLASH REGION 0 - 66.06.01.02
FLASH REGION 1 - 66.06.01.02
FLASH REGION 2 - 66.06.01.02
FLASH REGION 3 - 06.02
DEVICE INFORMATION:
DEVICE NAME - /dev/sg0
BMC IP - NULL

$ sudo ./CLIXTL -f all -t 5003048001B24DBF -d SAS3-EXPFW_66.16.11.00.fw
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Firmware Region 0 - Finished
Firmware Region 1 - Finished
Firmware Region 2 - Finished
Please reset expander to activate

pieter@ubuntuusb:~/SAS3$ sudo ./CLIXTL -f 3 -t 5003048001B24DBF -d BPN-SAS3-846EL-PRI_16_11.bin -r
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Error, incompatible file type or directory

And the expander dropped and never came back up.

$ sudo ./CLIXTL -l
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Error, no enclosure has been found

I can hear the comments now, why risk updating the firmware if it is not broken, true, but I didn’t know if it would work, and I’d rather start fresh. Do also note that I did the update on my secondary server, my primary server is still running unmodified, so no interruption of home service or work experiments.

I still had the second SAS3 backplane, so I replace the “bricked” one, leaving the firmware as is, and brought my Unraid server back up, all appeared fine. At least I had two working servers, giving me time to try and recover the backplane.

I sent Supermicro support an email asking for help, but since it was weekend I had to wait, so I did my own research. I found a forum post of a user that recovered a “bricked” SAS2 expander via the factory serial port, and I decided to give it a try.

I used my Arduino programming FTDI USB RS232 adapter, and the pin connections for PRI-SDB / 8 are:

PRI-SDB : 1 : TX  -> RS232 : RX
PRI-SDB : 2 : GND -> RS232 : GND
PRI-SDB : 3 : RX  -> RS232 : TX

The current XTools v6.10.C CLI does not include COM support, at least none that I could find in the documentation or CLI help, so I used the older v1.4 version.

>g3Xflash.exe -s com4 get avail
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface.....
Expander: Unknown (SAS_3X_40)
1) Unknown (SAS_3X_40) (00000000:00000000)

Good sign, the COM port worked, and the expander hardware was detected, but did not have an address.

I flashed the firmware and the MFG data:

>g3Xflash.exe -s com4 down fw SAS3-EXPFW_66.16.11.00.fw 0
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface..
Expander: Unknown (SAS_3X_40)
Expander Validation: Passed
Checksum: Passed
Target Firmware Region: 00
Current Version: 255.255.255.255
Replacement Version: 66.16.11.00
Image Validation: Passed
Pre-Validation of image is successful.
Are you sure to download file to expander?(y/n):y
Downloading File...........................................................................Download Complete.
Post-validating........................................................Post-Validation of image is successful.
Download Successful.

>g3Xflash.exe -s com4 down mfg BPN-SAS3-846EL-PRI_16_11.bin 3
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface.....
Expander: Unknown (SAS_3X_40)
Image Validation: Passed
Checksum: Passed
Reading MFG version from flash...Unable to retrieve version.
Replacement Version: 10.0b
Pre-Validation of image is successful.
Are you sure to download file to expander?(y/n):y
Downloading File............................Download Complete.
Post-validating.........Post-Validation of image is successful.
Download Successful.

I reset the expander, the LED’s now did a test pattern that they did not do before, and things looked good:

>g3Xflash.exe -s com4 reset exp
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface....................
Expander: SC846-P (SAS_3X_40)
Are you sure you want to reset Expander?(y/n):y
Expander reset successful.

>g3Xflash.exe -s com4 get avail
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface..
INFO: Bootstrap is not present on board.
Downloading the Bootstrap
................................................................Download Bootstrap Complete.
..................
Expander: SC846-P (SAS_3X_40)
1) SC846-P (SAS_3X_40) (50030480:0000007F)

>g3Xflash.exe -s com4 get exp

********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface....................
Expander: SC846-P (SAS_3X_40)
Reading the expander information................
Expander: SC846-P (SAS_3X_40) C1
SAS Address: 50030480:0000007F
Enclosure Logical Id: 50030480:0000007F
Component Identifier: 0x0232
Component Revision: 0x03

>g3Xflash.exe -s com4 get ver 0
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface....................
Expander: SC846-P (SAS_3X_40)
Firmware Region Version: 66.16.11.00

Everything looked good, except the SAS address defaulted to 50030480:0000007F.

The firmware “ReleaseNote.txt” file states that the v6.00 CLIXTL tool can change the SAS address, but the only version on the Supermicro site is the  v6.10.C version, that does not support changing the SAS address.

How to modify SAS address
-------------------------
Under Linux/Windows Environment to use CLIXTL (ver.6.00)
1. use "CLIXTL -l" to show SAS addresses
2. use "CLIXTL -s  -t  -r" to change SAS address and reset expander

example:
CLIXTL -l
CLIXTL -s 500304801234567F -t 500304800000007F -r

The v1.4 GUI does support changing the SAS address. It appears that the GUI dynamically creates a MFG image (I could see a BIN file get created in the directory), but after it changed the address, the backplane was back to a borked state, and I had to repeat the recovery process.

By the next week I heard back from Supermicro, and they confirmed the instructions from the “ReleaseNote.txt” file were wrong, and I should use the instructions from the “Command-line Xtool 6.10.C.pdf” file.

Wrong, will bork the MFG data:
CLIXTL -f 3 -t 5003048001B24DBF -d BPN-SAS3-846EL-PRI_16_11.bin -r

Right:
CLIXTL -c -t 5003048001B24DBF -d BPN-SAS3-846EL-PRI_16_11.bin

Better, update firmware and MFG and retain settings:
CLIXTL -a usc -t 5003048001B24DBF -d ~/

I used the all in one update method on the server that was running the original firmware backplane, and it updated without issue:

# ./CLIXTL -a usc -t 500304800914683F -d ~/CLIXTL6.10.C_Linux/
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Firmware Region 0 - Finished
Firmware Region 1 - Finished
Firmware Region 2 - Finished
MFG page Region 3 - Finished
New configuration is uploaded successfully
Please reset expander to activate

[Reboot]

# ./CLIXTL -i -t 500304800914683F
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 500304800914683F
ENCLOSURE ID - 500304800914683F
ENCLOSURE INFORMATION:
PLATFORM NAME - SMC846ELSAS3P
SERIAL NUMBER -
VENDOR ID - LSI
PRODUCT ID - SAS3x40
VERSION INFORMATION:
FLASH REGION 0 - 66.16.11.00
FLASH REGION 1 - 66.16.11.00
FLASH REGION 2 - 66.16.11.00
FLASH REGION 3 - 16.11
DEVICE INFORMATION:
DEVICE NAME - /dev/sg11
BMC IP - NULL

I now had one perfectly updated backplane preserving all the original MFG data, and one backplane with default MFG data. I wanted to apply the MFG data from the good backplane to the default values backplane.

I downloaded the firmware and manufacturing data from the good backplane using the v1.4 tools (not supported in current CLIXTL):

./g3Xflash -i get avail
./g3Xflash -y -i 500304800914683F up fw up_fw_loader_0.fw 0
./g3Xflash -y -i 500304800914683F up fw up_fw_loader_1.fw 1
./g3Xflash -y -i 500304800914683F up fw up_fw_loader_2.fw 2
./g3Xflash -y -i 500304800914683F up mfg up_mfg_loader.bin 3

The downloaded files are larger and are padded with 0xFF or 0x00. I trimmed the MFG file to the right size, and modified the SAS address in two places:

2020-01-22 (3)

I tried to uploaded the modified MFG data:

>g3Xflash.exe -y -s com4 down mfg BPN-SAS3-846EL-PRI_16_11.bin 3
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface.....
Expander: Unknown (SAS_3X_40)
Image Validation: Passed
Checksum: Failed

But the tool complains that the checksum failed. From the file diff we can see that there is more than just the SAS address that change, I assume some sort of checksum calculation that goes with the data.

The v1.4 g3xFlash CLI help does reference XML options for converting between binary and XML MFG formats, but no instructions on how to use it. Like the serial recovery procedure these tools are probably for internal use only, and I could find no public references.

>g3Xflash.exe -h
********************************************************************************
    g3Xflash
    LSI SAS Expander Flash Utility
    Version: 2.0.0.0
    Copyright (c) 2013 LSI Corporation.  All rights reserved.
********************************************************************************
SYNTAX:
    g3Xflash OPTIONS INTERFACE COMMAND
OPTIONS:
...
COMMAND:
...
        up mfg  
...
            In case of mfg upload using XML file, command syntax changes
            as below. XML file needs to be specified at two places.
            e.g.
            "g3Xflash -x   up mfg  3"

Supermicro support confirmed the current tools cannot change the SAS address, they would not supply the older version of the tools, and recommended I send the backplane in for service, or allow them to remotely SSH to the machine and they will change it for me. A bit disappointing that something so simple is made so complicated, and really way too much trouble.

Since I only had one expander in the chassis, there would be no issues using a default SAS address, and I decided to leave it as is. I replaced the other SAS2 backplane with the recovered SAS3 backplane, and the expander and all drives were back online.

If anybody knows how to update the SAS address, or has a copy of the v6.00 CLIXTL tools that supposedly can change the address, please do let me know.

Unraid repeat parity errors on reboot

This post started with a quick experiment, but after hardware incompatibilities forced me to swap SSD drives, and subsequently losing a data volume, it turned into a much bigger effort.

My two Unraid servers have been running nonstop without any issues for many months, last I looked the uptime on v6.7.2 was around 240 days. We recently experienced an extended power failure, and I noticed 5 parity errors, on both servers, after the servers were restarted.

Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934168
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934176
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934184
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934192
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934200

Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200

I initially suspected that a dirty shutdown caused the corruption, but my entire rack is on a large UPS, and the servers are configured, and tested, to cleanly shutdown in case of a low battery condition. Unfortunately Unraid does not persist logs across reboots, so it was impossible to verify the shutdown behavior via logs. Unraid logs to memory and not to the USB flash drive to prevent flash wear, but I think this needs to be at least configurable, as no logs means troubleshooting after an unexpected reboot is near impossible. Yes, I know I can enable the Unraid syslog server, and I can redirect syslog to write to the flash drive, but syslog is not as reliable or complete as native logging, especially during a shutdown scenario, but more importantly, syslog was not enabled, so no shutdown logs.

I could not entirely rule out a dirty shutdown, but I could test a clean reboot scenario. I restarted from within Unraid, ran a parity check, same exact 5 parity errors were back, ran a parity check again, and clean. It takes more than a day to run a single parity check, so this is a cumbersome and time consuming exercise. It is  very suspicious that it is exactly the same 5 sectors, every time.

Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200

I searched the Unraid forums, and I found that there are other reports of similar repeat parity errors. In some instances attributed to a Marvel chipset, or a Supermicro AOC-SASLP-MV8 controller, or the SASLP2 driver. My systems use Adaptec RAID cards, 7805Q SAS2 and 81605ZQ SAS3, in HBA mode, so no Marvel chipset and no SASLP2 driver, but the same symptoms.

An all too common forum reply to storage problems is to switch to a LSI HBA, and I got the same reply when I reported the parity problem with my Adaptec hardware.

I was sceptical, causation vs. correlation. As example, take the SQLite corruption bug introduced in v6.7 and for the longest time it was blamed on hardware or 3rd party apps, but it eventually turns out to be an Unraid bug.

Arguing my case on a community support forum is not productive, and I just want the parity problem resolved, so I decided to switch to LSI HBA cards. I really do have a love hate relationship with community support, especially when I pay for a product, like Unraid or Plex Pass, but have no avenue to dedicated support.

I am no stranger to LSI cards, and the problems flashing from IR to IT mode firmware, so I got my LSI cards preflashed with the latest IT mode firmware at the Art of Server eBay store. My systems are wired with miniSAS HD SFF-8643 cables, and the only cards offered with miniSAS HD ports were LSI SAS9340-8i ServeRAID M1215 cards. I know the RAID hardware is overkill when using IT mode, and maybe I should have gone for vanilla LSI SAS 9300-8i cards, especially when the the Unraid community was quick to comment that a 9340 is not a “true” HBA.

I replaced the 7805Q with the SAS9340 in Server-2, and noticed that none of my SSD drives showed up in the LSI BIOS utility, only the spinning disks showed up. I put the 7805Q card back, and all the drives, including the SSD drives, showed up in the Adaptec BIOS utility. I replaced the 81605ZQ with the SAS9340 in Server-1, and this time some of the SSD’s showed up. None of my Samsung EVO 840 SSD’s showed up, but the Samsung Pro 850 and Pro 860 SSD’s did show up. I again replaced the 7805Q in Server-2 with the SAS9340, but this time I added a Samsung Pro 850, and it did show up.

The problem seemed to be limited to my Samsung EVO drives. I reached out to Art of Server for help, and although he was very responsive, he had not seen or heard of this problem. I looked at the LSI hardware compatibility list, and the EVO drives were listed. Some more searching, and I found a LSI KB article mentioning TRIM support not being supported on Samsung Pro 850 drives. It seems that the LSI HBA’s need TRIM to support DRAT (Deterministic Read After TRIM) / (Data Set Management TRIM supported (limit 8 blocks)), and RZAT (Deterministic read ZEROs after TRIM). The Wikipedia article on TRIM mentions specific drives for faulty TRIM implementations, including the Samsung 840 and 850 (without specifying Pro or EVO), and the Linux kernel has special handling for Samsung 840 and 850 drives.

	/* devices that don't properly handle queued TRIM commands */
	{ "Micron_M500IT_*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Micron_M500_*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Crucial_CT*M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Micron_M5[15]0_*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Crucial_CT*M550*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Crucial_CT*MX100*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 840*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 850*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "FCCT*M500*",			NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },

This is all still circumstantial, as it does not explain why the LSI controller would not mount the 840 EVO drives, but will mount the 850 Pro drive, when both are listed as problematic, and both are included on the LSI hardware compatibility list. I do not have EVO 850’s to test with, so I can not confirm if the problem is limited to EVO 840’s.

I still had the original parity problem to deal with, and to verify that a LSI HBA will resolve the problem, so I needed a working Unraid with LSI HBA system. Server-1 had two EVO 840’s, a 850 Pro, and a 860 Pro for the BTRFS cache volume. I pulled a Pro 850 and a Pro 860 drive from another system, and proceeded to replace the two EVO 840’s. Per the Unraid FAQ, I should be able to replace the drives one at a time, waiting for the BTRFS volume to rebuild. I replaced the first disk, it took about a day to rebuild, I replaced the second disk using the same procedure, but something went wrong, and my cache volume would not mount, and reported being corrupt.

Jan  6 07:25:41 Server-1 kernel: BTRFS info (device sdf1): allowing degraded mounts
Jan  6 07:25:41 Server-1 kernel: BTRFS info (device sdf1): disk space caching is enabled
Jan  6 07:25:41 Server-1 kernel: BTRFS info (device sdf1): has skinny extents
Jan  6 07:25:41 Server-1 kernel: BTRFS warning (device sdf1): devid 4 uuid 94867179-94ed-4580-ace4-f026694623f6 is missing
Jan  6 07:25:41 Server-1 kernel: BTRFS error (device sdf1): failed to verify dev extents against chunks: -5
Jan  6 07:25:41 Server-1 root: mount: /mnt/cache: wrong fs type, bad option, bad superblock on /dev/sdr1, missing codepage or helper program, or other error.
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7033): exit status: 32
Jan  6 07:25:41 Server-1 emhttpd: /mnt/cache mount error: No file system
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7034): umount /mnt/cache
Jan  6 07:25:41 Server-1 kernel: BTRFS error (device sdf1): open_ctree failed
Jan  6 07:25:41 Server-1 root: umount: /mnt/cache: not mounted.
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7034): exit status: 32
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7035): rmdir /mnt/cache

In retrospect I should have known something was wrong when Unraid reported the array being stopped, but I still saw lots of disk activity on the SSD drive bay lights. I suspect the BTRFS rebuild was still ongoing, or mounted, even if Unraid reported the array being stopped. No problem, I thought, I make daily data backups to Backblaze B2 using Duplicacy, and weekly Unraid (appdata and docker) backups, that are then backed up to B2. I recreated the cache volume, and got the server started again, but my Unraid data backups were missing.

It was an oversight and configuration mistake: I configured my backup share to be cached, I ran daily backups of the backup share to B2 at 2am, and weekly Unraid backups to the backup share on Mondays at 3am. The last B2 backup was Monday morning at 2am, the last Unraid backup was Monday morning at 3am. When the cache died all data on the cache was lost, including the last Unraid backup, that never made it to B2. My last recoverable Unraid backup on B2 was a week old.

So a few key learnings: do not use the cache for backup storage, schedule offsite backups to run after onsite backups, and if the lights are still blinking don’t pull the disk.

Once I had all the drives installed, I tested for TRIM support.

Samsung Pro 860, supports DRAT and RZAT:

root@Server-1:/mnt# hdparm -I /dev/sdf | grep TRIM
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM

Samsung Pro 850, supports DRAT:

root@Server-2:~# hdparm -I /dev/sdf | grep TRIM
* Data Set Management TRIM supported (limit 8 blocks)

Samsung EVO 840, supports DRAT, but does not work with the LSI HBA:

root@Server-2:~# hdparm -I /dev/sdc | grep TRIM
* Data Set Management TRIM supported (limit 8 blocks)

The BTRFS volume consisting 4 x Pro 860 drives reported trimming what looks like all disks, 3.2 TiB:

root@Server-1:~# fstrim -v /mnt/cache
/mnt/cache: 3.2 TiB (3489240088576 bytes) trimmed

The BTRFS volume consisting of 2 x Pro 860 + 2 x Pro 850 drives reported trimming what looks like only 2 disks, 1.8 TiB:

root@Server-2:~# fstrim -v /mnt/cache
/mnt/cache: 1.8 TiB (1946586398720 bytes) trimmed

In summary, Samsung EVO 840 no good, Samsung Pro 850 avoid, Samsung Pro 860 is ok.

Server-2 uses SFF-8643 to SATA breakout cables with sideband SGPIO connectors, controlling the drive bay lights. With the Adaptec controller the drive bay lights worked fine, but with the LSI the lights do not appear to work. I am really tempted to replace the chassis with a SAS expander, alleviating the need for the breakout cables, but that is a project for another day.

After I recreated the cache volume, reinstalled the Duplicacy web container and tried to restore my now week old backup file. I could not get the web UI to restore the 240GB backup file, either the session timed out or the network connection was dropped. I reverted to using the CLI, and with a few retries, eventually restored the file. It was disappointing to learn that the web UI must remain open during the restore, and that the CLI does not automatically retry on network failures. Fortunately Duplicacy will do block-based restores and can resume restoring large files.

2020/01/06 07:59:01 Created restore session 1o8nqw
2020/01/06 07:59:01 Running /home/duplicacy/.duplicacy-web/bin/duplicacy_linux_x64_2.3.0 [-log restore -r 101 -storage B2-Backup -overwrite -stats --]
2020/01/06 07:59:01 Set current working directory to /cache/localhost/restore
2020/01/06 09:37:35 Deleted listing session jnji7l
2020/01/06 09:37:41 Invalid session
2020/01/06 12:07:57 Stopping the restore operation in session 1o8nqw
2020/01/06 12:07:57 Failed to restore files for backup B2-Backup-Backup revision 101 in the storage B2-Backup: Duplicacy was aborted
2020/01/06 12:07:57 closing log file restore-20200106-075901.log
2020/01/06 12:08:17 Deleted restore session 1o8nqw
Downloaded chunk 34683 size 13140565, 15.05MB/s 01:20:44 70.1%
Failed to download the chunk 
1ff9d2c082d06226b0d81019338d048bf5a4428827a3fc0d3f6f337d66fd7fa9: read tcp 192.168.1.113:49858->206.190.215.16:443: wsarecv: An existing connection was forcibly closed by the remote host.
...
Downloaded chunk 49473 size 2706939, 13.52MB/s 00:00:01 100.0%
Downloaded Unraid/2019-12-30@03.00/CA_backup.tar.gz (255834283004)
Restored C:\Users\piete\Downloads\Duplicacy to revision 101
Files: 1 total, 243982.58M bytes
Downloaded 1 file, 243982.58M bytes, 14795 chunks
Total running time: 01:30:23

I did lose my DW-Spectrum IPVMS running on an Ubuntu Server VM. I’ve known that I don’t have a VM backup solution, but the video footage is on the storage server not in the VM, video backups go to B2, and it is reasonably easy to recreate the VM. I am still working on a DW-Spectrum docker solution for Unraid, but as of today the VMS does not recognize Unraid mapped storage volumes.

After all this trouble, I could finally test a parity check after reboot with the LSI HBA.  With the system up I ran a parity check, all clear, rebooted, ran the parity check again, and … no errors. I performed this operation on both servers, no problems.

I was really sceptical that the LSI would work where the Adaptec failed, and this does not rule out Unraid as the cause, but it does show that Unraid with the LSI HBA does not have the dirty parity on reboot problem.

Unraid in production, a bit rough around the edges, and terrible SMB performance

In my last two posts I described how I migrated from W2K16 and hardware RAID6 to Unraid. Now that I’ve had two Unraid servers in production for a while, I’ll describe some of the good and not so good I experienced.

Running Docker on Unraid is magnitudes easier compared to getting Docker to work on Windows. Docker allowed me to move all but one of my workloads from VM’s to containers, simplifying updates, reducing the memory footprint, and improving performance.

For my IP security camera NVR software I did switch from Milestone XProtect Express running on a W2K16 VM, to DW Spectrum running on an Ubuntu Server VM. DW Spectrum is the US brand name for the Nx Witness product, and the DW Spectrum branded product is sold in the US. I chose to switch to Nx Witness, no DW Spectrum, from XProtect because Nx Witness is lighter in resource consumption, easier to deploy, easier to update, has perpetual licenses, includes native remote viewing, and an official Docker release is forthcoming.

I have been a long time user of CrashPlan, and I switched to CrashPlan Pro when they stopped offering a consumer product. I tested CrashPlan Pro and Duplicati containers on Unraid, with Duplicati backing up to Backblaze B2. Duplicati is the clear winner, backups were very fast, and completed in about 3 days. Where after 5 days I stopped CrashPlan, when it estimated another 18 days to complete the same backup operation, and it showed the familiar out of memory error. My B2 storage cost will be a few $ higher compared to a single seat license for CrashPlan Pro, but the Duplicati plus B2 functionality and speed is superior.

2019-06-08 (10)

When the Unraid 6.7.0 release went public, I immediately updated, but soon realized my mistake, when several plugins stopped working. It took several weeks before plugin updates were released that restored full functionality. It is worth mentioning, again, that I find it strange that Unraid without community provided plugins is really not that usable, but the functionality still remains in community provided plugins, not in Unraid. Next time I will wait a few weeks for the dust to settle in the plugin community before updating.

Storage and disk management is reasonably easy, and much more flexible compared to hardware RAID management. But adding and removing disks is still mostly a manual process, and doing it without invalidating parity is very cumbersome and time consuming. At several times I gave up on the convoluted steps required to add or remove disks without invalidating parity, and just reconfigured the array and then rebuilt parity, hoping nothing goes wrong during the parity rebuild. This is in my opinion a serious shortcoming, maybe not in technology, but in lack of an easy to use and reliable workflow to help retain redundant protection at all times.

In order to temporarily make enough storage space in my secondary server, I removed all the SSD cache drives and replaced them with 12TB Seagate IronWolf drives. I did move all the data that used to be on the cache to regular storage, including the docker appdata folder. This should not be a big deal, but I immediately started getting SQLite DB corruption errors in apps like Plex, that store data in SQLite on the appdata share. After some troubleshooting I found many people complaining about this issue, that seems to have been exasperated by the recent Unraid 6.7.0 update. Apparently this is a known problem with the Fuse filesystem used by Unraid. Fuse dynamically spans shares and folders across disks, but apparently breaks file and file-region locking required by SQLite. The recommended workaround is to put all files that require locking to work on the cache, or on a single disk, effectively bypassing Fuse. If it is Fuse that breaks file locking behavior, I find it troubling that this is not considered a critical bug.

I am quite familiar with VM snapshot management using Hyper-V and VMWare, it is a staple of VM management. In Unraid I am using a Docker based Virt-Manager, which seems far less flexible, but more importantly, fails to take snapshots of UEFI based VM’s. Apparently this is a known shortcoming. I have not looked very hard for alternatives, but this seems to be a serious functional gap compared to Hyper-V or VMWare’s snapshot capabilities.

2019-06-05 (2)

As I started using the SMB file shares, now hosted on Unraid, in my regular day to day activities, I noticed that under some conditions the write speed becomes extremely slow, often dropping to around 2MB/s. This seems to happen when there are other file read operations in progress, and even a few KB/s of reads can drastically reduce the array SMB write performance. Interestingly the issue does not appear to affect my use of rsync between Unraid servers, but only SMB. I did find at least one other recent report of similar slowdowns, where only SMB is affected.

Since the problem appeared to be specific to Unraid SMB, and not general network performance, I compared the Unraid SMB performance with Windows SMB in a W2K19 VM running on the same Unraid system. By running W2K19 as a VM on the same Unraid system, the difference in performance will be mostly the SMB stack, not hardware or network.

On Unraid I created a share that is backed by the SSD cache array, that same SSD cache array holds the W2K19 VM disk image, so the storage subsystems are similar. I ran a similar test against an Unraid share backed by disk instead of cache.

I found a few references (1, 2) to SMB benchmarking using DiskSpd, and I used them as a basis for the test options I used. Start by creating a 64GB test file on all test shares, we reuse the file and it saves a lot of time to not recreate it every time. Note, we get a warning when creating the file on Unraid, due to SetFileValidData() not being supported by Unraid’s SMB implementation, but that should not be an issue.

>diskspd.exe -c64G \\storage\testcache\testfile64g.dat
WARNING: Could not set valid file size (error code: 50); trying a slower method of filling the file (this does not affect performance, just makes the test preparation longer)

>diskspd.exe -c64G \\storage\testmnt\testfile64g.dat
WARNING: Could not set valid file size (error code: 50); trying a slower method of filling the file (this does not affect performance, just makes the test preparation longer)

>diskspd.exe -c64G \\WIN-EKJ8HU9E5QC\TestW2K19\testfile64g.dat

I ran several tests similar to the following commandlines:

>diskspd -w50 -b512K -F2 -r -o8 -W60 -d120 -Srw -Rtext \\storage\testcache\testfile64g.dat > d:\diskspd_unraid_cache.txt
>diskspd -w50 -b512K -F2 -r -o8 -W60 -d120 -Srw -Rtext \\storage\testmnt\testfile64g.dat > d:\diskspd_unraid_mnt.txt
>diskspd -w50 -b512K -F2 -r -o8 -W60 -d120 -Srw -Rtext \\WIN-EKJ8HU9E5QC\TestW2K19\testfile64g.dat > d:\diskspd_w2k19.txt

For a full explanation of the commandline arguments see here. The test will do 50% read and 50% write, block sizes varied from 4KB to 2048KB, 2 threads, 8 outstanding IO operations, random aligned IO, warm up for 60s, run for 120s, disable local caching for remote filesystems.

W2K19.1

W2K19.3

Cache.1

Cache.3

Mount.1

Mount.3

From the results we can see that the Unraid SMB performance for this test is pretty poor. I redid the tests, this time doing independent read and write tests, and instead of various block sizes, I just did a 512KB block size test (I got lazy).

RW.1

RW.2

No matter how we look at it, the Unraid SMB write performance is still really bad.

I wanted to validate the synthetic tests results with a real world test, so I collected a folder containing around 65.2GB of fairly large files, on SSD, and copied the files up and down using robocopy from my Win10 system. I chose the size of files to be about double the size of the memory on the Unraid system, such that the impact of caching can be minimized. I made sure to use a RAW VM disk to eliminate any performance impact of growing a QCOW2 image file.

>robocopy d:\temp\out \\storage\testmnt\in /mir /fft > d:\robo_pc_mnt.txt
>robocopy d:\temp\out \\storage\testcache\in /mir /fft > d:\robo_pc_cache.txt
>robocopy d:\temp\out \\WIN-EKJ8HU9E5QC\TestW2K19\in /mir > d:\robo_pc_w2k19.txt

>robocopy \\storage\testmnt\in d:\temp\in /mir /fft > d:\robo_mnt_pc.txt
>robocopy \\storage\testcache\in d:\temp\in /mir /fft > d:\robo_cache_pc.txt
>robocopy \\WIN-EKJ8HU9E5QC\TestW2K19\in d:\temp\in /mir > d:\robo_w2k19_pc.txt

During the robocopy to Unraid I notice that sporadically the Unraid web UI, and web browsing in general, becomes very slow. This never happens while copying to W2K19. I can’t explain this, I see no errors reported in my Win10 client eventlog or resource monitor, I see no unusual errors on the network switches, and no errors in Unraid. I suspect whatever is impacting SMB performance is affecting network performance in general, but without data I am really just speculating.

The robocopy read results are pretty even, but again shows inferior Unraid SMB write performance. Do note that the W2K19 VM is still not as fast as my previous W2K16 RAID6 setup where I could consistently saturate the 1Gbps link for read and writes, on the same hardware and using the same disk.

Robocopy.1.png

Robocopy.2

It is very disappointing to discover the poor SMB performance, I reported my findings to the Unraid support forum, and I hope they can do something to improve performance, or maybe invalidate my findings.

 

Unraid and Robocopy Problems

In my last post I described how I converted one of my W2K16 servers to Unraid, and how I am preparing for conversion of the second server.

As I’ve been copying all my data from W2K16 to Unraid, I discovered some interesting discrepancies between W2K16 SMB and Unraid SMB. I use robocopy to mirror files from one server to the other, and once the first run completes, any subsequent runs should complete without needing to copy any files again (unless they were modified).

First, you have to use the “robocopy.exe /mir [dest] /mir /fft” option, for Fat File Times, allowing for 2 seconds of drift in file timestamps.

I found a large number of files that would copy over and over with no changes to the source files. I also found a particular folder that would “magically” show up on Unraid, and cannot be deleted from the Unraid share by robocopy.

After some troubleshooting, I discovered that files with old timestamps, and folder names that end in a dot, do not copy correctly to Unraid.

I looked at the files that would not copy, and I discovered that the file modified timestamps were all set to “1 Jan 1970 00:00”. I experimented by changing the modified timestamp to today’s date, and the files copied correctly. It seems that if the modified timestamp on the source file is older than 1 Jan 1980, the modified timestamp on Unraid for the same newly created file will always be set as 1 Jan 1980. When then running robocopy again, the source files will always be reported as older, and the file copied again.

Below is an example of a folder of test files with a created date of 1 Jan 1970 UTC, I copy the files using robocopy, and copy them again. The second run of robocopy again copies all the files, instead of reporting them as similar. One can see that the destination timestamp is set to 1 Jan 1980, not 1 Jan 1970 as expected.

The second set of problem files occur in folder names ending in a dot. Unraid ignores the dots on the end of the folder names, and when another folder exists without dots, the copy operation uses the wrong folder.

Below is an example of a folder that contains two directories, one named “LocalState”, and one named “LocalState..”. I robocopy the folder contents, and when running robocopy again, it reports an extra folder. That extra folder gets “magically” created in the destination directory, but the “LocalState..” folder is missing.

The same robocopy operations to the W2K16 server over SMB works as expected.

From what I researched, the timestamp ranges for NTFS is 1 January 1601 to 14 September 30828, FAT is 1 January 1980 to 31 December 2107, and EXT4 is 1 January 1970 to 19 January 2106 (2038 + 408). I could not create files with a date earlier than 1 Jan 1980, but I could set file modified timestamps to dates greater than 2106, so I do not know what the Unraid timestamp range is.

Creating and accessing directories with trailing dots requires special care on Windows using the NT style notation, e.g. “CreateDirectoryW(L”\\\\?\\C:\\Users\\piete\\Unraid.Badfiles\\TestDot..”, NULL), but robocopy does handle that correctly on W2K16 SMB.

I don’t know if the observed behavior is specific to Unraid SMB, or if it would apply to Samba on Linux in general. But, it posed a problem as I wanted to make sure I do indeed have all files correctly backed up.

I decided to write a quick little app to find problem files and folders. The app iterates through all files and folders, it will fix timestamps that are out of range, and report on finding files or folders that end in a dot. I ran it through my files, it fixed the timestamps for me, and I deleted the folders ending in dot by hand. Multiple robocopy runs now complete as expected.

 

 

 

eNom Dynamic DNS Update Problems

Update: On 27 July 2018 eNom support notified me by email that the issue is resolved. I tested it, and all is back to normal with DNS-O-Matic.

Sometime between 12 May 2018 and 24 May 2018 the eNom dynamic DNS update mechanism stopped working.

I use the very convenient DNS-O-Matic dynamic DNS update service to update my OpenDNS account, and several host records at eNom, pointing them to my home IP address.

I was first alerted to the problem by a DNS-O-Matic status failure email, but as I was about to get on a plane for a business trip, I ignored the issue, hoping it was temporary.

eNom response for 'foo.bar.net':
--------------------
;URL Interface
;Machine is SJL0VWAPI03
;Encoding Type is utf-8
Command=SETDNSHOST
APIType=API.NET
Language=eng
ErrCount=1
Err1=Domain name not found
ResponseCount=1
ResponseNumber1=316153
ResponseString1=Validation error; not found; domain name(s)
MinPeriod=1
MaxPeriod=10
Server=sjl0vwapi03
Site=eNom
IsLockable=
IsRealTimeTLD=
TimeDifference=+0.00
ExecTime=0.053
Done=true
TrackingKey=5d09a343-b2d6-44e2-8d70-0ad9bcabcb8d
RequestDateTime=6/21/2018 6:11:11 PM
--------------------

Here is the update history from DNS-O-Matic:

47.44.1.123, Jun 29, 2018 4:58 pm, ERROR
47.44.1.123, Jun 29, 2018 4:53 pm, ERROR
47.44.1.123, Jun 21, 2018 6:11 pm, ERROR
47.44.1.123, May 24, 2018 6:10 pm, ERROR
47.44.1.124, May 12, 2018 8:56 am, OK
47.44.1.124, May 4, 2018 2:48 pm, OK
47.44.1.124, May 3, 2018 1:42 pm, OK
47.44.1.124, Apr 1, 2018 12:39 pm, OK
47.44.1.124, Apr 1, 2018 9:58 am, OK
47.44.1.124, Mar 24, 2018 5:06 pm, OK

As of yesterday, I could not find any other reports of similar issues on google, and the eNom status page showed no problems.

I use a Ubiquity UniFi Security Gateway Pro as home router, and I have the dynamic DNS service in the UniFi controller configured to point to DNS-O-Matic, but it offered no additional hints as to the cause of the problem.

DNS

I contacted eNom support over chat, and they informed me they know there is an issue, and they said I should use the following format for the update:

http://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=%1&PW=%2&Zone=%3&DomainPassword=%4

%1 = Is username in Enom
%2 = Is password
%3 = Is my host and domain
%4 = Is my domain access password

This was interesting, I had looked at several eNom update scripts, even the eNom sample code, and they all used a different command format. I looked up the SetDNSHost documentation, and sure enough, it looks like eNom changed the API.

Old format:

https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&HostName=[host]&Zone=[domain]&DomainPassword=[password]&Address=[IP]

New format:

https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&PW=[LoginPassword]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP]

eNom changed the meaning of the “Zone” parameter to be the fully qualified domain name, and they required the addition of the account username and password.

I tried the old format in my browser, and I got the same “Domain name not found” error. As I tried the URL, I noticed that HTTPS failed with a certificate mismatch. The certificate for https://dynamic.name-services.com points to reseller.enom.com.

Broken SSL, and including my account username and password was not an acceptable option, additionally I use 2FA on my account, so I had doubts that my password would even work. I tried the command as described in the documentation, but I omitted my account password, and it worked.

https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP]

I still find it very weird that this has been broken for so long, and that I could not find other reports of the problem on google, are people not using eNom or eNom resellers with dynamic DNS?

I also find it disappointing that the status page is not reflecting this problem, and that the SSL domain does not match, one would expect more from a domain company.

Until eNom fixes the problem, or until DNS-O-Matic updates support for the new API format, I created a PowerShell script to update my domains, maybe it is useful for others with the same problem.

$UserName = 'eNom account username'
$HostNames = @('www', 'name1', 'name2', 'etc')
$DomainName = 'yourdomain.com'
$Password = 'Domain change password'
 
$url = 'http://myip.dnsomatic.com'
$webclient = New-Object System.Net.WebClient
$result = $webclient.DownloadString($url)
Write-Host $result
$IPAddress = $result.ToString()
$webclient.Dispose()

# Ignore SSL error caused by dynamic.name-services.com SSL certificate pointing to a different domain
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = {$true}
$webclient = New-Object System.Net.WebClient
foreach ($hostname in $HostNames)
{
    # https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&HostName=[host]&Zone=[domain]&DomainPassword=[password]&Address=[IP]
    # https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP]
    $url = "https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=$UserName&Zone=$hostname.$DomainName&DomainPassword=$Password&Address=$IPAddress"
    Write-Host $url
    $result = $webclient.DownloadString($url);
    Write-Host $result
}
$webclient.Dispose()
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = $null

 

Razer BSOD When Driver Verifier is Enabled

I am done with Razer, exciting promises for technology on paper, great looking hardware, terrible support, terrible software.

Not too long ago I complained about Razer’s poor UX and Support, this time it is a BSOD in one of their drivers, and forever crashing Razer Stargazer camera software.

I’ve been looking for a Windows Hello capable webcam, and the Razer Stargazer, based on Intel RealSense technology, looked promising. The device is all metal and tactical looking, but the software experience is so buggy, install this, install that, then crash after crash after crash. I ended up returning it for a refund, and got a Logitech BRIO instead, the BRIO is cheaper, and works great.

A couple days ago I was greeted with a BSOD on one of my test machines, a crash in the RZUDD.SYS “Razer Rzudd Engine” driver, part of the Razer Synapse software. What makes this interesting, is that the issue seems to be triggered by having Driver Verifier enabled.

20170416_201259779_iOS

One may be tempted to say do not enable Driver Verifier, but, the point of driver verifier is to help detect bugs in drivers, and is a basic requirement for driver certification. Per the WinDbg analysis, this appears to be a memory corruption bug. After some searching, I found that the Driver Verifier BSOD has been reported by other users, with no acknowledgement, and no fix forthcoming. I contacted Razer support, and not surprisingly, they suggested uninstall and reinstall. I tried the community forums, and I was just pointed back to support.

FAULTING_IP:
rzudd+28c80
...
DEFAULT_BUCKET_ID:  CODE_CORRUPTION
...
PROCESS_NAME:  RzSynapse.exe
...
STACK_TEXT:
nt!KeBugCheckEx
nt!MiSystemFault+0x12e69c
nt!MmAccessFault+0xae6
nt!KiPageFault+0x132
rzudd+0x28c80
rzudd+0x218d4
rzudd+0x7a9f
Wdf01000!FxIoQueue::DispatchRequestToDriver+0x1bf [minkernel\wdf\framework\shared\irphandlers\io\fxioqueue.cpp @ 3325]
Wdf01000!FxIoQueue::DispatchEvents+0x3bf [minkernel\wdf\framework\shared\irphandlers\io\fxioqueue.cpp @ 3125]
Wdf01000!FxPkgIo::DispatchStep1+0x53e [minkernel\wdf\framework\shared\irphandlers\io\fxpkgio.cpp @ 324]
Wdf01000!FxDevice::DispatchWithLock+0x5a5 [minkernel\wdf\framework\shared\core\fxdevice.cpp @ 1430]
nt!IovCallDriver+0x245
...
FAILURE_BUCKET_ID:  MEMORY_CORRUPTION_LARGE

I am done with Razer, exciting promises for technology on paper, great looking hardware, terrible support, terrible software.

Razer Shoddy Support and Bad Software UX

This post is just me venting my frustration at Razer’s poor software user experience, and their shoddy support practices. I’m writing this after I just had to go and find a working mouse, so I could click a button on a dialog that had no keyboard navigation support.

I’ve been using Razer keyboards and mice for some time, love them, their software not so much. I had to replace an aging ThinkPad, and the newly released Razer Blade Stealth looked like a great candidate, small and fast, reasonably priced, should be perfect, well, not so much.

I keep my monitors color calibrated, and I cringe whenever I see side-by-side monitors that clearly don’t match, or when somebody creates graphic content (yes you graphic artists using MacBooks to create content for PC software without proper color profiles) that looks like shades of vomit on a projector or a cheap screen, but I digress. My monitor of choice is NEC and their native SpectraView color calibration software. Unfortunately, the Blade with its lower end Intel graphics processor, and HDMI port, does not support DDC/CI, so no ability to color calibrate my monitor. My main monitor is a NEC MultiSync EA275UHD 4K monitor, and the internal Intel graphics processor is frustratingly slow on this high resolution display. And, the HDMI connectivity would drop out whenever the monitor went into power saving mode. Why not use a more standard mini-DisplayPort connector, would not solve the speed problem, but at least would have resolved the connection reliability and allowed for proper color calibration.

To solve the problem, I decided to get a Razer Core with an EVGA GeForce GTX 1070 graphics adapter. The Core is an external USB and network dock, with a PSU and PCIe connector for a graphics card, all connected to the notebook by Thunderbolt 3 over a, too short, USB-C cable. I connected my monitor to the GTX 1070 DisplayPort connector, connectivity was fine, I could color calibrate my monitor, and the display performance with the GTX 1070 was fast, great. By the way, JayzTwoCents has a great video on the performance of external graphic cards.

But, my USB devices connected to the dock kept on dropping out. I found several threads on the Razer support forum complaining about the same USB problems, and the threads are promptly closed with a contact support message. I contacted Razer support and they told me they are working on the problem, and closed my ticket. I contacted them again stating that closing my ticket did not resolve the problem, and they said my choice is RMA the device, with no known solution, or wait, and then they closed my ticket again. To this day this issue has not been resolved, and I have to connect my USB devices directly the notebook, defeating the purpose of a dock. They did publish a FAQ advising users to not use 2.4GHz WiFi, but to stick with 5GHz due to interference issues, so much for their hardware testing.

Now, let’s talk about their Razer Synapse software, the real topic of this post. The software is used to configure all the Razer devices, and sync the device preferences across computers with a cloud account, neat idea. The color scheme and custom drawn controls of this software matches their edgy “brand”, but their choice of thin grey font on a dark background fails in my usability book when used in a brightly lit office space.

Synapse.1

Whenever Windows 10 updates, the stupid Synapse software pops up while the install is still going, if you say yes, install now, then as expected the install fails due to Windows still installing. I logged the issue with Razer support, and they told me it is behaving as designed, really, designed to fail.

Synapse.2

So, today the Synapse software, again, prompts me to update, a frequent occurrence, and my mouse dies during the update, presumably because they updated the mouse driver, but this time I am prompted with a reboot required dialog. Dead mouse, no problem, have keyboard, tab over, wait, no keyboard navigation on the stupid owner drawn custom control dialog, no way to interact with the dialog without a mouse, just fail.

Synapse.3

Moral of the story, UX is important people, and I should just stick with ThinkPad or Microsoft Surface Book hardware, costs more, but never disappoints.

Circumventing ThinkPad’s WiFi Card Whitelisting

What started as a simple Mini PCI Express WiFi card swap on a ThinkPad T61 notebook, turned into deploying a custom BIOS in order to get the card to work.

I love ThinkPad notebooks, they are workhorses that keep on going and going. I always keep my older models around for testing, and one of my old T61’s had an Intel 4965AGN card, that worked fine with Windows 10, until the release of the Anniversary / Redstone 1 update. After the RS1 update, WiFi would either fail to connect, or randomly drop out. The 4965AGN card is not supported by Intel on Win10, and the internet is full of problem reports of Win10 and 4965AGN cards.

Ok, no problem, I’ll just get a cheap, reasonably new, with support for Win10, Mini PCIe WiFi card, and swap the card. I got an Intel 3160 dual band 802.11AC card and mounting bracket for about $20. The 3160 is a circa 2013 card with Win10 support. I installed the card, booted, and got a BIOS error 1802: Unauthorized network card is plugged in.

This lead me to the discovery of ThinkPad hardware whitelisting, where the BIOS only allows specific cards to be used, which lead me to Middleton’s BIOS, a custom T61 BIOS, that removes the hardware whitelisting, and enables SATA-2 support. I found working download links to the v2.29-1.08 Middleton BIOS here.

The BIOS update is packaged as a Win7 x86 executable or DOS bootable ISO image. As I’m running Win10 x64, and I could not find any CD-R discs around, I used Rufus to create a bootable DOS USB key, and I extracted the ISO contents using 7-Zip to a directory on the USB key. The ISO is created using a bootable 1.44MB DOS floppy image, and AUTOEXEC.BAT launches “FLASH2.EXE /U”, I created a batch file that does the same.

I removed the WiFi card, booted from USB, ran the flash, and got an error 1, complaining that flashing over the LAN is disabled. Ok, I enabled flashing the BIOS over the LAN in the BIOS, and rebooted.

I ran the update again, and this time I got error 99, complaining that BitLocker is enabled, and to temporarily disable BitLocker. I did not have BitLocker enabled, so I removed the hard drive and tried again, same error. Must be something in the BIOS, I disabled the security chip in the BIOS, tried again, and the update starts, but a minute or so later the screen goes crazy with INVALID OPCODE messages.

Hmm, maybe the updater does not like the FreeDOS boot image used by Rufus. Ok, let me create a MS-DOS USB key, uhh, on Win10, that turned out to be near impossible. Win10 does not include MS-DOS files, Rufus does not support custom locations for MS-DOS files, nor does it support getting them from floppy or CD images (readily available for download), the HP USB Disk utility complains my USB drive is locked, and writing raw images to USB result in a FAT12 disk structure that is too small to use. I say near impossible because I gave up, and instead went looking for an existing MS-DOS USB key I had made a long time ago. I am sure with a bit more persistence I could have found a way to create MS-DOS bootable USB keys on Win10, but that is an exercise of another day.

Trying again with a MS-DOS USB key, and voilà, BIOS flashed, and WiFi working.

I am annoyed that I had to go to this much trouble to get the new WiFi card working, but the best part of the exercise turns out to be the SATA-2 speed increase. This machine had a SSD drive, that I always found to be slow, but with the SATA-2 speed bump in Middleton’s BIOS, the machine is noticeably snappier.

A couple hours later, my curiosity got the better of me, and I made my own version of Rufus that will allow formatting of MS-DOS USB drives on Win10. In the process I engaged in an interesting discussion with the author of Rufus. I say interesting, but it was rather frustrating, Microsoft removed the MS-DOS files from Win10, and Rufus refuses to add support for sourcing of MS-DOS files from a user specified location, citing legal reasons, and my reluctance to first report the issue to FreeDOS. Anyway, can code, have compiler, if have time, will solve problem.

UPS Battery Replacement Turns Into Unrecoverable Firmware Update

Two lessons learned; do not trust scheduled battery tests, and leave working firmware be!

As the saying goes, if it is not broken do not fix it, especially when it comes to firmware.

I have a couple APC Smart-UPS‘s at my house, same as the models I like to use at the office. I use the SMT750 models with AP9631 Network Monitoring Cards. The problem started when we had a short power outage, and the UPS that powers the home network switch, cell repeater, alarm internet connection, and PoE IP cameras, unexpectedly died. A battery replacement led to the opportunity to do a UPS firmware update, which led to an unrecoverable firmware update.

It started when I woke up one morning and it was obvious the power had been out, first indicator is the kitchen appliances have blinking clocks, second are the numerous power failure email notifications, and the emails that stood out were from the alarm system that says it lost power and internet connectivity. The alarm has it’s own backup battery, the network switches and FiOS internet have their own battery backups, and the outage was only about 4 minutes.  So how is it that the UPS died, killing the switch, disconnecting the internet, especially when the outage was only 4 minutes, and typical runtimes on the UPS should be about an hour?

Here is the UPS outage log produced by the NMC card:

10/11/2016 06:53:05 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 06:12:27 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 06:12:24 Device UPS: Restored the local network management interface-to-UPS communication. 0x0101
10/11/2016 06:12:12 System Network service started. IPv6 address FE80::2C0:B7FF:FE98:9BAF assigned by link-local autoconfiguration. 0x0007
10/11/2016 06:12:10 Device Environment: Restored the local network management interface-to-integrated Environmental Monitor (Universal I/O at Port 1) communication. 0x0344
10/11/2016 06:12:09 System Network service started. System IP is 192.168.1.11 from manually configured settings. 0x0007
10/11/2016 06:12:02 System Network Interface coldstarted. 0x0001
10/11/2016 05:45:36 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 05:45:36 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 05:45:35 Device UPS: The output power is turned off. 0x0114
10/11/2016 05:45:35 Device UPS: The graceful shutdown period has ended. 0x014F
10/11/2016 05:45:35 Device UPS: No longer on battery power. 0x010A
10/11/2016 05:45:35 Device UPS: Main outlet group, UPS Outlets, has been commanded to shutdown with on delay. 0x0174
10/11/2016 05:45:35 Device UPS: The power for the main outlet group, UPS Outlets, is now turned off. 0x0135
10/11/2016 05:45:18 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 05:41:36 Device UPS: On battery power in response to rapid change of input. 0x0109

I could see from the log that the UPS battery power ran out within 4 minutes, 05:41:36 on battery, 05:45:18 battery too low, 05:45:35 output turned off. The UPS status page was equally puzzling, load was at 9.7%, yet reported runtime was only 5 minutes, impossible.

Here is the status screenshot:

apc-1

I ran a manual battery test, the test passed, but from the log it was clear the battery failed. I have bi-weekly scheduled battery tests for all UPS’s, never received a failure report. So what is the point of a battery test if the test comes back no problem yet it is clear to me from the logs that the battery failed?

Here is the log:

10/11/2016 18:07:06 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 18:07:06 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 18:07:05 Device UPS: Self-Test passed. 0x0105
10/11/2016 18:06:59 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 18:06:59 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 18:06:58 Device UPS: Self-Test started by management device. 0x0137
10/11/2016 18:06:56 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107

I asked for advice on the APC forum, no reply yet, and I ordered a replacement RBC48 battery. I received the battery, installed it, and the reported runtime is back to normal, 1 hour 48 minutes.

Here is a status screenshot with the new battery:

apc-2

Here is where I should have stopped and called it a day, but no. I knew that the UPS’s were on old firmware, and I decided to use this opportunity to update the firmware. I’d normally let firmware be, unless I have a good reason to update, but I convinced myself that the new firmware readme had some fixes that may help with the false pass on the battery test:

Release Notes (UPS09.3):
========================
2. Improved self-test logging for PCBE / NMC.
11. Repaired an occasional math error in the battery replacement date algorithm that resulted in incorrect dates.

I update the UPS, where I just replaced the battery, instructions are pretty simple. Only hassle is I have to bypass the network equipment to be mains powered so I can turn the UPS outputs off while updating the firmware, while maintaining network connectivity.

I did the same for my office UPS, PC and office switch on mains power, and when I power down the output, I made sure to not notify PowerChute Network Shutdown (PCNS) clients, as my PC had the PowerChute client installed to receive power state via the network. I start the firmware update over the network, and a few seconds later I get a Windows message that shutdown had been initiated by PCNS, what? I sit there in frustration, nothing to do but watch my PC shutdown while it is still delivering the firmware update.

On rebooting my PC, NMC comes up, but reports the UPS has stopped communicating. I pull AC power from the UPS, no change, I also pull the batteries, and when I plug the batteries and mains back on, beeeeeeeeeep. NMC now reports no UPS found, the UPS LCD panel reports all is fine. And still beeeeeeeeeep, and no way to stop the beeeeeeeeeep.

Here is the NMC status page:

apc-3

I try to do Firmware Upgrade Wizard update via USB, plug a USB cable in, PC sees UPS, reports critical condition, but the upgrade wizard reports no UPS found on USB.

Here is the wizard error page:

apc-4

So, here I am, stuck with a bricked UPS, lesson learned, actually two lessons learned; do not trust scheduled battery tests, and leave working firmware be!