## Unraid in production, a bit rough around the edges, and terrible SMB performance

In my last two posts I described how I migrated from W2K16 and hardware RAID6 to Unraid. Now that I’ve had two Unraid servers in production for a while, I’ll describe some of the good and not so good I experienced.

Running Docker on Unraid is magnitudes easier compared to getting Docker to work on Windows. Docker allowed me to move all but one of my workloads from VM’s to containers, simplifying updates, reducing the memory footprint, and improving performance.

For my IP security camera NVR software I did switch from Milestone XProtect Express running on a W2K16 VM, to DW Spectrum running on an Ubuntu Server VM. DW Spectrum is the US brand name for the Nx Witness product, and the DW Spectrum branded product is sold in the US. I chose to switch to Nx Witness, no DW Spectrum, from XProtect because Nx Witness is lighter in resource consumption, easier to deploy, easier to update, has perpetual licenses, includes native remote viewing, and an official Docker release is forthcoming.

I have been a long time user of CrashPlan, and I switched to CrashPlan Pro when they stopped offering a consumer product. I tested CrashPlan Pro and Duplicati containers on Unraid, with Duplicati backing up to Backblaze B2. Duplicati is the clear winner, backups were very fast, and completed in about 3 days. Where after 5 days I stopped CrashPlan, when it estimated another 18 days to complete the same backup operation, and it showed the familiar out of memory error. My B2 storage cost will be a few higher compared to a single seat license for CrashPlan Pro, but the Duplicati plus B2 functionality and speed is superior. When the Unraid 6.7.0 release went public, I immediately updated, but soon realized my mistake, when several plugins stopped working. It took several weeks before plugin updates were released that restored full functionality. It is worth mentioning, again, that I find it strange that Unraid without community provided plugins is really not that usable, but the functionality still remains in community provided plugins, not in Unraid. Next time I will wait a few weeks for the dust to settle in the plugin community before updating. Storage and disk management is reasonably easy, and much more flexible compared to hardware RAID management. But adding and removing disks is still mostly a manual process, and doing it without invalidating parity is very cumbersome and time consuming. At several times I gave up on the convoluted steps required to add or remove disks without invalidating parity, and just reconfigured the array and then rebuilt parity, hoping nothing goes wrong during the parity rebuild. This is in my opinion a serious shortcoming, maybe not in technology, but in lack of an easy to use and reliable workflow to help retain redundant protection at all times. In order to temporarily make enough storage space in my secondary server, I removed all the SSD cache drives and replaced them with 12TB Seagate IronWolf drives. I did move all the data that used to be on the cache to regular storage, including the docker appdata folder. This should not be a big deal, but I immediately started getting SQLite DB corruption errors in apps like Plex, that store data in SQLite on the appdata share. After some troubleshooting I found many people complaining about this issue, that seems to have been exasperated by the recent Unraid 6.7.0 update. Apparently this is a known problem with the Fuse filesystem used by Unraid. Fuse dynamically spans shares and folders across disks, but apparently breaks file and file-region locking required by SQLite. The recommended workaround is to put all files that require locking to work on the cache, or on a single disk, effectively bypassing Fuse. If it is Fuse that breaks file locking behavior, I find it troubling that this is not considered a critical bug. I am quite familiar with VM snapshot management using Hyper-V and VMWare, it is a staple of VM management. In Unraid I am using a Docker based Virt-Manager, which seems far less flexible, but more importantly, fails to take snapshots of UEFI based VM’s. Apparently this is a known shortcoming. I have not looked very hard for alternatives, but this seems to be a serious functional gap compared to Hyper-V or VMWare’s snapshot capabilities. As I started using the SMB file shares, now hosted on Unraid, in my regular day to day activities, I noticed that under some conditions the write speed becomes extremely slow, often dropping to around 2MB/s. This seems to happen when there are other file read operations in progress, and even a few KB/s of reads can drastically reduce the array SMB write performance. Interestingly the issue does not appear to affect my use of rsync between Unraid servers, but only SMB. I did find at least one other recent report of similar slowdowns, where only SMB is affected. Since the problem appeared to be specific to Unraid SMB, and not general network performance, I compared the Unraid SMB performance with Windows SMB in a W2K19 VM running on the same Unraid system. By running W2K19 as a VM on the same Unraid system, the difference in performance will be mostly the SMB stack, not hardware or network. On Unraid I created a share that is backed by the SSD cache array, that same SSD cache array holds the W2K19 VM disk image, so the storage subsystems are similar. I ran a similar test against an Unraid share backed by disk instead of cache. I found a few references (1, 2) to SMB benchmarking using DiskSpd, and I used them as a basis for the test options I used. Start by creating a 64GB test file on all test shares, we reuse the file and it saves a lot of time to not recreate it every time. Note, we get a warning when creating the file on Unraid, due to SetFileValidData() not being supported by Unraid’s SMB implementation, but that should not be an issue. >diskspd.exe -c64G \\storage\testcache\testfile64g.dat WARNING: Could not set valid file size (error code: 50); trying a slower method of filling the file (this does not affect performance, just makes the test preparation longer) >diskspd.exe -c64G \\storage\testmnt\testfile64g.dat WARNING: Could not set valid file size (error code: 50); trying a slower method of filling the file (this does not affect performance, just makes the test preparation longer) >diskspd.exe -c64G \\WIN-EKJ8HU9E5QC\TestW2K19\testfile64g.dat I ran several tests similar to the following commandlines: >diskspd -w50 -b512K -F2 -r -o8 -W60 -d120 -Srw -Rtext \\storage\testcache\testfile64g.dat > d:\diskspd_unraid_cache.txt >diskspd -w50 -b512K -F2 -r -o8 -W60 -d120 -Srw -Rtext \\storage\testmnt\testfile64g.dat > d:\diskspd_unraid_mnt.txt >diskspd -w50 -b512K -F2 -r -o8 -W60 -d120 -Srw -Rtext \\WIN-EKJ8HU9E5QC\TestW2K19\testfile64g.dat > d:\diskspd_w2k19.txt For a full explanation of the commandline arguments see here. The test will do 50% read and 50% write, block sizes varied from 4KB to 2048KB, 2 threads, 8 outstanding IO operations, random aligned IO, warm up for 60s, run for 120s, disable local caching for remote filesystems. From the results we can see that the Unraid SMB performance for this test is pretty poor. I redid the tests, this time doing independent read and write tests, and instead of various block sizes, I just did a 512KB block size test (I got lazy). No matter how we look at it, the Unraid SMB write performance is still really bad. I wanted to validate the synthetic tests results with a real world test, so I collected a folder containing around 65.2GB of fairly large files, on SSD, and copied the files up and down using robocopy from my Win10 system. I chose the size of files to be about double the size of the memory on the Unraid system, such that the impact of caching can be minimized. I made sure to use a RAW VM disk to eliminate any performance impact of growing a QCOW2 image file. >robocopy d:\temp\out \\storage\testmnt\in /mir /fft > d:\robo_pc_mnt.txt >robocopy d:\temp\out \\storage\testcache\in /mir /fft > d:\robo_pc_cache.txt >robocopy d:\temp\out \\WIN-EKJ8HU9E5QC\TestW2K19\in /mir > d:\robo_pc_w2k19.txt >robocopy \\storage\testmnt\in d:\temp\in /mir /fft > d:\robo_mnt_pc.txt >robocopy \\storage\testcache\in d:\temp\in /mir /fft > d:\robo_cache_pc.txt >robocopy \\WIN-EKJ8HU9E5QC\TestW2K19\in d:\temp\in /mir > d:\robo_w2k19_pc.txt During the robocopy to Unraid I notice that sporadically the Unraid web UI, and web browsing in general, becomes very slow. This never happens while copying to W2K19. I can’t explain this, I see no errors reported in my Win10 client eventlog or resource monitor, I see no unusual errors on the network switches, and no errors in Unraid. I suspect whatever is impacting SMB performance is affecting network performance in general, but without data I am really just speculating. The robocopy read results are pretty even, but again shows inferior Unraid SMB write performance. Do note that the W2K19 VM is still not as fast as my previous W2K16 RAID6 setup where I could consistently saturate the 1Gbps link for read and writes, on the same hardware and using the same disk. It is very disappointing to discover the poor SMB performance, I reported my findings to the Unraid support forum, and I hope they can do something to improve performance, or maybe invalidate my findings. ## Unraid and Robocopy Problems In my last post I described how I converted one of my W2K16 servers to Unraid, and how I am preparing for conversion of the second server. As I’ve been copying all my data from W2K16 to Unraid, I discovered some interesting discrepancies between W2K16 SMB and Unraid SMB. I use robocopy to mirror files from one server to the other, and once the first run completes, any subsequent runs should complete without needing to copy any files again (unless they were modified). First, you have to use the “robocopy.exe /mir [dest] /mir /fft” option, for Fat File Times, allowing for 2 seconds of drift in file timestamps. I found a large number of files that would copy over and over with no changes to the source files. I also found a particular folder that would “magically” show up on Unraid, and cannot be deleted from the Unraid share by robocopy. After some troubleshooting, I discovered that files with old timestamps, and folder names that end in a dot, do not copy correctly to Unraid. I looked at the files that would not copy, and I discovered that the file modified timestamps were all set to “1 Jan 1970 00:00”. I experimented by changing the modified timestamp to today’s date, and the files copied correctly. It seems that if the modified timestamp on the source file is older than 1 Jan 1980, the modified timestamp on Unraid for the same newly created file will always be set as 1 Jan 1980. When then running robocopy again, the source files will always be reported as older, and the file copied again. Below is an example of a folder of test files with a created date of 1 Jan 1970 UTC, I copy the files using robocopy, and copy them again. The second run of robocopy again copies all the files, instead of reporting them as similar. One can see that the destination timestamp is set to 1 Jan 1980, not 1 Jan 1970 as expected. The second set of problem files occur in folder names ending in a dot. Unraid ignores the dots on the end of the folder names, and when another folder exists without dots, the copy operation uses the wrong folder. Below is an example of a folder that contains two directories, one named “LocalState”, and one named “LocalState..”. I robocopy the folder contents, and when running robocopy again, it reports an extra folder. That extra folder gets “magically” created in the destination directory, but the “LocalState..” folder is missing. The same robocopy operations to the W2K16 server over SMB works as expected. From what I researched, the timestamp ranges for NTFS is 1 January 1601 to 14 September 30828, FAT is 1 January 1980 to 31 December 2107, and EXT4 is 1 January 1970 to 19 January 2106 (2038 + 408). I could not create files with a date earlier than 1 Jan 1980, but I could set file modified timestamps to dates greater than 2106, so I do not know what the Unraid timestamp range is. Creating and accessing directories with trailing dots requires special care on Windows using the NT style notation, e.g. “CreateDirectoryW(L”\\\\?\\C:\\Users\\piete\\Unraid.Badfiles\\TestDot..”, NULL), but robocopy does handle that correctly on W2K16 SMB. I don’t know if the observed behavior is specific to Unraid SMB, or if it would apply to Samba on Linux in general. But, it posed a problem as I wanted to make sure I do indeed have all files correctly backed up. I decided to write a quick little app to find problem files and folders. The app iterates through all files and folders, it will fix timestamps that are out of range, and report on finding files or folders that end in a dot. I ran it through my files, it fixed the timestamps for me, and I deleted the folders ending in dot by hand. Multiple robocopy runs now complete as expected. ## eNom Dynamic DNS Update Problems Update: On 27 July 2018 eNom support notified me by email that the issue is resolved. I tested it, and all is back to normal with DNS-O-Matic. Sometime between 12 May 2018 and 24 May 2018 the eNom dynamic DNS update mechanism stopped working. I use the very convenient DNS-O-Matic dynamic DNS update service to update my OpenDNS account, and several host records at eNom, pointing them to my home IP address. I was first alerted to the problem by a DNS-O-Matic status failure email, but as I was about to get on a plane for a business trip, I ignored the issue, hoping it was temporary. eNom response for 'foo.bar.net': -------------------- ;URL Interface ;Machine is SJL0VWAPI03 ;Encoding Type is utf-8 Command=SETDNSHOST APIType=API.NET Language=eng ErrCount=1 Err1=Domain name not found ResponseCount=1 ResponseNumber1=316153 ResponseString1=Validation error; not found; domain name(s) MinPeriod=1 MaxPeriod=10 Server=sjl0vwapi03 Site=eNom IsLockable= IsRealTimeTLD= TimeDifference=+0.00 ExecTime=0.053 Done=true TrackingKey=5d09a343-b2d6-44e2-8d70-0ad9bcabcb8d RequestDateTime=6/21/2018 6:11:11 PM -------------------- Here is the update history from DNS-O-Matic: 47.44.1.123, Jun 29, 2018 4:58 pm, ERROR 47.44.1.123, Jun 29, 2018 4:53 pm, ERROR 47.44.1.123, Jun 21, 2018 6:11 pm, ERROR 47.44.1.123, May 24, 2018 6:10 pm, ERROR 47.44.1.124, May 12, 2018 8:56 am, OK 47.44.1.124, May 4, 2018 2:48 pm, OK 47.44.1.124, May 3, 2018 1:42 pm, OK 47.44.1.124, Apr 1, 2018 12:39 pm, OK 47.44.1.124, Apr 1, 2018 9:58 am, OK 47.44.1.124, Mar 24, 2018 5:06 pm, OK As of yesterday, I could not find any other reports of similar issues on google, and the eNom status page showed no problems. I use a Ubiquity UniFi Security Gateway Pro as home router, and I have the dynamic DNS service in the UniFi controller configured to point to DNS-O-Matic, but it offered no additional hints as to the cause of the problem. I contacted eNom support over chat, and they informed me they know there is an issue, and they said I should use the following format for the update: http://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=%1&PW=%2&Zone=%3&DomainPassword=%4 %1 = Is username in Enom %2 = Is password %3 = Is my host and domain %4 = Is my domain access password This was interesting, I had looked at several eNom update scripts, even the eNom sample code, and they all used a different command format. I looked up the SetDNSHost documentation, and sure enough, it looks like eNom changed the API. Old format: https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&HostName=[host]&Zone=[domain]&DomainPassword=[password]&Address=[IP] New format: https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&PW=[LoginPassword]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP] eNom changed the meaning of the “Zone” parameter to be the fully qualified domain name, and they required the addition of the account username and password. I tried the old format in my browser, and I got the same “Domain name not found” error. As I tried the URL, I noticed that HTTPS failed with a certificate mismatch. The certificate for https://dynamic.name-services.com points to reseller.enom.com. Broken SSL, and including my account username and password was not an acceptable option, additionally I use 2FA on my account, so I had doubts that my password would even work. I tried the command as described in the documentation, but I omitted my account password, and it worked. https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP] I still find it very weird that this has been broken for so long, and that I could not find other reports of the problem on google, are people not using eNom or eNom resellers with dynamic DNS? I also find it disappointing that the status page is not reflecting this problem, and that the SSL domain does not match, one would expect more from a domain company. Until eNom fixes the problem, or until DNS-O-Matic updates support for the new API format, I created a PowerShell script to update my domains, maybe it is useful for others with the same problem. UserName = 'eNom account username'
$HostNames = @('www', 'name1', 'name2', 'etc')$DomainName = 'yourdomain.com'
$Password = 'Domain change password'$url = 'http://myip.dnsomatic.com'
$webclient = New-Object System.Net.WebClient$result = $webclient.DownloadString($url)
Write-Host $result$IPAddress = $result.ToString()$webclient.Dispose()

# Ignore SSL error caused by dynamic.name-services.com SSL certificate pointing to a different domain
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = {$true}$webclient = New-Object System.Net.WebClient
foreach ($hostname in$HostNames)
{
$url = "https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=$UserName&Zone=$hostname.$DomainName&DomainPassword=$Password&Address=$IPAddress"
Write-Host $url$result = $webclient.DownloadString($url);
Write-Host $result }$webclient.Dispose()
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = $null ## Razer BSOD When Driver Verifier is Enabled I am done with Razer, exciting promises for technology on paper, great looking hardware, terrible support, terrible software. Not too long ago I complained about Razer’s poor UX and Support, this time it is a BSOD in one of their drivers, and forever crashing Razer Stargazer camera software. I’ve been looking for a Windows Hello capable webcam, and the Razer Stargazer, based on Intel RealSense technology, looked promising. The device is all metal and tactical looking, but the software experience is so buggy, install this, install that, then crash after crash after crash. I ended up returning it for a refund, and got a Logitech BRIO instead, the BRIO is cheaper, and works great. A couple days ago I was greeted with a BSOD on one of my test machines, a crash in the RZUDD.SYS “Razer Rzudd Engine” driver, part of the Razer Synapse software. What makes this interesting, is that the issue seems to be triggered by having Driver Verifier enabled. One may be tempted to say do not enable Driver Verifier, but, the point of driver verifier is to help detect bugs in drivers, and is a basic requirement for driver certification. Per the WinDbg analysis, this appears to be a memory corruption bug. After some searching, I found that the Driver Verifier BSOD has been reported by other users, with no acknowledgement, and no fix forthcoming. I contacted Razer support, and not surprisingly, they suggested uninstall and reinstall. I tried the community forums, and I was just pointed back to support. FAULTING_IP: rzudd+28c80 ... DEFAULT_BUCKET_ID: CODE_CORRUPTION ... PROCESS_NAME: RzSynapse.exe ... STACK_TEXT: nt!KeBugCheckEx nt!MiSystemFault+0x12e69c nt!MmAccessFault+0xae6 nt!KiPageFault+0x132 rzudd+0x28c80 rzudd+0x218d4 rzudd+0x7a9f Wdf01000!FxIoQueue::DispatchRequestToDriver+0x1bf [minkernel\wdf\framework\shared\irphandlers\io\fxioqueue.cpp @ 3325] Wdf01000!FxIoQueue::DispatchEvents+0x3bf [minkernel\wdf\framework\shared\irphandlers\io\fxioqueue.cpp @ 3125] Wdf01000!FxPkgIo::DispatchStep1+0x53e [minkernel\wdf\framework\shared\irphandlers\io\fxpkgio.cpp @ 324] Wdf01000!FxDevice::DispatchWithLock+0x5a5 [minkernel\wdf\framework\shared\core\fxdevice.cpp @ 1430] nt!IovCallDriver+0x245 ... FAILURE_BUCKET_ID: MEMORY_CORRUPTION_LARGE  I am done with Razer, exciting promises for technology on paper, great looking hardware, terrible support, terrible software. ## Razer Shoddy Support and Bad Software UX This post is just me venting my frustration at Razer’s poor software user experience, and their shoddy support practices. I’m writing this after I just had to go and find a working mouse, so I could click a button on a dialog that had no keyboard navigation support. I’ve been using Razer keyboards and mice for some time, love them, their software not so much. I had to replace an aging ThinkPad, and the newly released Razer Blade Stealth looked like a great candidate, small and fast, reasonably priced, should be perfect, well, not so much. I keep my monitors color calibrated, and I cringe whenever I see side-by-side monitors that clearly don’t match, or when somebody creates graphic content (yes you graphic artists using MacBooks to create content for PC software without proper color profiles) that looks like shades of vomit on a projector or a cheap screen, but I digress. My monitor of choice is NEC and their native SpectraView color calibration software. Unfortunately, the Blade with its lower end Intel graphics processor, and HDMI port, does not support DDC/CI, so no ability to color calibrate my monitor. My main monitor is a NEC MultiSync EA275UHD 4K monitor, and the internal Intel graphics processor is frustratingly slow on this high resolution display. And, the HDMI connectivity would drop out whenever the monitor went into power saving mode. Why not use a more standard mini-DisplayPort connector, would not solve the speed problem, but at least would have resolved the connection reliability and allowed for proper color calibration. To solve the problem, I decided to get a Razer Core with an EVGA GeForce GTX 1070 graphics adapter. The Core is an external USB and network dock, with a PSU and PCIe connector for a graphics card, all connected to the notebook by Thunderbolt 3 over a, too short, USB-C cable. I connected my monitor to the GTX 1070 DisplayPort connector, connectivity was fine, I could color calibrate my monitor, and the display performance with the GTX 1070 was fast, great. By the way, JayzTwoCents has a great video on the performance of external graphic cards. But, my USB devices connected to the dock kept on dropping out. I found several threads on the Razer support forum complaining about the same USB problems, and the threads are promptly closed with a contact support message. I contacted Razer support and they told me they are working on the problem, and closed my ticket. I contacted them again stating that closing my ticket did not resolve the problem, and they said my choice is RMA the device, with no known solution, or wait, and then they closed my ticket again. To this day this issue has not been resolved, and I have to connect my USB devices directly the notebook, defeating the purpose of a dock. They did publish a FAQ advising users to not use 2.4GHz WiFi, but to stick with 5GHz due to interference issues, so much for their hardware testing. Now, let’s talk about their Razer Synapse software, the real topic of this post. The software is used to configure all the Razer devices, and sync the device preferences across computers with a cloud account, neat idea. The color scheme and custom drawn controls of this software matches their edgy “brand”, but their choice of thin grey font on a dark background fails in my usability book when used in a brightly lit office space. Whenever Windows 10 updates, the stupid Synapse software pops up while the install is still going, if you say yes, install now, then as expected the install fails due to Windows still installing. I logged the issue with Razer support, and they told me it is behaving as designed, really, designed to fail. So, today the Synapse software, again, prompts me to update, a frequent occurrence, and my mouse dies during the update, presumably because they updated the mouse driver, but this time I am prompted with a reboot required dialog. Dead mouse, no problem, have keyboard, tab over, wait, no keyboard navigation on the stupid owner drawn custom control dialog, no way to interact with the dialog without a mouse, just fail. Moral of the story, UX is important people, and I should just stick with ThinkPad or Microsoft Surface Book hardware, costs more, but never disappoints. ## Circumventing ThinkPad’s WiFi Card Whitelisting What started as a simple Mini PCI Express WiFi card swap on a ThinkPad T61 notebook, turned into deploying a custom BIOS in order to get the card to work. I love ThinkPad notebooks, they are workhorses that keep on going and going. I always keep my older models around for testing, and one of my old T61’s had an Intel 4965AGN card, that worked fine with Windows 10, until the release of the Anniversary / Redstone 1 update. After the RS1 update, WiFi would either fail to connect, or randomly drop out. The 4965AGN card is not supported by Intel on Win10, and the internet is full of problem reports of Win10 and 4965AGN cards. Ok, no problem, I’ll just get a cheap, reasonably new, with support for Win10, Mini PCIe WiFi card, and swap the card. I got an Intel 3160 dual band 802.11AC card and mounting bracket for about$20. The 3160 is a circa 2013 card with Win10 support. I installed the card, booted, and got a BIOS error 1802: Unauthorized network card is plugged in.

This lead me to the discovery of ThinkPad hardware whitelisting, where the BIOS only allows specific cards to be used, which lead me to Middleton’s BIOS, a custom T61 BIOS, that removes the hardware whitelisting, and enables SATA-2 support. I found working download links to the v2.29-1.08 Middleton BIOS here.

The BIOS update is packaged as a Win7 x86 executable or DOS bootable ISO image. As I’m running Win10 x64, and I could not find any CD-R discs around, I used Rufus to create a bootable DOS USB key, and I extracted the ISO contents using 7-Zip to a directory on the USB key. The ISO is created using a bootable 1.44MB DOS floppy image, and AUTOEXEC.BAT launches “FLASH2.EXE /U”, I created a batch file that does the same.

I removed the WiFi card, booted from USB, ran the flash, and got an error 1, complaining that flashing over the LAN is disabled. Ok, I enabled flashing the BIOS over the LAN in the BIOS, and rebooted.

I ran the update again, and this time I got error 99, complaining that BitLocker is enabled, and to temporarily disable BitLocker. I did not have BitLocker enabled, so I removed the hard drive and tried again, same error. Must be something in the BIOS, I disabled the security chip in the BIOS, tried again, and the update starts, but a minute or so later the screen goes crazy with INVALID OPCODE messages.

Hmm, maybe the updater does not like the FreeDOS boot image used by Rufus. Ok, let me create a MS-DOS USB key, uhh, on Win10, that turned out to be near impossible. Win10 does not include MS-DOS files, Rufus does not support custom locations for MS-DOS files, nor does it support getting them from floppy or CD images (readily available for download), the HP USB Disk utility complains my USB drive is locked, and writing raw images to USB result in a FAT12 disk structure that is too small to use. I say near impossible because I gave up, and instead went looking for an existing MS-DOS USB key I had made a long time ago. I am sure with a bit more persistence I could have found a way to create MS-DOS bootable USB keys on Win10, but that is an exercise of another day.

Trying again with a MS-DOS USB key, and voilà, BIOS flashed, and WiFi working.

I am annoyed that I had to go to this much trouble to get the new WiFi card working, but the best part of the exercise turns out to be the SATA-2 speed increase. This machine had a SSD drive, that I always found to be slow, but with the SATA-2 speed bump in Middleton’s BIOS, the machine is noticeably snappier.

A couple hours later, my curiosity got the better of me, and I made my own version of Rufus that will allow formatting of MS-DOS USB drives on Win10. In the process I engaged in an interesting discussion with the author of Rufus. I say interesting, but it was rather frustrating, Microsoft removed the MS-DOS files from Win10, and Rufus refuses to add support for sourcing of MS-DOS files from a user specified location, citing legal reasons, and my reluctance to first report the issue to FreeDOS. Anyway, can code, have compiler, if have time, will solve problem.

## UPS Battery Replacement Turns Into Unrecoverable Firmware Update

Two lessons learned; do not trust scheduled battery tests, and leave working firmware be!

As the saying goes, if it is not broken do not fix it, especially when it comes to firmware.

I have a couple APC Smart-UPS‘s at my house, same as the models I like to use at the office. I use the SMT750 models with AP9631 Network Monitoring Cards. The problem started when we had a short power outage, and the UPS that powers the home network switch, cell repeater, alarm internet connection, and PoE IP cameras, unexpectedly died. A battery replacement led to the opportunity to do a UPS firmware update, which led to an unrecoverable firmware update.

It started when I woke up one morning and it was obvious the power had been out, first indicator is the kitchen appliances have blinking clocks, second are the numerous power failure email notifications, and the emails that stood out were from the alarm system that says it lost power and internet connectivity. The alarm has it’s own backup battery, the network switches and FiOS internet have their own battery backups, and the outage was only about 4 minutes.  So how is it that the UPS died, killing the switch, disconnecting the internet, especially when the outage was only 4 minutes, and typical runtimes on the UPS should be about an hour?

Here is the UPS outage log produced by the NMC card:

10/11/2016 06:53:05 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 06:12:27 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 06:12:24 Device UPS: Restored the local network management interface-to-UPS communication. 0x0101
10/11/2016 06:12:12 System Network service started. IPv6 address FE80::2C0:B7FF:FE98:9BAF assigned by link-local autoconfiguration. 0x0007
10/11/2016 06:12:10 Device Environment: Restored the local network management interface-to-integrated Environmental Monitor (Universal I/O at Port 1) communication. 0x0344
10/11/2016 06:12:09 System Network service started. System IP is 192.168.1.11 from manually configured settings. 0x0007
10/11/2016 06:12:02 System Network Interface coldstarted. 0x0001
10/11/2016 05:45:36 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 05:45:36 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 05:45:35 Device UPS: The output power is turned off. 0x0114
10/11/2016 05:45:35 Device UPS: The graceful shutdown period has ended. 0x014F
10/11/2016 05:45:35 Device UPS: No longer on battery power. 0x010A
10/11/2016 05:45:35 Device UPS: Main outlet group, UPS Outlets, has been commanded to shutdown with on delay. 0x0174
10/11/2016 05:45:35 Device UPS: The power for the main outlet group, UPS Outlets, is now turned off. 0x0135
10/11/2016 05:45:18 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 05:41:36 Device UPS: On battery power in response to rapid change of input. 0x0109

I could see from the log that the UPS battery power ran out within 4 minutes, 05:41:36 on battery, 05:45:18 battery too low, 05:45:35 output turned off. The UPS status page was equally puzzling, load was at 9.7%, yet reported runtime was only 5 minutes, impossible.

Here is the status screenshot:

I ran a manual battery test, the test passed, but from the log it was clear the battery failed. I have bi-weekly scheduled battery tests for all UPS’s, never received a failure report. So what is the point of a battery test if the test comes back no problem yet it is clear to me from the logs that the battery failed?

Here is the log:

10/11/2016 18:07:06 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 18:07:06 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 18:07:05 Device UPS: Self-Test passed. 0x0105
10/11/2016 18:06:59 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 18:06:59 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 18:06:58 Device UPS: Self-Test started by management device. 0x0137
10/11/2016 18:06:56 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107

I asked for advice on the APC forum, no reply yet, and I ordered a replacement RBC48 battery. I received the battery, installed it, and the reported runtime is back to normal, 1 hour 48 minutes.

Here is a status screenshot with the new battery:

Here is where I should have stopped and called it a day, but no. I knew that the UPS’s were on old firmware, and I decided to use this opportunity to update the firmware. I’d normally let firmware be, unless I have a good reason to update, but I convinced myself that the new firmware readme had some fixes that may help with the false pass on the battery test:

Release Notes (UPS09.3):
========================
2. Improved self-test logging for PCBE / NMC.
11. Repaired an occasional math error in the battery replacement date algorithm that resulted in incorrect dates.

I update the UPS, where I just replaced the battery, instructions are pretty simple. Only hassle is I have to bypass the network equipment to be mains powered so I can turn the UPS outputs off while updating the firmware, while maintaining network connectivity.

I did the same for my office UPS, PC and office switch on mains power, and when I power down the output, I made sure to not notify PowerChute Network Shutdown (PCNS) clients, as my PC had the PowerChute client installed to receive power state via the network. I start the firmware update over the network, and a few seconds later I get a Windows message that shutdown had been initiated by PCNS, what? I sit there in frustration, nothing to do but watch my PC shutdown while it is still delivering the firmware update.

On rebooting my PC, NMC comes up, but reports the UPS has stopped communicating. I pull AC power from the UPS, no change, I also pull the batteries, and when I plug the batteries and mains back on, beeeeeeeeeep. NMC now reports no UPS found, the UPS LCD panel reports all is fine. And still beeeeeeeeeep, and no way to stop the beeeeeeeeeep.

Here is the NMC status page:

I try to do Firmware Upgrade Wizard update via USB, plug a USB cable in, PC sees UPS, reports critical condition, but the upgrade wizard reports no UPS found on USB.

Here is the wizard error page:

So, here I am, stuck with a bricked UPS, lesson learned, actually two lessons learned; do not trust scheduled battery tests, and leave working firmware be!

## Amazon Associate’s Account Closed

Amazon just notified me in email that my Associate’s account was closed due to not being in compliance with their operating agreement:

“You are not in compliance with Participation Requirement Number 29 because purchases resulting from Special Links on your site have been used for resale or commercial use.”

I have no idea how or why this happened.

A couple of years ago I moved my blog from the free Blogger platform to a paid WordPress.com hosted site. About the same time I signed up for an Amazon Associate’s account, profiting from any Amazon links resulting in sales, hoping that the proceeds would cover the costs of WordPress hosting and domain registration.

A quick calculation shows Amazon payouts of $669.57 between 2 August 2012 and 30 August 2016, that is about$167.39 per year, less the $99.00 for WordPress hosting, less$36.00 for Akismet blog spam filtering, less $19.00 for domain registration, leaves a profit of$13.39. Less $99.00 for bulk domain registration fees, not really fair to charge this fee to one domain, leaves a loss of$85.61 per year.

I do not know why I was suddenly out of compliance, I made no changes to either my Amazon Associates or WordPress accounts, and I’ve not posted any new content in a number of months. The WordPress stats show typical traffic (ignore the last two days), but the Amazon Associates report does show a marked increase in traffic:

I sent an email to Amazon support to clarify the violation, and to request my account be reinstated, but based on similar reports from other low traffic users,  I do not expect a resolution.

Instead, I opted-in to use WordPress’s own WordAds advertising platform, I still need to adjust the blog theme and settings to not interfere with reading, and I have no idea what the monetization would be, but at least I no longer have to bother with making special Amazon links.

Please comment and let me know if you find the ads to be intrusive, and I’ll consider funding the site without advertising assistance.

[Update: 1 September 2016]

A day after sending Amazon a request asking for an explanation, I received the following in email:

Looks like my account has been reinstated, no explanation of what happened.

## Nest Protect False Alarms

2AM, beep, smoke alarm low battery warning, and when one beeps, all the interconnected ones beep, now it is impossible to find which one has a low battery. As for how smoke alarms look, I’ve always wondered who made those terrible aesthetic design choices, maybe it is some kind of industry insider competition to see who can design the ugliest unit with the most obnoxious markings, and still get them sold.

I was thrilled when Nest announced the Nest Protect combination smoke and CO alarm, finally usability and technology catching up with smoke alarms, and an attractive looking unit. I’ve been a long time fan and user of the Nest thermostats, first one v1 unit, and later two v2 units, and I hoped the Nest Protect would do for smoke alarms what Nest did for thermostats.

I pre-ordered ten alarms from Amazon in October 2013, delivered in December 2013. Installation was easy, but I do wish there was a way to get more spoken locations, e.g. “smoke in kids bedroom”, which kid’s bedroom, wait, let me get my phone to see, not.

A week or two after installation we are having friends over for a barbecue, I show the alarm units, I show the mobile app, I explain how great the wave to silence alarm feature is, and how it will warn you before the alarm sounds, everybody is very impressed. Until a few hours later when one of the units go off, “smoke in the guest bedroom”, what smoke. I wave at it, nothing, I press the button, “this alarm cannot be silenced”. Keep in mind they are all interconnected with a mesh wireless network, so all ten units are screaming. After the kids stopped crying and we moved the party outside, I get a ladder and remove the unit, still screaming, I take it to fresh air, still screaming, I get a screwdriver open it up and remove the batteries, silence, but the rest of the units are still screaming, and pressing the button on those units still say “this alarm cannot be silenced”. About 5 minutes after removing the battery from the failed alarm the the other alarms stop. Egg on my face.

Nest support exchanged the unit and sent out a replacement.

As I was browsing the Nest support forums I noticed many other users reporting false alarms, some reporting that replacement units resolved the problem, some reporting repeat problems. Things got worse for Nest when they issued a recall, offering refunds, disabling the wave feature with a firmware update, and stopped selling units until they swapped stock for units with the newer firmware before re-releasing at a reduced price.

October 2014 early AM the alarm goes off, false alarm again, at least this time the alarm silenced itself after a minute. After some back and forth, and an escalation, Nest support agreed to replace all units. The new units have September 2014 manufacturing dates, so I hope these new units are less buggy.

January 2015 early AM the alarm goes off, false alarm again, this time the alarm stopped after only a few seconds. I’ve had enough, my kids are scared, my wife is mad, Nest, you’re out.

Nest support agreed to issue a refund for all ten units, we’ll see how long it takes to receive the refund. And now I’m in the market for combination smoke and CO alarms again, and there are not many choices, if you want something that is functional and good looking.

I was tempted to wait for the First Alert Wi-Fi enabled combination smoke and CO alarm, available for pre-order on Amazon, and although this unit is from a well established manufacturer, hopefully no false alarms, I’m not making the same mistake I made with Nest. Regardless of the pre-order option, it still leaves me unprotected, and I need something now. I could simply not find a decent looking, combination smoke and CO, interconnectable, and hardwired unit, big problem being decent looking.

In the end I opted for the First Alert PC910V units, they are low profile voice enabled combination smoke and CO units with a built-in 10 year battery, sold at Lowes or Amazon. Not interconnected, not hardwired, but at least they look half decent.

Installing these units turned out to be a bit more tricky than I anticipated. The install base is so small that the round ceiling junction boxes are barely hidden, and the instructions specifically call out that they are not to be installed on junction boxes due to air flow concerns.

Below are some pictures showing the size differences between the Nest base (left, bottom), First Alert base (center, middle), and a round cover plate (right, top):

To account for the junction box ventilation warning I sealed between the junction boxes and the ceiling drywall, and between the cover plate and the ceiling. The alarm bases were mounted on the cover plates, see pics below.

Due to the small footprint of the alarm, the cover plate and imperfections around the hole in the ceiling can be seen when looking up at an angle. (Sorry for the crappy pictures, iPhone in low night not so great)

Let’s hope I never hear them peep, at least not for ten years if we can trust the battery life, and at least not without a real emergency.

## Electrical Power Quality

Earlier this year we moved a couple miles from Redondo Beach to Manhattan Beach, bigger house, better school district.
As far as the house and area is concerned, it is definitely an upgrade, but not so for the utilities.

Monthly utilities are a lot more expensive, not so much the per unit fees, but the base service fees, not just a couple $, but three of four times what we paid in Redondo Beach. Now, if it came with better offerings, or better service, or higher quality, ok, but the opposite. Water quality is worse, specifically hardness, MB supplies its own water, RB gets water from LADWP, and that unsightly water tower that no longer serves any practical purpose, with efforts to demolish it always being thwarted. As a new resident trash collection makes me pay almost thirty$ extra per month for an extra trash can, while grandfathered-in residents keep extras for free. Now, I know it is unfair to judge a service by their employee’s actions, or is it, but the trash collection guy is a jerk, if a little dust and having to get out of the truck is going to get you agitated, you are in the wrong business, especially when compared with the pack of trash collection men in RB that were always friendly and willing to give a hand.
But, I really digress, I want to discuss electrical power quality problems.

In the six plus years we lived in RB, I think we had one scheduled power outage, and maybe two short unplanned outages. Since moving to MB earlier this year, we’ve had two scheduled outages, one lasting an entire day, and several unscheduled outages.
The power is unreliable, SCE knows it, the city knows it, there are some plans addressing it, see here here here here.

My concern is not really power being on or off, it is power being on but of poor quality; an electronic equipment killer.

When we moved in, the first signs of electrical problems were flickering lights. At first I thought it was a problem with the Vantage light control system, but even lights directly on utility power flickered. As soon as I hooked up UPS’s to my servers and the signal distribution system, the UPS’s started complaining about power quality. Occasionally during the day I would get a notification from the UPS’s that it detected a distorted input, and every night the UPS’s would complain about low input voltage.
It may be coincidental, but I’ve also had two astronomical clock light timers fail at the same time, the casings were scorched in what appears to be signs of electrical damage.

UPS Event Log:

In order to quantify the problem, I used a Fluke VR1710 Voltage Quality Recorder. The device plugs into a mains outlet, and records events, and a USB port is used to configure the device, and download recorded data.

As I am not a power quality expert, I referred to Wikipedia to and Power Quality In Electrical Systems for information and reference material. To further simplify the analysis, I opted to compare my office power with my home power, this allowed me to easily visualize the quality differences, granted, I am assuming my office power is good.

I configured the VR1710 to take measurements every 10s, and to record exceptional events, about 10 days worth of data. I set the dip threshold to 106V, the swell threshold to 127V, and the transient sensitivity to 5V.

VR1710 Settings:

Below are reports detailing the recorded events, click graphs to view full resolution:

Home Voltage:

There is a clear pattern of voltage drops below 102V every evening, these drops are also observed in the UPS logs showing low voltage warnings around 7:30PM every evening.

Office Voltage:

The office voltage is very stable.

Home Flicker:

According to Wikipedia and PQW short term flicker (Pst) is noticeable at values exceeding 1.0, and long term flicker (Plt) is noticeable at values exceeding 0.65. These results would explain why we observe lights flickering.

Office Flicker:

Office flicker values are well within acceptable ranges.

Home Statistics:

From this distribution we can see the wide spread in voltages, well below the 120V theoretical norm. This chart does not show it, but the 95% distribution is 115.5V, and the 5% distribution is 106.1V.

Office Statistics:

The office voltage distribution is nicely clustered around 119V, with the 95% distribution at 119.6V, and the 5% distribution at 117.4V.

Home Dips And Swells:

ITIC and CBEMA are standards for acceptable power quality, see here for a detailed description.
To describe the graph, I quote from the Fluke Power Log software manual:
Dips and swells are shown on a CBEMA (Computer Business Equipment Manufacturers Association) and ITIC (Information Technology Industry Council) plot classification table according to EN50160. On the CBEMA (blue) and ITIC (red), curve markers are plotted for each dip and swell. The height on the vertical axis shows the severity of the dip or swell relative to the nominal voltage. The horizontal position shows the duration of the dip or swell. These curves show an ac input voltage envelope which typically can be tolerated (no interruption in function) by most Information Technology Equipment (ITE).

Based on the graph we can see a large number of events exceeding the acceptable ranges. Since there were no dips at the office, there is no graph for the office.

Home Transients:

I only show the transients graph for home, as the wave forms all look different, and the only difference between home and office is 87 events were recorded at home while 10 events were recorded at the office for the same approximate time duration. See PQW for an explanation of transients.

We can clearly see that the power quality at my house is significantly worse compared to the power at my office.

I am speculating, but I wonder if the old transformer across the road can supply sufficient power, given that it used to supply power to three small very old houses on four lots, demolished to make room for four new larger houses?

I just opened a support ticket with SCE, let’s hope they can do something about the problem.