Unraid and Robocopy Problems

In my last post I described how I converted one of my W2K16 servers to Unraid, and how I am preparing for conversion of the second server.

As I’ve been copying all my data from W2K16 to Unraid, I discovered some interesting discrepancies between W2K16 SMB and Unraid SMB. I use robocopy to mirror files from one server to the other, and once the first run completes, any subsequent runs should complete without needing to copy any files again (unless they were modified).

First, you have to use the “robocopy.exe /mir [dest] /mir /fft” option, for Fat File Times, allowing for 2 seconds of drift in file timestamps.

I found a large number of files that would copy over and over with no changes to the source files. I also found a particular folder that would “magically” show up on Unraid, and cannot be deleted from the Unraid share by robocopy.

After some troubleshooting, I discovered that files with old timestamps, and folder names that end in a dot, do not copy correctly to Unraid.

I looked at the files that would not copy, and I discovered that the file modified timestamps were all set to “1 Jan 1970 00:00”. I experimented by changing the modified timestamp to today’s date, and the files copied correctly. It seems that if the modified timestamp on the source file is older than 1 Jan 1980, the modified timestamp on Unraid for the same newly created file will always be set as 1 Jan 1980. When then running robocopy again, the source files will always be reported as older, and the file copied again.

Below is an example of a folder of test files with a created date of 1 Jan 1970 UTC, I copy the files using robocopy, and copy them again. The second run of robocopy again copies all the files, instead of reporting them as similar. One can see that the destination timestamp is set to 1 Jan 1980, not 1 Jan 1970 as expected.

The second set of problem files occur in folder names ending in a dot. Unraid ignores the dots on the end of the folder names, and when another folder exists without dots, the copy operation uses the wrong folder.

Below is an example of a folder that contains two directories, one named “LocalState”, and one named “LocalState..”. I robocopy the folder contents, and when running robocopy again, it reports an extra folder. That extra folder gets “magically” created in the destination directory, but the “LocalState..” folder is missing.

The same robocopy operations to the W2K16 server over SMB works as expected.

From what I researched, the timestamp ranges for NTFS is 1 January 1601 to 14 September 30828, FAT is 1 January 1980 to 31 December 2107, and EXT4 is 1 January 1970 to 19 January 2106 (2038 + 408). I could not create files with a date earlier than 1 Jan 1980, but I could set file modified timestamps to dates greater than 2106, so I do not know what the Unraid timestamp range is.

Creating and accessing directories with trailing dots requires special care on Windows using the NT style notation, e.g. “CreateDirectoryW(L”\\\\?\\C:\\Users\\piete\\Unraid.Badfiles\\TestDot..”, NULL), but robocopy does handle that correctly on W2K16 SMB.

I don’t know if the observed behavior is specific to Unraid SMB, or if it would apply to Samba on Linux in general. But, it posed a problem as I wanted to make sure I do indeed have all files correctly backed up.

I decided to write a quick little app to find problem files and folders. The app iterates through all files and folders, it will fix timestamps that are out of range, and report on finding files or folders that end in a dot. I ran it through my files, it fixed the timestamps for me, and I deleted the folders ending in dot by hand. Multiple robocopy runs now complete as expected.

 

 

 

Moving from W2K16 to Unraid

I have been happy with my server rack running my UniFi network equipment and two Windows Server 2016 (W2K16) instances. I use the servers for archiving my media collection and running Hyper-V for all sorts of home projects and work related experiments. But, time moves on, one can never have enough storage, and technology changes. So I set about a path that lead to me replacing my W2K16 servers with Unraid.

I currently use Adaptec 7805Q and 81605ZQ RAID cards, with a mixture of SSD for caching, SSD RAID1 for boot and VM images, and HDD RAID6 for the large media storage array. The setup has been solid, and although I’ve had both SSD and HDD failures, the hot spares kicked in, and I replaced the failed drives with new hot spares, no data lost.

For my large RAID6 media array I used lots of HGST 4TB Ultrastar (enterprise) and Deskstar (consumer) drives, but I am out of open slots in my 24-bay 4U case, so adding more storage has become a problem. I can replace the 4TB drives with larger drives, but in order to expand the RAID6 volume without loosing data, I need to replace all disks in the array, one-by-one, rebuilding parity in between every drive upgrade, and then expand the volume. This will be very expensive, take a very long time, and risk the data during during every drive rebuild.

I have been looking for more flexible provisioning solutions, including Unraid, FreeNAS, OpenMediaVaultStorage Spaces, and Storage Spaces Direct. I am not just looking for dynamic storage, I also want a system that can run VM’s, and Docker containers, I want it to work with consumer and or small business hardware, and I do not want to spend all my time messing around in a CLI.

I have tried Storage Spaces with limited success, but that was a long time ago. Storage Spaces Direct offers significant improvements, but with more stringent enterprise hardware requirements, that would make it too costly and complicated for my home use.

FreeNAS offers the best storage capabilities, but I found the VM and Docker ecosystem to be an afterthought and still lacking.

OpenMediaVault (OMV) is a relative newcomer, the web front-end is modern, think of OMV as Facebook and FreeNAS and Unraid as MySpace, with growing support for VM’s and Docker. Compared to FreeNAS and Unraid the OMV community is still very small, and I was reluctant to entrust my data to it.

Unraid offered a good balance between storage, VM, and Docker, with a large support community. Unlike FreeNAS and OMV, Unraid is not free, but the price is low enough.

An ideal solution would have been the storage flexibility offered by FreeNAS, the docker and VM app ecosystem offered by Unraid, and the UI of OMV. Since that does not exist, I opted to go with Unraid.

Picking a replacement OS was one problem, but moving the existing systems to run on it, without loosing data or workloads, quite another. I decided to convert the two servers one at a time, so I moved all the Hyper-V workloads from Server-2 with the 8-bay chassis, to Server-1 with the 24-bay chassis. This left Server-1 unused, and I could go about converting it to Unraid. I not only had to install Unraid, I also had to provision enough storage in the 8-bay chassis to hold all the data from the 24-bay chassis, so that I could then move the data on Server-1 to Server-2, convert Server-1 to Unraid, and move the data back to Server-1. And I had to do this without risking the data, and without an extended outage.

To get all the data from Server-1 to fit on Server-2, I pruned the near 60TB set down to around 40TB. You know how it works, no matter how much storage you have it will always be filled. I purchased 4 x 12TB Seagate IronWolf ST12000VN0007 drives, and combined with 2 x 4TB HGST drives, gave me around 44TB of of usable storage space, enough to copy all the important data from Server-1 to Server-2.

While I was at it, I decided to upgrade the IPMI firmware, motherboard BIOS, and RAID controller firmware. I knew it is possible to upgrade the SuperMicro BIOS through IPMI, but you have to buy a per-motherboard locked Out-of-Band feature key from SuperMicro to enable this, something I had never bothered doing. While looking for a way to buy a code online, I found an interesting post describing a method of creating my own activation keys, and it worked.

IPMI updated, motherboard BIOS updated, RAID firmware updated, I set about converting the Adaptec RAID controller from RAID to HBA mode. Unlike the LSI controllers that need to be re-flashed with IR or IT firmware to change modes, the Adaptec controller allows this configuration via the controller BIOS. In order to change modes, all drives have to be uninitialized, but there were two drives that I could not uninitialize. After some troubleshooting I discovered that it is not possible to delete MaxCache arrays from the BIOS. I had to boot using the Adaptec bootUSB Utility, that is a Linux bootable image that runs the MaxView storage controller GUI. MaxCache volumes deleted, I could convert to HBA mode.

With the controller in HBA mode, I set about installing Unraid, well, it is not really installing in the classic sense, Unraid runs from a USB drive, and all drives in the system are used for storage. There are lots of info online on installing and configuring Unraid, but I found very good info on the Spaceinvader One Youtube channel. I have seen some reports of issues with USB drives, but I had no problems using a SanDisk Cruzer Fit drive.

It took a couple iterations before I was happy with the setup, and here are a few important things I learned:

  • Unraid does not support SSD drives as data drives, see the install docs; “Do not assign an SSD as a data/parity device. While unRAID won’t stop you from doing this, SSDs are only supported for use as cache devices due TRIM/discard and how it impacts parity protection. Using SSDs as data/parity devices is unsupported and may result in data loss at this time.” This is one area where FreeNAS and OMV offer much better redundancy solutions using e.g. ZFS over Unraid’s parity solution, or many other commercial solutions that have for many years been using SSD’s in drive arrays.
  • Unraid’s caching solution using SSD drives and BTRFS works just ok. Unlike e.g. Adaptec MaxCache that seamlessly caches block storage regardless of the file system, the Unraid cache works at the file level. While this does create flexibility in deciding which files from which shares should be using the cache, it greatly complicates matters when running out of space on the cache. When a file is created on the cache, and the file is then enlarged to the point it no longer fits in the available space, the file operation will permanently fail. E.g. copying a large file to a cached share, and the file is larger that the available space, the copy will proceed until the cache runs out of space, and then fail, repeat and get the same. To avoid this, one has to set the minimum free space setting to a value larger than the largest file that would ever be created on the cache, for large files, this is very wasteful. Imagine a thin provisioned VM image, it can grow until no space, and then fail, until manually moved to a different drive.
  • The cache re-balancing and file moving algorithm is very rudimentary, the operation is schedule per time period, and will move files from the cache to regular storage. There is no support for flushing the cache in real-time as it runs out of space, there is no high water or low water mechanisms, no LRU or MRU file access logic. I installed the Mover Tuning plugin that allows balancing the cache based on consumed space, better, but still not good enough.
  • Exhausting the cache space while copying files to Unraid is painfully slow. I used robocopy to copy files from W2K16 to a share on Unraid that had caching set to “preferred”, meaning use the cache if it has space, and as soon as the cache ran out of space, the copy operation slowed down to a crawl. As soon as the cache ran out of space, new files were supposed to be written to HDD, but my experience showed that something was not working, and I had to disable the cache and then copy the files. The whole SSD and caching thing is a big disappointment.
  • Building parity while copying files is very slow. Copying files using robocopy while the parity was building resulted in about 200Mbps throughput, very slow. I cancelled the parity operation, disabled the parity drive, and copied with no parity protection in place, and got near the expected 1Gbps throughput. I will re-enable parity building after all data is copied across.
  • Performing typical disk based operations like add-, remove-, or replace- a drive, is very cumbersome. The wiki tries to explain, but it is still very confusing. I really expected much easier ways of doing typical disk based operations, especially when almost all operations result in the parity becoming invalid, leaving the system exposed to failure.
  • It is really easy to use Docker, with containers directly from Docker Hub, or from the Community Applications plugin that acts like an app store.
  • It is reasonably easy to create VM’s, one has to manually install the LibVirt KVM/QEMU drivers in Windows OS’s, but it is made easy with the automatic mounting of the LibVirt driver ISO.
  • I could not get any Ubuntu Desktop VM’s working, they would all hang during install. I had no problems with Ubuntu Server installs. I am sure there is a solution, I just did not try looking yet as I only needed Ubuntu Server.
  • VM runtime management is lacking, there is no support for snapshots or backups. One can install the Virt-Manager container to help, but it is still rather rudimentary compared to offerings from VMWare, Hyper-V, and VirtualBox.
  • In order to get things working I had to install several community plugins, I would have expected this functionality to be included in the base installation. Given how active the plugin authors are in the community, I wonder if not including said functionality by default may be intentional?
  • Drive power saving works very well, and drives are spun down when not in use. I will have to revisit the file and folder to drive distribution, as common access patterns to common files should be constrained to the same physical drive.
  • The community forum is very active and very helpful.

I still have a few days of file copying left, and I will keep my W2K16 server operational until I am confident in the integrity and performance of Unraid. When I’m ready, I’ll convert the second server to Unraid, and then re-balance the storage, VM, and Docker workloads between the two servers.

Monoprice MP Voxel 3D Printer Setup

[Update]
After using the printer for about two weeks, I returned it to Monoprice for a refund. I would suggest you stay away and look elsewhere, or wait until Monoprice addresses the serious issues; Polar Cloud disconnects while printing, IO timeout error breaks camera function, and the deal breaker is hangs during printing, with the touch screen unresponsive and the extruder and print bed heater still on.

 

I’ve been looking for a new 3D printer to use at home, and I just installed and configured my new Monoprice Voxel 3D Printer.

I bought my first 3D printer, a Makerbot Replicator 2, for our office new-tech lab in 2012. It was a shared printer, hard to maintain, and eventually printed to death.

A few months ago I found the HobbyKing Turnigy Mini Fabrikator V2 on sale at a great price. It sat unopened, until a few weeks ago, when I wanted to use it to print Christmas ornaments for the kids to decorate. What a disappointment, difficult to configure, flimsy construction, and during the first print the filament guide adapter broke away from the print head, impossible to fix without ordering replacement parts, I just trashed it.

In search of a replacement, I decided to look for something the kids could use with minimal (wishful thinking at this age) help from me. That meant an enclosed printer and iPad capable software. A semi-helpful Reddit post made it clear to me that the best printers, and best value for money printers, are not really kid friendly, but it did point me in the right direction. To Polar Cloud, and to FlashForge, and eventually the Monoprice Voxel, that is a cheaper Monoprice branded FlashForge Adventurer 3.

Unboxing and setting up was easy, and in a few minutes I printed the sample cube that is included on the 8GB internal memory. The unit is quiet while printing, but the continuous high pitched fan noise is annoying.

Next I configured WiFi and connected to Polar Cloud, this is where things started going wrong. The manual that is included in the box is not complete, and I found an updated manual and instructional video on the Monoprice site. The steps to connect to WiFi is easy, but I found that using the touch screen was very difficult. The WiFi password is hidden with asterisks, and the touch screen has a tendency to not respond, or to pick multiple characters, or the wrong character. Since I could not see the password, it took several frustrating tries to get the right password entered. This experience could easily be improved by simply not hiding the password, or allowing configuration and control via mobile app.

The Monoprice site lists the FlashPrint software as version 3.23.2, while the version on the FlashForge site is 3.25.1, so naturally I installed the latest software from FlashForge. Mistake, I cloud not select the Voxel as printer, and it turns out that “FlashPrint-MP” is for Monoprice printers, and “FlashPrint” is for FlashForge printers. So back to installing FlashPrint-MP 3.23.2. The software is typical modeling and slicing, and allowed me to connect to the printer over the network.

I followed the instructions to connect the Voxel to Polar Cloud, but I kept getting an error about my printer MAC address already belonging to a different user. I found a KB article that instructed me to hard reset the printer, and I dreaded having to reenter my WiFi password. The article mentioned that this problem will be addressed in a future firmware update, so I tried that first. I could not find any downloadable firmware, but I found that updating from the printer pulled new firmware over the internet. After a reboot, no more error, and I was connected to Polar Cloud.

I encountered some IO timeout errors being displayed on the printer, I thought it was related to Polar Cloud, but I later encountered them during normal operation navigating the camera configuration menus. I suspect it is related the camera, since I lost the camera view on Polar Cloud as soon as I got this error. I hope this gets fixed in a firmware update.

My first cloud print after the firmware update did not go so well, the head scratched the print plate. I did find other users (Amazon reviews) complaining of the same scratching problem, I suspect the firmware update may have reset the calibration. After re-leveling, cloud prints worked fine again, but this should never have happened.

Printing from Polar Cloud is super simple, select a community model, upload your own model, and print. Job is sliced per selected options, and sent to the printer, it is a nice touch to have the printer camera images be displayed on the website while printing. I did notice that the time estimate displayed on the printer keeps changing, e.g. Polar Cloud listed the time remaining as 2 hours 26 minutes, Voxel 2 minutes … 11 minutes … 45 minutes. I suspect the job is spooled such that the printer only knows what is left in the buffer vs. the complete job.

In closing, so far, I think it is a great printer for the money, and I hope future firmware updates improves the experience.

Bad:

  • Instructions are incomplete and all over the place.
  • Poor touchscreen experience, difficult to press, double press, wrong press.
  • Difficult WiFi setup, password is hidden, and combined with wrong key presses results in frustration.
  • Scratched build plate, re-leveling seems to be required after a firmware update.
  • Out of the box Polar Cloud would not connect due to duplicate MAC address, requires firmware update to fix.
  • Sporadic IO timeout errors, suspect related to camera.
  • High pitched fan noise (I may try to mod this by replacing the fan or adding sound dampening material).
  • Hangs during printing, with the extruder and build plate heaters still on, and touchscreen unresponsive.

Good:

  • Low price
  • Enclosed
  • Easy to use
  • Cloud print

 

For others setting up this printer, I would recommend the following steps:

  1. Ignore the manual in the box, download the manual from Monoprice.
  2. Connect to WiFi, I suggest using the eraser part of a pencil to touch the screen while entering the password, this helps prevent typing mistakes.
  3. Update the firmware from the tools menu.
  4. Re-level the build plate.
  5. Connect to Polar Cloud.
  6. Print, do not print until after re-leveling.

 

Next steps for me is to find suitable software for the kids to use for modeling on their iPads.

 

PurpleAir Sensor Installation

I’ve had a Ambient Weather WS-1400-IP weather station installed for some time, reporting to Weather Underground. During the fires of previous years, I considered getting an air quality monitor, but I could never find anything worth the installation effort. During this year’s fire season I saw several ads for PurpleAir, advertising that they collaborate with Weather Underground, so I decided to purchase and install a PA-II outdoor sensor.

The installation instructions are sparse, and the device is not really what I would call rugged or weather proof. I would not put money it on it surviving outdoors for longer than a year, specifically because of the use a vanilla Micro-USB power plug that offers no corrosion protection. The unit I received came with a Nest outdoor camera power cable, but unlike the Nest camera that uses a watertight plug, the sensor uses an open USB cable. The instructions do say to point the open USB port downwards, instead I opted to seal it in using clear silicone sealer.

 

The ideal installation location would have been near my Rachio water sprinkler controller, where I have a waterproof enclosure with power, but that location is also near the HVAC, instant hot water heater, and dryer vents, so not ideal due to local air pollutants.

I installed the sensor next to my UniFi AC Mesh AP outdoor AP, the cables do look a bit messy, and if the sensor survives long enough, I may install an enclosure to clean up the cables.

Configuring the device is reasonably simple, but a mobile app would have been easier. Power up the device, connect to it’s WiFi access point, access a web page hosted by the device, configure the local WiFi SSID and password, connect to local WiFi, then register the device with PurpleAir. After all is done, I received a welcome email, and I could see the device on the PurpleAir map.

Annotation 2018-11-18 160901

Next up, I have to figure out how to view combined weather and air quality data on Weather Underground, how to get a direct link to my sensor’s data (the map link shows an area only), and how to use the data API (I archive all my data).

eNom Dynamic DNS Update Problems

Update: On 27 July 2018 eNom support notified me by email that the issue is resolved. I tested it, and all is back to normal with DNS-O-Matic.

Sometime between 12 May 2018 and 24 May 2018 the eNom dynamic DNS update mechanism stopped working.

I use the very convenient DNS-O-Matic dynamic DNS update service to update my OpenDNS account, and several host records at eNom, pointing them to my home IP address.

I was first alerted to the problem by a DNS-O-Matic status failure email, but as I was about to get on a plane for a business trip, I ignored the issue, hoping it was temporary.

eNom response for 'foo.bar.net':
--------------------
;URL Interface
;Machine is SJL0VWAPI03
;Encoding Type is utf-8
Command=SETDNSHOST
APIType=API.NET
Language=eng
ErrCount=1
Err1=Domain name not found
ResponseCount=1
ResponseNumber1=316153
ResponseString1=Validation error; not found; domain name(s)
MinPeriod=1
MaxPeriod=10
Server=sjl0vwapi03
Site=eNom
IsLockable=
IsRealTimeTLD=
TimeDifference=+0.00
ExecTime=0.053
Done=true
TrackingKey=5d09a343-b2d6-44e2-8d70-0ad9bcabcb8d
RequestDateTime=6/21/2018 6:11:11 PM
--------------------

Here is the update history from DNS-O-Matic:

47.44.1.123, Jun 29, 2018 4:58 pm, ERROR
47.44.1.123, Jun 29, 2018 4:53 pm, ERROR
47.44.1.123, Jun 21, 2018 6:11 pm, ERROR
47.44.1.123, May 24, 2018 6:10 pm, ERROR
47.44.1.124, May 12, 2018 8:56 am, OK
47.44.1.124, May 4, 2018 2:48 pm, OK
47.44.1.124, May 3, 2018 1:42 pm, OK
47.44.1.124, Apr 1, 2018 12:39 pm, OK
47.44.1.124, Apr 1, 2018 9:58 am, OK
47.44.1.124, Mar 24, 2018 5:06 pm, OK

As of yesterday, I could not find any other reports of similar issues on google, and the eNom status page showed no problems.

I use a Ubiquity UniFi Security Gateway Pro as home router, and I have the dynamic DNS service in the UniFi controller configured to point to DNS-O-Matic, but it offered no additional hints as to the cause of the problem.

DNS

I contacted eNom support over chat, and they informed me they know there is an issue, and they said I should use the following format for the update:

http://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=%1&PW=%2&Zone=%3&DomainPassword=%4

%1 = Is username in Enom
%2 = Is password
%3 = Is my host and domain
%4 = Is my domain access password

This was interesting, I had looked at several eNom update scripts, even the eNom sample code, and they all used a different command format. I looked up the SetDNSHost documentation, and sure enough, it looks like eNom changed the API.

Old format:

https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&HostName=[host]&Zone=[domain]&DomainPassword=[password]&Address=[IP]

New format:

https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&PW=[LoginPassword]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP]

eNom changed the meaning of the “Zone” parameter to be the fully qualified domain name, and they required the addition of the account username and password.

I tried the old format in my browser, and I got the same “Domain name not found” error. As I tried the URL, I noticed that HTTPS failed with a certificate mismatch. The certificate for https://dynamic.name-services.com points to reseller.enom.com.

Broken SSL, and including my account username and password was not an acceptable option, additionally I use 2FA on my account, so I had doubts that my password would even work. I tried the command as described in the documentation, but I omitted my account password, and it worked.

https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP]

I still find it very weird that this has been broken for so long, and that I could not find other reports of the problem on google, are people not using eNom or eNom resellers with dynamic DNS?

I also find it disappointing that the status page is not reflecting this problem, and that the SSL domain does not match, one would expect more from a domain company.

Until eNom fixes the problem, or until DNS-O-Matic updates support for the new API format, I created a PowerShell script to update my domains, maybe it is useful for others with the same problem.

$UserName = 'eNom account username'
$HostNames = @('www', 'name1', 'name2', 'etc')
$DomainName = 'yourdomain.com'
$Password = 'Domain change password'
 
$url = 'http://myip.dnsomatic.com'
$webclient = New-Object System.Net.WebClient
$result = $webclient.DownloadString($url)
Write-Host $result
$IPAddress = $result.ToString()
$webclient.Dispose()

# Ignore SSL error caused by dynamic.name-services.com SSL certificate pointing to a different domain
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = {$true}
$webclient = New-Object System.Net.WebClient
foreach ($hostname in $HostNames)
{
    # https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&HostName=[host]&Zone=[domain]&DomainPassword=[password]&Address=[IP]
    # https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=[LoginName]&Zone=[FQDN]&DomainPassword=[Password]&Address=[IP]
    $url = "https://dynamic.name-services.com/interface.asp?Command=SetDNSHost&UID=$UserName&Zone=$hostname.$DomainName&DomainPassword=$Password&Address=$IPAddress"
    Write-Host $url
    $result = $webclient.DownloadString($url);
    Write-Host $result
}
$webclient.Dispose()
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = $null

 

CrashPlan throws in the towel … for home users

Today CrashPlan, my current online backup provider, announced on Facebook of all places, that they threw in the towel, and will no longer provide service to home users. The backlash was heated, and I found the CEO’s video message on the blog post rather condescending.

I’ve been a long time user of online backup providers, and many have thrown in the towel, especially when free file sync from Google and Microsoft offers ever expanding capabilities and more and more free storage. Eventually even the cheapest backup storage implementation becomes expensive, when compared to a cloud provider, and not profitable as a primary business.

I’ve been using CrashPlan’s unlimited home plan for quite some time now, they were one of a few, today none, that were reasonably priced, allowed unlimited storage, and supported server class OS’s. But, I could sense the writing was on the wall; they split the home and business Facebook account, they split the website, the home support site has not seen activity in ages, they made major improvements to the enterprise backup agent, switching to a much leaner and faster C++ agent, while the home agent remained the old Java app with its many shortcomings, and there were some vague rumors on the street of a home business selloff attempt.

The transition offered a free switch to the small business plan, for the remaining duration of the home subscription, plus 3 months, and then a 75% discount on next year’s plan. For my account, this means free CrashPlan Pro until 12 June 2018, then $2.50 per month until 12 June 2019, and then $10.00 per month.

I’ve switched to the Pro plan, as they promised the agent updated itself, going from the old Java to the new C++ agent, the already backed up data was retained without needing to backup again, and all seems well, for now…

Razer BSOD When Driver Verifier is Enabled

I am done with Razer, exciting promises for technology on paper, great looking hardware, terrible support, terrible software.

Not too long ago I complained about Razer’s poor UX and Support, this time it is a BSOD in one of their drivers, and forever crashing Razer Stargazer camera software.

I’ve been looking for a Windows Hello capable webcam, and the Razer Stargazer, based on Intel RealSense technology, looked promising. The device is all metal and tactical looking, but the software experience is so buggy, install this, install that, then crash after crash after crash. I ended up returning it for a refund, and got a Logitech BRIO instead, the BRIO is cheaper, and works great.

A couple days ago I was greeted with a BSOD on one of my test machines, a crash in the RZUDD.SYS “Razer Rzudd Engine” driver, part of the Razer Synapse software. What makes this interesting, is that the issue seems to be triggered by having Driver Verifier enabled.

20170416_201259779_iOS

One may be tempted to say do not enable Driver Verifier, but, the point of driver verifier is to help detect bugs in drivers, and is a basic requirement for driver certification. Per the WinDbg analysis, this appears to be a memory corruption bug. After some searching, I found that the Driver Verifier BSOD has been reported by other users, with no acknowledgement, and no fix forthcoming. I contacted Razer support, and not surprisingly, they suggested uninstall and reinstall. I tried the community forums, and I was just pointed back to support.

FAULTING_IP:
rzudd+28c80
...
DEFAULT_BUCKET_ID:  CODE_CORRUPTION
...
PROCESS_NAME:  RzSynapse.exe
...
STACK_TEXT:
nt!KeBugCheckEx
nt!MiSystemFault+0x12e69c
nt!MmAccessFault+0xae6
nt!KiPageFault+0x132
rzudd+0x28c80
rzudd+0x218d4
rzudd+0x7a9f
Wdf01000!FxIoQueue::DispatchRequestToDriver+0x1bf [minkernel\wdf\framework\shared\irphandlers\io\fxioqueue.cpp @ 3325]
Wdf01000!FxIoQueue::DispatchEvents+0x3bf [minkernel\wdf\framework\shared\irphandlers\io\fxioqueue.cpp @ 3125]
Wdf01000!FxPkgIo::DispatchStep1+0x53e [minkernel\wdf\framework\shared\irphandlers\io\fxpkgio.cpp @ 324]
Wdf01000!FxDevice::DispatchWithLock+0x5a5 [minkernel\wdf\framework\shared\core\fxdevice.cpp @ 1430]
nt!IovCallDriver+0x245
...
FAILURE_BUCKET_ID:  MEMORY_CORRUPTION_LARGE

I am done with Razer, exciting promises for technology on paper, great looking hardware, terrible support, terrible software.

Razer Shoddy Support and Bad Software UX

This post is just me venting my frustration at Razer’s poor software user experience, and their shoddy support practices. I’m writing this after I just had to go and find a working mouse, so I could click a button on a dialog that had no keyboard navigation support.

I’ve been using Razer keyboards and mice for some time, love them, their software not so much. I had to replace an aging ThinkPad, and the newly released Razer Blade Stealth looked like a great candidate, small and fast, reasonably priced, should be perfect, well, not so much.

I keep my monitors color calibrated, and I cringe whenever I see side-by-side monitors that clearly don’t match, or when somebody creates graphic content (yes you graphic artists using MacBooks to create content for PC software without proper color profiles) that looks like shades of vomit on a projector or a cheap screen, but I digress. My monitor of choice is NEC and their native SpectraView color calibration software. Unfortunately, the Blade with its lower end Intel graphics processor, and HDMI port, does not support DDC/CI, so no ability to color calibrate my monitor. My main monitor is a NEC MultiSync EA275UHD 4K monitor, and the internal Intel graphics processor is frustratingly slow on this high resolution display. And, the HDMI connectivity would drop out whenever the monitor went into power saving mode. Why not use a more standard mini-DisplayPort connector, would not solve the speed problem, but at least would have resolved the connection reliability and allowed for proper color calibration.

To solve the problem, I decided to get a Razer Core with an EVGA GeForce GTX 1070 graphics adapter. The Core is an external USB and network dock, with a PSU and PCIe connector for a graphics card, all connected to the notebook by Thunderbolt 3 over a, too short, USB-C cable. I connected my monitor to the GTX 1070 DisplayPort connector, connectivity was fine, I could color calibrate my monitor, and the display performance with the GTX 1070 was fast, great. By the way, JayzTwoCents has a great video on the performance of external graphic cards.

But, my USB devices connected to the dock kept on dropping out. I found several threads on the Razer support forum complaining about the same USB problems, and the threads are promptly closed with a contact support message. I contacted Razer support and they told me they are working on the problem, and closed my ticket. I contacted them again stating that closing my ticket did not resolve the problem, and they said my choice is RMA the device, with no known solution, or wait, and then they closed my ticket again. To this day this issue has not been resolved, and I have to connect my USB devices directly the notebook, defeating the purpose of a dock. They did publish a FAQ advising users to not use 2.4GHz WiFi, but to stick with 5GHz due to interference issues, so much for their hardware testing.

Now, let’s talk about their Razer Synapse software, the real topic of this post. The software is used to configure all the Razer devices, and sync the device preferences across computers with a cloud account, neat idea. The color scheme and custom drawn controls of this software matches their edgy “brand”, but their choice of thin grey font on a dark background fails in my usability book when used in a brightly lit office space.

Synapse.1

Whenever Windows 10 updates, the stupid Synapse software pops up while the install is still going, if you say yes, install now, then as expected the install fails due to Windows still installing. I logged the issue with Razer support, and they told me it is behaving as designed, really, designed to fail.

Synapse.2

So, today the Synapse software, again, prompts me to update, a frequent occurrence, and my mouse dies during the update, presumably because they updated the mouse driver, but this time I am prompted with a reboot required dialog. Dead mouse, no problem, have keyboard, tab over, wait, no keyboard navigation on the stupid owner drawn custom control dialog, no way to interact with the dialog without a mouse, just fail.

Synapse.3

Moral of the story, UX is important people, and I should just stick with ThinkPad or Microsoft Surface Book hardware, costs more, but never disappoints.

Circumventing ThinkPad’s WiFi Card Whitelisting

What started as a simple Mini PCI Express WiFi card swap on a ThinkPad T61 notebook, turned into deploying a custom BIOS in order to get the card to work.

I love ThinkPad notebooks, they are workhorses that keep on going and going. I always keep my older models around for testing, and one of my old T61’s had an Intel 4965AGN card, that worked fine with Windows 10, until the release of the Anniversary / Redstone 1 update. After the RS1 update, WiFi would either fail to connect, or randomly drop out. The 4965AGN card is not supported by Intel on Win10, and the internet is full of problem reports of Win10 and 4965AGN cards.

Ok, no problem, I’ll just get a cheap, reasonably new, with support for Win10, Mini PCIe WiFi card, and swap the card. I got an Intel 3160 dual band 802.11AC card and mounting bracket for about $20. The 3160 is a circa 2013 card with Win10 support. I installed the card, booted, and got a BIOS error 1802: Unauthorized network card is plugged in.

This lead me to the discovery of ThinkPad hardware whitelisting, where the BIOS only allows specific cards to be used, which lead me to Middleton’s BIOS, a custom T61 BIOS, that removes the hardware whitelisting, and enables SATA-2 support. I found working download links to the v2.29-1.08 Middleton BIOS here.

The BIOS update is packaged as a Win7 x86 executable or DOS bootable ISO image. As I’m running Win10 x64, and I could not find any CD-R discs around, I used Rufus to create a bootable DOS USB key, and I extracted the ISO contents using 7-Zip to a directory on the USB key. The ISO is created using a bootable 1.44MB DOS floppy image, and AUTOEXEC.BAT launches “FLASH2.EXE /U”, I created a batch file that does the same.

I removed the WiFi card, booted from USB, ran the flash, and got an error 1, complaining that flashing over the LAN is disabled. Ok, I enabled flashing the BIOS over the LAN in the BIOS, and rebooted.

I ran the update again, and this time I got error 99, complaining that BitLocker is enabled, and to temporarily disable BitLocker. I did not have BitLocker enabled, so I removed the hard drive and tried again, same error. Must be something in the BIOS, I disabled the security chip in the BIOS, tried again, and the update starts, but a minute or so later the screen goes crazy with INVALID OPCODE messages.

Hmm, maybe the updater does not like the FreeDOS boot image used by Rufus. Ok, let me create a MS-DOS USB key, uhh, on Win10, that turned out to be near impossible. Win10 does not include MS-DOS files, Rufus does not support custom locations for MS-DOS files, nor does it support getting them from floppy or CD images (readily available for download), the HP USB Disk utility complains my USB drive is locked, and writing raw images to USB result in a FAT12 disk structure that is too small to use. I say near impossible because I gave up, and instead went looking for an existing MS-DOS USB key I had made a long time ago. I am sure with a bit more persistence I could have found a way to create MS-DOS bootable USB keys on Win10, but that is an exercise of another day.

Trying again with a MS-DOS USB key, and voilà, BIOS flashed, and WiFi working.

I am annoyed that I had to go to this much trouble to get the new WiFi card working, but the best part of the exercise turns out to be the SATA-2 speed increase. This machine had a SSD drive, that I always found to be slow, but with the SATA-2 speed bump in Middleton’s BIOS, the machine is noticeably snappier.

A couple hours later, my curiosity got the better of me, and I made my own version of Rufus that will allow formatting of MS-DOS USB drives on Win10. In the process I engaged in an interesting discussion with the author of Rufus. I say interesting, but it was rather frustrating, Microsoft removed the MS-DOS files from Win10, and Rufus refuses to add support for sourcing of MS-DOS files from a user specified location, citing legal reasons, and my reluctance to first report the issue to FreeDOS. Anyway, can code, have compiler, if have time, will solve problem.

UPS Battery Replacement Turns Into Unrecoverable Firmware Update

Two lessons learned; do not trust scheduled battery tests, and leave working firmware be!

As the saying goes, if it is not broken do not fix it, especially when it comes to firmware.

I have a couple APC Smart-UPS‘s at my house, same as the models I like to use at the office. I use the SMT750 models with AP9631 Network Monitoring Cards. The problem started when we had a short power outage, and the UPS that powers the home network switch, cell repeater, alarm internet connection, and PoE IP cameras, unexpectedly died. A battery replacement led to the opportunity to do a UPS firmware update, which led to an unrecoverable firmware update.

It started when I woke up one morning and it was obvious the power had been out, first indicator is the kitchen appliances have blinking clocks, second are the numerous power failure email notifications, and the emails that stood out were from the alarm system that says it lost power and internet connectivity. The alarm has it’s own backup battery, the network switches and FiOS internet have their own battery backups, and the outage was only about 4 minutes.  So how is it that the UPS died, killing the switch, disconnecting the internet, especially when the outage was only 4 minutes, and typical runtimes on the UPS should be about an hour?

Here is the UPS outage log produced by the NMC card:

10/11/2016 06:53:05 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 06:12:27 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 06:12:24 Device UPS: Restored the local network management interface-to-UPS communication. 0x0101
10/11/2016 06:12:12 System Network service started. IPv6 address FE80::2C0:B7FF:FE98:9BAF assigned by link-local autoconfiguration. 0x0007
10/11/2016 06:12:10 Device Environment: Restored the local network management interface-to-integrated Environmental Monitor (Universal I/O at Port 1) communication. 0x0344
10/11/2016 06:12:09 System Network service started. System IP is 192.168.1.11 from manually configured settings. 0x0007
10/11/2016 06:12:02 System Network Interface coldstarted. 0x0001
10/11/2016 05:45:36 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 05:45:36 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 05:45:35 Device UPS: The output power is turned off. 0x0114
10/11/2016 05:45:35 Device UPS: The graceful shutdown period has ended. 0x014F
10/11/2016 05:45:35 Device UPS: No longer on battery power. 0x010A
10/11/2016 05:45:35 Device UPS: Main outlet group, UPS Outlets, has been commanded to shutdown with on delay. 0x0174
10/11/2016 05:45:35 Device UPS: The power for the main outlet group, UPS Outlets, is now turned off. 0x0135
10/11/2016 05:45:18 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 05:41:36 Device UPS: On battery power in response to rapid change of input. 0x0109

I could see from the log that the UPS battery power ran out within 4 minutes, 05:41:36 on battery, 05:45:18 battery too low, 05:45:35 output turned off. The UPS status page was equally puzzling, load was at 9.7%, yet reported runtime was only 5 minutes, impossible.

Here is the status screenshot:

apc-1

I ran a manual battery test, the test passed, but from the log it was clear the battery failed. I have bi-weekly scheduled battery tests for all UPS’s, never received a failure report. So what is the point of a battery test if the test comes back no problem yet it is clear to me from the logs that the battery failed?

Here is the log:

10/11/2016 18:07:06 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 18:07:06 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 18:07:05 Device UPS: Self-Test passed. 0x0105
10/11/2016 18:06:59 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 18:06:59 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 18:06:58 Device UPS: Self-Test started by management device. 0x0137
10/11/2016 18:06:56 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107

I asked for advice on the APC forum, no reply yet, and I ordered a replacement RBC48 battery. I received the battery, installed it, and the reported runtime is back to normal, 1 hour 48 minutes.

Here is a status screenshot with the new battery:

apc-2

Here is where I should have stopped and called it a day, but no. I knew that the UPS’s were on old firmware, and I decided to use this opportunity to update the firmware. I’d normally let firmware be, unless I have a good reason to update, but I convinced myself that the new firmware readme had some fixes that may help with the false pass on the battery test:

Release Notes (UPS09.3):
========================
2. Improved self-test logging for PCBE / NMC.
11. Repaired an occasional math error in the battery replacement date algorithm that resulted in incorrect dates.

I update the UPS, where I just replaced the battery, instructions are pretty simple. Only hassle is I have to bypass the network equipment to be mains powered so I can turn the UPS outputs off while updating the firmware, while maintaining network connectivity.

I did the same for my office UPS, PC and office switch on mains power, and when I power down the output, I made sure to not notify PowerChute Network Shutdown (PCNS) clients, as my PC had the PowerChute client installed to receive power state via the network. I start the firmware update over the network, and a few seconds later I get a Windows message that shutdown had been initiated by PCNS, what? I sit there in frustration, nothing to do but watch my PC shutdown while it is still delivering the firmware update.

On rebooting my PC, NMC comes up, but reports the UPS has stopped communicating. I pull AC power from the UPS, no change, I also pull the batteries, and when I plug the batteries and mains back on, beeeeeeeeeep. NMC now reports no UPS found, the UPS LCD panel reports all is fine. And still beeeeeeeeeep, and no way to stop the beeeeeeeeeep.

Here is the NMC status page:

apc-3

I try to do Firmware Upgrade Wizard update via USB, plug a USB cable in, PC sees UPS, reports critical condition, but the upgrade wizard reports no UPS found on USB.

Here is the wizard error page:

apc-4

So, here I am, stuck with a bricked UPS, lesson learned, actually two lessons learned; do not trust scheduled battery tests, and leave working firmware be!