Recovering the Firmware on a Supermicro BPN-SAS3-846EL1 Backplane

In a previous adventure I replaced an Adaptec HBA with a LSI SAS3 HBA, and the chassis drive bay LED’s stopped working. I suspect the LSI card does not play nice with the SGPIO sideband controller, and I decided to replace the chassis with one similar to my SC846 chassis, where the LSI card and drive bay LED’s do work fine.

What should have been a simple replacement turned into quite a recovery operation.

Since I now had SAS3 HBA’s in both my servers, I really wanted to get SAS3 backplanes, but I did not want to pay SAS3 chassis prices. I found a refurbished SuperChassis 846E16-R1200B chassis and two refurbished Supermicro BPN-SAS3-846EL1 backplanes on eBay. The one SAS3 backplane would replace the SAS2 backplane in my existing SC846, and the second SAS3 backplane would replace the SAS2 backplane in the newly purchased SC846. The combination of the 24 bay 4U chassis with a SAS2 backplane and a replacement SAS3 backplane is much cheaper compared to any native SAS3 chassis I could find.

I do understand that using SATA3 drives on a SAS3 backplane will not perform like SAS3 drives, but with multipath the aggregate throughput can still outperform SAS2.

I received my chassis and my backplanes. The chassis was clean but a bit dinged in one corner, and the expanders were clean, but the metal frames showing a little rust. I had no idea how old the firmware on the backplanes were, so I contacted Supermicro support to ask for the current firmware. After asking for my serial numbers, Supermicro sent me the latest firmware for my hardware. The firmware update instructions were included in the “ReleaseNote.txt” file that came with the firmware.

I removed the motherboard from the old chassis and installed it in the new SC846. I removed the SAS2 backplane, and installed the SAS3 backplane in its place. The power cable layout on the SAS3 backplane is a bit different, and I had to use a few molex power splitters to extend the power cables to reach the power plugs on the SAS3 backplane. The standard rails are too long to fit in my rack, and the short rails are too short for the chassis, so as I again used short outer rails and standard inner rails.

I powered the machine up through remote IPMI KVM, all looked good, and I booted into my Ubuntu Server USB stick so I could SSH into the box, and update the firmware.

The instructions from “ReleaseNote.txt” say:

How to Flash Firmware
-------------------------
Under Linux/Windows Environment to use CLIXTL (ver.6.00)
1. use "CLIXTL -l" to show SAS addresses
2. use "CLIXTL -f all  -d " to update firmware
3. use "CLIXTL -f 3  -d  -r" to update MFG and reset expander

example:
CLIXTL -l
CLIXTL -f all -t 500304800000007F -d SAS3-EXPFW_66.16.11.00.fw
CLIXTL -f 3 -t 500304800000007F -d BPN-SAS3-846EL-PRI_16_11.bin -r

The existing firmware on the expanders were v66.06.01.02 and MFG v06.02, while the new firmware was v66.16.11.00 and MFG v16.11.

Firmware Update History
-------------------------
01. migrate expander firmware to phase 16 
02. enhance TMP, VOL, FAN, and PWS status in SES pages
03. present version information of current running firmware
04. Dynamic SES page element presentation
05. move BMC IP to SCSI network inquiry 
06. support I2C R/W as slave to let BMC to identify platforms
07. redundant side has sensor information 
08. firmware rewrite and optimization

I did the following:

$ sudo ./CLIXTL -i -t 5003048001B24DBF
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 5003048001B24DBF
ENCLOSURE ID - 5003048001B24DBF
ENCLOSURE INFORMATION:
PLATFORM NAME - SMC846ELSAS3P
SERIAL NUMBER -
VENDOR ID - LSI
PRODUCT ID - SAS3x40
VERSION INFORMATION:
FLASH REGION 0 - 66.06.01.02
FLASH REGION 1 - 66.06.01.02
FLASH REGION 2 - 66.06.01.02
FLASH REGION 3 - 06.02
DEVICE INFORMATION:
DEVICE NAME - /dev/sg0
BMC IP - NULL

$ sudo ./CLIXTL -f all -t 5003048001B24DBF -d SAS3-EXPFW_66.16.11.00.fw
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Firmware Region 0 - Finished
Firmware Region 1 - Finished
Firmware Region 2 - Finished
Please reset expander to activate

pieter@ubuntuusb:~/SAS3$ sudo ./CLIXTL -f 3 -t 5003048001B24DBF -d BPN-SAS3-846EL-PRI_16_11.bin -r
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Error, incompatible file type or directory

And the expander dropped and never came back up.

$ sudo ./CLIXTL -l
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Error, no enclosure has been found

I can hear the comments now, why risk updating the firmware if it is not broken, true, but I didn’t know if it would work, and I’d rather start fresh. Do also note that I did the update on my secondary server, my primary server is still running unmodified, so no interruption of home service or work experiments.

I still had the second SAS3 backplane, so I replace the “bricked” one, leaving the firmware as is, and brought my Unraid server back up, all appeared fine. At least I had two working servers, giving me time to try and recover the backplane.

I sent Supermicro support an email asking for help, but since it was weekend I had to wait, so I did my own research. I found a forum post of a user that recovered a “bricked” SAS2 expander via the factory serial port, and I decided to give it a try.

I used my Arduino programming FTDI USB RS232 adapter, and the pin connections for PRI-SDB / 8 are:

PRI-SDB : 1 : TX  -> RS232 : RX
PRI-SDB : 2 : GND -> RS232 : GND
PRI-SDB : 3 : RX  -> RS232 : TX

The current XTools v6.10.C CLI does not include COM support, at least none that I could find in the documentation or CLI help, so I used the older v1.4 version.

>g3Xflash.exe -s com4 get avail
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface.....
Expander: Unknown (SAS_3X_40)
1) Unknown (SAS_3X_40) (00000000:00000000)

Good sign, the COM port worked, and the expander hardware was detected, but did not have an address.

I flashed the firmware and the MFG data:

>g3Xflash.exe -s com4 down fw SAS3-EXPFW_66.16.11.00.fw 0
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface..
Expander: Unknown (SAS_3X_40)
Expander Validation: Passed
Checksum: Passed
Target Firmware Region: 00
Current Version: 255.255.255.255
Replacement Version: 66.16.11.00
Image Validation: Passed
Pre-Validation of image is successful.
Are you sure to download file to expander?(y/n):y
Downloading File...........................................................................Download Complete.
Post-validating........................................................Post-Validation of image is successful.
Download Successful.

>g3Xflash.exe -s com4 down mfg BPN-SAS3-846EL-PRI_16_11.bin 3
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface.....
Expander: Unknown (SAS_3X_40)
Image Validation: Passed
Checksum: Passed
Reading MFG version from flash...Unable to retrieve version.
Replacement Version: 10.0b
Pre-Validation of image is successful.
Are you sure to download file to expander?(y/n):y
Downloading File............................Download Complete.
Post-validating.........Post-Validation of image is successful.
Download Successful.

I reset the expander, the LED’s now did a test pattern that they did not do before, and things looked good:

>g3Xflash.exe -s com4 reset exp
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface....................
Expander: SC846-P (SAS_3X_40)
Are you sure you want to reset Expander?(y/n):y
Expander reset successful.

>g3Xflash.exe -s com4 get avail
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface..
INFO: Bootstrap is not present on board.
Downloading the Bootstrap
................................................................Download Bootstrap Complete.
..................
Expander: SC846-P (SAS_3X_40)
1) SC846-P (SAS_3X_40) (50030480:0000007F)

>g3Xflash.exe -s com4 get exp

********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface....................
Expander: SC846-P (SAS_3X_40)
Reading the expander information................
Expander: SC846-P (SAS_3X_40) C1
SAS Address: 50030480:0000007F
Enclosure Logical Id: 50030480:0000007F
Component Identifier: 0x0232
Component Revision: 0x03

>g3Xflash.exe -s com4 get ver 0
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface....................
Expander: SC846-P (SAS_3X_40)
Firmware Region Version: 66.16.11.00

Everything looked good, except the SAS address defaulted to 50030480:0000007F.

The firmware “ReleaseNote.txt” file states that the v6.00 CLIXTL tool can change the SAS address, but the only version on the Supermicro site is the  v6.10.C version, that does not support changing the SAS address.

How to modify SAS address
-------------------------
Under Linux/Windows Environment to use CLIXTL (ver.6.00)
1. use "CLIXTL -l" to show SAS addresses
2. use "CLIXTL -s  -t  -r" to change SAS address and reset expander

example:
CLIXTL -l
CLIXTL -s 500304801234567F -t 500304800000007F -r

The v1.4 GUI does support changing the SAS address. It appears that the GUI dynamically creates a MFG image (I could see a BIN file get created in the directory), but after it changed the address, the backplane was back to a borked state, and I had to repeat the recovery process.

By the next week I heard back from Supermicro, and they confirmed the instructions from the “ReleaseNote.txt” file were wrong, and I should use the instructions from the “Command-line Xtool 6.10.C.pdf” file.

Wrong, will bork the MFG data:
CLIXTL -f 3 -t 5003048001B24DBF -d BPN-SAS3-846EL-PRI_16_11.bin -r

Right:
CLIXTL -c -t 5003048001B24DBF -d BPN-SAS3-846EL-PRI_16_11.bin

Better, update firmware and MFG and retain settings:
CLIXTL -a usc -t 5003048001B24DBF -d ~/

I used the all in one update method on the server that was running the original firmware backplane, and it updated without issue:

# ./CLIXTL -a usc -t 500304800914683F -d ~/CLIXTL6.10.C_Linux/
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
Firmware Region 0 - Finished
Firmware Region 1 - Finished
Firmware Region 2 - Finished
MFG page Region 3 - Finished
New configuration is uploaded successfully
Please reset expander to activate

[Reboot]

# ./CLIXTL -i -t 500304800914683F
================================================================================
COMMAND-LINE INTERFACE XTOOL
version 6.10.C
Supermicro Computer ,Inc.
================================================================================
UNIT SPECIFIC INFORMATION:
SAS ADDRESS - 500304800914683F
ENCLOSURE ID - 500304800914683F
ENCLOSURE INFORMATION:
PLATFORM NAME - SMC846ELSAS3P
SERIAL NUMBER -
VENDOR ID - LSI
PRODUCT ID - SAS3x40
VERSION INFORMATION:
FLASH REGION 0 - 66.16.11.00
FLASH REGION 1 - 66.16.11.00
FLASH REGION 2 - 66.16.11.00
FLASH REGION 3 - 16.11
DEVICE INFORMATION:
DEVICE NAME - /dev/sg11
BMC IP - NULL

I now had one perfectly updated backplane preserving all the original MFG data, and one backplane with default MFG data. I wanted to apply the MFG data from the good backplane to the default values backplane.

I downloaded the firmware and manufacturing data from the good backplane using the v1.4 tools (not supported in current CLIXTL):

./g3Xflash -i get avail
./g3Xflash -y -i 500304800914683F up fw up_fw_loader_0.fw 0
./g3Xflash -y -i 500304800914683F up fw up_fw_loader_1.fw 1
./g3Xflash -y -i 500304800914683F up fw up_fw_loader_2.fw 2
./g3Xflash -y -i 500304800914683F up mfg up_mfg_loader.bin 3

The downloaded files are larger and are padded with 0xFF or 0x00. I trimmed the MFG file to the right size, and modified the SAS address in two places:

2020-01-22 (3)

I tried to uploaded the modified MFG data:

>g3Xflash.exe -y -s com4 down mfg BPN-SAS3-846EL-PRI_16_11.bin 3
********************************************************************************
g3Xflash
LSI SAS Expander Flash Utility
Version: 2.0.0.0
Copyright (c) 2013 LSI Corporation. All rights reserved.
********************************************************************************
Initializing Interface.....
Expander: Unknown (SAS_3X_40)
Image Validation: Passed
Checksum: Failed

But the tool complains that the checksum failed. From the file diff we can see that there is more than just the SAS address that change, I assume some sort of checksum calculation that goes with the data.

The v1.4 g3xFlash CLI help does reference XML options for converting between binary and XML MFG formats, but no instructions on how to use it. Like the serial recovery procedure these tools are probably for internal use only, and I could find no public references.

>g3Xflash.exe -h
********************************************************************************
    g3Xflash
    LSI SAS Expander Flash Utility
    Version: 2.0.0.0
    Copyright (c) 2013 LSI Corporation.  All rights reserved.
********************************************************************************
SYNTAX:
    g3Xflash OPTIONS INTERFACE COMMAND
OPTIONS:
...
COMMAND:
...
        up mfg  
...
            In case of mfg upload using XML file, command syntax changes
            as below. XML file needs to be specified at two places.
            e.g.
            "g3Xflash -x   up mfg  3"

Supermicro support confirmed the current tools cannot change the SAS address, they would not supply the older version of the tools, and recommended I send the backplane in for service, or allow them to remotely SSH to the machine and they will change it for me. A bit disappointing that something so simple is made so complicated, and really way too much trouble.

Since I only had one expander in the chassis, there would be no issues using a default SAS address, and I decided to leave it as is. I replaced the other SAS2 backplane with the recovered SAS3 backplane, and the expander and all drives were back online.

If anybody knows how to update the SAS address, or has a copy of the v6.00 CLIXTL tools that supposedly can change the address, please do let me know.

Unraid repeat parity errors on reboot

This post started with a quick experiment, but after hardware incompatibilities forced me to swap SSD drives, and subsequently losing a data volume, it turned into a much bigger effort.

My two Unraid servers have been running nonstop without any issues for many months, last I looked the uptime on v6.7.2 was around 240 days. We recently experienced an extended power failure, and I noticed 5 parity errors, on both servers, after the servers were restarted.

Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934168
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934176
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934184
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934192
Jan 1 06:09:23 Server-1 kernel: md: recovery thread: PQ corrected, sector=1962934200

Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192
Jan 1 04:42:39 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200

I initially suspected that a dirty shutdown caused the corruption, but my entire rack is on a large UPS, and the servers are configured, and tested, to cleanly shutdown in case of a low battery condition. Unfortunately Unraid does not persist logs across reboots, so it was impossible to verify the shutdown behavior via logs. Unraid logs to memory and not to the USB flash drive to prevent flash wear, but I think this needs to be at least configurable, as no logs means troubleshooting after an unexpected reboot is near impossible. Yes, I know I can enable the Unraid syslog server, and I can redirect syslog to write to the flash drive, but syslog is not as reliable or complete as native logging, especially during a shutdown scenario, but more importantly, syslog was not enabled, so no shutdown logs.

I could not entirely rule out a dirty shutdown, but I could test a clean reboot scenario. I restarted from within Unraid, ran a parity check, same exact 5 parity errors were back, ran a parity check again, and clean. It takes more than a day to run a single parity check, so this is a cumbersome and time consuming exercise. It is  very suspicious that it is exactly the same 5 sectors, every time.

Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934168
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934176
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934184
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934192
Jan  3 10:03:07 Server-2 kernel: md: recovery thread: P corrected, sector=1962934200

I searched the Unraid forums, and I found that there are other reports of similar repeat parity errors. In some instances attributed to a Marvel chipset, or a Supermicro AOC-SASLP-MV8 controller, or the SASLP2 driver. My systems use Adaptec RAID cards, 7805Q SAS2 and 81605ZQ SAS3, in HBA mode, so no Marvel chipset and no SASLP2 driver, but the same symptoms.

An all too common forum reply to storage problems is to switch to a LSI HBA, and I got the same reply when I reported the parity problem with my Adaptec hardware.

I was sceptical, causation vs. correlation. As example, take the SQLite corruption bug introduced in v6.7 and for the longest time it was blamed on hardware or 3rd party apps, but it eventually turns out to be an Unraid bug.

Arguing my case on a community support forum is not productive, and I just want the parity problem resolved, so I decided to switch to LSI HBA cards. I really do have a love hate relationship with community support, especially when I pay for a product, like Unraid or Plex Pass, but have no avenue to dedicated support.

I am no stranger to LSI cards, and the problems flashing from IR to IT mode firmware, so I got my LSI cards preflashed with the latest IT mode firmware at the Art of Server eBay store. My systems are wired with miniSAS HD SFF-8643 cables, and the only cards offered with miniSAS HD ports were LSI SAS9340-8i ServeRAID M1215 cards. I know the RAID hardware is overkill when using IT mode, and maybe I should have gone for vanilla LSI SAS 9300-8i cards, especially when the the Unraid community was quick to comment that a 9340 is not a “true” HBA.

I replaced the 7805Q with the SAS9340 in Server-2, and noticed that none of my SSD drives showed up in the LSI BIOS utility, only the spinning disks showed up. I put the 7805Q card back, and all the drives, including the SSD drives, showed up in the Adaptec BIOS utility. I replaced the 81605ZQ with the SAS9340 in Server-1, and this time some of the SSD’s showed up. None of my Samsung EVO 840 SSD’s showed up, but the Samsung Pro 850 and Pro 860 SSD’s did show up. I again replaced the 7805Q in Server-2 with the SAS9340, but this time I added a Samsung Pro 850, and it did show up.

The problem seemed to be limited to my Samsung EVO drives. I reached out to Art of Server for help, and although he was very responsive, he had not seen or heard of this problem. I looked at the LSI hardware compatibility list, and the EVO drives were listed. Some more searching, and I found a LSI KB article mentioning TRIM support not being supported on Samsung Pro 850 drives. It seems that the LSI HBA’s need TRIM to support DRAT (Deterministic Read After TRIM) / (Data Set Management TRIM supported (limit 8 blocks)), and RZAT (Deterministic read ZEROs after TRIM). The Wikipedia article on TRIM mentions specific drives for faulty TRIM implementations, including the Samsung 840 and 850 (without specifying Pro or EVO), and the Linux kernel has special handling for Samsung 840 and 850 drives.

	/* devices that don't properly handle queued TRIM commands */
	{ "Micron_M500IT_*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Micron_M500_*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Crucial_CT*M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Micron_M5[15]0_*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Crucial_CT*M550*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Crucial_CT*MX100*",		"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 840*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "Samsung SSD 850*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },
	{ "FCCT*M500*",			NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
						ATA_HORKAGE_ZERO_AFTER_TRIM, },

This is all still circumstantial, as it does not explain why the LSI controller would not mount the 840 EVO drives, but will mount the 850 Pro drive, when both are listed as problematic, and both are included on the LSI hardware compatibility list. I do not have EVO 850’s to test with, so I can not confirm if the problem is limited to EVO 840’s.

I still had the original parity problem to deal with, and to verify that a LSI HBA will resolve the problem, so I needed a working Unraid with LSI HBA system. Server-1 had two EVO 840’s, a 850 Pro, and a 860 Pro for the BTRFS cache volume. I pulled a Pro 850 and a Pro 860 drive from another system, and proceeded to replace the two EVO 840’s. Per the Unraid FAQ, I should be able to replace the drives one at a time, waiting for the BTRFS volume to rebuild. I replaced the first disk, it took about a day to rebuild, I replaced the second disk using the same procedure, but something went wrong, and my cache volume would not mount, and reported being corrupt.

Jan  6 07:25:41 Server-1 kernel: BTRFS info (device sdf1): allowing degraded mounts
Jan  6 07:25:41 Server-1 kernel: BTRFS info (device sdf1): disk space caching is enabled
Jan  6 07:25:41 Server-1 kernel: BTRFS info (device sdf1): has skinny extents
Jan  6 07:25:41 Server-1 kernel: BTRFS warning (device sdf1): devid 4 uuid 94867179-94ed-4580-ace4-f026694623f6 is missing
Jan  6 07:25:41 Server-1 kernel: BTRFS error (device sdf1): failed to verify dev extents against chunks: -5
Jan  6 07:25:41 Server-1 root: mount: /mnt/cache: wrong fs type, bad option, bad superblock on /dev/sdr1, missing codepage or helper program, or other error.
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7033): exit status: 32
Jan  6 07:25:41 Server-1 emhttpd: /mnt/cache mount error: No file system
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7034): umount /mnt/cache
Jan  6 07:25:41 Server-1 kernel: BTRFS error (device sdf1): open_ctree failed
Jan  6 07:25:41 Server-1 root: umount: /mnt/cache: not mounted.
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7034): exit status: 32
Jan  6 07:25:41 Server-1 emhttpd: shcmd (7035): rmdir /mnt/cache

In retrospect I should have known something was wrong when Unraid reported the array being stopped, but I still saw lots of disk activity on the SSD drive bay lights. I suspect the BTRFS rebuild was still ongoing, or mounted, even if Unraid reported the array being stopped. No problem, I thought, I make daily data backups to Backblaze B2 using Duplicacy, and weekly Unraid (appdata and docker) backups, that are then backed up to B2. I recreated the cache volume, and got the server started again, but my Unraid data backups were missing.

It was an oversight and configuration mistake: I configured my backup share to be cached, I ran daily backups of the backup share to B2 at 2am, and weekly Unraid backups to the backup share on Mondays at 3am. The last B2 backup was Monday morning at 2am, the last Unraid backup was Monday morning at 3am. When the cache died all data on the cache was lost, including the last Unraid backup, that never made it to B2. My last recoverable Unraid backup on B2 was a week old.

So a few key learnings: do not use the cache for backup storage, schedule offsite backups to run after onsite backups, and if the lights are still blinking don’t pull the disk.

Once I had all the drives installed, I tested for TRIM support.

Samsung Pro 860, supports DRAT and RZAT:

root@Server-1:/mnt# hdparm -I /dev/sdf | grep TRIM
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM

Samsung Pro 850, supports DRAT:

root@Server-2:~# hdparm -I /dev/sdf | grep TRIM
* Data Set Management TRIM supported (limit 8 blocks)

Samsung EVO 840, supports DRAT, but does not work with the LSI HBA:

root@Server-2:~# hdparm -I /dev/sdc | grep TRIM
* Data Set Management TRIM supported (limit 8 blocks)

The BTRFS volume consisting 4 x Pro 860 drives reported trimming what looks like all disks, 3.2 TiB:

root@Server-1:~# fstrim -v /mnt/cache
/mnt/cache: 3.2 TiB (3489240088576 bytes) trimmed

The BTRFS volume consisting of 2 x Pro 860 + 2 x Pro 850 drives reported trimming what looks like only 2 disks, 1.8 TiB:

root@Server-2:~# fstrim -v /mnt/cache
/mnt/cache: 1.8 TiB (1946586398720 bytes) trimmed

In summary, Samsung EVO 840 no good, Samsung Pro 850 avoid, Samsung Pro 860 is ok.

Server-2 uses SFF-8643 to SATA breakout cables with sideband SGPIO connectors, controlling the drive bay lights. With the Adaptec controller the drive bay lights worked fine, but with the LSI the lights do not appear to work. I am really tempted to replace the chassis with a SAS expander, alleviating the need for the breakout cables, but that is a project for another day.

After I recreated the cache volume, reinstalled the Duplicacy web container and tried to restore my now week old backup file. I could not get the web UI to restore the 240GB backup file, either the session timed out or the network connection was dropped. I reverted to using the CLI, and with a few retries, eventually restored the file. It was disappointing to learn that the web UI must remain open during the restore, and that the CLI does not automatically retry on network failures. Fortunately Duplicacy will do block-based restores and can resume restoring large files.

2020/01/06 07:59:01 Created restore session 1o8nqw
2020/01/06 07:59:01 Running /home/duplicacy/.duplicacy-web/bin/duplicacy_linux_x64_2.3.0 [-log restore -r 101 -storage B2-Backup -overwrite -stats --]
2020/01/06 07:59:01 Set current working directory to /cache/localhost/restore
2020/01/06 09:37:35 Deleted listing session jnji7l
2020/01/06 09:37:41 Invalid session
2020/01/06 12:07:57 Stopping the restore operation in session 1o8nqw
2020/01/06 12:07:57 Failed to restore files for backup B2-Backup-Backup revision 101 in the storage B2-Backup: Duplicacy was aborted
2020/01/06 12:07:57 closing log file restore-20200106-075901.log
2020/01/06 12:08:17 Deleted restore session 1o8nqw
Downloaded chunk 34683 size 13140565, 15.05MB/s 01:20:44 70.1%
Failed to download the chunk 
1ff9d2c082d06226b0d81019338d048bf5a4428827a3fc0d3f6f337d66fd7fa9: read tcp 192.168.1.113:49858->206.190.215.16:443: wsarecv: An existing connection was forcibly closed by the remote host.
...
Downloaded chunk 49473 size 2706939, 13.52MB/s 00:00:01 100.0%
Downloaded Unraid/2019-12-30@03.00/CA_backup.tar.gz (255834283004)
Restored C:\Users\piete\Downloads\Duplicacy to revision 101
Files: 1 total, 243982.58M bytes
Downloaded 1 file, 243982.58M bytes, 14795 chunks
Total running time: 01:30:23

I did lose my DW-Spectrum IPVMS running on an Ubuntu Server VM. I’ve known that I don’t have a VM backup solution, but the video footage is on the storage server not in the VM, video backups go to B2, and it is reasonably easy to recreate the VM. I am still working on a DW-Spectrum docker solution for Unraid, but as of today the VMS does not recognize Unraid mapped storage volumes.

After all this trouble, I could finally test a parity check after reboot with the LSI HBA.  With the system up I ran a parity check, all clear, rebooted, ran the parity check again, and … no errors. I performed this operation on both servers, no problems.

I was really sceptical that the LSI would work where the Adaptec failed, and this does not rule out Unraid as the cause, but it does show that Unraid with the LSI HBA does not have the dirty parity on reboot problem.

LSI turns their back on Green

I previously blogged here and here on my research into finding a power saving RAID controllers.

I have been using LSI MegaRAID SAS 9280-4i4e controllers in my Windows 7 workstations and LSI MegaRAID SAS 9280-8e controllers Windows Server 2008 R2 servers. These controllers work great, my workstations go to sleep and wake up, and in workstations and servers drives spin down when not in use.

I am testing a new set of workstation and server systems running Windows 8 and Server 2012, and using the “2nd generation” PCIe 3.0 based LSI RAID controllers. I’m using LSI MegaRAID SAS 9271-8i with CacheVault and LSI MegaRAID SAS 9286CV-8eCC controllers.

I am unable to get any of the configured drives to spin down on either of the controllers, nor in Windows 8 or Windows Server 2012.

LSI has not yet published any Windows 8 or Server 2012 drivers on their support site. In September 2012, after the public release of Windows Server 2012, LSI support told me drivers would ship in November, and now they tell me drivers will ship in December. All is not lost as the 9271 and 9286 cards are detected by the default in-box drivers, and appear to be functional.

I had hoped the no spin-down problem was a driver issue, and that it would be corrected by updated drivers, but that appears to be wishful thinking.

I contacted LSI support about the drive spin-down issue, and was referred to this August 2011 KB 16563, pointing to KB 16385 stating:

newer versions of firmware no longer support DS3; the newest version of firmware to support DS3 was 12.12.0-0045_SAS_2108_FW_Image_APP-2.120.33-1197

When I objected to the removal, support replied with this canned quote:

In some cases, when Dimmer Switch with DS3 spins down the volume, the volume cannot spin up in time when I/O access is requested by the operating system.  This can cause the volume to go offline, requiring a reboot to access the volume again.

LSI basically turned their back on green by disabling drive spin-down on all new controllers and new firmware versions.

I have not had any issues with this functionality on my systems, and spinning down unused drives to save power and reduce heat is a basic operational requirement. Maybe there are issues with some systems, but at least give me the choice of enabling it in my environment.

A little bit of searching shows I am not alone in my complaint, see here and here.

And from Intel a November 2012 KB 033877 that they have disabled drive power save on all their RAID controllers, maybe not that surprising given that Intel uses rebranded LSI controllers.

After a series of overheating batteries and S3 failures, I have long ago given up on Adaptec RAID controllers, but this situation with LSI is making me take another look at them.

Adaptec is advertising Intelligent Power Management as a feature of their controllers, I ordered a 7805Q controller, and will report my findings in a future post.

Storage Spaces Leaves Me Empty

I was very intrigued when I found out about Storage Spaces and ReFS being introduced in Windows Server 2012 and Windows 8. But now that I’ve spent some time with it, I’m left disappointed, and I will not be trusting my precious data with either of these features, just yet.

 

Microsoft publicly announced Storage Spaces and ReFS in early Windows 8 blog posts. Storage Spaces was of special interest to the Windows Home Server community in light of Microsoft first dropping support for Drive Extender in Windows Home Server 2011, and then completely dropping Windows Home Server, and replacing it with Windows Server 2012 Essentials. My personal interest was more geared towards expanding my home storage capacity in a cost effective and energy efficient way, without tying myself to proprietary hardware solutions.

 

I archive all my CD’s, DVD’s, and BD discs, and store the media files on a Synology DS2411+ with 12 x 3TB drives in a RAID6 volume, giving me approximately 27TB of usable storage. Seems like a lot of space, but I’ve run out of space, and I have a backlog of BD discs that need to be archived. In general I have been very happy with Synology (except for an ongoing problem with “Local UPS was plugged out” errors), and they do offer devices capable of more storage, specifically the RS2212+ with the RX1211 expansion unit offering up to 22 combined drive bays. But, at $2300 plus $1700, this is expensive, capped at 22 drives, and further ties me in with Synology. Compare that with $1400 for a Norco DS24-E or $1700 for a SansDigital ES424X6+BS 24 bay 4U storage unit, an inexpensive LSI OEM branded SAS HBA from eBay, or a LSI SAS 9207-8e if you like the real thing, connected to Windows Server 2012, running Storage Spaces and ReFS, and things look promising.

Arguable I am swapping one proprietary technology for another, but with native Windows support, I have many more choices for expansion. One could make the same argument for the use of ZFS on Linux, and if I was a Linux expert, that may have been my choice, but I’m not.

 

I tested using a SuperMicro SuperWorkstation 7047A-73, with dual Xeon E5-2660 processors and 32GB RAM. The 7047A-73 uses a X9DA7 motherboard, that includes a LSI SAS2308 6Gb/s SAS2 HBA, connected to 8 hot-swap drive bays.

For comparison with a hardware RAID solution I also tested using a LSI MegaRAID SAS 9286CV-8e 6Gb/s SAS2 RAID adapter, with the CacheCade 2.0 option, and a Norco DS12-E 12 bay SAS2 2U expander.

For drives I used Hitachi Deskstar 7K4000 4TB SATA3 desktop drives and Intel 520 series 480GB SATA3 SSD drives. I did not test with enterprise class drives, 4TB models are still excessively expensive, and defeats the purpose of cost effective home use storage.

 

I previously reported that the Windows Server 2012 and Windows 8 install will hang when trying to install on a SSD connected to the SAS2308. As such I installed Server 2012 Datacenter on an Intel 480GB SSD connected to the onboard SATA3 controller.

Windows automatically installed the drivers for the LSI SAS2308 controller.

I had to manually install the drivers for the C600 chipset RSTe controller, and as reported before, the driver works, but suffers from dyslexia.

The SAS2308 controller firmware was updated to the latest released SuperMicro v13.0.57.0.

 

Since LSI already released v14.0.0.0 firmware for their own SAS2308 based boards like the SAS 9207-8e, I asked SuperMicro support for their v14 version, and they provided me with an as yet unreleased v14.0.0.0 firmware version for test purposes. Doing a binary compare between the LSI version and the SuperMicro version, the differences appear to be limited to descriptive model numbers, and a few one byte differences that are probably configuration or default parameters. It is possible to cross-flash between some LSI and OEM adapters, but since I had a SuperMicro version of the firmware, this was not necessary.

SuperMicro publishes a v2.0.58.0 LSI driver that lists Windows 8 support, but LSI has not yet released Windows 8 or Server 2012 drivers for their own SAS2308 based products. I contacted LSI support, and their Windows 8 and Server 2012 drivers are scheduled for release in the P15 November 2012 update.

I tested the SuperMicro v14.0.0.0 firmware with the SuperMicro v2.0.58.0 driver, the SuperMicro v14.0.0.0 firmware with the Windows v2.0.55.84 driver, and the SuperMicro v2.0.58.0 driver with the SuperMicro v13.0.57.0 firmware. Any combination that included the SuperMicro v2.0.58.0 driver or the SuperMicro v14.0.0.0 firmware resulted in problems with the drives or controller not responding. The in-box Windows v2.0.55.84 driver and the released SuperMicro v13.0.57.0 firmware was the only stable combination.

Below are some screenshots of the driver versions and errors:

LSI.2.0.55.84LSI.2.0.58.0

Eventlog.Controller.ErrorEventlog.IO.RetriedEventlog.Reset.DeviceFormat.Failed

 

One of the reasons I am not yet prepared to use Storage Spaces or ReFS is because of the complete lack of decent documentation, best practice guides, or deployment recommendations. As an example, the only documentation on SSD journal drive configuration is in TechNet forum post from a Microsoft employee, requiring the use of PowerShell, and even then there is no mention of scaling or size ratio requirements. Yes, the actual PowerShell commandlet parameters are documented on MSDN, but not the use or the meaning.

PowerShell is very powerful and Server 2012 is completely manageable using PowerShell, but an appeal of Windows has always been the management user interface, especially important for adoption by SMB’s that do not have a dedicated IT staff. With Windows Home Server being replaced by Windows Server 2012 Essentials, the lack of storage management via the UI will require regular users to become PowerShell experts, or maybe Microsoft anticipates that configuration UI’s will be developed by hardware OEM’s deploying Windows Storage Server 2012 or Windows Server 2012 Essentials based systems.

My feeling is that Storage Spaces will be one of those technologies that matures and becomes generally usable after one or two releases or service packs post the initial release.

 

I tested disk performance using ATTO Disk Benchmark 2.47, and CrystalDiskMark 3.01c.

I ran each test twice, back to back, and report the average. I realize two runs are not statistically significant, but with just two runs it took several days to complete the testing in between regular work activities. I opted to only publish the CrystalDiskMark data as the ATTO Disk Benchmark results varied greatly between runs, while the CrystalDiskMark results were consistent.

Consider the values useful for relative comparison under my test conditions, but not useful for absolute comparison with other systems.

 

Before we get to the results, a word on the tests.

The JBOD tests were performed using the C600 SATA3 controller.
The Simple, Mirror, Triple, and RAID0 tests were performed using the SAS 2308 SAS2 controller.
The Parity, RAID5, RAID6, and CacheCade tests were performed using the SAS 9286CV-8e controller.

The Simple test created a simple storage pool.
The Mirror test created a 2-way mirrored storage pool.
The Triple test created a 3-way mirrored storage pool.
The Parity test created a parity storage pool.
The Journal test created a parity storage pool, with SSD drives used for the journal disks.
The CacheCade test created RAID sets, with SSD drives used for caching.

 

As I mentioned earlier, there is next to no documentation on how to use Storage Spaces. In order to use SSD drives as journal drives, I followed information provided in a TechNet forum post.

Create the parity storage pool using PowerShell or the GUI. Then associate the SSD drives as journal drives with the pool.

Windows PowerShell
Copyright (C) 2012 Microsoft Corporation. All rights reserved.

PS C:\Users\Administrator> Get-PhysicalDisk -CanPool $True

FriendlyName CanPool OperationalStatus HealthStatus Usage Size
------------ ------- ----------------- ------------ ----- ----
PhysicalDisk4 True OK Healthy Auto-Select 447.13 GB
PhysicalDisk5 True OK Healthy Auto-Select 447.13 GB

PS C:\Users\Administrator> $PDToAdd = Get-PhysicalDisk -CanPool $True
PS C:\Users\Administrator>
PS C:\Users\Administrator> Add-PhysicalDisk -StoragePoolFriendlyName "Pool" -PhysicalDisks $PDToAdd -Usage Journal
PS C:\Users\Administrator>
PS C:\Users\Administrator>
PS C:\Users\Administrator> Get-VirtualDisk

FriendlyName ResiliencySettingNa OperationalStatus HealthStatus IsManualAttach Size
me
------------ ------------------- ----------------- ------------ -------------- ----
Pool Parity OK Healthy False 18.18 TB

PS C:\Users\Administrator> Get-PhysicalDisk

FriendlyName CanPool OperationalStatus HealthStatus Usage Size
------------ ------- ----------------- ------------ ----- ----
PhysicalDisk0 False OK Healthy Auto-Select 3.64 TB
PhysicalDisk1 False OK Healthy Auto-Select 3.64 TB
PhysicalDisk2 False OK Healthy Auto-Select 3.64 TB
PhysicalDisk3 False OK Healthy Auto-Select 3.64 TB
PhysicalDisk4 False OK Healthy Journal 446.5 GB
PhysicalDisk5 False OK Healthy Journal 446.5 GB
PhysicalDisk6 False OK Healthy Auto-Select 3.64 TB
PhysicalDisk7 False OK Healthy Auto-Select 3.64 TB
PhysicalDisk8 False OK Healthy Auto-Select 447.13 GB
PhysicalDisk10 False OK Healthy Auto-Select 14.9 GB

PS C:\Users\Administrator>

I initially added the journal drives after the virtual drive was already created, but that would not use the journal drives. I had to delete the virtual drive, recreate it, and then the journal drives kicked in. There must be some way to manage this after virtual drives already exist, but again, no documentation.

 

In order to test Storage Spaces using the SAS 9286CV-8e RAID controller I had to switch it to JBOD mode using the commandline MegaCli utility.


D:\Install>MegaCli64.exe AdpSetProp EnableJBOD 1 a0

Adapter 0: Set JBOD to Enable success.

Exit Code: 0x00

D:\Install>MegaCli64.exe AdpSetProp EnableJBOD 0 a0

Adapter 0: Set JBOD to Disable success.

Exit Code: 0x00

D:\Install>

 

The RAID and CacheCade disk sets were created using the LSI MegaRAID Storage Manager GUI utility.

 

Below is a summary of the throughput results:

ReadWriteKBPS

ReadWriteIOPS

 

Not surprisingly the SSD drives had very good scores all around for JBOD, Simple, and RAID0. I only had two drives to test with, but I expect more drives to further improve performance.

The Simple, Mirror, and Triple test results speak for themselves, performance halving, and halving again.

The Parity test shows good read performance, and bad write performance. The write performance approaches that of a single disk.

The Parity with SSD Journal disks shows about the same read performance as without journal disks, and the write performance double that of a single disk.

The RAID0 and Simple throughput results are close, but the RAID0 write IOPS doubling that of the Simple volume.

The RAID5 and RAID6 read performance is close to Parity, but the write performance almost ten fold that of Parity. It appears that the SLI card writes to all drives in parallel, while Storage Spaces parity writes to one drive only.

The CacheCade read and write performance is less than without CacheCade, but the IOPS ten fold higher.

The ReFS performance is about 30% less than the equivalent NTFS performance.

 

 

Until Storage Spaces gets thoroughly documented and improves performance, I’m sticking with hardware RAID solutions.

Windows 8 Install Hangs Booting From LSI 2308 SAS Controller

I’ve previously posted about problems installing Windows 8 on SuperMicro machines, and that SuperMicro released a Beta BIOS that solved the install problems. I’ve since run into two more problems; the install hanging when booting of the LSI 2038 SAS controller, and a BSOD when using a Quadro 5000 video card (more on that in a later post).

 

I have two SuperWorkstation machines, a 7047A-T using a X9DAi motherboard, and a 7047A-73 using a X9DA7 motherboard.

The X9DAi and X9DA7 both use the Intel C602 chipset. The X9DAi and X9DA7 both have 2 x SATA3 ports, 4 x SATA2 ports, and 4 x SAS / SATA2 ports. The X9DA7 has an additional LSI 2308 controller with 8 x SAS2 / SATA3 ports.

On the 7047A-T / X9DAi machine, the 8 x hot-swap drive trays are connected to the 2 x SATA3, 2 x SATA2, and 4 x SAS / SATA2 ports.

On the 7047A-73 / X9DA7 machine, the 8 x hot-swap drive trays are connected to the 8 x LSI 2308 ports.

SuperMicro support provided me with Beta BIOS’s for the X9DAi and X9DA7 motherboards, this resolved the ACPI_BIOS_ERROR, and allowed me to install Windows 8 RTM on these machines, or at least get past the BSOD while booting the install media.

 

I configured both machines with:

 

In the 7047A-T / X9DAi machine, I installed the SSD drive in slot-0 of the hot-swap trays, connected to SATA3 port-0. I installed Windows 8 x64 RTM without issue.

 

In the 7047A-73 / X9DA7 machine I installed the SSD drive in slot-0 of the hot-swap trays, connected to LSI2308 port-0. I installed Windows 8 x64 RTM, and the install hanged at 0% while copying files.

While in this state, I suspected the problem to be IO related, so I pressed Shift-F10 to open a console window, I ran diskpart, and diskpart hanged.

I downloaded the latest LSI 2308 drivers from the Supermicro FTP site. I ran the install again, this time I manually loaded the drivers instead of using the in-box drivers, same problem, hang at 0%.

LSI does not make drivers directly available for HBA chips, but the LSI SAS 9205-8e uses the LSI 2308, and I downloaded the drivers from LSI. They were the same version as the drivers available on the SuperMicro FTP site.

 

I contacted SuperMicro support, they suggested I install using the SATA3 port while they research the problem. Connecting the SSD drive to SATA3 port-0 installed fine.

 

I tested the same setup using Windows 7, and although Windows 7 did not include in-box drivers for the LSI 2308, after loading the drivers, Windows 7 installed fine with the SSD connected to LSI2308 port-0.

This probably indicates a Windows 8 compatibility problem with the LSI 2308 driver, or HBA firmware.

 

LSI HBA’s can be configured to run in Initiator Target (IT) or Integrated RAID (IR) mode. This can be changed by flashing with the appropriate IT or IR firmware. IT firmware is typically preferred where there is no need for hardware RAID and all disks will be in JBOD mode, e.g. for use with ZFS or Storage Spaces.

When you flash between IT and IR mode, you need to erase the firmware before re-flashing, i.e. you cannot simply flash one mode on top of another mode. On the SuperMicro motherboards, you also need to perform the flash operation from within the EFI shell, flashing from other environments will fail. You can follow these KB’s to help with the process; LSI 16266, SuperMicro 14368, and SuperMicro 14151. I would not recommend using SuperMicro 14368 method, as it wipes the entire firmware memory, and you will need to manually re-enter the SAS address. It is basically the difference between using “sas2flash -o -e 6” and “sas2flash -o -e 7”, see the SAS2flash reference guide for details.

SAS2Flash

 

The X9DA7 motherboard came with firmware version 13.0.0.56 for the LSI 2308, configured in IR mode. I updated the firmware using the firmware from the SuperMicro FTP site to 13.0.0.57 in IT mode.

The update process I followed was to boot into the EFI shell while having a USB drive attached containing the firmware update, the drive must contain the firmware, the boot BIOS, and the SAS2Flash.efi tool.

In the EFI shell run the “map” command to list the hardware and see which drive is the USB drive, mount that drive using “mount fs[drive number]:”, e.g. “mount fs1:”, then change to the directory to the USB drive using “fs1:”:

map
mount fs1:
fs1:

Then wipe the flash “sas2flash -o -e 6”, then program the new firmware and boot code “sas2flash -o -f [firmware file] -b [bootcode file]”, e.g. “sas2flsh -f 2308IT13.5FW -b mptsas2.rom”, then restart.:

sas2flash -o -e 6
sas2flsh -f 2308IT13.5FW -b mptsas2.rom
reset

Same problem, hang at 0%.

 

I again referred to the LSI site for updated firmware for the LSI 2308, and the LSI SAS 9205-8e and LSI SAS 9207-8e includes firmware P14 version 14.0.0.0, a major revision upgrade from version 13.0.0.57 from the SuperMicro site.

The P14 firmware packages does not include the EFI version of SAS2Flash, but a bit of search engine exploration showed it is still included in the P13 packages.

 

I am not quite brave enough to flash to this version yet, as a failed flash will require a hardware swap. I’ll continue running this machine with the SSD connected to the SATA3 port.

 

At this point I am waiting for SuperMicro support to get back to me with a solution, or confirming that I can flash the P14 firmware and see if that resolves the issue.

Hitachi Ultrastar and Seagate Barracude LP 2TB drives

In my previous post I talked about Western Digital RE4-GP 2TB drive problems.

In this post I present my test results for 2TB drives from Seagate and Hitachi.
The test setup is the same as for the RE4-GP testing, except that I only tested 4 drives from each manufacturer.
Unlike the enterprise class WD RE4-GP and Hitachi Ultrastar A7K2000 drives, the Seagate Barracuda LP drive is a desktop drive.
The equivalent should have been a Seagate Constellation ES drive, but as far as I know the 2TB drives are not yet available.
To summarize:
The Hitachi A7K2000 drives performed without issue on all three controllers, the Seagate Barracuda LP drive failed to work with the Adaptec controller.
The Hitachi Ultrastar A7K2000 outperformed the Seagate Barracuda LP drive, but this was not really a surprise given the drive specs.
The Areca ARC1680 controller produced the best and most reliable results, the Adaptec was close, but given the overheating problem, it is not reliable unless additional cooling is added.
Test hardware:
Intel S5000PSL motherboard, dual Xeon E5450, 32GB RAM, firmware BIOS-98 BMC-65 FRUSDR-48
Adaptec 51245 RAID controller, firmware 17517, driver 5.2.0.17517
Areca ARC1680ix-12 RAID controller, firmware 1.47, driver 6.20.00.16_80819
LSI 8888ELP RAID controller, firmware 11.0.1-0017 (APP-1.40.62-0665), driver 4.16.0.64
Chenbro CK12803 28-port SAS expander, firmware AA11
Drive setup:
– Boot drive, 1 x 1TB WD Caviar Black WD1001FALS, firmware 05.00K05
Simple volume, connected to onboard Intel ICH10R controller running in RAID mode
– Data drives, 4 x 2TB Hitachi Ultrastar A7K2000 HUA722020ALA330 drives, firmware JKAOA20N
1 x hot spare, 3 x drive RAID5 4TB, configured as GPT partitions, dynamic disks, and simple volumes
– Data drives, 4 x 2TB Seagate Barracuda LP ST32000542AS drives, firmware CC32
1 x hot spare, 3 x drive RAID5 4TB, configured as GPT partitions, dynamic disks, and simple volumes

I tested the drives as shipped, with no jumpers, running at SATA-II / 3Gb/s speeds.
Adaptec 51245, SATA-II / 3Gb/s:
As in my previous test I had to use an extra fan to keep the Adaptec card from overheating.
The Hitachi drives had no problems.
The Hitachi drives completed initialization in 16 hours.
The Seagate drives would not show up on the system, I tried different ports, resets, cable swaps, no go.
Adaptec, RAID5, Hitachi:

Adaptec, RAID5, WD:

Areca ARC1680ix-12, SATA-II / 3Gb/s:
The Areca had not problems with the Hitachi or Seagate drives.
The Hitachi drives completed initialization in 40 hours.
The Seagate drives completed initialization in 49 hours.
The array initialization time of the Areca is significantly longer compared to Adaptec or LSI.
Areca, RAID5, Hitachi:

Areaca, RAID5, Seagate:

Areca, RAID5, WD:

LSI 8888ELP and Chenbro CK12803, SATA-II / 3Gb/s:
The Hitachi drives reported a few “Invalid field in CDB” errors with, but it did not appear to affect the operation of the array.
The Hitachi drives completed initialization in 4 hours.
The Seagate drives reported lots of “Invalid field in CDB” and “Power on, reset, or bus device reset occurred” errors, but it did not appear to affect the operation of the array.
The Seagate drives made clicking sounds when they powered on, and occasionally during normal operation.
The Seagate drives completed initialization in 4 hours.

LSI, RAID5, Hitachi:

LSI, RAID5, Seagate:

LSI, RAID5, WD:

The Hitachi A7K2000 drives performed without issue on all three controllers, the Seagate Barracuda LP drive failed to work with the Adaptec controller.
The Hitachi A7K2000 outperformed the Seagate Barracuda LP drive, but this was not really a surprise given the drive specs.
The Areca ARC1680 controller produced the best and most reliable results, the Adaptec was close, but given the overheating problem, it is not reliable unless additional cooling is added.

I will be scaling my test up from 4 to 12 Hitachi drives, using the Areca controller, and I will expand the Areca cache from 512MB to 2GB.

Western Digital RE4-GP 2TB Drive Problems

In my previous two posts I described my research into the power saving features of various enterprise class RAID controllers.
In this post I detail the results of my testing of the Western Digital RE4-GP enterprise class “green” drives when used with hardware RAID controllers from Adaptec, Areca, and LSI.
To summarize, the RE4-GP drive fails with a variety of problems, Adaptec, Areca, and LSI acknowledge the problem and lays blame on WD, yet WD insists there are no known problems with the RE4-GP drives.
Test hardware:
Intel S5000PSL motherboard, dual Xeon E5450, 32GB RAM, firmware BIOS-98 BMC-65 FRUSDR-48
Adaptec 51245 RAID controller, firmware 17517, driver 5.2.0.17517
Areca ARC1680ix-12 RAID controller, firmware 1.47, driver 6.20.00.16_80819
LSI 8888ELP RAID controller, firmware 11.0.1-0017 (APP-1.40.62-0665), driver 4.16.0.64
Chenbro CK12803 28-port SAS expander, firmware AA11
Drive setup:
– Boot drive, 1 x 1TB WD Caviar Black WD1001FALS, firmware 05.00K05
Simple volume, connected to onboard Intel ICH10R controller running in RAID mode
– Data drives, 10 x 2TB WD RE4-GP WD2002FYPS drives, firmware 04.05G04
1 x hot spare, 3 x drive RAID5 4TB, 6 x drive RAID6 8TB, configured as GPT partitions, dynamic disks, and simple volumes
I started testing the drives as shipped, with no jumpers, running at SATA-II / 3Gb/s speeds.
Adaptec 51245, SATA-II / 3Gb/s:
The Adaptec card has 3 x internal SFF-8087 ports and 1 x external SFF-8088 port, supporting 12 internal drives.
The Adaptec card had immediate problems with the RE4-GP drives, in the ASM utility the drives would randomly drop out and in.
I could not complete testing.
Areca ARC1680ix-16, SATA-II / 3Gb/s:
The Areca card has 3 x internal SFF-8087 ports and 1 x external SFF-8088 port, supporting 12 internal drives.
Unlike the LSI and Adaptec cards that require locally installed management software, the Areca card is completely managed through a web interface from an embedded Ethernet port.
The Areca card allowed the RAID volumes to be created, but during initialization at around 7% the web interface stopped responding, requiring a cold reset.
I could not complete testing.
LSI 8888ELP and Chenbro CK12803, SATA-II / 3Gb/s:
The LSI card has 2 x internal SFF-8087 ports and 2 x external SFF-8088 port, supporting 8 internal drives.
Since I needed to host 10 drives, I used the Chenbro 28 port SAS expander.
The 8888ELP support page only lists the v3 series drivers, while W2K8R2 ships with the v4 series drivers, so I used the latest v4 drivers from the new 6Gb/s LSI cards.
The LSI and Chenbro allowed the volumes to be created, but during initialization 4 drives dropped out, and initialization failed.
I could not complete testing.
I contacted WD, Areca, Adaptec, and LSI support with my findings.
WD support said there is nothing wrong with the RE4-GP, and that they are not aware of any problems with any RAID controllers.
When I insisted that there must be something wrong, they suggested I try to force the drives to SATA-I / 1.5Gb/s speed and see if that helps.
I tested at SATA-I / 1.5Gb/s speed, and achieved some success, but I still insisted that WD acknowledge the problem.
The case was escalated to WD engineering, and I am still waiting for an update.
Adaptec support acknowledged a problem with RE4-GP drives when used with high port count controllers, and that a card hardware fix is being worked on.
I asked if the fix will be firmware or hardware, and was told hardware, and that the card will have to be swapped, but the timeframe is unknown.
Areca support acknowledged a problem between the Intel IOP348 controller and RE4-GP drives, and that Intel and WD are aware of the problem, and that running the drives at SATA-I / 1.5Gb/s speed resolves the problem.
I asked if a fix to run at SATA-II / 3Gb/s speeds will be made available, I was told this will not be possible without hardware changes, and no fix is planned.
LSI support acknowledged a problem with RE4-GP drives, and that they have multiple cases open with WD, and that my best option is to use a different drive, or to contact WD support.
I asked if a fix will become available, they said that it is unlikely that a firmware update would be able to resolve the problem, and that WD would need to provide a fix.
This is rather disappointing, WD advertises the RE4-GP as an enterprise class drive, yet 3/3 of the enterprise class RAID controllers I tested failed with the RE4-GP, and all three vendors blame WD, yet WD insists there is nothing wrong with the RE4-GP.
I continued testing, this time with the SATA-I / 1.5Gb/s jumper set.
Adaptec 51245, SATA-I / 1.5Gb/s:
This time the Adaptec card had no problems seeing the arrays, although some of the drives continue to report link errors.
A much bigger problem was that the controller and battery was overheating, the controller running at 103C / 217F.
In order to continue my testing I had to install an extra chassis fan to provide additional ventilation over the card.
The Adaptec and LSI have passive cooling, where in contrast the Areca has active cooling and only ran at around 51C / 124F.
The Areca and LSI batteries are off-board, and although a bit inconvenient to mount, they did not overheat like the Adaptec.
Initialization completed in 22 hours, compared to 52 hours for Areca and 8 hours for LSI.
The controller supports power management, and drives are spun down when not in use.
3 x Drive RAID5 4TB performance:

6 x Drive RAID6 8TB Performance:

Areca ARC1680ix-16, SATA-I / 1.5Gb/s:
This time the Areca card had no problems initializing the arrays.
Initialization completed in 52 hours, much longer compared to 22 hours for Adaptec and 8 hours for LSI.
Areca support said initialization time depends on the drive speed and controller load, and that the RE4-GP drives are known to be slow.
The controller supports power management, and drives are spun down when not in use.

3 x Drive RAID5 4TB performance:

6 x Drive RAID6 8TB Performance:

LSI 8888ELP and Chenbro CK12803, SATA-I / 1.5Gb/s:
This time only 2 drives dropped out, one out of each array, and initialization completed after I forced the drives back online.
Initialization completed in 8 hours, much quicker compared to 22 hours for Adaptec and 52 hours for Areca.

The controller only supports power management on unassigned drives, there is no support for spinning down configured but not in use drives.

3 x Drive RAID5 4TB performance:

6 x Drive RAID6 8TB Performance:

Although all three cards produced results when the RE4-GP drives were forced to SATA-I / 1.5Gb/s speeds, the results still show that the drives are unreliable.
The RE4-GP drive fails with a variety of problems, Adaptec, Areca, and LSI acknowledge the problem and lays blame on WD, yet WD insists there are no known problems with the RE4 drives-GP.
There are alternative low power drives available from Seagate and Hitachi.
I still haven’t forgiven Seagate for the endless troubles they caused with ES.2 drives and Intel IOP348 based controllers, and, like WD, also denying any problems with the drives, yet eventually releasing two firmware updates for the ES.2 drives.
I’ve always had good service from Hitachi drives, so maybe I’ll give the new Hitachi A7K2000 drives a run.
One thing is for sure, I will definately be returning the RE4-GP drives.
[Update: 11 October 2009]
I tested the Seagate Barracuda LP and Hitachi Ultrastar 2TB drives.
[Update: 24 October 2009]
WD support still has not responded to my request for the firmware.

Power Saving RAID Controller (Continued)

This post continues from my last post on power saving RAID controllers.
It turns out the Adaptec 5 series controller are not that workstation friendly.
I was testing with Western Digital drives; 1TB Caviar Black WD1001FALS, 2TB Caviar Green WD20EADS, and 1TB RE3 WD1002FBYS.
I also wanted to test with the new 2TB RE4-GP WD2002FYPS drives, but they are on backorder.
I found that the Caviar Black WD1001FALS and Caviar Green WD20EADS drives were just dropping out of the array for no apparent reason, yet they were still listed in ASM as if nothing was wrong.
I also noticed that over time ASM listed medium errors and aborted command errors for these drives.
In comparison the RE3 WD1002FBYS drives worked perfectly.
A little searching pointed me to a feature of WD drives called Time Limited Error Recovery (TLER).
You can read more about TLER here, or here, or here.
Basically the enterprise class drives have TLER enabled, and the consumer drives not, so when the RAID controller issues a command and the drive does not respond in a reasonable amount of time, the controller drops the drive out of the array.
The same drives worked perfectly in single drive, RAID-0, and RAID-1 configurations with an Intel ICH10R RAID controller, granted, the Intel chipset controller is not in the same performance league.
The Adaptec 5805 and 5445 controllers I tested did let the drives spin down, but the controller is not S3 sleep friendly.
Every time my system resumes from S3 sleep ASM would complain “The battery-backup cache device needs a new battery: controller 1.”, and when I look in ASM it tells me the battery is fine.
Whenever the system enters S3 sleep the controller does not spin down any of the drives, this means that all the drives in external enclosures, or on external power, will keep on spinning while the machine is sleeping.
This defeats the purpose of power saving and sleep.
The embedded Intel ICH10R RAID controller did correctly spin down all drives before entering sleep.
Since installing the ASM utility my system is taking a noticably longer time to shutdown.
Vista provides a convenient, although not always accurate, way to see what is impacting system performance in terms of even timing, and ASM was identified as adding 16s to every shutown.
Under [Computer Management][Event Viewer][Applications and Services Logs][Microsoft][Windows][Diagnostics-Performance][Operational], I see this for every shutdown event:
This service caused a delay in the system shutdown process:
File Name : AdaptecStorageManagerAgent
Friendly Name :
Version :
Total Time : 20002ms
Degradation Time : 16002ms
Incident Time (UTC) : 6/11/2009 3:15:57 AM
It really seems that Adaptec did not design or test the 5 series controllers for use in Workstations, this is unfortunate, for performance wise the 5 series cards really are great.
[Update: 22 August 2009]
I received several WD RE4-GP / WD2002FYPS drives.
I tested with W2K8R2 booted from a WD RE3 / WD1002FBYS drive connected to an Intel ICH10R controller on an Intel S5000PSL server board.
I tested 8 drives in RAID6 connected to a LSI 8888ELP controller, worked perfectly.
I connected the same 8 drives to an Adaptec 51245 controller, at boot only 2 out of 8 drives were recognized.
After booting, ASM showed all 8 drives, but they were continuously dropping out and back in.
I received confirmation of similar failures with the RE4 drives and Adaptec 5 series cards from a blog reader.
Adaptec support told him to temporarily run the drives at 1.5Gb/s, apparently this does work, I did not test it myself, clearly this is not a long term solution, nor acceptable.
I am still waiting to hear back from Adaptec and WD support.
[Update: 30 August 2009]
I received a reply from Adaptec support, and the news is not good, there is a hardware compatibility problem between the WD RE4-GP /WD2002FYPS drives.
“I am afraid currently these drives are not supported with this model of controller. This is due to a compatibility issue with the onboard expander on the 51245 card. We are working on a hardware solution to this problem, but I am currently not able to say in what timeframe this will come.”
[Update: 31 August 2009]
I asked support if a firmware update will fix the issue, or if a hardware change will be required.
“Correct, a hardware solution, this would mean the card would need to be swapped, not a firmeware update. I can’t tell you for sure when the solution would come as its difficult to predict the amount of time required to certify the solution but my estimate would be around the end of September.”
[Update: 6 September 2009]
I experienced similar timeouts testing an Areca ARC-1680 controller.
Areca support was very forthcoming with the problem and the solution.
“this issue had been found few weeks ago and problem had been reported to WD and Intel which are vendors for hard drive and processor on controller. because the problem is physical layer issue which Areca have no ability to fix it.
but both Intel and WD have no fix available for this issue, the only solution is recommend customer change to SATA150 mode.
and they had closed this issue by this solution.
so i do not think a fix for SATA300 mode may available, sorry for the inconvenience.”
That explains why the problem happens with the Areca and Adaptec controllers, but not the LSI, both use the Intel IOP348 processor.