UPS Battery Replacement Turns Into Unrecoverable Firmware Update

Two lessons learned; do not trust scheduled battery tests, and leave working firmware be!

As the saying goes, if it is not broken do not fix it, especially when it comes to firmware.

I have a couple APC Smart-UPS‘s at my house, same as the models I like to use at the office. I use the SMT750 models with AP9631 Network Monitoring Cards. The problem started when we had a short power outage, and the UPS that powers the home network switch, cell repeater, alarm internet connection, and PoE IP cameras, unexpectedly died. A battery replacement led to the opportunity to do a UPS firmware update, which led to an unrecoverable firmware update.

It started when I woke up one morning and it was obvious the power had been out, first indicator is the kitchen appliances have blinking clocks, second are the numerous power failure email notifications, and the emails that stood out were from the alarm system that says it lost power and internet connectivity. The alarm has it’s own backup battery, the network switches and FiOS internet have their own battery backups, and the outage was only about 4 minutes.  So how is it that the UPS died, killing the switch, disconnecting the internet, especially when the outage was only 4 minutes, and typical runtimes on the UPS should be about an hour?

Here is the UPS outage log produced by the NMC card:

10/11/2016 06:53:05 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 06:12:27 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 06:12:24 Device UPS: Restored the local network management interface-to-UPS communication. 0x0101
10/11/2016 06:12:12 System Network service started. IPv6 address FE80::2C0:B7FF:FE98:9BAF assigned by link-local autoconfiguration. 0x0007
10/11/2016 06:12:10 Device Environment: Restored the local network management interface-to-integrated Environmental Monitor (Universal I/O at Port 1) communication. 0x0344
10/11/2016 06:12:09 System Network service started. System IP is 192.168.1.11 from manually configured settings. 0x0007
10/11/2016 06:12:02 System Network Interface coldstarted. 0x0001
10/11/2016 05:45:36 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 05:45:36 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 05:45:35 Device UPS: The output power is turned off. 0x0114
10/11/2016 05:45:35 Device UPS: The graceful shutdown period has ended. 0x014F
10/11/2016 05:45:35 Device UPS: No longer on battery power. 0x010A
10/11/2016 05:45:35 Device UPS: Main outlet group, UPS Outlets, has been commanded to shutdown with on delay. 0x0174
10/11/2016 05:45:35 Device UPS: The power for the main outlet group, UPS Outlets, is now turned off. 0x0135
10/11/2016 05:45:18 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 05:41:36 Device UPS: On battery power in response to rapid change of input. 0x0109

I could see from the log that the UPS battery power ran out within 4 minutes, 05:41:36 on battery, 05:45:18 battery too low, 05:45:35 output turned off. The UPS status page was equally puzzling, load was at 9.7%, yet reported runtime was only 5 minutes, impossible.

Here is the status screenshot:

apc-1

I ran a manual battery test, the test passed, but from the log it was clear the battery failed. I have bi-weekly scheduled battery tests for all UPS’s, never received a failure report. So what is the point of a battery test if the test comes back no problem yet it is clear to me from the logs that the battery failed?

Here is the log:

10/11/2016 18:07:06 Device UPS: A low battery condition no longer exists. 0x0110
10/11/2016 18:07:06 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107
10/11/2016 18:07:05 Device UPS: Self-Test passed. 0x0105
10/11/2016 18:06:59 Device UPS: A discharged battery condition no longer exists. 0x0108
10/11/2016 18:06:59 Device UPS: The battery power is too low to continue to support the load; the UPS will shut down if input power does not return to normal soon. 0x010F
10/11/2016 18:06:58 Device UPS: Self-Test started by management device. 0x0137
10/11/2016 18:06:56 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately. 0x0107

I asked for advice on the APC forum, no reply yet, and I ordered a replacement RBC48 battery. I received the battery, installed it, and the reported runtime is back to normal, 1 hour 48 minutes.

Here is a status screenshot with the new battery:

apc-2

Here is where I should have stopped and called it a day, but no. I knew that the UPS’s were on old firmware, and I decided to use this opportunity to update the firmware. I’d normally let firmware be, unless I have a good reason to update, but I convinced myself that the new firmware readme had some fixes that may help with the false pass on the battery test:

Release Notes (UPS09.3):
========================
2. Improved self-test logging for PCBE / NMC.
11. Repaired an occasional math error in the battery replacement date algorithm that resulted in incorrect dates.

I update the UPS, where I just replaced the battery, instructions are pretty simple. Only hassle is I have to bypass the network equipment to be mains powered so I can turn the UPS outputs off while updating the firmware, while maintaining network connectivity.

I did the same for my office UPS, PC and office switch on mains power, and when I power down the output, I made sure to not notify PowerChute Network Shutdown (PCNS) clients, as my PC had the PowerChute client installed to receive power state via the network. I start the firmware update over the network, and a few seconds later I get a Windows message that shutdown had been initiated by PCNS, what? I sit there in frustration, nothing to do but watch my PC shutdown while it is still delivering the firmware update.

On rebooting my PC, NMC comes up, but reports the UPS has stopped communicating. I pull AC power from the UPS, no change, I also pull the batteries, and when I plug the batteries and mains back on, beeeeeeeeeep. NMC now reports no UPS found, the UPS LCD panel reports all is fine. And still beeeeeeeeeep, and no way to stop the beeeeeeeeeep.

Here is the NMC status page:

apc-3

I try to do Firmware Upgrade Wizard update via USB, plug a USB cable in, PC sees UPS, reports critical condition, but the upgrade wizard reports no UPS found on USB.

Here is the wizard error page:

apc-4

So, here I am, stuck with a bricked UPS, lesson learned, actually two lessons learned; do not trust scheduled battery tests, and leave working firmware be!

Electrical Power Quality

Earlier this year we moved a couple miles from Redondo Beach to Manhattan Beach, bigger house, better school district.
As far as the house and area is concerned, it is definitely an upgrade, but not so for the utilities.

Monthly utilities are a lot more expensive, not so much the per unit fees, but the base service fees, not just a couple $, but three of four times what we paid in Redondo Beach. Now, if it came with better offerings, or better service, or higher quality, ok, but the opposite.
Water quality is worse, specifically hardness, MB supplies its own water, RB gets water from LADWP, and that unsightly water tower that no longer serves any practical purpose, with efforts to demolish it always being thwarted.
As a new resident trash collection makes me pay almost thirty $ extra per month for an extra trash can, while grandfathered-in residents keep extras for free. Now, I know it is unfair to judge a service by their employee’s actions, or is it, but the trash collection guy is a jerk, if a little dust and having to get out of the truck is going to get you agitated, you are in the wrong business, especially when compared with the pack of trash collection men in RB that were always friendly and willing to give a hand.
But, I really digress, I want to discuss electrical power quality problems.

In the six plus years we lived in RB, I think we had one scheduled power outage, and maybe two short unplanned outages. Since moving to MB earlier this year, we’ve had two scheduled outages, one lasting an entire day, and several unscheduled outages.
The power is unreliable, SCE knows it, the city knows it, there are some plans addressing it, see here here here here.

My concern is not really power being on or off, it is power being on but of poor quality; an electronic equipment killer.

When we moved in, the first signs of electrical problems were flickering lights. At first I thought it was a problem with the Vantage light control system, but even lights directly on utility power flickered. As soon as I hooked up UPS’s to my servers and the signal distribution system, the UPS’s started complaining about power quality. Occasionally during the day I would get a notification from the UPS’s that it detected a distorted input, and every night the UPS’s would complain about low input voltage.
It may be coincidental, but I’ve also had two astronomical clock light timers fail at the same time, the casings were scorched in what appears to be signs of electrical damage.

UPS Event Log:
APC Event Log

In order to quantify the problem, I used a Fluke VR1710 Voltage Quality Recorder. The device plugs into a mains outlet, and records events, and a USB port is used to configure the device, and download recorded data.

As I am not a power quality expert, I referred to Wikipedia to and Power Quality In Electrical Systems for information and reference material. To further simplify the analysis, I opted to compare my office power with my home power, this allowed me to easily visualize the quality differences, granted, I am assuming my office power is good.

I configured the VR1710 to take measurements every 10s, and to record exceptional events, about 10 days worth of data. I set the dip threshold to 106V, the swell threshold to 127V, and the transient sensitivity to 5V.

VR1710 Settings:
VR1710 Settings

Below are reports detailing the recorded events, click graphs to view full resolution:

Home Voltage:
Home - Voltage

There is a clear pattern of voltage drops below 102V every evening, these drops are also observed in the UPS logs showing low voltage warnings around 7:30PM every evening.

Office Voltage:
Office - Voltage

The office voltage is very stable.

Home Flicker:
Home - Flicker

According to Wikipedia and PQW short term flicker (Pst) is noticeable at values exceeding 1.0, and long term flicker (Plt) is noticeable at values exceeding 0.65. These results would explain why we observe lights flickering.

Office Flicker:
Office - Flicker

Office flicker values are well within acceptable ranges.

Home Statistics:
Home - Statistics

From this distribution we can see the wide spread in voltages, well below the 120V theoretical norm. This chart does not show it, but the 95% distribution is 115.5V, and the 5% distribution is 106.1V.

Office Statistics:
Office - Statistics

The office voltage distribution is nicely clustered around 119V, with the 95% distribution at 119.6V, and the 5% distribution at 117.4V.

Home Dips And Swells:
Home - Dips Swells

ITIC and CBEMA are standards for acceptable power quality, see here for a detailed description.
To describe the graph, I quote from the Fluke Power Log software manual:
Dips and swells are shown on a CBEMA (Computer Business Equipment Manufacturers Association) and ITIC (Information Technology Industry Council) plot classification table according to EN50160. On the CBEMA (blue) and ITIC (red), curve markers are plotted for each dip and swell. The height on the vertical axis shows the severity of the dip or swell relative to the nominal voltage. The horizontal position shows the duration of the dip or swell. These curves show an ac input voltage envelope which typically can be tolerated (no interruption in function) by most Information Technology Equipment (ITE).

Based on the graph we can see a large number of events exceeding the acceptable ranges. Since there were no dips at the office, there is no graph for the office.

Home Transients:
Home - Transients

I only show the transients graph for home, as the wave forms all look different, and the only difference between home and office is 87 events were recorded at home while 10 events were recorded at the office for the same approximate time duration. See PQW for an explanation of transients.

We can clearly see that the power quality at my house is significantly worse compared to the power at my office.

I am speculating, but I wonder if the old transformer across the road can supply sufficient power, given that it used to supply power to three small very old houses on four lots, demolished to make room for four new larger houses?

I just opened a support ticket with SCE, let’s hope they can do something about the problem.

LSI turns their back on Green

I previously blogged here and here on my research into finding a power saving RAID controllers.

I have been using LSI MegaRAID SAS 9280-4i4e controllers in my Windows 7 workstations and LSI MegaRAID SAS 9280-8e controllers Windows Server 2008 R2 servers. These controllers work great, my workstations go to sleep and wake up, and in workstations and servers drives spin down when not in use.

I am testing a new set of workstation and server systems running Windows 8 and Server 2012, and using the “2nd generation” PCIe 3.0 based LSI RAID controllers. I’m using LSI MegaRAID SAS 9271-8i with CacheVault and LSI MegaRAID SAS 9286CV-8eCC controllers.

I am unable to get any of the configured drives to spin down on either of the controllers, nor in Windows 8 or Windows Server 2012.

LSI has not yet published any Windows 8 or Server 2012 drivers on their support site. In September 2012, after the public release of Windows Server 2012, LSI support told me drivers would ship in November, and now they tell me drivers will ship in December. All is not lost as the 9271 and 9286 cards are detected by the default in-box drivers, and appear to be functional.

I had hoped the no spin-down problem was a driver issue, and that it would be corrected by updated drivers, but that appears to be wishful thinking.

I contacted LSI support about the drive spin-down issue, and was referred to this August 2011 KB 16563, pointing to KB 16385 stating:

newer versions of firmware no longer support DS3; the newest version of firmware to support DS3 was 12.12.0-0045_SAS_2108_FW_Image_APP-2.120.33-1197

When I objected to the removal, support replied with this canned quote:

In some cases, when Dimmer Switch with DS3 spins down the volume, the volume cannot spin up in time when I/O access is requested by the operating system.  This can cause the volume to go offline, requiring a reboot to access the volume again.

LSI basically turned their back on green by disabling drive spin-down on all new controllers and new firmware versions.

I have not had any issues with this functionality on my systems, and spinning down unused drives to save power and reduce heat is a basic operational requirement. Maybe there are issues with some systems, but at least give me the choice of enabling it in my environment.

A little bit of searching shows I am not alone in my complaint, see here and here.

And from Intel a November 2012 KB 033877 that they have disabled drive power save on all their RAID controllers, maybe not that surprising given that Intel uses rebranded LSI controllers.

After a series of overheating batteries and S3 failures, I have long ago given up on Adaptec RAID controllers, but this situation with LSI is making me take another look at them.

Adaptec is advertising Intelligent Power Management as a feature of their controllers, I ordered a 7805Q controller, and will report my findings in a future post.

Synology DS2411+ Performance Review

In my last post I compared the performance of  Synology DS1511+ against the QNAP TS-859 Pro. As I finished writing that post, Synology announced the new Synology DS2411+.
Instead of using a DS1511+ and DX510 extender for 10 disks, the DS2411+ offers 12 disks in a single device. The price difference is also marginal, DS1511+ is $836, the DX510 is $500, and the DS2411+ is $1700. That is a difference of only $364, and well worth it for the extra storage space, and the reliability and stability of all drives in one enclosure. I ended up returning my DX510 and DS1511+, and got a DS2411+ instead.

To test the DS2411+, I ran the same performance tests, using the same MPIO setup as I described in my previous post. The only slight difference was in the way I configured the iSCSI LUN; the DS1511+ was configured as SHR2, while the DS2411+ was configured as RAID6. Theoretically both are the same when all the disks are the same size, and SHR2 ends up using RAID6 internally.
iSCSI LUN configuration:
DS2411.iSCSI.LUN

At idle the DS2411+ used 42W power, and under load it used 138W power. The idle power usage is close to the advertised 39W idle power usage, but quite a bit more than the advertised 105W power usage under load.

I use Remote Desktop Manager to manage all my devices in one convenient application. RDM supports web portals, Remote Desktop, Hyper-V, and many more remote configuration options, all in a single tabbed UI. What I found was that the Synology DSM has some problems when running in a tabbed IE browser. When I open the log history, I get a script error, and whenever I focus away and back on the browser window, the DSM desktop windows shift all the way to the left. I assume this is a DSM problem related to absolute and relative referencing. I logged a support case, and I hope they can fix it.
Script error:
DS2411.DSM.Script.Error

Test results:

Device
ATTO Read
ATTO Write
CDM Read
CDM Write
PM810 267.153 260.839 256.674 251.850
DS2411+ 244.032 165.564 149.802 156.673
DS1511+ 244.032 126.030 141.213 115.032
TS-859 Pro 136.178 95.152 116.015 91.097

Chart
DS2411+:
Atto.Synology.MPIOCDM.Synology.MPIO
DS1511+
Atto.Synology.MPIOCDM.Synology.MPIO

The DS2411+ published performance numbers are slightly better than the DS1511+ numbers, and my testing confirms that. so far I am really impressed with the DS2411+.

Power Saving RAID Controller (Continued)

This post continues from my last post on power saving RAID controllers.
It turns out the Adaptec 5 series controller are not that workstation friendly.
I was testing with Western Digital drives; 1TB Caviar Black WD1001FALS, 2TB Caviar Green WD20EADS, and 1TB RE3 WD1002FBYS.
I also wanted to test with the new 2TB RE4-GP WD2002FYPS drives, but they are on backorder.
I found that the Caviar Black WD1001FALS and Caviar Green WD20EADS drives were just dropping out of the array for no apparent reason, yet they were still listed in ASM as if nothing was wrong.
I also noticed that over time ASM listed medium errors and aborted command errors for these drives.
In comparison the RE3 WD1002FBYS drives worked perfectly.
A little searching pointed me to a feature of WD drives called Time Limited Error Recovery (TLER).
You can read more about TLER here, or here, or here.
Basically the enterprise class drives have TLER enabled, and the consumer drives not, so when the RAID controller issues a command and the drive does not respond in a reasonable amount of time, the controller drops the drive out of the array.
The same drives worked perfectly in single drive, RAID-0, and RAID-1 configurations with an Intel ICH10R RAID controller, granted, the Intel chipset controller is not in the same performance league.
The Adaptec 5805 and 5445 controllers I tested did let the drives spin down, but the controller is not S3 sleep friendly.
Every time my system resumes from S3 sleep ASM would complain “The battery-backup cache device needs a new battery: controller 1.”, and when I look in ASM it tells me the battery is fine.
Whenever the system enters S3 sleep the controller does not spin down any of the drives, this means that all the drives in external enclosures, or on external power, will keep on spinning while the machine is sleeping.
This defeats the purpose of power saving and sleep.
The embedded Intel ICH10R RAID controller did correctly spin down all drives before entering sleep.
Since installing the ASM utility my system is taking a noticably longer time to shutdown.
Vista provides a convenient, although not always accurate, way to see what is impacting system performance in terms of even timing, and ASM was identified as adding 16s to every shutown.
Under [Computer Management][Event Viewer][Applications and Services Logs][Microsoft][Windows][Diagnostics-Performance][Operational], I see this for every shutdown event:
This service caused a delay in the system shutdown process:
File Name : AdaptecStorageManagerAgent
Friendly Name :
Version :
Total Time : 20002ms
Degradation Time : 16002ms
Incident Time (UTC) : 6/11/2009 3:15:57 AM
It really seems that Adaptec did not design or test the 5 series controllers for use in Workstations, this is unfortunate, for performance wise the 5 series cards really are great.
[Update: 22 August 2009]
I received several WD RE4-GP / WD2002FYPS drives.
I tested with W2K8R2 booted from a WD RE3 / WD1002FBYS drive connected to an Intel ICH10R controller on an Intel S5000PSL server board.
I tested 8 drives in RAID6 connected to a LSI 8888ELP controller, worked perfectly.
I connected the same 8 drives to an Adaptec 51245 controller, at boot only 2 out of 8 drives were recognized.
After booting, ASM showed all 8 drives, but they were continuously dropping out and back in.
I received confirmation of similar failures with the RE4 drives and Adaptec 5 series cards from a blog reader.
Adaptec support told him to temporarily run the drives at 1.5Gb/s, apparently this does work, I did not test it myself, clearly this is not a long term solution, nor acceptable.
I am still waiting to hear back from Adaptec and WD support.
[Update: 30 August 2009]
I received a reply from Adaptec support, and the news is not good, there is a hardware compatibility problem between the WD RE4-GP /WD2002FYPS drives.
“I am afraid currently these drives are not supported with this model of controller. This is due to a compatibility issue with the onboard expander on the 51245 card. We are working on a hardware solution to this problem, but I am currently not able to say in what timeframe this will come.”
[Update: 31 August 2009]
I asked support if a firmware update will fix the issue, or if a hardware change will be required.
“Correct, a hardware solution, this would mean the card would need to be swapped, not a firmeware update. I can’t tell you for sure when the solution would come as its difficult to predict the amount of time required to certify the solution but my estimate would be around the end of September.”
[Update: 6 September 2009]
I experienced similar timeouts testing an Areca ARC-1680 controller.
Areca support was very forthcoming with the problem and the solution.
“this issue had been found few weeks ago and problem had been reported to WD and Intel which are vendors for hard drive and processor on controller. because the problem is physical layer issue which Areca have no ability to fix it.
but both Intel and WD have no fix available for this issue, the only solution is recommend customer change to SATA150 mode.
and they had closed this issue by this solution.
so i do not think a fix for SATA300 mode may available, sorry for the inconvenience.”
That explains why the problem happens with the Areca and Adaptec controllers, but not the LSI, both use the Intel IOP348 processor.

Power Saving SATA RAID Controller

I’ve been a longtime user of Adaptec SATA RAID cards (3805, 5805, 51245), but over the years I’ve become more energy saving conscious, and the Adaptec controllers did not support Windows power management.
My workstations are normally running in the “Balanced” power mode so that they will go to sleep after an hour, but sometimes I need to run computationally intensive tasks that leaves the machines running 24/7.
During these periods the disks don’t need to be on and I want the disks to spin down, like they would had they been directly connected and not in a RAID configuration.
I was building a new system with 4 drives in RAID10, and I decided to the try a 3Ware / AMCC SATA 9690SA-4I RAID controller. Their sales support confirmed that the card does support native Windows power management.
I also ordered a battery backup unit with the card, and my first impressions of installing the battery backup unit was less than impressive. The BBU comes with 4 plastic screws with pillars, but the 9690SA card only had one mounting hole. After inserting the BBU in the IDC header I had to pull it back out and adjust it so that it would align properly.
After running the card for a few hours I started getting battery overheating warnings. The BBU comes with an extension cable, and I had to use the extension cable and mount the battery away from the controller board. After making this adjustment the BBU seemed to operate at normal temperature.
Getting back to installation, the 3Ware BIOS utility is very rudimentary (compared to Adaptec), I later found out that the 3Ware Disk Manager 2 (3DM2) utility is not much better. The BIOS only allowed you to create one boot volume, and the rest of the disk space was automatically allocated. The BIOS also only supports INT13 booting from the boot volume.
I installed Vista Ultimate x64 on the boot volume, and used the other of the volume for data. I also installed the 3DM2 management utility, and the client tray alerting application. The client utility does not work on Vista because it requires elevation, and elevation s not allowed for auto start items. The 3DM2 utility is a web server and you connect using your web browser.
At first the lack of management functionality did not bother me, I did not need it, and the drives seemed to perform fine. After a month or so I noticed that I was getting more and more controller reset messages in the eventlog. I contacted 3Ware support, and they told me they see CRC errors and that the fanout cable was probably bad. I replaced the cable, but the problems persisted.
The CRC errors reminded me of problems I had with Seagate ES2 drives on other systems, and I updated the firmware in the 4 500 GB Seagate drives I was using. No change, same problem.
I needed more disk space anyway, so I decided to upgrade the 500GB Seagate drives to 1TB WD Caviar Black drives. The normal procedure would be to remove the drives one by one, insert the new drive, wait for the array to rebuild, and when all drives have been replaced, to expand the volume.
A 3Ware KB article confirmed this operation, but, there was no support for volume expansion, what?
In order to expand the volume I would need to boot from DOS, Windows is not supported, run a utility to collect data, send the data to 3Ware, and they would create a custom expansion script for me that I then need to run against the volume to rewrite the META data. They highly recommend that I backup the data before proceeding.
I know the Adaptec Storage Manager (ASM) utility does support volume expansion, I’ve used it, it’s easy, it’s a right click in the GUI.
I never got to the point of actually trying the expansion procedure. After swapping the last drive I ran a verify, and one of the mirror units would not go past 22%. Support told me to try various things, disable scheduling, enable scheduling, stop the verify, restart the verify. When they eventually told me it seems there are some timeouts, and that the cause was Native Command Queuing (NCQ) and a bad BBU, I decided I had enough.
The new Adaptec 5-series cards do support power management, but unlike the 9690SA card they do not support native Windows power management, and requires power savings to be enabled through the ASM utility.
I ordered an Adaptec 5445 card, booted my system with the 9690SA still in place from WinPE, made an image backups using Symantec Ghost Solution Suite (SGSS), installed the 5445 card, created new RAID10 volumes, booted from WinPE, restored the images using Ghost, and Vista booted just fine.
From past experience I knew that when changing RAID controllers I had to make sure that the Adaptec driver would be ready after swapping the hardware, else the boot will fail. So before I swapped the cards and made the Ghost backup, I used regedit and changed the start type of the “arcsas” driver from disabled to boot. I know that SGSS does have support for driver injection used for bare metal restore, but since the Adaptec driver comes standard with Vista, I just had to enable it.
It has only been a few days, but the system is running stable with no errors. Based purely on boot times, I do think the WD WD1001FALS Caviar Black drives are faster than the Seagate ST3500320AS Barracuda drives I used before.
Let’s hope things stay this way.
[Updated: 17 July 2009]
The Adaptec was not that power friendly after all.
Read the continued post.