HPE users: patch our SAS SSDs to quash permanent crash bug

by Mark Tyson on 27 November 2019, 13:11

Tags: HP (NYSE:HPQ)

Quick Link: HEXUS.net/qaega3

Add to My Vault: x

Hewlett Packard Enterprise (HPE) has issued a support bulletin for users of its SAS Solid State Drives. If left unpatched certain models, which were sold individually and as parts of other server and storage products, will fail as they pass 32,768 hours of operation. Understandably this is listed as a critical fix by HPE and users are recommended to apply patches immediately.

If you are running any HPE SAS SSD with drive firmware version older than HPD8 please head on over to the support bulletin page to check for your HPE Model number / SKU to see if it affected by this critical bug. The affected parts list is rather long and includes a number of HPE server and storage products from the ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 lines.

Looking more closely at this peculiar bug, it is said to cause "drive failure and data loss at 32,768 hours of operation and require restoration of data from backup in non-fault tolerance, such as RAID 0 and in fault tolerance RAID mode if more drives fail than what is supported by the fault tolerance RAID mode logical drive". The "firmware defect" will make itself known after 32,768 hours of operation, which equates to 3 years, 270 days 8 hours of runtime. HPE adds that if you run one of these storage devices past this duration and it fails "neither the SSD nor the data can be recovered," and compounding the issue for users will be that SSDs put into service in a batch will all likely fail simultaneously.

HPE's Offline Smart Storage Administrator software for Windows and Linux can show you your SSD drive power on hours in its GUI. However, it is probably advisable to just head on over to the new firmware download links about two thirds down the HPE support bulletin page to download and apply this firmware if you are anywhere near 3 years of use of your HPE SAS SSD product. Meanwhile, there have recently been reports of HPE SSD failures due to this bug, on Reddit.

Please note that only eight of the 20 drives affected have patches available today. You will have to wait until the second week in December for patch files for the remaining 12 affected SKUs. HPE assures that those waiting for patches couldn't have run their SSDs long enough yet for the critical failure to occur.

While HPE didn't provide any hints about what might be tripping this 'SSD kill switch' at 32,768 hours of operation, some experts have speculated that it could be a permanent crash caused by integer overflow. A quick Google indicates that the maximum signed range of integer values that can be stored in 16-bits is 32,768, which is probably relevant to both the problem and solution.



HEXUS Forums :: 7 Comments

Login with Forum Account

Don't have an account? Register today!
Hopefully the patch doesn't make it fail at 65536 hours of use (which would be outside warranty ;) )


edit: Ooh, I speculated an overflow as soon as I saw the 32768 number, guess that makes me an expert!
DanceswithUnix
Hopefully the patch doesn't make it fail at 65536 hours of use (which would be outside warranty ;) )


edit: Ooh, I speculated an overflow as soon as I saw the 32768 number, guess that makes me an expert!

That was immediately my first thought thinking this is juat simply a bug with a maximum value overflow!

Such a silly bug to have in 2019 xD
Tabbykatze
That was immediately my first thought thinking this is juat simply a bug with a maximum value overflow!

Such a silly bug to have in 2019 xD

It isn't usually the overflow that directly kills your code though, it is usually some secondary effect like using the resulting -32768 value from the overflow to search/index into a table which doesn't have any entries suitable for negative numbers. Given that power on hours isn't usually considered that important a metric I can imagine it not being that heavily tested either.

OTOH, if it was something like using the top bit as a debug flag then someone needs to be taken out and shot :D
DanceswithUnix
It isn't usually the overflow that directly kills your code though, it is usually some secondary effect like using the resulting -32768 value from the overflow to search/index into a table which doesn't have any entries suitable for negative numbers. Given that power on hours isn't usually considered that important a metric I can imagine it not being that heavily tested either.

OTOH, if it was something like using the top bit as a debug flag then someone needs to be taken out and shot :D

Ha ha, chemical sheds and the ditches!

It is very interesting that the drive is completely inoperable/irrecoverable when this value is hit which definitely follows your logic of the secondary effect, maybe the time is used as a calculation in SMART, the SMART crashes and takes the controllers with it?

Edit: to qualify my thought, the flipped bit would make a negative time so the calculations, if uncaught, will just drop out of range. Why they're counting time using a signed 16-bit integer is a little bit odd…
Tabbykatze
Why they're counting time using a signed 16-bit integer is a little bit odd…

Thinking about it, there is a good chance they aren't, and this isn't an overflow…

Imagine you store that value in a word of flash, then every hour you erase the page it is in and re-write it with the new value one higher. That's 65535 writes to a page just to store one thing, where a page has an endurance in modern flash devices of about 3000 writes. Just to count.

Now imagine you choose an 4KB page of flash, that's 32768 bits in total. On first ever power up you clear the page so all the bits are 1's. Every hour, you clear one bit. Flash is written by erasing an entire page of bytes to all 1 bits (as in each byte 0xff) and then clearing the bits you want cleared to get the value you wanted stored. So you can actually zero a bit in flash at any time without erasing it first (flash programming fun fact!), you only need to erase to flip a zero into a one. Now you get 3.7 years of counting hours before you have to erase to count the next 3.7 years, so your 3000 erase endurance gets you 11000 years of counting. Handling the 3.7 year boundary would take some careful testing though (how many cycles you had been through being stored elsewhere).

That's probably how I would do it anyway, and given storage devices use a 4K filesystem page that fits nicely.

Hmm, so now I don't think it is an overflow. Will have to hand my expert title back ;)