Seagate boot-of-death analysis – nothing but overhyped FUD
The nature and scope of the Seagate 7200.11 boot-of-death problem has been blown way out of proportion, and people are making grossly incorrect assumptions. Seagate recently released a failure analysis report under non-disclosure to some (or all, I don’t know) OEM partners and distributors that describes the issue in great detail. Why under NDA? In my opinion, full knowledge of the problem could potentially create a blueprint for virus writers who want to go beyond just erasing files on targeted machines. So to be safe, full specifics aren’t being disclosed (but you can now find a little more info on Tom’s Hardware if you know where to look).
As part of the manufacturing process, Seagate writes diagnostic information to reserved areas of the disk drives. These bit patterns work with the test equipment and drive firmware to perform diagnostic actions such as placing it in a secure lockdown mode. Once SATA disks (well, all disks) are powered up, they maintain counters in these reserved areas for purposes of diagnostics and usage reporting. Reserved areas are used because they are non-volatile, are hidden from the O/S, and require special commands to read/write to them. The firmware includes error-recovery logic that is triggered based on the contents of log page values. [Example, when drive temperature gets to high, the disk spins down].
So here is what happened. For whatever reason, some of Seagate’s test equipment didn’t zero out the test pattern once the test suite completed, and these disks were shipped. When disks that have this test pattern pre-loaded into the reserved area, and put into service, they are subjected to certain errors, warnings, or I/O activity [remember, I’m not going to tell you what the specific trigger is …, but the information is available to people who need to know] that results in a counter reaching a certain value. (This is NOT a threshold, but an exact value. I.e., if the magic number was 12345, then 12346 and higher would NOT trigger the bricking logic. Only 12345 triggers it. ). Furthermore, this value is stored in a circular buffer, so it can go up and down over the life of the disk. In order for the disk to brick, the disk must be spun down at the EXACT MOMENT this field is set to this magic number. (The magic number is not an 8-bit value, either). So either on power-down, or power-up, the firmware saw the bit pattern, and the magic number in the circular buffer, and likely did what it was programmed to do … perform a type of lockdown test that is supposed to happen in the safety of the manufacturing/test lab, where it can be unlocked and appropriate action taken by test engineers.
So, let’s say you have a disk with the naughty firmware, that was tested on the wrong test systems at the wrong time. Let’s say that the magic number is a 16-bit number. Then even if you had one of the disks that are at risk, then the odds are > 65,000:1 that you will power the disk off when the counter is currently set to this magic number. If the magic number is stored in a 32-bit field, then buy lottery tickets, because you have a higher probability of winning the lottery then you do that the disk will spin down with the register set to the right value. (BTW, the magic number is not something simple like number of cumulative hours.)
What also happened is that everybody and their brother who ever had a non-related problem like a general drive failure on a Barracuda disk has been coming out of the woodworks complaining about the problem. They see posts about bricking & boot-of-death, and next thing you know they are adding fuel to the fire that got picked up by media. Remember, that people who don’t have problems generally don’t log into Seagate support sites.
Common Questions and Anwers
Q. So why doesn’t Seagate just post an alert saying if you have certain firmware, your disk is in danger of bricking?
A. Unless your disk was hooked up to the test equipment in question, then there is no danger. The reserved area must already contain the magic bit pattern.
Q. Will a firmware update fix the problem, really?
A. Yes, too many things have to fall into place for the bricking function to kick in, so the update merely has to add further constraints, or clear out reserved area.
Q. Will the boot-of-death destroy data?
A. No, but you have to either hook your disk up to a diagnostic board (approx $500, or send it back to Seagate, who will fix it for free. )
Q. What about those reports where 30-40% of disks are affected?
A. 3-40% disks have affected firmware. 30-40% of the disks were not tested on affected hardware. 30-40% of disks will not encounter the specific condition of reaching the specific bricking value in a certain log page. So do the math and consider the source. While the statement may be true concerning affected firmware, once you factor in everything else, then the risk is “nominal”.
Q. Are patches / fixes out yet?
A. Yes. As of the date of this posting, they are out for every affected model. The problem, of course, is that it is a pain to update the firmware.
Q. Do I need to update my firmware?
A. I don’t know. See if you have an affected serial number.
Q. What are some of the caveats?
A. None, really, unless you are running Windows. If your affected disks are installed in a Solaris, IRIX, HP/UX, or any UNIX-based system, or if the disk(s) are behind a RAID controller, then you need to get out your screwdrivers. (For what it is worth, I’m testing some code which will allow you to flash the revised firmware on UNIX platforms. Feel free to contact me if you are interested.