Ever since I wrote the first S.M.A.R.T. disk monitor program for Windows back in 1999, I am still amazed at the amount of FUD out there about disk drives, why they break, and fingerpointing between disk manufacturers, controller manufacturers, end-users, and operating system vendors.

This section will address some of that fear, uncertainty, and doubt.

My favorite research papers that study real-world failures in large populations of disks are.  Their results are counterintuitive, and well-worth reading.

Both papers are highly technical, and designed for storage industry professionals.  The CMU article reads like an advanced calculus textbook, so be warned:

Figure 8.… In this section, we focus on the second key property of a Poisson failure process, the exponentially distributed time between failures. Figure 8 shows the empirical cumulative distribution function of time between disk replacements as observed in the HPC1 system and four distributions matched to it. We find that visually the gamma and Weibull distributions are the best fit to the data, while exponential and lognormal distributions …

If you aren’t up on the terminology, technology and calculus, then don’t feel too bad if some of it doesn’t make sense to you, and start with the Google paper.  Pay close attention to the study of failure rates vs. disk drive temperature. It proves that disk drive coolers are a waste of money.  

If you need software that can analyze and report this type of information, especially if you are on a UNIX platform or have embedded RAID controllers, check out the SANtools software (celebrating the 10th anniversary of shipping the first and original S.M.A.R.T. Disk Monitor – see links on the About page for some old product announcements from 1999 – very depressing we didn’t get a copyright on it).

