Google datacentre SSD study offers surprising conclusions

by Mark Tyson on 29 February 2016, 10:31

Tags: Google (NASDAQ:GOOG)

Quick Link: HEXUS.net/qacyzi

Add to My Vault: x

On Friday a report entitled 'Flash Reliability in Production: The Expected and the Unexpected' was published. The authors of the paper include Professor Bianca Schroeder of the University of Toronto, and Raghav Lagisetty and Arif Merchant of Google Inc. According to the authors this is among the first studies of the reliability characteristics of flash based storage in datacentres. What is more, some of the findings and conclusions might be surprising.

As mentioned in the abstract of the new paper, there has been a large amount of data published about the durability and reliability of flash chips, taken from lab experiments. Friday's report, published after the FAST 16 USENIX Conference on File and Storage Technologies in Santa Clara, is based upon a large scale field study. Google's datacentres, no less, were the source of the data presented. Included in the study were ten different drive models, varying flash technologies (MLC, eMLC, SLC), plus over 6 years of production use in Google's datacentres.

SLC and MLC drives are equally reliable

From analysing the data, the researchers came to a number of surprising conclusions in comparing flash-based drives to each other, and in comparing them against traditional spinning disc hard drives. I've compiled a bullet point list of some of the key findings below:

  • SLC drives, which are targeted at the enterprise market and considered to be higher end, are not more reliable than the lower end MLC drives.
  • Age, rather than usage amount correlates to higher error rates. So flash memory wearing out isn't really a problem with the SSD designs we have now.
  • Between 20 and 63 per cent of drives experience at least one uncorrectable error during their first four years in the field.
  • Between 30 and 80 per cent of drives develop at least one bad block and 2 to 7 per cent develop at least one bad chip during the first four years in the field.
  • RBER (raw bit error rate), the standard metric for drive reliability, is not a good predictor of those failure modes that are the major concern in practice.
  • RBER and the number of uncorrectable errors grow with PE cycles in a linear fashion.
  • UBER (uncorrectable bit error rate), the standard metric to measure uncorrectable errors, is not very meaningful.
  • While flash drives offer lower field replacement rates than hard disk drives, they have a significantly higher rate of problems that can impact the user, such as uncorrectable errors.
  • Drives tend to either have less than a handful of bad blocks, or a large number of them, suggesting that impending chip failure could be predicted based on prior number of bad blocks.

The above are rather interesting and perhaps surprising conclusions. Quite a few of the surprises are pleasant. However the way that flash-based drives tend to fail should encourage users to be even more regular with their backup regimes. You can download the full report here (PDF).



HEXUS Forums :: 8 Comments

Login with Forum Account

Don't have an account? Register today!
at the end mechanical drives have a higher tendency to fail.
interesting I'll have a read of that later.

lumireleon
at the end mechanical drives have a higher tendency to fail.

not sure it's that clear cut.

While flash drives offer lower field replacement rates than hard disk drives, they have a significantly higher rate of problems that can impact the user, such as uncorrectable errors.
“Between 20 and 63 per cent of drives…” “Between 30 and 80 per cent of drives…”

How is that useful data? is it 20 or 63%? is it 30% or 80%. Might as well say between 1 and 100% of drives had issues!
ik9000
interesting I'll have a read of that later.



not sure it's that clear cut.

I think it basically says: spinning drives fail more often but SSDs corrupt more data.

Corruption is always worse than failure.
Gunbust3r
“Between 20 and 63 per cent of drives…” “Between 30 and 80 per cent of drives…”

How is that useful data? is it 20 or 63%? is it 30% or 80%. Might as well say between 1 and 100% of drives had issues!

Margins like that just make the results sound unreliable.