I’ve just finished reading this report from wikibon and, even though I agree with most of what is said, I’m not sure the timing is correct. In fact, from my point of view, it doesn’t reflect the reality of modern datacenters and their evolution, especially when large capacities are involved.

Capacity is important

iStock_000000261349SmallA few days ago I wrote an article about a two tier storage architecture for the enterprise. And I did it because the economics of disks is still good if we look at the overall picture. And there are a couple of reasons why I think so.
The next wave of applications will heavily involve IoT (internet of things) and Big Data Analytics of all sorts. On top of that, data lakes or not, you also have to maintain most of your data for an indefinite amount of time (presumably forever!). Integrated hybrid infrastructures capable of managing different kinds of data and workloads at the same time will be the most sustainable in terms of cost, performance and capacity growth.

Comparing Apples with Apples

Industry wide we usually compare flash with disks, but the problem is that most of the comparisons are wrong because they compare consumer-grade flash (MLC or TLC) against enterprise-grade disks.

iStock_000001747187XSmallOn the contrary, when you look at what’s happening around us, vendors and end users are beginning to use consumer-grade disk drives to store their data. Reliability and durability of the single disk is no longer important, especially when numbers get really big.

Implementation of consumer grade flash in enterprise storage systems was made possibile because vendors, at every level, started to optimize the use of the NAND. Not only basic techniques like wear leveling, optimized garbage collection and so on, but also the storage system is now capable of writing and reading data in the proper way (like, for example, using the proper block size) to avoid any unnecessary write operation on the media that could shorten its life. Something similar has been happening for disks too.

Software-defined reliability and efficiency

Most of the alternatives to classic RAID data protection, like multiple data copies or erasure coding, have solved the problem of potential data losses with big disks. And, when you look at modern storage systems, data is organized much more efficiently than in the past making system rebuild much quicker even when slow disks are involved.

Scale-out architectures are the best you can find for Flash and it is even more true for disks. Horizontal scaling capacity, thanks to X86 commodity hardware and fast cheap interconnections, has opened up new possibilities/opportunities for developers. In fact, if I take object storage as an example, most of the vendors make a copy of all metadata on RAM, considerably speeding up all access to information, and without actually hitting low speed disks until it is effectively necessary to pick up an object. Disks in these systems could be, theoretically, slowed down (or switched off) to save power when idle, making this solution much more efficient and interesting for certain use cases. Even more so, other techniques, like caching, help to further improve disk access.

All the characteristics of modern system designs coupled with modern disk drives make these solutions still attractive when it’s time to think about capacity. Cloudian, just to make an example, is claiming a TCO of around $0.01/GB per month for its appliances. And, just a few days ago, HDS launched a new high capacity solution for its HCP platform that leverages consumer grade disks to drive down costs while maintaining a high resiliency thanks to erasure coding.

Thinking about “multi-petabyte scale”

iStock_000019159342SmallI’d like to work with an example here: let’s suppose you have a 1000 disk system (made of consumer grade 4TB SATA drives), in the worst case scenario (with a triple data copy), you get around 1,3PB usable capacity. And, if you can leverage erasure coding, it could easily be more than 2.5PB of effective capacity! That’s without involving any form of data compression.

Now, look at the math:

case 1: 3 data copies. You can potentially lose two thirds of your disks before losing one single piece of data! In this 1000-disk example it means more than 600 disks!

case 2: with erasure coding (considering a 10+6 segments, which means that you need 10 segments out of 16 to rebuild your information), you can lose around 370 disks before losing a bit!

In practice, you could sustain a very high disk mortality without changing hard disks often. Potentially you could collect all the alarms and visit your datacenter once a month to change disks and failed nodes. And we are talking about very cheap disks ($140 on Amazon.com ), probably under warranty, that you can send back to the vendor to be replaced or repaired.
In some cases, due to the natural growth of the storage system, you won’t need to replace disks but only add new higher capacity nodes… Even if you are going to use 500.000 hours MTBF disks it’s quite impossibile to lose so many drives in a month to become a problem.
This is a successful practice already in use in very high demanding environments, like Backblaze for example, where costs have to be maintained very low to be competitive while granting a very high reliability and availability to end users. If they can do that for 100+ PB, you can do the same for the 1000-disk system of my example.

Last but not least, if you have a 1000 disks system, overall throughput shouldn’t be a problem at all. Even if you are using 5400 RPM disks.

Closing the circle

Flash will rule the Datacenter eventually. But not in the immediate future… It will take years.

The Wikibon article is certainly a good study but it only focuses on generic “active data” which is not clear to me. And, in fact, they only focus on tier 1 all-flash arrays (as you can see on the tables and charts). I totally agree with them, if their intention is to compare products like Solidfire with EMC VMAX for example.
But, If we consider that most of the growth in the following years will be driven by IoT and other Big Data needs, it means that the study is constrained to only specific kinds of workloads! Effectively covering only a small part of overall storage spending.

Talking about “active data” could be deceptive indeed. The talk should probably be more about types of workloads than types of data, especially in the future. Flash is going to be the king in every kind of transactional workload, but storing, managing and analyzing huge amounts of data involves high sequential throughputs and big capacities… and you can’t define that storage inactive! Can you?