Enterprises are storing much more data than in the past (no news here), and they are going to be storing much more than now in the next future (no news here either). About a year ago I wrote an article about the necessity for enterprises to consider a new two tier strategy, based on Flash and Object storage technologies.
Now, if you look around, you can see the first signs of this happening but, even if we are at the beginning of a long trail, there are some aspects that we should take into serious account to make it really successful.
Flash, Flash, Flash… and Disks
Flash memory, in all its nuances and implementations, isn’t a niche anymore and every decent deal in 2015 (where primary data is concerned) will contain a certain amount of Flash. Some will be all-flash, others will be hybrid but it can no longer be avoided. The economics of traditional primary workloads (IOPS and latency sensitive) run on flash memory, are undebatable when compared to spinning media. But, at the same time, the opposite is also true: when it comes to space, the hard disk still wins hands down.
Furthermore, another point of excellence for the hard disk is always the throughput or, at least, $/MB/sec. Which doesn’t mean that HDD is better than flash but, when data is correctly organized, you can stream data out of a disk very quickly and at a lower cost than Flash (for example HDFS blocks are huge, in the order of 128/256MB).
In the next few years Flash will become more and more relevant and we will see it growing up to 10/20% of total data storage capacity in most enterprises. This is why the right integration between Flash and Disk tiers will bring a lot of advantages in terms of simplification, and will definitely drive down TCO as well.
Flash and Object-stores need to talk
Let’s suppose that we are talking about a large infrastructure. In this case it wouldn’t be about a single large hybrid system but, more in general, it would be a hybrid storage infrastructure made of different systems.
Primary storage could be part of a hyper-converged infrastructure or external arrays and it has all the smart data services we are now used to seeing in modern systems (I mean thin provisioning, snapshots, remote replicas and so on). On the other side we could have huge object-based scale-out distributed infrastructures capable of managing several petabytes of data for all non-primary (or better, non IOPS/latency sensitive workloads), in practice everything ranging from file services and big data, to backup and cold data (like, for example, archiving).
As you are probably aware, some vendors are already proposing systems to de-stage snapshots to a secondary storage system. For example, SolidFire has the ability to copy snapshots and data volumes directly to an S3-compatible storage and manage their retention. Something similar to what you can find on HP 3PAR systems (even if in this case it only works with HP StoreOnce VTLs). These kinds of mechanisms lead to a better overall efficiency in terms of space used and simplification but can also help to have more automation at the infrastructure layer without needing separate/additional software like, for example traditional backup servers. And even though some backup softwares can already leverage data services available in the array to do backups, I would like to see more arrays directly supporting Object Storage APIs to move data between primary and secondary systems.
I know that other array vendors are working on similar functionalities and I hope we will see more Object-enabled primary storage systems soon on the market.
They need to be smarter
Smart cloud-based analytics is becoming more and more common for primary storage vendors (vendors like Nimble are giving Analytics a central role in their strategy, and rightly so!) but we can’t say the same for secondary systems (which are becoming not so secondary after all). If object-storage becomes the platform to store all the rest of our data, then it’s clear that analytics will become considerably important in the future.
More types of data and workloads will be concurrently managed by single large, and distributed, systems. With this in mind, it’s quite obvious we need to have a clear view of what is happening, when and why. And, of course, predictive analytics will be fundamental too.
Fortunately the first signs of change are visible this year too. Cloudian, an object-storage startup, has launched a new version of its product and now they continuously collect information from installed systems which feeds an analytics tool to help the customers. This is the first release (and I haven’t had the chance to look at it personally yet) but it is going towards the right direction, for sure!
In the future I would like to have more insights from my storage analytics than I do now. What is happening in Primary storage systems should also happen, and it’s even more important here, to secondary storage systems. Technology already shown by companies like Data Gravity would be even more interesting if applied to huge data repositories (call them “expanded data lakes” if you like, just because they don’t contain only data that your are collecting for BI)… and I can’t wait to see some startups, still in stealth mode, showing up their software solution to analyze storage content.
They need to be application-aware and data-aware
One more important aspect is application awareness. Some primary storage systems know when they are working with a particular DB or an hypervisor (just to bring up a couple of examples), and they enable specific performance profiles or features to offload servers from doing some heavy tasks (VMWARE VAAI is a vivid example here).
We need similar functionalities on secondary object-based storage too, but in this case it is necessary to climb up the stack. It’s not only about being aware of the application but also about how the application works with data. Fortunately, object-based storage systems provide rich metadata capabilities which can be leveraged to build a lot of stuff and make gateways or native interfaces much more clever than they actually are.
For example, in a big data environment, during the collection-preparation-analysis process, metadata could be useful to tag data and offload some basic operations to the storage system during the preparatory stage. This is not easy to achieve but it could enrich data during the ingestion process and open new opportunities for other applications. Searching and Big Data Analytics are the first types of applications that come to my mind, and it is very interesting to see some object storage vendors evolving from being a cold (big) data archive/repository to an active storage, now capable of being used instead of HDFS in conjunction with a (diskless) hadoop cluster. Scality demonstrated it last here and Cloudian raised the bar yesterday with the announcement of its Hyperstor 5.1, certified to work with Hortonworks’ HDP!
This also leads back to what I wrote a while ago about the use of containers, flash and objects to build next generation data-driven infrastructures.
Closing the circle
This post stands somewhere between predictions and wishful thinking, I know. But many of the necessary pieces to build the complete picture are ready or are in development. However, connecting all the dots will take time, probably somewhere in the range of 4/5 years!
Fast and application-aware primary storage on one side and highly reliable, distributed and data-aware secondary storage on the other… capable of talking together. Isn’t it exciting?
It the meantime, you can already check startups like Primary Data for example. They are working on products that will make most of the stuff I talked about in the first part of this article already possible, up to a certain level at least. It adds a virtualization layer on top (and some complexity) but it also has the big advantage to abstract the entire storage infrastructure and make it software-defined for real.
[Disclaimer: I recently did some work for Scality and Cloudian.]
Trackbacks/Pingbacks