Everyone likes hyper-convergenced systems. They are cool, dense, fast, energy-saving, agile, scalable, manageable, easy to use and whatever else you want. But you know what? They have their limits too… They are good for average workloads, a large range of workloads indeed, but not for those workloads that need huge amounts of a specific resource to the detriment of others, like Big Data for example.
Why Big Data needs divergence (an example)
Data grows (steadily… and exponentially) and nothing gets thrown away. Since data adds up, the concept of the “data lake” has taken shape. Even systems created for big data are starting to sense this problem and system architects are beginning to think differently about storage.
I’m going to take Hadoop as an example because this gives a good idea of a hyper converged infrastructure, doesn’t it?
Today, most Hadoop clusters are built on top of a HDFS (Hadoop Distributed File system). HDFS characteristics make this FS much cheaper, reliable and scalable than many other solutions but, at the same time it’s limited by the cluster design itself.
CPU/RAM/Network/Capacity ratios are important to design the best balanced systems, but things change so rapidly that what you have implemented today could become very inefficient tomorrow. I know that we are living in a very commodity-hardware-world now, but despite the short lifespan of modern hardware I’m not convinced that enterprises are willing to change their infrastructures (and spend boatloads of money) very often.
Look at what is happening. Two years ago it was all about Map Reduce, than it was all about Hive/impala and the like, now it’s all about Spark (and other in-memory technologies), what’s next?
Now, my first doubt: “Can they run on the same cluster?”. Yes of course, because the underlying infrastructure (Now Hadoop 2.6) has evolved as well.
But the real question is “Can they run with the same level of *efficiency* on the same two-year-old cluster?”. Probably not.
And another question arises: “Can you update that cluster to meet the new requirements???”. Well, this is a tough one to answer… Capacity grows but you don’t normally need to process all the data at the same time while, on the other hand, applications, business needs and workloads change very quickly making it difficult to build a hyper converged cluster and serve them all efficiently.
Things get even more complicated if the big data analytics cluster becomes an enterprise-wide utility. Classic Hadoop tools are not the only ones, and more departments of your organization need to have different views and make different analysis on different data sets (which often come from the same raw data…), it’s one of the advantages of a data lake.
Why divergence
I’m sure you are already aware that there is an interesting trend around Hadoop and Object storage. In fact, most of the object storage (and scale-out NAS) vendors are developing HDFS interfaces on top of their platforms while the Hadoop guys are working to use object storage as a sort of extension of HDFS. Doing this means that you’ll be able to increase storage independently from other resources. This is the first step but, at the moment, it only solves the capacity problem… storage must also be fast.
Caching technology helps in this case and a good mix of DRAM and NAND memory on your compute nodes can do the magic. These nodes, without 3,5” or 2,5” disks are much smaller so that you can build denser clusters. MCS (Memory Channel Storage) is an example of what I’m talking about, developed by Diablo Technologies and manufactured by companies like San Disk, it’s a Flash Memory device with the form factor of a standard DDR3 DIMM. This type of device is very sophisticated and brings very low, predictable latency to each single node. Two weeks ago I also got to meet with Diablo technologies and they talked about the next step of this technology. In practice, thanks to their physical position, (a DIMM slot), clever engineering and a set of APIs, they’ll be able to “extend” DRAM with NAND (they are only different kinds of memory after all), making it possibile to build multi-terabyte-memory nodes. This is mind-blowing tech and its best characteristic is not pure performance but agility.
The next step is micro-convergence (aka “Dockerization”)
One of the best characteristics of Hyper-converged infrastructures is their VM centric view of the world. But here we are talking about an application and a data centric world, in this case the VM only limits agility and manageability due to all the redundant parts it is made of.
If this big data cluster becomes the center of many different kinds of workloads, used by various departments of the same organization, then containers will be the best way to go.
You could instantiate different kinds of jobs (CPU or RAM bound), manage peaks, multi tenancy, different applications and so on. It’s just a matter of reconfiguring the cluster for that particular application or specific temporary workload. And thanks to container characteristics it could be done very quickly.
Closing the circle
We are far from achieving these results and I’m merely trying to connect all the dots after seeing what is happening all around.
If different resources grow differently over time ,splitting them in various repositories is the best way to lower costs and carve out the maximum from them. At the same time, the most interesting parts of hyper-convergence is ease of use and management… and this will be the most challenging aspect of the problem.
Technologies such as Diablo’s MCS could be great enablers to build a more agile hardware underneath different types of containers created in function of specific needs. APIs are/will be available, but most of these components are still immature and others are still in the early stages of development. Smart caching techniques and FS abstraction for example would be important elements of this type of design, (especially if you’ve already separated and abstracted other components).
Do you think I’m crazy? I’m not. During Storage Field Day 6 Andy Warfield (CTO of Coho Data), spent some time talking about similar concepts and they already have a PoC to show that their storage can be extended into an Hadoop cluster through containers. here the video:
Disclaimer: I was invited to the meeting with Diablo Technologies by Condor Consulting Group and they paid for travel and accommodation, I have not been compensated for my time and am not obliged to blog. Furthermore, the content is not reviewed, approved or published by any other person than the Juku team.