A couple of weeks ago, during Storage Field Day 5, I had the opportunity to learn more about interesting “new” technologies set to have a tremendous impact on future infrastructures development: Flash and RAM.
I’m not kidding! It could sound weird but it’s not about the technology itself rather how it can now be used to build next generation infrastructures.

Yesterday

In legacy environments scale-up architectures were the standard and, to obtain performance, you had to add resources (CPU, RAM, Disks, etc.) into the same box. The problem is that when you have filled up the box (A server, a storage system or whatever), you need a bigger one.
Installing a bigger box has its problems and costs: complex migrations, forklift upgrades, downtimes, and so on. They all concur together to make it a pain in the neck.

Today

iStock_000015945408XSmallRecently, after the advent of virtualization, web and modern software it is possible to build efficient scale-out systems where performance and scalability are obtained by adding more similar nodes to a cluster. Each node adds its own resources to the cluster and, all together, they sum up to produce a bigger system.
The advantages are many when compared to a scale-up architecture, not only scalability or performance but also costs, availability and resiliency are better. Migrations, forklift upgrades and downtimes are all things of the past.
This is why Web 2.0 infrastructures are all scale-out.

Scale-out is not perfect though. If you look in depth at technology and how each node is organized you can easily find flaws which can compromise scalability, especially when software isn’t smart enough to manage hardware deficiencies. In particular most of these scale out architectures are not well balanced.
I’m not talking about specialized infrastructures, like a web farm or a storage system specifically designed from the ground up to be scale-out. I’m talking about general purpose systems like the new hyper converged solutions we like so much and that are beginning to show up in Data Centers.

These systems work like a charm, both in terms of efficiency and performance, when the numbers are “in the average”, but when specific resources are stressed they need to trade efficiency for performance, or viceversa, to grant an acceptable response time to applications.

Unbalanced

iStock_000014554111MediumI think that it’s quite easy to find why it happens and it all depends by the fact that modern cluster nodes are unbalanced.
In a single node, if you look outside the CPU, you have RAM, Network and local storage. Thinking about latency of various components you can easily understand where the bottlenecks are:

– RAM: nS (highly predictable)
– 10GbE Network: from 4 to 20 uS (predictable with the right networking equipment)
– Flash storage (PCI Flash): from 15 to 100+ uS (it depends on R/W% and not highly predictable. e.g. garbage collection)
– Disk (10K RPM drive): 4-7 mS (poorly predictable)

Do you see that? Accessing RAM on a remote node via a network connection can be faster than accessing local storage! Especially when storage is stressed by certain types of workloads. (it should be the contrary…)

Tomorrow

As I already mentioned at the beginning of this post, my thoughts come from a series of briefings with SanDisk, Diablo Technologies, Pernix Data and Atlantis Computing (the latter was not at SFD5). In a way or another they are all talking about better storage latency and predictability… and for me, in this context, it means more balanced and efficient nodes.

In particular, SanDisk’s ULLtraDIMM has a very constant latency usually under 7uS(!) and the brilliant design of the product makes garbage collection a secondary problem.
At the same time, I was impressed by the new features of PernixData’s FVP (which is now also able to use RAM for caching).
Furthermore, Atlantis has proved that its in-memory storage technology can dramatically improve storage performance even when installed on top of a VSAN!

Dropping storage latency and improving its predictability re-establishes the right model where the closer the data is to the CPU, the faster is to access it. This consequently leads to more balanced nodes and overall better (and balanced) scale-out infrastructures.

Why it is important

iStock_000017591577MediumLeveraging RAM and next generation Flash is the key to obtain the right level of performance and latency for this type of resource.
Hyper converged systems are potentially disruptive but, often, true scalability without compromise is more on paper than in reality. In fact, inefficiency of one resource could severely compromise the efficiency of an entire cluster.

Disclaimer: I was invited to this meeting by Gestalt IT and they paid for travel and accommodation, I have not been compensated for my time and am not obliged to blog. Furthermore, the content is not reviewed, approved or published by any other person than the Juku team.