About scale-out storage ARM-ification... and $/GB

I came across two interesting news items this week – OpenIO introducing a 96-HDD appliance for its object storage platform and Western Digital launching 12 and 14TB disks!

At first glance if you sum the two, it’s like crazy: just think about a single 96-slot appliance full of 14TB disks, which means 1.3PB in a 4U box, or 13PB in a datacenter rack. Again, it sounds crazy but in reality it’s totally different and it is absolutely brilliant!

Is a 14TB HDD too big?

14TB is a lot (the 12TB is based on PMR technology while the 14TB is based on SMR); and as far as I know, HDD vendors are expecting to release 20 and 25 TB HDDs, and not in the too distant future (but I must also admit that some are skeptical about this roadmap).

No matter what the future is reserving for us, 14TB is a lot for a 3.5″ HDD and it’s quite unmanageable with all traditional storage architectures. RAID makes no sense at all (whether it’s single, dual or even triple parity!), losing a 14TB disk could easily become a nightmare with very long rebuilds, impacting the performance the whole time (and without taking into account that triple parity RAID sucks performance wise).

Distributed RAID mechanisms or, better yet, erasure coding, could be a solution. Blocks are distributed on a very large number of disks and thanks to an N:N rebuilding mechanism the impact is limited… but how many disks can you fit in a single system? (For example, IIRC an HPE 3PAR 28000 can have 1920 disks max, but I’m pretty sure this number could be halved for 3.5″ drives… and I’m not too sure you’d buy such a powerful, expensive, array just for the capacity!).

Go Scale-out then!

Let’s think scale-out then. Easier and cheaper right? Well… maybe!

Since you can’t think of 12/14TB HDD as a performance device, the lowest $/GB is highly likely what you are aiming for. And how many disks can you fit in a modern storage server? Between 60 and 90 depending on a few design compromises you have to withstand. But hey! We’re talking about something between 840 and 1260TB in 4U, this is absolutely huge!

Huge, in this case, is also a synonym of issues. You solve the problem of the single disk fail, but what happens if one of these servers stops? That could easily become a major nightmare! In fact, this solution is unfeasible for small clusters, and in this case small refers only to the number of nodes and not to capacity. 10 nodes, 1 rack, equals to 12PB of storage. It’s raw storage, but even if we take into account a 40% capacity loss for data protection, we are still in the range of 8+ PB! Losing a node in this scenario means 1/10th of 8PB, 800TB!!! Think about rebuilding data, metadata and hash tables for all of that? What will it take to get your cluster back at full speed? Well, it is true that some storage systems are more clever than others and can rebuild quickly, but it’s still a massive job to do…

A simple workaround exists of course, but it doesn’t make any sense from the $/GB perspective. Putting fewer disks on more nodes is easy but it simply means more CPUs, servers, data center footprint and power… hence a higher $/GB.

Making nonsense work

Even by taking the ability to scale for granted (and I know it’s not always the case), a larger number of nodes introduces a lot of issues and higher costs. More of everything: servers, cables, network equipment, time and so on. In one word, complexity. And, again, not all the scale-out storage systems are easy to manage, with easy-to-use GUIs, etc.

I think that OpenIO, with its SLS, has found the right solution. Their box is particularly dense (96 3.5″ HDDs or SSDs!!!) but the box is the less interesting piece of the solutions. In fact, density is just a (positive) consequence.

You can think of SLS as a complete scale-out cluster-in-a-box. Each one of the 96 slots can host a nano-node, which is a very small card with the hard disk in the back and equipped with a dual-core ARM-v8 CPU, RAM, flash memory and two 2.5gb/s Ethernet links. The front-end connector, very similar to what you usually find on a SAS drive, is plugged directly into the SLS chassis just as it is for a normal hard disk in a JBOD.

All the 96×2 Ethernet links are connected internally to two high speed 40gb/S Ethernet switches. The switches have 6 actual ports that can be used for back-to-back expansion of the chassis or for external connectivity.

Failure domain is one disk, which equals to one node. And hardware maintenance can become lazier than before: you can afford to break many disks before going into the datacenter and swapping all of them in a single (monthly?) operation.

Not that this is a new idea. I heard about this idea for the first time years ago and projects like Kinetic are going towards the same direction, not to mention that a Ceph-based cluster was built on ARM not so long ago. This one is just more polished and refinished. A product that makes a lot of sense nonetheless and has a lot of potential. And truth be told, hardware components are designed by Marvell. But, again, it’s still the software that does all the magic!!!

OpenIO’s object storage platform, SDS, has a very lightweight backend, allowing it to run smoothly in a small ARM-based device. Even more so, SDS has some unique characteristics when it comes to load balancing and data placement making it scalable and perfectly suited for this kind of infrastructure. A nice web GUI and a lot of automation in cluster management are the other key components to get it right. They briefed me a couple of weeks ago and they were able to get a new 8TB nano-node up and running in less than a minute, without any intervention if not the disk swap! (and as far as I can see of their internal design, 12 or 14TB won’t change much this time).

They claim 0.008/GB/Month (for a 96x8TB SLS4U96 box with a 3 year support contract) and I think it is incredibly low.

I didn’t get the chance to ask about performance figures, but the first customers have already received their SLS-4U96 and in January I’ll be able to meet one of them. I can’t wait!

Closing the circle

No matter what you think, HDDs or SSDs are going to have similar problems: large capacities are challenging. We are talking about 14TB HDDs now, but vendors are already talking about future 50 and 100TB SSDs, with 32TB size already available! They sound big today, but you’ll be storing more in the future…

Data is growing, and you have to think about putting it somewhere at the lowest cost. The problem is that you can trade durability, resiliency and availability for a lower $/GB. Especially because you want it cheap, but not really really really really cold! We like it cold-ish or, better, warm-ish, and with the increase of use cases for object storage in the enterprise (backup repositories, storage consolidation, collaboration, big data lakes and so on) you absolutely need something which can give the best $/GB, but without too many compromises or risks.

[Disclaimer: OpenIO is a client of Juku consulting]

3 Comments

storagebod on 15/12/2016 at 8:04 pm

So interesting question about whether you would buy an expensive 3Par array for capacity; when we start to get to the very large spindle numbers…the cost of the array controllers starts to become less and less significant when you spread the cost over the capacity. You might well find that the costs drop below the cost per terabyte of an equivalent SDS solution. So then it becomes a completely different discussion – about capability and agility. If you want an Object platform, you are probably not going to put it on an Enterprise array..although when you get to the very large spindle accounts…you might; the whole market is getting a lot weird in the 2Pb+ scale..but if you want a building block approach – then you might be better off going for the server-storage model.

But it’s getting very odd at the moment.

BTW – the sweet spot at the moment for price/capacity is the 8 terabyte drives – the larger ones appear to have supply chain issues.
- Enrico Signoretti on 15/12/2016 at 8:33 pm
  
  Martin,
  thank you for chiming in,
  
  good points but,
  1) not all end users have the same leverage on vendors..
  2) buying an 8-controller enterprise array is not the problem, it’s the support contract that will kill you over time…
  
  And yes, 8TB remains the best in terms of $/GB. I hope 10+ TB HDD prices will shrink in 2017. 😉
Tim Wessels on 28/12/2016 at 6:13 pm

Well, when it comes to HDDs for OBS JBODs, choosing HDD size depends on where you want to start in terms of cluster capacity. Scaling out is not an issue with OBS, but you will probably pay more per GB for the largest HDDs than small HDDs. Storagebod is correct in stating that there is a “sweet spot” for HDD size in the market. The current “sweet spot” could be 8TB HDDs. A couple of years ago it was 4TB HDDs. You do pay a premium for the largest HDDs when they are initially released because their availability tends to be constrained by manufacturing capacity.

In general, you will want more storage servers with smaller capacity HDDs when you get started because you will need to have enough storage servers to do both replication and erasure coding in the same OBS cluster. The variables that need to be computed to determine how many storage servers you will need include the amount of data you need to protect, the size of the HDDs and the factors you are using for replication and/or erasure coding. This is not rocket-science. A simple worksheet can be created to do it. That said, every OBS vendor will tell you the minimum configuration of storage servers you will need. Some OBS vendors who do not use a P2P architecture need more servers because some of them provide specific functions not performed by the storage servers themselves.

Here is one last thought on large numbers of HDDs in JBODs. In a 60 or 72 or 96 HDD JBODs you need to be sure that vibration is properly addressed in the chassis design. Excessive vibration is not good for HDDs. The larger the JBOD chassis the more critical it is to properly deal with vibration.