In recent chats I had with end users and vendors I found a common pattern that made me think about Big Data analytics and how data is collected, organized and analyzed in many organizations. This is also, I think, an explanation for the slow growth of some Big Data companies and slower than expected ROI in some Big Data investments.  I eventually discovered some interesting things that I’d like to share with you in this post.

The rise of data ponds (and broom closet IT)

Garden pondInstead of building a single huge data lake, many organizations, or departments/organizational units, prefer to build their own “data ponds” (I know, “data pond” is my invention, but it gives the right idea of what I mean, doesn’t it?).
 
This is done for several reasons but, on top of the list, there are always two major factors:
 
1. Having more control over data and infrastructure. The perception is that, by having direct control over a small fully-owned infrastructure (usually in the range of 2/3 to 8 nodes), everything is much easier to manage and control.
Procurement is easier, IT operations are easier and there are only one or few applications running at the same time. And in some cases these are cloud-based infrastructures.
2. Supposed Flexibility and efficiency. The perception is that this kind of approach is much more efficient, and brings better results in the short term. There is no, or limited, resource contention for example and developers are free to pick up and use the tools/platforms they like the most without any additional consideration.
Also, system management is much “easier” especially if you consider that most of these clusters are out of enterprise IT control and, sometimes, are deployed in closets (“the small computer room at the end of the hallway”…so I’ve been told 😉 )

Data ponds could become data swamps

compute clusterAs you can imagine this approach poses a series of major problems. Scarce security is just the first that comes to my mind, but system administration left to external consultants without rules comes immediately after, and this without considering the risk associated with data management (starting from backups for example). Just a nightmare for the IT department… and sooner or later, the IT department will be involved (and it’ll be too late by then).
 
Other problems come from the real efficiency of the data pond. In fact isolating data sets is not the best choice you can make and soon you’ll be needing data that is not present in your infrastructure, or it will become just too small for your needs… You know, the “broom closet” is what it is and you can add more compute resources in it.

Pond consolidation is the first step

Building a single data lake should be the answer but it’s not that easy… Consolidation of data and compute with conventional big data tools have some major issues but it’s clear that ponds are not sustainable in the long term.
From the infrastructure standpoint there are some interesting things to consider:
 

  • Traditional Big Data infrastructure has to be revised: HDFS doesn’t work with everything. It was thought up for large sequential workloads, batch jobs and for Map Reduce. Now we also have ELK stack, Spark and many other tools working on the same infrastructure and different workload concurrency can result in nasty behavior (IO blending, especially in hard disk based infrastructures is still a huge problem). It’s not surprising that companies like Cloudera are working on more flexible storage solutions, is it?
  • Virtualization: well, usually virtualization is not considered a good fit for Big Data workloads but, again, because of problems with IO blending and throughput. Most virtualized architectures for big data fail because they are trying to solve the storage cost problem by installing HDDs on local nodes… just by running a couple of different VMs/workloads on the same node, performance falls apart…
  • Orchestration and resource contention: different apps generate different workloads and you need to be very flexible in the way the cluster is managed. New apps must start in seconds, do their job, and quickly leave unused resources to others. Resource contention could be a huge problem (one of the main reasons we find these small clusters in the field, actually). It’s not difficult to find batch and real-time jobs running simultaneously (like, for example an end-of-the-week job while you need a specific report for something else)

Looking at possible solutions

Once again, looking at the problem from the infrastructure POV, we have to find solutions that allow consolidation of data while enabling several workloads to run at the same time and keep users happy… and I believe it’s not as hard as you might think.
 
illustration of big data, file transfes and sharing filesStorage consolidation is probably one of the easiest parts and, probably the first step… but it can’t be done on an HDFS/HDD based cluster. Leaving HDFS as the access layer, so that nothing has to be changed in the upper stack, is still the best solution however. And now, there are solutions that can be really effective.  Some of them, like Caringo and Cloudian, have the right functionality but may lack the necessary performance to cover all the use cases. But others, like Coho Data and EMC Isilon could be a really interesting compromise, both in terms of scalability and performance. Thanks to flash memory, hard disks and high speed connections it is possible to cover large capacity and many different workloads at the same time.
 
Furthermore, by leveraging containers instead of VMs, it is possible to deploy consistent application frameworks almost instantly. And, with the right integration with tools like Openstack or Kubernets, it isn’t difficult to control the whole cluster. This makes it possible to run many different Data Analytics tools and applications at the same time against the same storage infrastructure… even if these Apps are developed on top of different Hadoop distributions, for example.
 
Some storage vendors are going even further and are now capable of running these workloads (or part of them) directly into the storage system (Coho and Intel already have a joint reference end-to-end architecture on this, and also Cohesity demonstrated a similar functionality during one of the latest Storage Field Days). These kind of features lead to the creation of a new class of storage systems which are now much smarter and able to run specific applications in a purpose-built hyper-converged fashion.

Closing the circle

Building data lakes is not easy but it’s not a technology fault. With the right architecture it is possible to consolidate data in a single large repository and run many different workloads and application stacks at the same time while giving the best user experience to end users.
 
At the same time, building a data lake through data consolidation (or “data pond” consolidation) without designing the right compute infrastructure will lead to all the inefficiencies discussed above.
 
There are only a few vendors that are able to provide a similar kind of architecture today and in my short list I can see Coho and HDS (with its HSP) leading the pack with solutions already deployed to actual end users… have I forgotten to mention anyone here?