This is the second article of a short series dedicated to data-aware storage and sponsored by Data Gravity (here the first one).
Data-aware is a new concept and as such it can be misinterpreted or have different meanings for different end users and vendors. It unleashes the power of data analytics applied to primary (active) data but depending on implementation, it can bring totally different results and serve different use cases.
In order to better understand what data-awareness really means, we can divide modern primary storage systems in three categories according to their analytics capabilities:
Infrastructure analytics
The storage system has the capability to collect sensors and logs and send them to an external analytics system (usually to the cloud). This data stream can be analyzed to get plenty of information about the state of the single storage array and compare it to all the others present in the field.
If the analytics engine is well implemented, the system understands application workloads and can give insights on real system usage and potential problems and automatically open support calls when needed. This approach is very useful in order to have full control over the infrastructure and it can also drive down TCO. But on the other hand, the quality of information collected is not sufficient to dig into data content. In fact, most of the activity is confined to the data container (for example a Data volume or a Virtual Machine).
Local resources needed to implement this type of feature are fairly limited. All the analytics are usually done externally and this has no impact on performance whatsoever.
The list of vendors implementing this type of functionality on their storage arrays is getting very long: Nimble storage InfoSight was the first, but similar products are now available from Pure Storage, Solidfire and many others.
Metadata analytics
In this case, depending on the implementation, additional metadata is created and maintained by either the storage array or externally. Rich metadata can be created during the ingestion process (as it could happen in object storage systems with file indexing) or it could be embedded in the file system itself (with much more info about files and their utilization over time).
Use cases vary, but this type of indexing and metadata management entails strong search capabilities which are very helpful in managing large repositories (in the order of several Petabytes) or for data discovery.
Infrastructure TCO is still the main target but data discovery or more sophisticated activities for specific applications in vertical markets are becoming more common. In the latter case, it’s important to note that pre-packaged solutions are very rare. API and third party visualization tools are still the most common way to interact with these kinds of systems. Examples of large scale-out file-based solutions that have implemented this type of functionality for vertical applications can be found in Qumulo and Caringo FileFly with Kibana integration.
Full Data-awareness
This is the most advanced solution and its scope reaches well beyond the traditional infrastructure TCO, helping to take full advantage of stored data by building an information asset out of it. The storage array allocates specific resources to a full featured analytics engine to provide detailed information about file contents and related workloads. This has to be embedded in the system to capture logs, sensors and real data activity as well. Maintaining a separate metadata-enriched copy of data and its associated workload allows for a better understanding of what is happening to the storage system as well as to the data saved in it.
Data-aware storage can be thought of as a sort of auto-generated data lake, where all data can be searched and analyzed to exploit all the hidden value in it.
This is the only way to analyze the array and its content simultaneously. Complex queries can be run without impacting production while advanced data discovery and full content search represent two other key aspects. Depending on the implementation, this type of analytics can be leveraged on the contents in file shares (NAS volumes), as well as file systems embedded in block-based volumes and Virtual Machines for the maximum granularity. Furthermore, thanks to integration with Active Directory and other user authentication services, it is possible to match users and groups to get a deeper understanding of user behavior and data access patterns.
Reducing TCO is certainly a big benefit of this approach. But data-aware storage can be thought of as a sort of auto-generated data lake, especially for SMB organizations, where all data can be searched and analyzed to exploit its hidden value and potential risks in it. This infrastructure component can automate and offload many tasks from different business units in the organization while improving processes and reducing costs.
Potential use cases range from advanced data recovery and discovery to Auditing, Security and Compliance.
At the moment, the most advanced solution in this space is offered by DataGravity. Other solutions, like Cohesity for example, are focused on secondary storage and their architecture doesn’t grant consistent performance for primary data.