Gold digging: thinking about dark data and metadata

Last week I wrote an article talking about amounts of data in the medium sized enterprises and I finished it wondering about the real value of stored Data.
This time I would like to write down some ideas which I matured in the last months as a consequence of my interest for Object Storage.

Taking for granted that most of the enterprise data are copies, which means pure costs, others have a hidden value. All these data, stored for many different reasons, stay there because of compliance or just because someone lost control of it (but no one dares to delete it or really knows if it is still important!)… Gartner has a definition for that: Dark Data.

The basic ideas I’m going to expose here could make the migration to object storage much more interesting and, probably, justifiable.

Why migrate to Object stores

Why should you have to migrate from a traditional File System, or NAS, to an Object Storage?
There are many reasons to do that indeed, but they are all questionable.

Scalability? yes, you can do it because an Object Storage system can manage trillions of files but only a few companies in the world have trillions of files (or hundreds of petabytes to manage), don’t they?

RAS features (resiliency, availability, serviceability)? yes, but most of these features are also available with major NAS platforms out of there, aren’t they?

Better data protection mechanisms? Again, yes, but snapshots, replicas, integration with backup systems are all in place and they are working fine, aren’t they?

And it’s not (even if this can dramatically drive down TCO) because you can have a single consolidated platform for all your unstructured data, with all the features described above… and more.

So, why should you migrate to objects? Easy! For a much simpler reason: it’s because of the magic of metadata!

Metadata?

Metadata are data which describe data. This is the shortest possibile definition, but gives the idea and the power of this kind of information.
Almost all object storage vendors are working on different ways to expand metadata capabilities of their products. This opens a world of new opportunities and potential benefits for those end users that will be able to take full advantage of it.

Take a look at your data

How many files is your organization managing? hundreds of thousands? Millions? more?
Well, have you ever figured out the real value of those files? and how can you transform them into a valuable asset for the company business?

If you look at those files I’m sure it will be very hard to find out their value because you have no clue of what they contain.
Any single user or team of your organization have a very limited view, and even at the higher levels it is impossibile to really estimate what you really have and how to use it!

Part of those data are in some sort of collaboration/content management platform (SharePoint, for example) but it’s not enough.

Adding metadata tags to your files is the way to go, but how can you do that?

Indexing and Searching

If you think about traditional File systems indexing/searching is the only chance.
Is this a viable solution for the traditional enterprise? Well, it has it pros and cons but let me say that this approach doesn’t solve the problem.
Some search engine platform leverage very good technology but their software and hardware has to be managed, maintained and integrated with the rest of the infrastructure.

It’s also anachronistic! Indexing and Searching can give the best when you have many different repositories but, if you are working to build your private cloud, you are going in the opposite direction.
Object storage, seen as one of the pillars of you private cloud strategy, is a single consolidated environment which offers various data services. Remote data repositories are only front-end gateways to that service.

Moreover, an object storage has two important characteristics:
1) Data are not identified by their position. It means that you don’t have to maintain something that continues to re-index your filesystems all the time!
2) Not all the files/data are equally searchable but you can also add metadata tags to them if you know what is important in it.

With this in mind you can start thinking to a different and smarter approach.

Tag, tag, tag and tag!

Adding metadata tags when you use APIs is easy, doing it when you use a file interface is not that simple.

The classic process, when you add new files to an object storage system through a NAS gateway, is quite linear. The gateway only adds basic system metadata and other information useful to emulate the file system (it just depends on the implementation) but there is nothing more than that.
The primary objective is to emulate a traditional file system, and as such you don’t theoretically need more info but, the consequence is that you don’t have a clue about the contents.
On the other hand, if the gateway was smarter it could add the needed information to the content of the file. But that’s the tricky part.

Data should be intercepted during the ingestion process, scanned in depth to squeeze out all the necessary information, and add this information to the object as metadata.

How hard is it?

It’s hard but not impossible.
You should have an software capable of analyzing different kind of data and smart enough to understand the important information in it.
For example, you have boatloads of Word documents in your company but the content of those files are quite different from department to department. So the software should be clever enough to understand patterns, categorize documents and find the few important things that can give pull out the value from the file!
Another example could be images. Faces, bar codes, things? you name it. Image recognition software exist, the only problem is the price and the kind of recognition they are able to do.

From my point of view the toughest part is finding the right information, discerning between good and bad isn’t a simple job. My personal opinion is that all this renewed interest in Artificial Intelligence from Google, Facebook and the like, is all about finding useful information in a sea of clutter…
In any case, I’m digressing a little bit maybe. Enterprise documents are not as variegated as they could be and I think that most of the knowledge developed with index/search engines could be reused to perform this task. Use search/index engine to produce metadata could be a good idea after all…

Bottom line

Creating valuable custom metadata on top of your files and storing all together in an object storage system has a massive advantage when compared to traditional storage. It allows to create new storage policies, find and reuse data, recycle information and build new value upon it.

It’s hard to do and, probably, not affordable for most of the enterprises. At the same time, the metadata creation process could be sold as a service from the NAS Gateway vendor, especially when the Object Storage system is hosted in the cloud. (Metadata as a Service sounds good, doesn’t it? 😉 )

I know that some Object Storage vendors are doing some research in this field and basic custom metadata creation is already possible.
Now, it will be fun to see if/when someone will be able to implement (or integrate) a mechanism similar to what I very briefly described in this post.

1 Comment

jmartins on 25/03/2014 at 7:49 pm

Glad to read that this topic has remained top of mind in the industry. We’ve been talking about it since 2002 and still view it as one of the hottest greenfield opportunities of the coming decade.

By the way Enrico, for years we’ve referred to this as “Information Alchemy”. Companies shouldn’t dig for gold so much as they should find ways to turn information lead (the information that just sort of sits there as a cost center) into gold (profitable information).