This topic often comes in conversations, by customer’s coffee machine, but its implications are less often underrated.

what happens to your data

if you do not stop and think over what happens inside your storage, you may end up with a distorted view of your needs, and this might lead to regrettable effects: higher infrastructureexpenditures (TCA and TCO), performances not up to the needs/expectations, management difficulties, and last but not least, a bad reputation among users. I’d like to clarify what happens to data/information, from the moment they are created on. I’m mainly thinking about unstructured data i.e files, but some of the following considerations can easily fit DB (and object storage).

For sure, most of the companies are not alike, company internal processes are not easily repeatable, but after having read this article, you will likely recognize the patterns, see if they match any problem in your reality, and what matters the most, know how to cope with. What’s following is based upon published results of a research performed a couple of years ago by an american university , data have been merged and averaged according to first hand experiences in the companies I happen to have relationships with.

a file is indeed many files

whenever a new file is written, even a simple Word document, a whole chain reaction takes place, resulting in several files written! Temporary files, automatic backup, working copies, etc, etc, etc… Nowadays softwares are quite complex and you might not be aware of what lies behind any even trivial operation, hidden by high level features and multiple layers of complexity.

some data have a very short lifespan

Yes, some of the data have often a very short life, just think about the temporary backup file I just mentioned, this kind of files live just until the application is closed, as soon I save my data they will be deleted.

This can mean just a handful of seconds lifespan, but nonetheless require an intense storage activity.

data lifespan is often longer than 24 hours.

Most of what we create live longer than one day. This implies that my storage must be time-consistent, not just a temporary place, I must have a way to backup the data created by my users and give them a way to recover any “live” file/informations that might have been deleted.

Just 15% of deleted files are older than 24 hours.

It might seem a complex idea, but there’s indeed a simple explanation. Most of the deleted files are younger than 24 hours, only the 15% of the remaining files will be ever deleted. In other words, if a file is not erased within 1 day after creation, it won’t likely be deleted any more! This has a direct consequence: the space needed on my storage will increase over and over. Most of the data have a long life, actually I’d better say eternal (paper archiving is a disappearing practice, documents are being stored more and more in their electronical counterpart).

first life hours are the most intense

Whenever a new file is born, it is very likely that it will be edited, appended and saved lot of times in its first few hours. During the ensuing days any editing activity will become more and more scarce. The net result is that higher writing and reading performances must be granted in the beginning, then pure performance numbers are somewhat less important.

what happens after 24 hours?

Here are some interesting data:

  • 65% of the files will be re-opened just 1 time!
  • 94% of the files will be accessed again less than 5 times!

Needless to say that figures are different in each environment, but are certainly comparable. The bottom line is that we usually create many files but very few are re-used and anyway very few times.

Moreover, surely many files are also alike! The percentages above are worrying numbers even for a single user, just consider the number of the users in your company, then you’ll start to glimpse the real size of the problem.

more than 80% of your data stands still

It is likely that 80% is just an old and even conservative figure, the real amount of data that stand still is higher and steadily rising. Another trend we must consider is the increasing (average) size of the documents we create. Then they easily and quickly go in to an eternal state of calmness, while using precious quality storage space.

most of data shared are actually used by 1 client only

Even if we use a networked storage, most of the data we create is just for a single user usage. That’s what really happens in any company. How many files written on a networked storage or on a server are for a single user purpose? How many of the documents we create day after day are actually shared?

The answer is more or less the same, in more or less any company, and lead to the same direction: shared storage is mostly used as a personal data repository (frequently the real way documents are shared is e-mail, but that’s a different topic). You might even wonder why the use of a networked storage shoud be considered worthwhile.

if a file is accessed many times, will be read

Might sound weird, but this is actually what happens. Most of the documents are edited by only one user, then shared after completion, other users will then usually read them. This is a very important fact, it means that read performance has the highest importance, higher than write performance, the larger gets the workgroup, the higher the need for read performance.

recap

We create data/files, we use them for a short amount of time, then we cannot (or chose not to) delete them anymore. Where’s the problem?

Maintaining frozen data costs alot! It might even cost as much as maintaining active data. Data size is exploding, in any company, at an unstoppable and sometimes uncontrollable rate (controllability depends also on external causes). Being aware, as often implies, is a first. Step toward the solution.

technological solutions

We basically need a storage system featuring:

  1. active data manipulation
  2. very roomy and scalable size to host the growing amount of silent or just seldom read data

Theoretically a storage like this exists, but is way too pricey.

That’s why storage industry devised some technologies to calm down the issues while keeping costs at a fair level. The solutions are:

  • automated tiered storage
  • SSD (used as cache)
  • data size reduction (deduplication and compression)

Each technology has its own pros and cons:

automated tiered storage

Automated tiered storage, is the storage capability to place the most accessed data on the fastest disk tracks using a writes savvy RAID level, as RAID 10, then to migrate the less hot data on slower disks and space savy RAID level (R5 or R6). Most vendors offer an implementation of this technologies, some are efficient and well-established, others not so much, it highly depends on the storage system architecture, on the page/block/chunk size (any vendor has its own dialect) and on the algorithms used to monitor and perform movements of data.

SSD as a cache

Some vendor chose to use latest generation SSD disks, or flash memory cards as a big cache. Unlike automated tiering, this approach has very little impact on the existing architecture while gives good results, the only limit is the maximun cache size, and in some implementations its suitability as a write cache.

data size reduction

Compression and deduplication (data chunks are indexed and duplicates are deleted) are other smart ways to tackle with data growth issue. There are many complexities in the adoption of these tecniques, unlike the previous two solutions, data size reduction requires an high CPU usage, but anyway the results can be brilliant. In some real world cases I have seen very high compression/dedupe ratios, potentially able to oppose the physiological long term size growth.

in union there is strength

At least initially, vendors implemented just one of the technologies, then to mitigate cons and make pros stronger, they started to implement two solutions on the same data volume. This approach seems to often be the best to actually get the highest performances as well as the greater space efficiency.

summing up

You can find data growth at almost uncontrallable rate in any kind of IT business.

Small companies usually ignore the problem, they just buy a bigger size storage, this usually ends up being the worst among the possible choices, an expensive chase for bigger and bigger size without satisfying outcomes.

Bigger companies are usually aware of the problem, and look for real solutions, even though sometimes I still happen to be involved in discussions about the price of a storage, more as a function of its overall raw disk size than its real capabilities to solve the real problems.

Managing a steady growing amount of data is a challenge just begun: if I take a look at my smartphone I count 950 pictures and videos stuffed into its internal storage! My first cell onboard camera resolution was 640×480 pixels and just a few images could be crammed in its tiny memory, now pictures are 5M pixels sized, I can record HD videos and the internal storage is 32GB big… and every time I sync the smartphone, more GB of data get into our company server.