BLACK MOUNTAIN,In last week’s column I began considering opportunities that arise from the linkage between storage of data and the uses of that data. Beginning from a simple definition of storage,

“Storage is about saving a copy of something, which you are able to get later,”

I raised four questions:

  • What is “something”?
  • When and where does it get saved?
  • What does it mean to get it?
  • When is later and who are you?
  • Last week’s column explored the first two questions and the associated links between storage and content, networking, and distribution. This week, we look at the last two questions.

    What does it mean to get it? The link to search and document management.

    In a world of Googling, it is no longer acceptable to have to search through folder after folder, or to wait for the results of an excruciatingly slow find, in order to retrieve a file. While relational database systems address this issue fairly well for highly structured data, today’s general storage systems leave the question of retrieval of information to the user, offering nothing more than a hierarchical organizational tool (which is wonderful for naturally hierarchical data, but unhelpful otherwise) and the possibility of hand-creating a few links via shortcuts. The same is true of data on the Internet – the facilities provided for maintaining lists of favorites are precisely the same hierarchical tools as the file system.

    As the amount of data that we want and need to access continues to grow, so will this traditional organization become increasingly inappropriate. Why should I have to know a file’s location, or even its name? Why can I not simply find it through a topic search within my own files, or through links from related documents.

    These needs will drive a trend that is already becoming visible in storage, namely the convergence of storage, search, and document management. The intersection of the first and last was signaled recently by EMC’s purchase of Documentum. Similarly, the user of information does not care about the difference between search and retrieval – all falls under the concept “get,” whether the mechanism that enables it is a document management system or pre-indexing for fast search at the time of storage or on an ongoing basis on all files within a system. Also of interest is the ability to make internally generated indexing for search available for outside (web) use. Perhaps the next Google will be a distributed application that works internally, and makes information available outside in a controlled way.

    When is later and who are you? – The link to security and archive management.

    This question cheats a bit, but I believe that the issues of protecting data from disaster, intrusion and corruption, and unauthorized access are all interrelated with one another and with the general question of archiving for long-term access.

    These are not new issues, of course. They have been around for a long time, but they have become much sharper due to the events of 9/11, the increasing exposure of vital corporate and social data to sophisticated digital attacks, and new requirements recently mandated by such legislation as HIPAA and Sarbanes-Oxley. These, along with cost and manageability, are the issues that are driving storage most strongly in the shorter term.

    There are three basic areas of opportunity.

    The first is targeted at disaster recovery and includes a wide variety of types of solutions, from RAID to journaling and log-structured file systems, to on-site and remote mirroring solutions. Opportunities will continue to arise here as distributed system tools, network bandwidth capacities and new data transfer standards continue to develop.

    A second area of opportunity is that of long-term storage and archiving. On the hardware side, continuing increases in storage density and decreases in cost will make very-long-term storage of data technically feasible, as long as issues of managing the data can be addressed. Here, there are possibilities not only for traditional archive and document management techniques to play a role, but for these to be combined with innovative virtualization techniques and new indexing technologies for fast search.

    The final, and perhaps largest, opportunity is security, from controlling access to data, to preventing and detecting corruption. Today, security seems to be a game of catch-up, trying to detect the latest exploitation of security holes before too much damage is done. In the long term, I believe, this is a losing game. Yes, there will need to be multiple levels of solution, but I suspect that a paradigm shift will also be necessary. In the nearer term, however, the protection of storage that, increasingly, is created, duplicated and accessed across both local and wide area networks offers many opportunities, from encryption and digital signature infrastructure to agents and virus-like network guardians, to methods of binding access control and corruption detection methods to the data objects themselves in ways that can cross operating systems.


    In all the discussion above, the common theme is that of convergence of storage with other concerns. Even as the latest advances in basic storage infrastructure are more and more rapidly commoditized, innovations that tie the storage infrastructure into the value associated with the data and the ability to access that data are most likely to be winners in the longer term.

    Eric Jackson is the founder of DeepWeave. He has built his career pioneering software solutions to particularly large and difficult problems. In 2000, Eric co-founded Ibrix, Inc. He is the inventor of the Ibrix distributed file system, a parallel file storage system able to scale in size and performance to millions of terabytes.


    Part One: