FARSIGHTED NEWS Archive

Read articles from past Farsighted News editions.

OCTOBER 2006 | AUGUST 2006 | JUNE 2006 | APRIL 2006 | FEBRUARY 2006


AUGUST 2006

Considerations for data discovery, classification and indexing tools

By Greg Schulz – Author "Resilient Storage Networks" and founder of the StorageIO group.

In the April 2006 issue of SNIA FarSighted, the article "Storage vocabulary de-mystification and contextuality" I discussed issues involving storage related vocabulary. In that article, the importance of knowing what context or in what situation and relevance different terms were being used in to understand what was being referred to was discussed. For example, SoA to some people means service oriented architecture that depending upon the context, service could mean storage services, application software, or utility grid services among others. SoA has also been used to describe search oriented architectures with examples ranging from Internet search engines (Google, MSN and yahoo, etc) to document management systems, to storage resource management (SRM), data discovery, and Email content management tools among others.

This article provides an overview to help put different discovery, classification, indexing and reporting technologies into the proper context to help identify potential applicable technologies to meet your particular needs to support SRM, ILM, compliance, data protection and other initiatives. There are many different terms pertaining to data discovery, classification and indexing that the SNIA dictionary can be of assistance with. Data classification, discovery, search, indexing, content classification, analysis and other terms are being used to describe technologies that provide information about stored data including deep content analysis. File content information including file characteristics and information about the data itself is known as meta data.

There are three primary functions when it comes to data classification and search tools. The three basic functions are discovery, classification and indexing as well as searching, reporting and taking action. Some vendors my break these out further to help articulate their particular story such as separating classification and analysis, indexing along with searching and taking action. Taking action refers to the ability to interface with various storage systems (on-line, near-line, and off-line) including object based CAS and archiving systems to enable management and migration of data. Taking action can also mean interfacing with policy managers that are part of other solutions either host based, network, or storage system based to perform data movement, migration and other functions based on discovered and analyzed data and policies.

SRM and basic data discovery and classification tools include file path and file meta data discovery SRM tools. SRM tools have a vertical focus on storage and file identification for storage management purposes including allocation, performance, and reporting. Examples include traditional SRM tools such as HP/AppIQ, EMC Visual SRM, Tek-Tools/Profiler among others. Some tools provide basic SRM like functionality along with more advanced capabilities including archiving, document management, email and data migration capabilities. Examples include integrated products from different vendors and family suites of products from a single vendor to support archiving and data movement. Deep content discovery, indexing, classification and analysis tools support features including word relativity, advanced language support, search and discovery features for vertical markets. Features include litigation hold, lexicon, templates, and other definitions. Examples include Kazeon, StoredIQ, Abrevity, Scentric, Mathon, Nijni, Verity, Index Engines and FAST among others.

When looking at data discovery and indexing tools keep in mind what your intended and primary usage for the technology will be. For example, are you planning on using the tools to perform deep content discovery for compliance, legal litigation, and intellectual property search. Perhaps you are looking to identify what files you have, when they were last accessed and what might be candidates for moving files to different tiers of storage. By keeping your primary objectives in focus, you may find that different tools work better for various tasks compared to others and that you may end up with more than one tool.

Items that will impact the performance or how many documents that can be processed in a given amount of time will be influenced by several factors including if basic file meta data or directory type information is being captured, or if deeper document content discovery is being performed. These factors include the size, type, content and composition of a document along with what is being done with the document. Graphics and images may have less meta data based upon content than compared to a word or PDF document that contains lots of text data.

You should be able to process very large numbers of files per minute if all you are doing is basic file and access meta data lookup and indexing. More involved discovery and coverage will take more time and result in more meta data being generated. Some vendors will provide a number of how many files that can be processed per second or per minute while others may provide you with a gauge of how many GBytes or TBytes per minute or hour that can be processed. While interesting, take these numbers if they exist with a grain of salt and keep them in perspective in terms of what type of discovery and indexing activity is taking place for your specific needs and environment.

For example, when a vendor tells me that they can process 1,000s of documents per second, my first question is if they are actually opening up the documents and performing lexical and taxonomy processing (deep content analysis) or simply looking at file meta data including access, modification, size, type and other attributes. If you are interested in deep processing, then ask questions to help put things into more appropriate context for your environment such as how many files of a particular size and type (those that apply to your situation) can be processed. Simply looking at just one metric like the number of files processed in a period is similar to only looking at the number of I/O operations per second (IOPS) of a storage system without regard to size or type of I/O in that you are not seeing the full performance picture.

There are different approaches to analysis and discovery of data including in-line or in parallel while the data is being stored or after fact with a scheduled discovery scan and analysis. How much storage capacity will be required to store meta data and indices will be a function of how optimized the indexing and repository is performed along with the amount of meta data information being retained per file, size of the files and number of files. The overhead impact of file processing to build indexes and parse documents can vary depending on a vendor's architecture and will depend on the type of file processing being performed. Unless a separate copy of your data is being made, data discovery tools will need to perform read I/O operations against your storage systems.

I laugh when vendors tell me that there is no impact and no overhead due to file processing and I congratulate those vendors who say that there is some or negligible, non noticeable other than seeing an increase in file opens or read operations during scan operations. The reality is that for light weight basic file meta and access type discovery, the overhead should be similar or less than performing directory scan type operations. On the other hand, I would expect a more in-depth full file discovery processing to be similar or less than the I/O activity associated with file based backup operations when volumes of files are read.

Many discovery tools have their processing and analysis software running on appliances that access shared storage for processing thus their only impact to production servers and systems would be I/O access to files being analyzed. The file processing lexical and taxonomy compute analysis processing and subsequent indexing is off loaded to processors on the appliances reducing impact to application servers. Some tools have the ability to partition work such that different files are worked on by different nodes in a cluster to boost parallel processing along with distributed indexes. Some tools also support the notion of a unified or federated search that accesses the various distributed meta bases and indices yet provide a single view of results.

When looking at data discovery and content classifying technologies, keep the following in mind when planning for your environment. Learn where the Meta database or repository is stored and how it is backed up and protected. Also pertaining to the meta data base, how much space is required and what is the overhead (e.g. size of meta data) to store meta information for various types and sizes of documents. Look at what languages are supported both in terms of documents and different language taxonomies as well as internationalism in terms of discovery tool interfaces and GUIs. Some other things to consider include what licensing options exists such as base fees, site wide usage, amount of storage or number of documents under management and other add ons including plug-ins and personality modules. Take a look at how the solution scales in terms of supporting more users and documents under management, along with processing of larger amounts of documents.

Architectural considerations include performance, capacity, and depth of coverage along with discovery, security and audit trails. Policy management should be considered along with policy execution including interfaces with other policy mangers and data migration tools. Some tools also support interfaces to different storage systems including vendor specific APIs for archiving and compliance storage. Consider if the candidate tools have embedded or built in support for processing different templates, lexicons, syntax and taxonomies associated with different industries and regulations. For example, if you are dealing with financial documents, then the tool should support processing of data in the context of various financial taxonomies such as banking, trading, benefits and insurance among others. If you are processing legal documents then support for legal taxonomies will be needed.

Look into what reporting and search capabilities exist including the ability to save search results and search queries along with holding data for retention. An example is support for litigation hold where discovered documents applicable to a particular legal action are put on hold and preserved including interfaces to object based storage systems. Consider what auditing and tracking tools are important to be able to track who and when sensitive data has been reviewed via discovery tools. This should extend to cover not only when files were processed, but also who ran what queries as a means of auditing the auditors. General questions include when is the data classification, analysis and indexing performed? How long is the indexed and stored data retained? How will the technology integrate into your environment with your existing technologies? And how seamless are upgrades?

A couple of trends that the StorageIO group sees pertaining to data discovery tools include more support for multi lingual documents, ability to handle more complex document formats and styles, interfaces to fixed content, archiving and content based (CAS) storage systems for data preservation and compliance purposes. Discovery and indexing type technologies are a natural fit to be paired with content based data preservation and archiving storage systems.

Search, discovery and indexing solutions vary in nature, complexity and capabilities to meet different needs and purposes. Understanding what your target applications and needs are for discovery tools will help to insure a positive and successful solution. If you are looking to understand what files exist on your system to help implement a tiered storage environment, start by looking at traditional SRM type tools. If on the other hand you need to perform deep data discovery to support litigation, compliance and other functions, then take a look at the more advanced deep data discovery tools. You may find that some tools can meet multiple objectives, however be careful to understand what if anything is compromised by a system that can do lots of different things well vs. a tool that can do certain things very well.

There are many considerations and product specific details that could be covered, however for now, hopefully this gives you a better picture of the options and things to consider when looking at data discovery, classification, indexing and document management systems. Feel free to drop me a note with your comments and questions at greg@storageio.com.

About the author:

Greg Schulz is founder and Sr. analyst of the StorageIO group (www.storageio.com) and author of the book (SNIA recommended reading) "Resilient Storage Networks" (Elsevier).