There are three basic steps to the process of searching and each presents a challenge as the amount of content grows. One of these steps is the process of crawling. Crawling is a resource intensive process which gathers, opens and breaks apart the content in order to build an index to search against. As content grows this step will become slow and the index becomes stale prompting users to abandon the use of search.
Crawling in a Nutshell
The crawl process is the most time consuming of all the search steps because of the many steps in the process. The first step is to gather the content. There are two parts to gathering the information from a content source. The first part is the enumeration of the content items that should be crawled. Connectors connect to the content source and walk the URL addresses of each content item. After a sufficient number of items have been enumerated the second part of gathering starts. This part downloads the item and opens it.
Once the document is opened the second step of indexing starts. Content is examined to detect its language and an IFilter is applied. The IFilter identifies and indexes the content. There are specific IFilters for specific file types. Once the indexing is completed the third step of Word Breaking is applied. This step tokenizes the content by removing spaces and splitting the content into words at spaces, punctuation and special characters. The next step is to add the indexed content to the index and any property metadata to the property database. Finally the crawl process checks to see if any more URLs are available to be crawled, if there are, then the steps are repeated.
Improving the Speed with a Catch
In the SharePoint 2010 December CU a change was made to allow file types not to be crawled and yet still be searched and retrieved based on SharePoint field metadata. This change only applies to FAST search. Using the FAST search content service application you can exclude a file type to be crawled. Prior to the December CU this would prevent that file type from showing up in search results. After the December CU you can exclude the file type and the file type is still available in search results but the content (binary) is not crawled and indexed. However the metadata (managed properties) generated from SharePoint fields is still available to be searched against. Eliminating the need to download, open and index the content of files can improve the speed of the crawl dramatically. This is great for keeping indexes fresh and scaling to hundreds of millions of documents. The catch is that this is useful only if you do not rely on searching for documents based on what is in the content. This makes the process of tagging/indexing the document very important.
How much faster?
Preliminary testing shows decreases in full crawl times of approximately 30%. This was a result of small amounts of pdf documents being excluded. Times will vary based on the size of the binaries and the slowness of the corresponding IFilter. I am sure my colleague Russ Houberg (SharePoint MCM) will have a substantial amount of information on this soon.