Thursday, April 26, 2012

Improving Crawl Speeds in Fast Search for SharePoint 2010

Technorati Tags: ,,

There are three basic steps to the process of searching and each presents a challenge as the amount of content grows. One of these steps is the process of crawling. Crawling is a resource intensive process which gathers, opens and breaks apart the content in order to build an index to search against. As content grows this step will become slow and the index becomes stale prompting users to abandon the use of search.

 

Crawling in a Nutshell

The crawl process is the most time consuming of all the search steps because of the many steps in the process. The first step is to gather the content. There are two parts to gathering the information from a content source. The first part is the enumeration of the content items that should be crawled. Connectors connect to the content source and walk the URL addresses of each content item. After a sufficient number of items have been enumerated the second part of gathering starts. This part downloads the item and opens it.

Once the document is opened the second step of indexing starts. Content is examined to detect its language and an IFilter is applied. The IFilter identifies and indexes the content. There are specific IFilters for specific file types. Once the indexing is completed the third step of Word Breaking is applied. This step tokenizes the content by removing spaces and splitting the content into words at spaces, punctuation and special characters. The next step is to add the indexed content to the index and any property metadata to the property database. Finally the crawl process checks to see if any more URLs are available to be crawled, if there are, then the steps are repeated.

 

 

Improving the Speed with a Catch

In the SharePoint 2010 December CU a change was made to allow file types not to be crawled and yet still be searched and retrieved based on SharePoint field metadata. This change only applies to FAST search.  Using the FAST search content service application you can exclude a file type to be crawled. Prior to the December CU this would prevent that file type from showing up in search results. After the December CU you can exclude the file type and the file type is still available in search results but the content (binary) is not crawled and indexed. However the metadata (managed properties) generated from SharePoint fields is still available to be searched against. Eliminating the need to download, open and index the content of files can improve the speed of the crawl dramatically. This is great for keeping indexes fresh and scaling to hundreds of millions of documents. The catch is that this is useful only if you do not rely on searching for documents based on what is in the content. This makes the process of tagging/indexing the document very important.

How much faster?

Preliminary testing shows decreases in full crawl times of approximately 30%. This was a result of small amounts of pdf documents being excluded. Times will vary based on the size of the binaries and the slowness of the corresponding IFilter. I am sure my colleague Russ Houberg (SharePoint MCM) will have a substantial amount of information on this soon.

Tuesday, April 17, 2012

SharePoint 2010 Code Tips – Determining if a Content Type is Published and other Tips

Technorati Tags: ,,

In this post I will showing you different code snippets that may be useful when developing SharePoint applications. The code accomplished various things and I will give you an idea how you may want to use them in your applications. The code is posted “AS IS” with no warranties and confers no rights.

Determining if a content type is published

The Content Type Publishing feature in SharePoint 2010 is very useful for the management and reuse of content types. It enables you to define a content type on one site collection and publish the content types to other site collections. More information: Content Type Publishing in SharePoint 2010.

Typically, the published content types are read only and can only be inherited from. So you may need some code to determine if a content type is published. The following code does this by examining the SPContentType.XmlDocuments collection. If the collection contains an XML document with a key of “Microsoft.SharePoint.Taxonomy.ContentTypeSync”, then it is either a published content type or derived from one. In order to decipher which is which, then you must compare the content type ID that is listed as an attribute in the XML document with the content type you are interrogating. If the two match then the content type has been published from the publishing hub.

public static bool IsContentTypePublished(SPContentType contentType)
{
    bool flag = false;
    string str = contentType.XmlDocuments["Microsoft.SharePoint.Taxonomy.ContentTypeSync"];
    if (!string.IsNullOrEmpty(str))
    {
        XmlDocument document = new XmlDocument();
        document.LoadXml(str);
        XmlAttribute attribute = document.DocumentElement.Attributes["ContentTypeId"];
        if (attribute != null)
        {
            try
            {
                flag = new SPContentTypeId(attribute.Value).Equals(contentType.Id);
            }
            catch (ArgumentException)
            {
                flag = false;
            }
        }
    }

    return flag;
}

 

Retrieving schema information for SharePoint built-In fields

Many developers use Visual Studio to develop and deploy SharePoint solutions that create document libraries and  content types. More information: Creating custom content types. The definitions for these can contain FieldRef elements to existing built-in SharePoint fields and developers need a way to obtain information from the schema xml definition for the field. There are tools you can buy or download for free to obtain this information. However, you can use the simple code to run in a console application to obtain this information given a site URL and the internal name of the field. The code uses reflection to obtain the GUID of the built-in field using the internal name and then uses the GUID to obtain the schema.

public static string GetSchemaXmlForBuiltInField(string siteURL, string fieldName)
{
    string schema = string.Empty;
    Guid fieldID = Guid.Empty;

    var result = from f in typeof(SPBuiltInFieldId).GetFields()
                    where string.Equals(f.Name,fieldName,StringComparison.OrdinalIgnoreCase)
                    select f;

    if (result != null && result.Count() == 1)
    {
        FieldInfo fi = result.FirstOrDefault();
        fieldID = (Guid)fi.GetValue(fi);
    }

    using (SPSite site = new SPSite(siteURL))
    {
        using (SPWeb web = site.OpenWeb())
        {                   
            SPField builtInField = web.AvailableFields[fieldID];
            schema = builtInField.SchemaXml;                  
        }           
    }

    return schema;  
}

 

Setting default values for metadata fields and locations

A nice feature in SharePoint 2010 is being able to set default values for fields per location or folder. This is accessible from the list settings page under “column default value settings”.

 

A scenario may arise where you might want to have different default values for a metadata field per folder. Setting a default value for a managed metadata field is much different than other types of SharePoint fields. The following code shows how to do this using the Microsoft.Office.DocumentManagement.MetadataDefaults class. The method takes the URL to the site, list name, metadata field name and the relative URL to the folder. The code can be modified to accept the term group name, term set name and the default term you want to use. This example shows hard coded values these.

public static void AddDefaultMetaDataForLocation(string siteUrl,
    string listName, string folderLocation, string metadataFieldName)
{

    using (SPSite site = new SPSite(siteUrl))
    {
        using (SPWeb web = site.OpenWeb())
        {

            SPList list = web.Lists[listName];
            string folder = web.GetFolder(folderLocation).ServerRelativeUrl;

            TaxonomySession sessions = new TaxonomySession(web.Site);
            TermStore store = sessions.TermStores["Managed Metadata Service"];
            Group group = store.Groups["Cars"];
            TermSet productenTermSet = group.TermSets["Cadillac"];
            TermCollection terms = productenTermSet.GetTerms("STS", true);
            Term defaultValue = terms[0];

            string defaultValueText = "-1;#" + defaultValue.Labels[0].Value +
                TaxonomyField.TaxonomyGuidLabelDelimiter + defaultValue.Id.ToString();

            MetadataDefaults defaults = new MetadataDefaults(list);

            defaults.SetFieldDefault(folder, metadataFieldName, defaultValueText);

            defaults.Update();
        }

    }

}

 

More Code Tips Soon

SharePoint has many features and figuring out how to use these features in your application can take time. Hopefully these code tips can help speed up that process for you. I will try to post more of these in the future.