Sitecore: Extract Indexed Content of Media Files using MediaItemContentExtractor
Here is something in addition to my previous post regarding indexing associated content:
Here is a common scenario:
Your custom index configuration is set up to crawl all the content for your website which is then used by your site search (keywords search) to fetch search results. In addition to you content item crawlers, you add a crawler for Media Library items as well and Sitecore does a great job of indexing PDF, DOCX, DOC, etc. files automatically, provided your have a valid IFilter installed, and now you have search extended to show file items as search results.
Now consider the following scenario:
One of the lookup fields on your page points to a file in the media library and the new requirement is to show the page item in the search result when the search phrase matches the content in the associated file.
Solution (Lucene & Solr): Create a computed field called "related_content" that stored the crawled content of the associate file and extend the query to now search both "_content" (in Solr and I think it's simply "content" in lucene) and "related_content" fields.
Here is the code:
Don't forget to add the configuration element for your computed field:
Here is a common scenario:
Your custom index configuration is set up to crawl all the content for your website which is then used by your site search (keywords search) to fetch search results. In addition to you content item crawlers, you add a crawler for Media Library items as well and Sitecore does a great job of indexing PDF, DOCX, DOC, etc. files automatically, provided your have a valid IFilter installed, and now you have search extended to show file items as search results.
Now consider the following scenario:
One of the lookup fields on your page points to a file in the media library and the new requirement is to show the page item in the search result when the search phrase matches the content in the associated file.
Solution (Lucene & Solr): Create a computed field called "related_content" that stored the crawled content of the associate file and extend the query to now search both "_content" (in Solr and I think it's simply "content" in lucene) and "related_content" fields.
Here is the code:
using System;
using System.Text;
using System.Xml;
using OConnell.Domain.Models.OConnell.Intranet.Components;
using OConnell.SC.Extensions;
using Sitecore.ContentSearch;
using Sitecore.ContentSearch.ComputedFields;
using Sitecore.Data.Items;
using Sitecore.Diagnostics;
namespace OConnell.SC.Search.ComputedFields
{
public class RelatedContent : IComputedIndexField
{
public string FieldName { get; set; }
public string ReturnType { get; set; }
public object ComputeFieldValue(IIndexable indexable)
{
// PDF,File,Docx,Document,Doc template IDs (Unversioned)
private readonly string _mediaItemTemplates =
"{0603F166-35B8-469F-8123-E8D87BEDC171}|{962B53C4-F93B-4DF9-9821-415C867B8903}|{7BB0411F-50CD-4C21-AD8F-1FCDE7C3AFFE}|{777F0C76-D712-46EA-9F40-371ACDA18A1C}|{16692733-9A61-45E6-B0D4-4C0C06F8DD3C}";
Item item = indexable as SitecoreIndexableItem;
if (item == null)
{
Log.Debug(string.Concat("Rejected item at path: ", indexable.AbsolutePath));
return null;
}
Log.Debug(string.Concat("Getting related content for item at path: ", indexable.AbsolutePath));
var sb = new StringBuilder();
try
{
//change the field name to your lookup field
var mediaItem = item.Database.GetItem(item.Fields["Related File"].Value);
if (mediaItem != null && _mediaItemTemplates.Contains(mediaItem.TemplateID.ToString()))
{
var indexedContent = GetFileContent(mediaItem);
sb.Append(indexedContent.ToString() ?? string.Empty);
}
}
catch (Exception ex)
{
Log.Error(ex.Message,item);
}
return string.IsNullOrEmpty(sb.ToString()) ? null : sb.ToString();
}
private string GetFileContent(SitecoreIndexableItem indexableMediaItem)
{
XmlNode configurationNode =
Sitecore.Configuration.Factory.GetConfigNode(
"contentSearch/indexConfigurations/defaultSolrIndexConfiguration/mediaIndexing");
//MediaItemContentExtractor expects the full xml
//including the "mediaIndexing" node and GetConfigNode seems to
//be ommitting the parent node and hence loading it as XML before passing
//passing to MediaItemContentExtractor
var xmlDocument = new XmlDocument();
xmlDocument.LoadXml(configurationNode.OuterXml);
var extractor = new MediaItemContentExtractor(xmlDocument);
var indexedContent = extractor.ComputeFieldValue(indexableMediaItem);
return indexedContent == null ? string.Empty : indexedContent.ToString();
}
}
}
Don't forget to add the configuration element for your computed field:
<fields hint="raw:AddComputedIndexField">
<field fieldName="related_content">Sitecore.SharedSource.ComputedFields.RelatedContent, Sitecore.SharedSource
</field>
</fields>
Comments
Post a Comment