RESOLVED: Solr Exceptions - Document contains at least one immense term in field

If you implemented Solr with Sitecore using Solr 5.x, you may run into the following error when indexing extremely large content in string fields:

org.apache.solr.common.SolrException: Exception writing document id <xxxuniqueid> to the index; possible analysis error.
Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="<xxxfieldname>" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 9, 9, 10, 32, 32, 32, 32, 32, 32, 82, 84, 82, 83, 32, 70, 97, 99, 105, 108, 105, 116, 121, 32, 10, 32, 32, 32, 32, 84]...', original message: bytes can be at most 32766 in length; got <intgreaterthan32766>

Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got <intgreaterthan32766> 


This is due to the fact that there was a change in Lucene 4.8.x.  Before then, the exception was just
being eaten...now they throw it up and don't index that document. This exception is actually thrown by Lucene and not Solr as described here: DefaultIndexingChain swallows useful information from MaxBytesLengthExceededException.

After hours of researching the web and turning up with little or nothing to solve the issue, my inclination was to write my pipeline processor to truncate the field being indexed if it exceeded the preset maxFieldLength within Lucene, but then I struck gold. Apparently there is a built-in mechanism in Solr to truncate fields and you can provide your own limit.

Solve:

Step 1: Find the element "updateRequestProcessorChain" and add the following to your Solr.config. If the element does not exists, simply add it directly under <config>.

        <updateRequestProcessorChain name="mychain">
<processor class="solr.TruncateFieldUpdateProcessorFactory">
<str name="typeClass">solr.StrField</str>
<int name="maxLength">10000</int>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>


Step 2: Find the element <requestHandler name="/update" class="solr.UpdateRequestHandler"> in Solr.config and insert the following element under it:

       <lst name="defaults">
         <str name="update.chain">mychain</str>
       </lst>

Restart Apache, and re-build your index. This will ensure your string fields do not exceed the max length set in the processor.

Comments

  1. Thanks so much. this got me out of a fix.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. In Solr 5.4.1, from the solrconfig.xml, there is no entry of requestHandler name="/update" class="solr.UpdateRequestHandler".

    Should we add it manually?

    ReplyDelete
    Replies
    1. You shouldn't need to add it, there should be a requestHandler element for "/update" but may be commented by default with no child elements.

      Delete

Post a Comment

Popular posts from this blog

First look at Sitecore XM Cloud: Part 4 - Creating a new Site

Is Rendered Item Valid XHtml Document Could not find schema information warnings during publish item Sitecore 7.2