brainsteam.co.uk/brainsteam/content/posts/legacy/2015-11-17-spellchecking-in...

11 KiB
Raw Blame History

author date post_meta preview tags title type url
James 2015-11-17 21:41:09+00:00
date
/social/0b4b017b3fb6b36b3c48b72fd744ab10b01dc723d8c9ec4414e9aa308b8c5494.png
watson
work
Spellchecking in retrieve and rank posts /2015/11/17/spellchecking-in-retrieve-and-rank/

Introduction

Being able to deal with typos and incorrect spellings is an absolute must in any modern search facility. Humans can be lazy and clumsy and I personally often search for things with incorrect terms due to my sausage fingers. In this article I will explain how to turn on spelling suggestions in retrieve and rank so that if your users ask your system for something with a clumsy query, you can suggest spelling fixes for them so that they can submit another, more fruitful question to the system.

Spellchecking is a standard feature of Apache SOLR which is turned off by default with Retrieve and Rank. This post will walk through the process of turning it on for your instance and enabling spell checking suggestions to be returned as part of calls rankers through fcselect. Massive shout out to David Duffett on Stack Overflow who posted this answer from which most of my blog post is derived.

Enabling spell checking in your schema

The first thing we need to do is set up a spell checker field in our SOLR schema. For the sake of simplicity, the example schema used below only has a title and text field which are used in indexing and querying. However, this methodology can easily be extended to as many fields as your use case requires.

Set up field type

The first thing you need to do is define a “textSpell” field type which SOLR can use to build a field into which it can dump valid words from your corpus that have been preprocessed and made ready for use in the spell checker. Create the following element in your schema.xml file:

<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
 <analyzer type="index">
 <tokenizer class="solr.StandardTokenizerFactory" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.StandardFilterFactory" />
 </analyzer>
 <analyzer type="query">
 <tokenizer class="solr.StandardTokenizerFactory" />
 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.StandardFilterFactory" />
 </analyzer>
 </fieldType>

This field type runs a lower case filter over the words provided in the input and also expands any synonyms defined in synonyms.txt and ignores any stopwords defined in stopwords.txt before storing the output in the field. This should give us a list of lower case words that are useful in search and spell checking.

Create a spellcheck copy field in your schema

The next step is to create a “textSpell” field in your SOLR schema that stores the “suggestions” from the main content to be used by the spellchecker API.

The following XML defines the field in your schema and should be copied into schema.xml. It assumes that you have a couple of content fields called “title” and “text” from which content can be copied and filtered for use in the spell checker.

<field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" />
 
 <copyField source="title" dest="spell"/>
 <copyField source="text" dest="spell"/>

Defining the spellcheck search component

Once you have finished setting up your schema, you can define the spellchecker parameters in solrconfig.xml.

The following XML defines two spelling analysers. The DirectSolrSpellChecker which pulls search terms directly from the index adhoc this means that it does not need to be regularly reindexed/rebuilt and always has up to date spelling suggestions.

WordBreakSolrSpellChecker offers suggestions by combining adjacent query terms and/or breaking terms into multiple words. This means that it can provide suggestions that DirectSolrSpellChecker might not find where, for example, a user has a spelling mistake in one of the words in a multi-word search term.

Notice that both lst elements contain a spell attribute. This must map to at the spell field we defined in the above step so if you used a different name for your field, substitute this in here.

The documentation provides more detail on how to configure the individual spell check components as well as some alternatives to Direct and Wordbreak which might be more useful depending on your own use case. Your mileage may vary.

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
 <str name="name">default</str>
 <str name="field">spell</str>
 <str name="classname">solr.DirectSolrSpellChecker</str>
 <str name="distanceMeasure">internal</str>
 <float name="accuracy">0.5</float>
 <int name="maxEdits">2</int>
 <int name="minPrefix">1</int>
 <int name="maxInspections">5</int>
 <int name="minQueryLength">4</int>
 <float name="maxQueryFrequency">0.01</float>
 <float name="thresholdTokenFrequency">.01</float>
 </lst>
 
 <lst name="spellchecker">
 <str name="name">wordbreak</str>
 <str name="classname">solr.WordBreakSolrSpellChecker</str>
 <str name="field">spell</str>
 <str name="combineWords">true</str>
 <str name="breakWords">true</str>
 <int name="maxChanges">10</int>
 </lst>
</searchComponent>

Add spelling suggestions to your request handlers

The default SOLR approach is to add a new request handler that deals with searches on the /spell endpoint. However, there is no reason why you cant add spelling suggestions to any endpoint including /select and perhaps more relevently in retrieve and rank /fcselect. Below is a snippet of XML for a custom /spell endpoint:

<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
 <lst name="defaults">
 <!-- Solr will use suggestions from both the 'default' spellchecker
 and from the 'wordbreak' spellchecker and combine them.
 collations (re-written queries) can include a combination of
 corrections from both spellcheckers -->
 <str name="spellcheck.dictionary">default</str>
 <str name="spellcheck.dictionary">wordbreak</str>
 <str name="spellcheck">on</str>
 <str name="spellcheck.extendedResults">true</str> 
 <str name="spellcheck.count">10</str>
 <str name="spellcheck.alternativeTermCount">5</str>
 <str name="spellcheck.maxResultsForSuggest">5</str> 
 <str name="spellcheck.collate">true</str>
 <str name="spellcheck.collateExtendedResults">true</str> 
 <str name="spellcheck.maxCollationTries">10</str>
 <str name="spellcheck.maxCollations">5</str> 
 </lst>
 <arr name="last-components">
 <str>spellcheck</str>
 </arr>
 </requestHandler>

The following snippet adds spellchecking suggestions to the /fcselect endpoint. Simply append the XML inside the _** **_markup area.

<requestHandler name="/fcselect" class="com.ibm.watson.hector.plugins.ss.FCSearchHandler">
 <lst name="defaults">
 <str name="defType">fcQueryParser</str>
 <str name="spellcheck.dictionary">default</str>
 <str name="spellcheck.dictionary">wordbreak</str>
 <str name="spellcheck.count">20</str>
 </lst>
 <arr name="last-components">
 <str>fcFeatureGenerator</str>
 <str>spellcheck</str>
 </arr>
</requestHandler>

Create and populate your SOLR index in Retrieve and Rank

If you havent done this before, you should really read the official documentation and may want to read my post about using python to do it too.

You should also train a ranker so that you can take advantage of the fcselect with spelling suggestions example below.

Test your new spelling suggestor

Once youve got your collection up and running you should be able to try out the new spelling suggestor. First well inspect /spell:

$ curl -u $USER:$PASSWORD "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/$CLUSTER_ID/solr/$COLLECTION_NAME/spell?q=businwss&wt=json

{"responseHeader":{"status":0,"QTime":4},"response":{"numFound":0,"start":0,"docs":[]},"spellcheck":{"suggestions":["businwss",{"numFound":1,"startOffset":0,"endOffset":8,"origFreq":0,"suggestion":[{"word":"business","freq":3}]}],"correctlySpelled":false,"collations":["collation",{"collationQuery":"business","hits":3,"misspellingsAndCorrections":["businwss","business"]}]}}

As you can see, the system has not found any documents containing the word **businwss. **However, it has identified **businwss **(easily misspelt because e and w are next to each other) as a typo of business. It has also suggested business as a correction. This can be presented back to the user so that they can refine their search and presented with more results.

Now lets also look at how to use spellcheck with your ranker results.

$ curl -u $USER:$PASSWORD "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/$CLUSTER_ID/solr/$COLLECTION_NAME/fcselect?ranker_id=$RANKER_ID&q=test+splling+mstaek&wt=json&fl=id,title,score&spellcheck=true&spellcheck=true"

{"responseHeader":{"status":0,"QTime":4},"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]},"spellcheck":{"suggestions":["businwss",{"numFound":1,"startOffset":0,"endOffset":8,"suggestion":["business"]}]}}

You should see something similar to the above.  The SOLR search failed to return any results for the ranker to rank. However it has come up with a spelling correction which should return more results for ranking next time.