--- author: James date: 2015-11-17 21:41:09+00:00 post_meta: - date preview: /social/0b4b017b3fb6b36b3c48b72fd744ab10b01dc723d8c9ec4414e9aa308b8c5494.png tags: - watson - work title: Spellchecking in retrieve and rank type: posts url: /2015/11/17/spellchecking-in-retrieve-and-rank/ --- ### Introduction Being able to deal with typos and incorrect spellings is an absolute must in any modern search facility. Humans can be lazy and clumsy and I personally often search for things with incorrect terms due to my sausage fingers. In this article I will explain how to turn on spelling suggestions in retrieve and rank so that if your users ask your system for something with a clumsy query, you can suggest spelling fixes for them so that they can submit another, more fruitful question to the system. Spellchecking is a standard feature of Apache SOLR which is turned off by default with Retrieve and Rank. This post will walk through the process of turning it on for your instance and enabling spell checking suggestions to be returned as part of calls rankers through fcselect. Massive shout out to David Duffett on Stack Overflow who posted [this answer][1] from which most of my blog post is derived. ### Enabling spell checking in your schema The first thing we need to do is set up a spell checker field in our SOLR schema. For the sake of simplicity, the example schema used below only has a title and text field which are used in indexing and querying. However, this methodology can easily be extended to as many fields as your use case requires. ### Set up field type The first thing you need to do is define a “textSpell” field type which SOLR can use to build a field into which it can dump valid words from your corpus that have been preprocessed and made ready for use in the spell checker. Create the following element in **your schema.xml** file:
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.StandardFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.StandardFilterFactory" /> </analyzer> </fieldType>This field type runs a lower case filter over the words provided in the input and also expands any synonyms defined in synonyms.txt and ignores any stopwords defined in stopwords.txt before storing the output in the field. This should give us a list of lower case words that are useful in search and spell checking. ### Create a spellcheck copy field in your schema The next step is to create a “textSpell” field in your SOLR schema that stores the “suggestions” from the main content to be used by the spellchecker API. The following XML defines the field in your schema and should be copied **into schema.xml.** It assumes that you have a couple of content fields called “title” and “text” from which content can be copied and filtered for use in the spell checker.
<field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" /> <copyField source="title" dest="spell"/> <copyField source="text" dest="spell"/>### Defining the spellcheck search component Once you have finished setting up your schema, you can define the spellchecker parameters **in solrconfig.xml.** The following XML defines two spelling analysers. The DirectSolrSpellChecker which pulls search terms directly from the index adhoc – this means that it does not need to be regularly reindexed/rebuilt and always has up to date spelling suggestions. [`WordBreakSolrSpellChecker` offers suggestions by combining adjacent query terms and/or breaking terms into multiple words.][2] This means that it can provide suggestions that DirectSolrSpellChecker might not find where, for example, a user has a spelling mistake in one of the words in a multi-word search term. Notice that both lst elements contain a
<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="distanceMeasure">internal</str> <float name="accuracy">0.5</float> <int name="maxEdits">2</int> <int name="minPrefix">1</int> <int name="maxInspections">5</int> <int name="minQueryLength">4</int> <float name="maxQueryFrequency">0.01</float> <float name="thresholdTokenFrequency">.01</float> </lst> <lst name="spellchecker"> <str name="name">wordbreak</str> <str name="classname">solr.WordBreakSolrSpellChecker</str> <str name="field">spell</str> <str name="combineWords">true</str> <str name="breakWords">true</str> <int name="maxChanges">10</int> </lst> </searchComponent>### Add spelling suggestions to your request handlers The default SOLR approach is to add a new request handler that deals with searches on the **/spell** endpoint. However, there is no reason why you can’t add spelling suggestions to any endpoint including **/select** and perhaps more relevently in retrieve and rank **/fcselect**. Below is a snippet of XML for a custom /spell endpoint:
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> <lst name="defaults"> <!-- Solr will use suggestions from both the 'default' spellchecker and from the 'wordbreak' spellchecker and combine them. collations (re-written queries) can include a combination of corrections from both spellcheckers --> <str name="spellcheck.dictionary">default</str> <str name="spellcheck.dictionary">wordbreak</str> <str name="spellcheck">on</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">10</str> <str name="spellcheck.alternativeTermCount">5</str> <str name="spellcheck.maxResultsForSuggest">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.collateExtendedResults">true</str> <str name="spellcheck.maxCollationTries">10</str> <str name="spellcheck.maxCollations">5</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>The following snippet adds spellchecking suggestions to the **/fcselect** endpoint. Simply append the XML inside the _**
<requestHandler name="/fcselect" class="com.ibm.watson.hector.plugins.ss.FCSearchHandler"> <lst name="defaults"> <str name="defType">fcQueryParser</str> <str name="spellcheck.dictionary">default</str> <str name="spellcheck.dictionary">wordbreak</str> <str name="spellcheck.count">20</str> </lst> <arr name="last-components"> <str>fcFeatureGenerator</str> <str>spellcheck</str> </arr> </requestHandler>### Create and populate your SOLR index in Retrieve and Rank If you haven’t done this before, you should really read the [official documentation][3] and may want to read [my post about using python to do it too.][4] You should also [train a ranker][5] so that you can take advantage of the fcselect with spelling suggestions example below. ### Test your new spelling suggestor Once you’ve got your collection up and running you should be able to try out the new spelling suggestor. First we’ll inspect **/spell:**
$ curl -u $USER:$PASSWORD "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/$CLUSTER_ID/solr/$COLLECTION_NAME/spell?q=businwss&wt=json {"responseHeader":{"status":0,"QTime":4},"response":{"numFound":0,"start":0,"docs":[]},"spellcheck":{"suggestions":["businwss",{"numFound":1,"startOffset":0,"endOffset":8,"origFreq":0,"suggestion":[{"word":"business","freq":3}]}],"correctlySpelled":false,"collations":["collation",{"collationQuery":"business","hits":3,"misspellingsAndCorrections":["businwss","business"]}]}}As you can see, the system has not found any documents containing the word **businwss. **However, it has identified **businwss **(easily misspelt because ‘e’ and ‘w’ are next to each other) as a typo of **business**. It has also suggested business as a correction. This can be presented back to the user so that they can refine their search and presented with more results. Now lets also look at how to use spellcheck with your ranker results.
$ curl -u $USER:$PASSWORD "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/$CLUSTER_ID/solr/$COLLECTION_NAME/fcselect?ranker_id=$RANKER_ID&q=test+splling+mstaek&wt=json&fl=id,title,score&spellcheck=true&spellcheck=true" {"responseHeader":{"status":0,"QTime":4},"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]},"spellcheck":{"suggestions":["businwss",{"numFound":1,"startOffset":0,"endOffset":8,"suggestion":["business"]}]}}You should see something similar to the above. The SOLR search failed to return any results for the ranker to rank. However it has come up with a spelling correction which should return more results for ranking next time. [1]: http://stackoverflow.com/questions/6653186/solr-suggester-not-returning-any-results [2]: https://cwiki.apache.org/confluence/display/solr/Spell+Checking [3]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml [4]: https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/ [5]: https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/plugin_overview.shtml#generate_queries