brainsteam.co.uk/brainsteam/content/posts/legacy/2015-11-16-retrieve-and-ran...

---
author: James
date: 2015-11-16 18:25:39+00:00
post_meta:
- date
tags:
- work
- python
- watson
title: Retrieve and Rank and Python
type: posts
url: /2015/11/16/retrieve-and-rank-and-python/
---

## Introduction

Retrieve and Rank (R&R), if you hadn&#8217;t already heard about it, is IBM Watson&#8217;s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way [here][1].

R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt &#8220;relevance&#8221; order.

Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with [solrpy][2] &#8211; a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy &#8211; in R&R with a few special tweaks.

## Getting Started

**You will need
  
** An R&R instance with a cluster and collection already configured. I&#8217;m using a schema which has three fields fields &#8211; id, title and text.

Firstly you&#8217;ll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so you&#8217;ll need to install it from my github repo:

<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
 $ python setup.py install</pre>

The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:

<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;</pre>

You will also need the credentials from your bluemix service which should look something like this:

<pre>{
 "credentials": {
 "url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
 "username": "&lt;USERNAME&gt;",
 "password": "&lt;PASSWORD&gt;"
 }
}</pre>

In python you should try running the following (I am using the interactive python shell [IDLE][3] for this example)

<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; s.search("hello world")
<em><strong>&lt;solr.core.Response object at 0x7ff77f91d7d0&gt;</strong></em></pre>

If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response &#8211; try checking that you have substituted in valid values for <CLUSTER\_ID>, <COLLECTION\_NAME>, <USERNAME> and <PASSWORD>.

From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.

To add a document you can use the code below:

<pre>&gt;&gt;&gt; s.add({"title" : "test", "text" : "this is a test", "id" : 1})
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;167&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong>
>&gt;&gt; s.commit()
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;68&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>

The XML output shows that the initial add and then commit operations were both successful.

## Content Management

You can also add a number of documents &#8211; this is specifically useful if you have a large number of python objects to insert into SOLR:

<pre>&gt;&gt;&gt; s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;20&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>

Of course you can also delete items via their ID from python too:

<pre>&gt;&gt;&gt; s.delete(id=1)
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;43&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>

## Querying SOLR (unranked results)

And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers &#8211; this only gives you access to the SOLR rankers.)

<pre>&gt;&gt;&gt; r = s.select("test")
>&gt;&gt; r.numFound
<strong>1L
</strong>&gt;&gt;&gt; r.results
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>

</pre>

## Querying Rankers

Provided you have [successfully trained a ranker ][4] and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.

<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; fcselect = solr.SearchHandler(s, "/fcselect")
>&gt;&gt; r = fcselect("my query text", ranker_id="&lt;RANKER-ID&gt;")</pre>

in this case **r **is the same type as in the above non-ranker example, you can access the results via **r.results.**

## More information

For more information on how to use solrpy, visit their documentation page [here][5]

 [1]: http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/
 [2]: https://github.com/edsu/solrpy
 [3]: https://en.wikipedia.org/wiki/IDLE_(Python)
 [4]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train
 [5]: http://pythonhosted.org/solrpy/