brainsteam.co.uk/brainsteam/content/posts/legacy/2015-11-16-retrieve-and-ran...

6.1 KiB
Raw Permalink Blame History

author date post_meta tags title type url
James 2015-11-16 18:25:39+00:00
date
work
python
watson
Retrieve and Rank and Python posts /2015/11/16/retrieve-and-rank-and-python/

Introduction

Retrieve and Rank (R&R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.

R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.

Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with solrpy – a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy in R&R with a few special tweaks.

Getting Started

**You will need

** An R&R instance with a cluster and collection already configured. Im using a schema which has three fields fields  id, title and text.

Firstly youll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so youll need to install it from my github repo:

$ git clone git@github.com:ravenscroftj/solrpy.git
 $ python setup.py install

The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:

https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>

You will also need the credentials from your bluemix service which should look something like this:

{
 "credentials": {
 "url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
 "username": "<USERNAME>",
 "password": "<PASSWORD>"
 }
}

In python you should try running the following (I am using the interactive python shell IDLE for this example)

>>> import solr
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
>>> s.search("hello world")
<solr.core.Response object at 0x7ff77f91d7d0>

If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response try checking that you have substituted in valid values for <CLUSTER_ID>, <COLLECTION_NAME>, and .

From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.

To add a document you can use the code below:

>>> s.add({"title" : "test", "text" : "this is a test", "id" : 1})
'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">167</int></lst>\n</response>\n'
>>> s.commit()
'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">68</int></lst>\n</response>\n'

The XML output shows that the initial add and then commit operations were both successful.

Content Management

You can also add a number of documents this is specifically useful if you have a large number of python objects to insert into SOLR:

>>> s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">20</int></lst>\n</response>\n'

Of course you can also delete items via their ID from python too:

>>> s.delete(id=1)
'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">43</int></lst>\n</response>\n'

Querying SOLR (unranked results)

And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers this only gives you access to the SOLR rankers.)

>>> r = s.select("test")
>>> r.numFound
1L
>>> r.results
[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]

Querying Rankers

Provided you have successfully trained a ranker  and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.

>>> import solr
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
>>> fcselect = solr.SearchHandler(s, "/fcselect")
>>> r = fcselect("my query text", ranker_id="<RANKER-ID>")

in this case **r **is the same type as in the above non-ranker example, you can access the results via r.results.

More information

For more information on how to use solrpy, visit their documentation page here