108 lines
6.1 KiB
Markdown
108 lines
6.1 KiB
Markdown
---
|
||
title: Retrieve and Rank and Python
|
||
author: James
|
||
type: post
|
||
date: 2015-11-16T18:25:39+00:00
|
||
url: /2015/11/16/retrieve-and-rank-and-python/
|
||
tags:
|
||
- work
|
||
- python
|
||
- watson
|
||
---
|
||
## Introduction
|
||
|
||
Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way [here][1].
|
||
|
||
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.
|
||
|
||
Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with [solrpy][2] – a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy – in R&R with a few special tweaks.
|
||
|
||
## Getting Started
|
||
|
||
**You will need
|
||
|
||
** An R&R instance with a cluster and collection already configured. I’m using a schema which has three fields fields – id, title and text.
|
||
|
||
Firstly you’ll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so you’ll need to install it from my github repo:
|
||
|
||
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
|
||
$ python setup.py install</pre>
|
||
|
||
The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:
|
||
|
||
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME></pre>
|
||
|
||
You will also need the credentials from your bluemix service which should look something like this:
|
||
|
||
<pre>{
|
||
"credentials": {
|
||
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
|
||
"username": "<USERNAME>",
|
||
"password": "<PASSWORD>"
|
||
}
|
||
}</pre>
|
||
|
||
In python you should try running the following (I am using the interactive python shell [IDLE][3] for this example)
|
||
|
||
<pre>>>> import solr
|
||
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
|
||
>>> s.search("hello world")
|
||
<em><strong><solr.core.Response object at 0x7ff77f91d7d0></strong></em></pre>
|
||
|
||
If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response – try checking that you have substituted in valid values for <CLUSTER\_ID>, <COLLECTION\_NAME>, <USERNAME> and <PASSWORD>.
|
||
|
||
From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.
|
||
|
||
To add a document you can use the code below:
|
||
|
||
<pre>>>> s.add({"title" : "test", "text" : "this is a test", "id" : 1})
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">167</int></lst>\n</response>\n'</strong>
|
||
>>> s.commit()
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">68</int></lst>\n</response>\n'</strong></pre>
|
||
|
||
The XML output shows that the initial add and then commit operations were both successful.
|
||
|
||
## Content Management
|
||
|
||
You can also add a number of documents – this is specifically useful if you have a large number of python objects to insert into SOLR:
|
||
|
||
<pre>>>> s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">20</int></lst>\n</response>\n'</strong></pre>
|
||
|
||
Of course you can also delete items via their ID from python too:
|
||
|
||
<pre>>>> s.delete(id=1)
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">43</int></lst>\n</response>\n'</strong></pre>
|
||
|
||
## Querying SOLR (unranked results)
|
||
|
||
And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers – this only gives you access to the SOLR rankers.)
|
||
|
||
<pre>>>> r = s.select("test")
|
||
>>> r.numFound
|
||
<strong>1L
|
||
</strong>>>> r.results
|
||
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
|
||
|
||
</pre>
|
||
|
||
## Querying Rankers
|
||
|
||
Provided you have [successfully trained a ranker ][4] and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.
|
||
|
||
<pre>>>> import solr
|
||
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
|
||
>>> fcselect = solr.SearchHandler(s, "/fcselect")
|
||
>>> r = fcselect("my query text", ranker_id="<RANKER-ID>")</pre>
|
||
|
||
in this case **r **is the same type as in the above non-ranker example, you can access the results via **r.results.**
|
||
|
||
## More information
|
||
|
||
For more information on how to use solrpy, visit their documentation page [here][5]
|
||
|
||
[1]: http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/
|
||
[2]: https://github.com/edsu/solrpy
|
||
[3]: https://en.wikipedia.org/wiki/IDLE_(Python)
|
||
[4]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train
|
||
[5]: http://pythonhosted.org/solrpy/ |