brainsteam.co.uk/brainsteam/content/posts/legacy/2015-11-16-retrieve-and-ran...

125 lines
6.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Retrieve and Rank and Python
author: James
type: post
date: 2015-11-16T18:25:39+00:00
url: /2015/11/16/retrieve-and-rank-and-python/
categories:
- Work
tags:
- api
- cloud
- custom
- developer
- ecosystem
- fcselect
- ibm
- python
- query
- rank
- ranker
- retrieve
- services
- solr
- train
- watson
- wdc
---
## Introduction
Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way [here][1].
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.
Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with [solrpy][2] – a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy – in R&R with a few special tweaks.
## Getting Started
**You will need
** An R&R instance with a cluster and collection already configured. I’m using a schema which has three fields fields – id, title and text.
Firstly you’ll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so you’ll need to install it from my github repo:
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
$ python setup.py install</pre>
The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;</pre>
You will also need the credentials from your bluemix service which should look something like this:
<pre>{
"credentials": {
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
"username": "&lt;USERNAME&gt;",
"password": "&lt;PASSWORD&gt;"
}
}</pre>
In python you should try running the following (I am using the interactive python shell [IDLE][3] for this example)
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; s.search("hello world")
<em><strong>&lt;solr.core.Response object at 0x7ff77f91d7d0&gt;</strong></em></pre>
If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response &#8211; try checking that you have substituted in valid values for <CLUSTER\_ID>, <COLLECTION\_NAME>, <USERNAME> and <PASSWORD>.
From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.
To add a document you can use the code below:
<pre>&gt;&gt;&gt; s.add({"title" : "test", "text" : "this is a test", "id" : 1})
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;167&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong>
>&gt;&gt; s.commit()
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;68&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
The XML output shows that the initial add and then commit operations were both successful.
## Content Management
You can also add a number of documents &#8211; this is specifically useful if you have a large number of python objects to insert into SOLR:
<pre>&gt;&gt;&gt; s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;20&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
Of course you can also delete items via their ID from python too:
<pre>&gt;&gt;&gt; s.delete(id=1)
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;43&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
## Querying SOLR (unranked results)
And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers &#8211; this only gives you access to the SOLR rankers.)
<pre>&gt;&gt;&gt; r = s.select("test")
>&gt;&gt; r.numFound
<strong>1L
</strong>&gt;&gt;&gt; r.results
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
</pre>
## Querying Rankers
Provided you have [successfully trained a ranker ][4] and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; fcselect = solr.SearchHandler(s, "/fcselect")
>&gt;&gt; r = fcselect("my query text", ranker_id="&lt;RANKER-ID&gt;")</pre>
in this case **r **is the same type as in the above non-ranker example, you can access the results via **r.results.**
## More information
For more information on how to use solrpy, visit their documentation page [here][5]
[1]: http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/
[2]: https://github.com/edsu/solrpy
[3]: https://en.wikipedia.org/wiki/IDLE_(Python)
[4]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train
[5]: http://pythonhosted.org/solrpy/