125 lines
6.3 KiB
Markdown
125 lines
6.3 KiB
Markdown
|
---
|
|||
|
title: Retrieve and Rank and Python
|
|||
|
author: James
|
|||
|
type: post
|
|||
|
date: 2015-11-16T18:25:39+00:00
|
|||
|
url: /2015/11/16/retrieve-and-rank-and-python/
|
|||
|
categories:
|
|||
|
- Work
|
|||
|
tags:
|
|||
|
- api
|
|||
|
- cloud
|
|||
|
- custom
|
|||
|
- developer
|
|||
|
- ecosystem
|
|||
|
- fcselect
|
|||
|
- ibm
|
|||
|
- python
|
|||
|
- query
|
|||
|
- rank
|
|||
|
- ranker
|
|||
|
- retrieve
|
|||
|
- services
|
|||
|
- solr
|
|||
|
- train
|
|||
|
- watson
|
|||
|
- wdc
|
|||
|
|
|||
|
---
|
|||
|
## Introduction
|
|||
|
|
|||
|
Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way [here][1].
|
|||
|
|
|||
|
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.
|
|||
|
|
|||
|
Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with [solrpy][2] – a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy – in R&R with a few special tweaks.
|
|||
|
|
|||
|
## Getting Started
|
|||
|
|
|||
|
**You will need
|
|||
|
|
|||
|
** An R&R instance with a cluster and collection already configured. I’m using a schema which has three fields fields – id, title and text.
|
|||
|
|
|||
|
Firstly you’ll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so you’ll need to install it from my github repo:
|
|||
|
|
|||
|
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
|
|||
|
$ python setup.py install</pre>
|
|||
|
|
|||
|
The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:
|
|||
|
|
|||
|
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME></pre>
|
|||
|
|
|||
|
You will also need the credentials from your bluemix service which should look something like this:
|
|||
|
|
|||
|
<pre>{
|
|||
|
"credentials": {
|
|||
|
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
|
|||
|
"username": "<USERNAME>",
|
|||
|
"password": "<PASSWORD>"
|
|||
|
}
|
|||
|
}</pre>
|
|||
|
|
|||
|
In python you should try running the following (I am using the interactive python shell [IDLE][3] for this example)
|
|||
|
|
|||
|
<pre>>>> import solr
|
|||
|
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
|
|||
|
>>> s.search("hello world")
|
|||
|
<em><strong><solr.core.Response object at 0x7ff77f91d7d0></strong></em></pre>
|
|||
|
|
|||
|
If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response – try checking that you have substituted in valid values for <CLUSTER\_ID>, <COLLECTION\_NAME>, <USERNAME> and <PASSWORD>.
|
|||
|
|
|||
|
From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.
|
|||
|
|
|||
|
To add a document you can use the code below:
|
|||
|
|
|||
|
<pre>>>> s.add({"title" : "test", "text" : "this is a test", "id" : 1})
|
|||
|
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">167</int></lst>\n</response>\n'</strong>
|
|||
|
>>> s.commit()
|
|||
|
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">68</int></lst>\n</response>\n'</strong></pre>
|
|||
|
|
|||
|
The XML output shows that the initial add and then commit operations were both successful.
|
|||
|
|
|||
|
## Content Management
|
|||
|
|
|||
|
You can also add a number of documents – this is specifically useful if you have a large number of python objects to insert into SOLR:
|
|||
|
|
|||
|
<pre>>>> s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
|
|||
|
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">20</int></lst>\n</response>\n'</strong></pre>
|
|||
|
|
|||
|
Of course you can also delete items via their ID from python too:
|
|||
|
|
|||
|
<pre>>>> s.delete(id=1)
|
|||
|
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">43</int></lst>\n</response>\n'</strong></pre>
|
|||
|
|
|||
|
## Querying SOLR (unranked results)
|
|||
|
|
|||
|
And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers – this only gives you access to the SOLR rankers.)
|
|||
|
|
|||
|
<pre>>>> r = s.select("test")
|
|||
|
>>> r.numFound
|
|||
|
<strong>1L
|
|||
|
</strong>>>> r.results
|
|||
|
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
|
|||
|
|
|||
|
</pre>
|
|||
|
|
|||
|
## Querying Rankers
|
|||
|
|
|||
|
Provided you have [successfully trained a ranker ][4] and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.
|
|||
|
|
|||
|
<pre>>>> import solr
|
|||
|
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
|
|||
|
>>> fcselect = solr.SearchHandler(s, "/fcselect")
|
|||
|
>>> r = fcselect("my query text", ranker_id="<RANKER-ID>")</pre>
|
|||
|
|
|||
|
in this case **r **is the same type as in the above non-ranker example, you can access the results via **r.results.**
|
|||
|
|
|||
|
## More information
|
|||
|
|
|||
|
For more information on how to use solrpy, visit their documentation page [here][5]
|
|||
|
|
|||
|
[1]: http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/
|
|||
|
[2]: https://github.com/edsu/solrpy
|
|||
|
[3]: https://en.wikipedia.org/wiki/IDLE_(Python)
|
|||
|
[4]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train
|
|||
|
[5]: http://pythonhosted.org/solrpy/
|