225 lines
13 KiB
HTML
225 lines
13 KiB
HTML
<!DOCTYPE html>
|
||
<html>
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Retrieve and Rank and Python - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
|
||
<meta itemprop="name" content="Retrieve and Rank and Python">
|
||
<meta itemprop="description" content="Introduction Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
|
||
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order."><meta itemprop="datePublished" content="2015-11-16T18:25:39+00:00" />
|
||
<meta itemprop="dateModified" content="2015-11-16T18:25:39+00:00" />
|
||
<meta itemprop="wordCount" content="653">
|
||
<meta itemprop="keywords" content="api,cloud,custom,developer,ecosystem,fcselect,ibm,python,query,rank,ranker,retrieve,services,solr,train,watson,wdc," /><meta property="og:title" content="Retrieve and Rank and Python" />
|
||
<meta property="og:description" content="Introduction Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
|
||
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order." />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/" /><meta property="article:section" content="posts" />
|
||
<meta property="article:published_time" content="2015-11-16T18:25:39+00:00" />
|
||
<meta property="article:modified_time" content="2015-11-16T18:25:39+00:00" />
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="Retrieve and Rank and Python"/>
|
||
<meta name="twitter:description" content="Introduction Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
|
||
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order."/>
|
||
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
|
||
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
|
||
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
|
||
|
||
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
|
||
|
||
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
|
||
|
||
<script src="https://brainsteam.co.uk/js/main.js"></script>
|
||
</head>
|
||
|
||
<body>
|
||
<div class="container wrapper">
|
||
<div class="header">
|
||
|
||
<div class="avatar">
|
||
<a href="https://brainsteam.co.uk/">
|
||
<img src="/images/avatar.png" alt="Brainsteam" />
|
||
</a>
|
||
</div>
|
||
|
||
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
|
||
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
|
||
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
|
||
</nav></div>
|
||
|
||
<nav class="nav">
|
||
<ul class="flat">
|
||
|
||
<li>
|
||
<a href="/">Home</a>
|
||
</li>
|
||
|
||
<li>
|
||
<a href="/tags">Tags</a>
|
||
</li>
|
||
|
||
<li>
|
||
<a href="https://jamesravey.me">About Me</a>
|
||
</li>
|
||
|
||
</ul>
|
||
</nav>
|
||
</div>
|
||
|
||
<div class="post">
|
||
<div class="post-header">
|
||
|
||
<div class="meta">
|
||
<div class="date">
|
||
<span class="day">16</span>
|
||
<span class="rest">Nov 2015</span>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="matter">
|
||
<h1 class="title">Retrieve and Rank and Python</h1>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="markdown">
|
||
<h2 id="introduction">Introduction</h2>
|
||
<p>Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way <a href="http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/">here</a>.</p>
|
||
<p>R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.</p>
|
||
<p>Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with <a href="https://github.com/edsu/solrpy">solrpy</a> – a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy – in R&R with a few special tweaks.</p>
|
||
<h2 id="getting-started">Getting Started</h2>
|
||
<p>**You will need</p>
|
||
<p>** An R&R instance with a cluster and collection already configured. I’m using a schema which has three fields fields – id, title and text.</p>
|
||
<p>Firstly you’ll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so you’ll need to install it from my github repo:</p>
|
||
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
|
||
$ python setup.py install</pre>
|
||
<p>The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:</p>
|
||
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME></pre>
|
||
<p>You will also need the credentials from your bluemix service which should look something like this:</p>
|
||
<pre>{
|
||
"credentials": {
|
||
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
|
||
"username": "<USERNAME>",
|
||
"password": "<PASSWORD>"
|
||
}
|
||
}</pre>
|
||
<p>In python you should try running the following (I am using the interactive python shell <a href="https://en.wikipedia.org/wiki/IDLE_(Python)">IDLE</a> for this example)</p>
|
||
<pre>>>> import solr
|
||
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
|
||
>>> s.search("hello world")
|
||
<em><strong><solr.core.Response object at 0x7ff77f91d7d0></strong></em></pre>
|
||
<p>If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response – try checking that you have substituted in valid values for <CLUSTER_ID>, <COLLECTION_NAME>, <USERNAME> and <PASSWORD>.</p>
|
||
<p>From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.</p>
|
||
<p>To add a document you can use the code below:</p>
|
||
<pre>>>> s.add({"title" : "test", "text" : "this is a test", "id" : 1})
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">167</int></lst>\n</response>\n'</strong>
|
||
>>> s.commit()
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">68</int></lst>\n</response>\n'</strong></pre>
|
||
<p>The XML output shows that the initial add and then commit operations were both successful.</p>
|
||
<h2 id="content-management">Content Management</h2>
|
||
<p>You can also add a number of documents – this is specifically useful if you have a large number of python objects to insert into SOLR:</p>
|
||
<pre>>>> s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">20</int></lst>\n</response>\n'</strong></pre>
|
||
<p>Of course you can also delete items via their ID from python too:</p>
|
||
<pre>>>> s.delete(id=1)
|
||
<strong>'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n<lst name="responseHeader"><int name="status">0</int><int name="QTime">43</int></lst>\n</response>\n'</strong></pre>
|
||
<h2 id="querying-solr-unranked-results">Querying SOLR (unranked results)</h2>
|
||
<p>And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers – this only gives you access to the SOLR rankers.)</p>
|
||
<pre>>>> r = s.select("test")
|
||
>>> r.numFound
|
||
<strong>1L
|
||
</strong>>>> r.results
|
||
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
|
||
|
||
</pre>
|
||
<h2 id="querying-rankers">Querying Rankers</h2>
|
||
<p>Provided you have <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train">successfully trained a ranker </a> and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.</p>
|
||
<pre>>>> import solr
|
||
>>> s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/<CLUSTER_ID>/solr/<COLLECTION_NAME>", http_user="<USERNAME>", http_pass="<PASSWORD>")
|
||
>>> fcselect = solr.SearchHandler(s, "/fcselect")
|
||
>>> r = fcselect("my query text", ranker_id="<RANKER-ID>")</pre>
|
||
<p>in this case **r **is the same type as in the above non-ranker example, you can access the results via <strong>r.results.</strong></p>
|
||
<h2 id="more-information">More information</h2>
|
||
<p>For more information on how to use solrpy, visit their documentation page <a href="http://pythonhosted.org/solrpy/">here</a></p>
|
||
|
||
</div>
|
||
|
||
<div class="tags">
|
||
|
||
|
||
<ul class="flat">
|
||
|
||
<li><a href="/tags/api">api</a></li>
|
||
|
||
<li><a href="/tags/cloud">cloud</a></li>
|
||
|
||
<li><a href="/tags/custom">custom</a></li>
|
||
|
||
<li><a href="/tags/developer">developer</a></li>
|
||
|
||
<li><a href="/tags/ecosystem">ecosystem</a></li>
|
||
|
||
<li><a href="/tags/fcselect">fcselect</a></li>
|
||
|
||
<li><a href="/tags/ibm">ibm</a></li>
|
||
|
||
<li><a href="/tags/python">python</a></li>
|
||
|
||
<li><a href="/tags/query">query</a></li>
|
||
|
||
<li><a href="/tags/rank">rank</a></li>
|
||
|
||
<li><a href="/tags/ranker">ranker</a></li>
|
||
|
||
<li><a href="/tags/retrieve">retrieve</a></li>
|
||
|
||
<li><a href="/tags/services">services</a></li>
|
||
|
||
<li><a href="/tags/solr">solr</a></li>
|
||
|
||
<li><a href="/tags/train">train</a></li>
|
||
|
||
<li><a href="/tags/watson">watson</a></li>
|
||
|
||
<li><a href="/tags/wdc">wdc</a></li>
|
||
|
||
</ul>
|
||
|
||
|
||
</div><div id="disqus_thread"></div>
|
||
<script type="text/javascript">
|
||
(function () {
|
||
|
||
|
||
if (window.location.hostname == "localhost")
|
||
return;
|
||
|
||
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
|
||
var disqus_shortname = 'brainsteam';
|
||
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
|
||
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
|
||
})();
|
||
</script>
|
||
<noscript>Please enable JavaScript to view the </a></noscript>
|
||
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
|
||
</div>
|
||
</div>
|
||
<div class="footer wrapper">
|
||
<nav class="nav">
|
||
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
|
||
</nav>
|
||
</div>
|
||
|
||
|
||
<script type="application/javascript">
|
||
var doNotTrack = false;
|
||
if (!doNotTrack) {
|
||
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
|
||
ga('create', 'UA-186263385-1', 'auto');
|
||
|
||
ga('send', 'pageview');
|
||
}
|
||
</script>
|
||
<script async src='https://www.google-analytics.com/analytics.js'></script>
|
||
<script>feather.replace()</script>
|
||
</body>
|
||
</html>
|