brainsteam.co.uk/new_files/2015/11/16/retrieve-and-rank-and-python/index.html

225 lines
13 KiB
HTML
Raw Normal View History

2021-12-21 13:30:09 +00:00
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Retrieve and Rank and Python - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="Retrieve and Rank and Python">
<meta itemprop="description" content="Introduction Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order."><meta itemprop="datePublished" content="2015-11-16T18:25:39&#43;00:00" />
<meta itemprop="dateModified" content="2015-11-16T18:25:39&#43;00:00" />
<meta itemprop="wordCount" content="653">
<meta itemprop="keywords" content="api,cloud,custom,developer,ecosystem,fcselect,ibm,python,query,rank,ranker,retrieve,services,solr,train,watson,wdc," /><meta property="og:title" content="Retrieve and Rank and Python" />
<meta property="og:description" content="Introduction Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2015-11-16T18:25:39&#43;00:00" />
<meta property="article:modified_time" content="2015-11-16T18:25:39&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Retrieve and Rank and Python"/>
<meta name="twitter:description" content="Introduction Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">16</span>
<span class="rest">Nov 2015</span>
</div>
</div>
<div class="matter">
<h1 class="title">Retrieve and Rank and Python</h1>
</div>
</div>
<div class="markdown">
<h2 id="introduction">Introduction</h2>
<p>Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way <a href="http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/">here</a>.</p>
<p>R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.</p>
<p>Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with <a href="https://github.com/edsu/solrpy">solrpy</a>  a wrapper around Apache SOLR in Python. Since R&amp;R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy in R&amp;R with a few special tweaks.</p>
<h2 id="getting-started">Getting Started</h2>
<p>**You will need</p>
<p>** An R&amp;R instance with a cluster and collection already configured. Im using a schema which has three fields fields  id, title and text.</p>
<p>Firstly youll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so youll need to install it from my github repo:</p>
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
$ python setup.py install</pre>
<p>The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:</p>
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;</pre>
<p>You will also need the credentials from your bluemix service which should look something like this:</p>
<pre>{
"credentials": {
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
"username": "&lt;USERNAME&gt;",
"password": "&lt;PASSWORD&gt;"
}
}</pre>
<p>In python you should try running the following (I am using the interactive python shell <a href="https://en.wikipedia.org/wiki/IDLE_(Python)">IDLE</a> for this example)</p>
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; s.search("hello world")
<em><strong>&lt;solr.core.Response object at 0x7ff77f91d7d0&gt;</strong></em></pre>
<p>If this worked then you will see something like _**&lt;solr.core.Response object at 0x7ff77f91d7d0&gt; **_as output here. If you get an error response try checking that you have substituted in valid values for &lt;CLUSTER_ID&gt;, &lt;COLLECTION_NAME&gt;, <USERNAME> and <PASSWORD>.</p>
<p>From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.</p>
<p>To add a document you can use the code below:</p>
<pre>&gt;&gt;&gt; s.add({"title" : "test", "text" : "this is a test", "id" : 1})
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;167&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong>
>&gt;&gt; s.commit()
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;68&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
<p>The XML output shows that the initial add and then commit operations were both successful.</p>
<h2 id="content-management">Content Management</h2>
<p>You can also add a number of documents this is specifically useful if you have a large number of python objects to insert into SOLR:</p>
<pre>&gt;&gt;&gt; s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;20&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
<p>Of course you can also delete items via their ID from python too:</p>
<pre>&gt;&gt;&gt; s.delete(id=1)
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;43&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
<h2 id="querying-solr-unranked-results">Querying SOLR (unranked results)</h2>
<p>And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers this only gives you access to the SOLR rankers.)</p>
<pre>&gt;&gt;&gt; r = s.select("test")
>&gt;&gt; r.numFound
<strong>1L
</strong>&gt;&gt;&gt; r.results
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
</pre>
<h2 id="querying-rankers">Querying Rankers</h2>
<p>Provided you have <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train">successfully trained a ranker </a> and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.</p>
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; fcselect = solr.SearchHandler(s, "/fcselect")
>&gt;&gt; r = fcselect("my query text", ranker_id="&lt;RANKER-ID&gt;")</pre>
<p>in this case **r **is the same type as in the above non-ranker example, you can access the results via <strong>r.results.</strong></p>
<h2 id="more-information">More information</h2>
<p>For more information on how to use solrpy, visit their documentation page <a href="http://pythonhosted.org/solrpy/">here</a></p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/api">api</a></li>
<li><a href="/tags/cloud">cloud</a></li>
<li><a href="/tags/custom">custom</a></li>
<li><a href="/tags/developer">developer</a></li>
<li><a href="/tags/ecosystem">ecosystem</a></li>
<li><a href="/tags/fcselect">fcselect</a></li>
<li><a href="/tags/ibm">ibm</a></li>
<li><a href="/tags/python">python</a></li>
<li><a href="/tags/query">query</a></li>
<li><a href="/tags/rank">rank</a></li>
<li><a href="/tags/ranker">ranker</a></li>
<li><a href="/tags/retrieve">retrieve</a></li>
<li><a href="/tags/services">services</a></li>
<li><a href="/tags/solr">solr</a></li>
<li><a href="/tags/train">train</a></li>
<li><a href="/tags/watson">watson</a></li>
<li><a href="/tags/wdc">wdc</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>