brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/index.html

225 lines
13 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Retrieve and Rank and Python - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="Retrieve and Rank and Python">
<meta itemprop="description" content="Introduction Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order."><meta itemprop="datePublished" content="2015-11-16T18:25:39&#43;00:00" />
<meta itemprop="dateModified" content="2015-11-16T18:25:39&#43;00:00" />
<meta itemprop="wordCount" content="653">
<meta itemprop="keywords" content="api,cloud,custom,developer,ecosystem,fcselect,ibm,python,query,rank,ranker,retrieve,services,solr,train,watson,wdc," /><meta property="og:title" content="Retrieve and Rank and Python" />
<meta property="og:description" content="Introduction Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2015-11-16T18:25:39&#43;00:00" />
<meta property="article:modified_time" content="2015-11-16T18:25:39&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Retrieve and Rank and Python"/>
<meta name="twitter:description" content="Introduction Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here.
R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">16</span>
<span class="rest">Nov 2015</span>
</div>
</div>
<div class="matter">
<h1 class="title">Retrieve and Rank and Python</h1>
</div>
</div>
<div class="markdown">
<h2 id="introduction">Introduction</h2>
<p>Retrieve and Rank (R&amp;R), if you hadnt already heard about it, is IBM Watsons new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way <a href="http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/">here</a>.</p>
<p>R&amp;R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order.</p>
<p>Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with <a href="https://github.com/edsu/solrpy">solrpy</a>  a wrapper around Apache SOLR in Python. Since R&amp;R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy in R&amp;R with a few special tweaks.</p>
<h2 id="getting-started">Getting Started</h2>
<p>**You will need</p>
<p>** An R&amp;R instance with a cluster and collection already configured. Im using a schema which has three fields fields  id, title and text.</p>
<p>Firstly youll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so youll need to install it from my github repo:</p>
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
$ python setup.py install</pre>
<p>The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:</p>
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;</pre>
<p>You will also need the credentials from your bluemix service which should look something like this:</p>
<pre>{
"credentials": {
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
"username": "&lt;USERNAME&gt;",
"password": "&lt;PASSWORD&gt;"
}
}</pre>
<p>In python you should try running the following (I am using the interactive python shell <a href="https://en.wikipedia.org/wiki/IDLE_(Python)">IDLE</a> for this example)</p>
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; s.search("hello world")
<em><strong>&lt;solr.core.Response object at 0x7ff77f91d7d0&gt;</strong></em></pre>
<p>If this worked then you will see something like _**&lt;solr.core.Response object at 0x7ff77f91d7d0&gt; **_as output here. If you get an error response try checking that you have substituted in valid values for &lt;CLUSTER_ID&gt;, &lt;COLLECTION_NAME&gt;, <USERNAME> and <PASSWORD>.</p>
<p>From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.</p>
<p>To add a document you can use the code below:</p>
<pre>&gt;&gt;&gt; s.add({"title" : "test", "text" : "this is a test", "id" : 1})
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;167&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong>
>&gt;&gt; s.commit()
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;68&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
<p>The XML output shows that the initial add and then commit operations were both successful.</p>
<h2 id="content-management">Content Management</h2>
<p>You can also add a number of documents this is specifically useful if you have a large number of python objects to insert into SOLR:</p>
<pre>&gt;&gt;&gt; s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;20&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
<p>Of course you can also delete items via their ID from python too:</p>
<pre>&gt;&gt;&gt; s.delete(id=1)
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;43&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
<h2 id="querying-solr-unranked-results">Querying SOLR (unranked results)</h2>
<p>And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers this only gives you access to the SOLR rankers.)</p>
<pre>&gt;&gt;&gt; r = s.select("test")
>&gt;&gt; r.numFound
<strong>1L
</strong>&gt;&gt;&gt; r.results
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
</pre>
<h2 id="querying-rankers">Querying Rankers</h2>
<p>Provided you have <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train">successfully trained a ranker </a> and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.</p>
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; fcselect = solr.SearchHandler(s, "/fcselect")
>&gt;&gt; r = fcselect("my query text", ranker_id="&lt;RANKER-ID&gt;")</pre>
<p>in this case **r **is the same type as in the above non-ranker example, you can access the results via <strong>r.results.</strong></p>
<h2 id="more-information">More information</h2>
<p>For more information on how to use solrpy, visit their documentation page <a href="http://pythonhosted.org/solrpy/">here</a></p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/api">api</a></li>
<li><a href="/tags/cloud">cloud</a></li>
<li><a href="/tags/custom">custom</a></li>
<li><a href="/tags/developer">developer</a></li>
<li><a href="/tags/ecosystem">ecosystem</a></li>
<li><a href="/tags/fcselect">fcselect</a></li>
<li><a href="/tags/ibm">ibm</a></li>
<li><a href="/tags/python">python</a></li>
<li><a href="/tags/query">query</a></li>
<li><a href="/tags/rank">rank</a></li>
<li><a href="/tags/ranker">ranker</a></li>
<li><a href="/tags/retrieve">retrieve</a></li>
<li><a href="/tags/services">services</a></li>
<li><a href="/tags/solr">solr</a></li>
<li><a href="/tags/train">train</a></li>
<li><a href="/tags/watson">watson</a></li>
<li><a href="/tags/wdc">wdc</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>