brainsteam.co.uk/new_files/public/2015/11/21/scrolling-in-elasticsearch/index.html

155 lines
8.1 KiB
HTML
Raw Normal View History

2021-12-21 13:31:30 +00:00
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Scrolling in ElasticSearch - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="Scrolling in ElasticSearch">
<meta itemprop="description" content="I know Im doing a lot of flip-flopping between SOLR and Elastic at the moment Im trying to figure out key similarities and differences between them and where one is more suitable than the other.
The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.
If you use elastic, it is possible to do paging by adding a size and a from parameter."><meta itemprop="datePublished" content="2015-11-21T09:41:19&#43;00:00" />
<meta itemprop="dateModified" content="2015-11-21T09:41:19&#43;00:00" />
<meta itemprop="wordCount" content="320">
<meta itemprop="keywords" content="elasticsearch,lucene,python,results,scan,scroll," /><meta property="og:title" content="Scrolling in ElasticSearch" />
<meta property="og:description" content="I know Im doing a lot of flip-flopping between SOLR and Elastic at the moment Im trying to figure out key similarities and differences between them and where one is more suitable than the other.
The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.
If you use elastic, it is possible to do paging by adding a size and a from parameter." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2015/11/21/scrolling-in-elasticsearch/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2015-11-21T09:41:19&#43;00:00" />
<meta property="article:modified_time" content="2015-11-21T09:41:19&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Scrolling in ElasticSearch"/>
<meta name="twitter:description" content="I know Im doing a lot of flip-flopping between SOLR and Elastic at the moment Im trying to figure out key similarities and differences between them and where one is more suitable than the other.
The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.
If you use elastic, it is possible to do paging by adding a size and a from parameter."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">21</span>
<span class="rest">Nov 2015</span>
</div>
</div>
<div class="matter">
<h1 class="title">Scrolling in ElasticSearch</h1>
</div>
</div>
<div class="markdown">
<p>I know Im doing a lot of flip-flopping between SOLR and Elastic at the moment Im trying to figure out key similarities and differences between them and where one is more suitable than the other.</p>
<p>The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.</p>
<p>If you use elastic, it is possible to do paging by adding a size and a from parameter. For example if you wanted to retrieve results in pages of 5 starting from the 3rd page (i.e. show results 11-15) you would do:</p>
<pre><span class="pln">GET </span><span class="pun">/</span><span class="pln">_search</span><span class="pun">?</span><span class="pln">size</span><span class="pun">=</span><span class="lit">5</span><span class="pun">&</span><span class="pln">from</span><span class="pun">=</span><span class="lit">10</span></pre>
<p>However this becomes more expensive as you move further and further into the list of results. Each time you make one of these calls you are re-running the search operation forcing Lucene to go off and re-score all the results, rank them and then discard the first 10 (or 10000 if you get that far). There is an easier option: the scan and scroll API.</p>
<p>The idea is that you run your actual query once and then Elastic caches the result somewhere gives you an “access token” to go back in and get them. Then you call the scroll API endpoint with said token to get each page of results (a caveat of this is that each time you make a call your token updates and you need to use the new one. My code sample deals with this but it took me a while to figure out what was going on).</p>
<p>The below code uses the python elasticsearch library to make a scan and scroll call to an index and continues to load results until there are no more hits. For each page it maps a function <em><strong>f</strong></em>** **onto the results. It would not be hard to modify this code to work on multiple threads/processes using the Python multiprocessing API. Take a look!</p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/elasticsearch">elasticsearch</a></li>
<li><a href="/tags/lucene">lucene</a></li>
<li><a href="/tags/python">python</a></li>
<li><a href="/tags/results">results</a></li>
<li><a href="/tags/scan">scan</a></li>
<li><a href="/tags/scroll">scroll</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>