brainsteam.co.uk/brainsteam/content/posts/legacy/2015-11-21-scrolling-in-ela...

2.2 KiB
Raw Permalink Blame History

author date post_meta tags title type url
James 2015-11-21 09:41:19+00:00
date
elasticsearch
python
phd
Scrolling in ElasticSearch posts /2015/11/21/scrolling-in-elasticsearch/

I know Im doing a lot of flip-flopping between SOLR and Elastic at the moment Im trying to figure out key similarities and differences between them and where one is more suitable than the other.

The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.

If you use elastic, it is possible to do paging by adding a size and a from parameter. For example if you wanted to retrieve results in pages of 5 starting from the 3rd page (i.e. show results 11-15) you would do:

GET /_search?size=5&from=10

However this becomes more expensive as you move further and further into the list of results. Each time you make one of these calls you are re-running the search operation forcing Lucene to go off and re-score all the results, rank them and then discard the first 10 (or 10000 if you get that far). There is an easier option: the scan and scroll API.

The idea is that you run your actual query once and then Elastic caches the result somewhere gives you an “access token” to go back in and get them. Then you call the scroll API endpoint with said token to get each page of results (a caveat of this is that each time you make a call your token updates and you need to use the new one. My code sample deals with this but it took me a while to figure out what was going on).

The below code uses the python elasticsearch library to make a scan and scroll call to an index and continues to load results until there are no more hits. For each page it maps a function f** **onto the results. It would not be hard to modify this code to work on multiple threads/processes using the Python multiprocessing API. Take a look!