brainsteam.co.uk/2015/11/29/elasticsearch-turning-analy.../index.html

174 lines
14 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>ElasticSearch: Turning analysis off and why its useful - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="ElasticSearch: Turning analysis off and why its useful">
<meta itemprop="description" content="I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to."><meta itemprop="datePublished" content="2015-11-29T14:59:06&#43;00:00" />
<meta itemprop="dateModified" content="2015-11-29T14:59:06&#43;00:00" />
<meta itemprop="wordCount" content="302">
<meta itemprop="keywords" content="analysis,elasticsearch,indexing,python,schema," /><meta property="og:title" content="ElasticSearch: Turning analysis off and why its useful" />
<meta property="og:description" content="I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2015-11-29T14:59:06&#43;00:00" />
<meta property="article:modified_time" content="2015-11-29T14:59:06&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="ElasticSearch: Turning analysis off and why its useful"/>
<meta name="twitter:description" content="I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">29</span>
<span class="rest">Nov 2015</span>
</div>
</div>
<div class="matter">
<h1 class="title">ElasticSearch: Turning analysis off and why its useful</h1>
</div>
</div>
<div class="markdown">
<p>I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to. We recently identified the fact that we do not want to look at all units of impact (my PhD is around impact in science so domains such as Art History are largely irrelevent to me). Therefore I started trying to run queries like this:</p>
<pre><span id="s-1" class="sBrace structure-1">{ <i class="fa fa-minus-square-o"></i> </span>
   <span id="s-2" class="sObjectK">"query"</span><span id="s-3" class="sColon">:</span><span id="s-4" class="sBrace structure-2">{ <i class="fa fa-minus-square-o"></i> </span>
      <span id="s-5" class="sObjectK">"filtered"</span><span id="s-6" class="sColon">:</span><span id="s-7" class="sBrace structure-3">{ <i class="fa fa-minus-square-o"></i> </span>
         <span id="s-8" class="sObjectK">"query"</span><span id="s-9" class="sColon">:</span><span id="s-10" class="sBrace structure-4">{ <i class="fa fa-minus-square-o"></i> </span>
            <span id="s-11" class="sObjectK">"match_all"</span><span id="s-12" class="sColon">:</span><span id="s-13" class="sBrace structure-5">{ <i class="fa fa-minus-square-o"></i> </span>
            <span id="s-14" class="sBrace structure-5">}</span>
         <span id="s-15" class="sBrace structure-4">}</span><span id="s-16" class="sComma">,</span>
         <span id="s-17" class="sObjectK">"filter"</span><span id="s-18" class="sColon">:</span><span id="s-19" class="sBrace structure-4">{ <i class="fa fa-minus-square-o"></i> </span>
            <span id="s-20" class="sObjectK">"term"</span><span id="s-21" class="sColon">:</span><span id="s-22" class="sBrace structure-5">{ <i class="fa fa-minus-square-o"></i> </span>
               <span id="s-23" class="sObjectK">"UOA"</span><span id="s-24" class="sColon">:</span><span id="s-25" class="sObjectV">"General Engineering"</span>
            <span id="s-26" class="sBrace structure-5">}</span>
         <span id="s-27" class="sBrace structure-4">}</span>
      <span id="s-28" class="sBrace structure-3">}</span>
   <span id="s-29" class="sBrace structure-2">}</span>
<span id="s-30" class="sBrace structure-1">}</span></pre>
<p>For some reason this returns zero results. Now it took me ages to find <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_term_filter_with_text">this page</a> in the elastic manual which talks about the exact phenomenon Im running into above. It turns out that the default analyser is tokenizing every text field and so Elastic has no notion of UOA ever containing “General Engineering”. Instead it only knows of a UOA field that contains the word “general” and the word “engineering” independently of each other in the model somewhere (bag-of-words). To solve this you have to</p>
<ul>
<li>Download the existing schema from elastic:</li>
<li>
<pre>curl -XGET "http://localhost:9200/impact_studies/_mapping/study" master [4cb268b] untracked
</li>
</ul>
<p>{&ldquo;impact_studies&rdquo;:{&ldquo;mappings&rdquo;:{&ldquo;study&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;CaseStudyId&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Continent&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;GeoNamesId&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Name&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}},&ldquo;Country&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;GeoNamesId&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Name&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}},&ldquo;Funders&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;ImpactDetails&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;ImpactSummary&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;ImpactType&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Institution&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Institutions&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;AlternativeName&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;InstitutionName&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;PeerGroup&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Region&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;UKPRN&rdquo;:{&ldquo;type&rdquo;:&ldquo;long&rdquo;}}},&ldquo;Panel&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;PlaceName&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;GeoNamesId&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Name&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}},&ldquo;References&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;ResearchSubjectAreas&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;Level1&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Level2&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Subject&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}},&ldquo;Sources&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Title&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;UKLocation&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;GeoNamesId&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Name&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}},&ldquo;UKRegion&rdquo;:{&ldquo;properties&rdquo;:{&ldquo;GeoNamesId&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;Name&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}},&ldquo;UOA&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;},&ldquo;UnderpinningResearch&rdquo;:{&ldquo;type&rdquo;:&ldquo;string&rdquo;}}}}}}</pre></p>
<ul>
<li>Delete the schema (unfortunately you cant make this change on the fly) and then turn off the analyser which tokenizes the values in the field:</li>
</ul>
<pre>$ curl -XDELETE "http://localhost:9200/impact_studies"</pre>
<ul>
<li>Then recreate the schema with “index”:”not_analyzed” on the field you are interested in:</li>
</ul>
<pre>curl -XPUT "http://localhost:9200/impact_studies/" -d '{"mappings":{"study":{"properties":{"CaseStudyId":{"type":"string"},"Continent":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Country":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Funders":{"type":"string"},"ImpactDetails":{"type":"string"},"ImpactSummary":{"type":"string"},"ImpactType":{"type":"string"},"Institution":{"type":"string"},"Institutions":{"properties":{"AlternativeName":{"type":"string"},"InstitutionName":{"type":"string"},"PeerGroup":{"type":"string"},"Region":{"type":"string"},"UKPRN":{"type":"long"}}},"Panel":{"type":"string"},"PlaceName":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"References":{"type":"string"},"ResearchSubjectAreas":{"properties":{"Level1":{"type":"string"},"Level2":{"type":"string"},"Subject":{"type":"string"}}},"Sources":{"type":"string"},"Title":{"type":"string"},"UKLocation":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UKRegion":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UOA":{"type":"string", "index" : "not_analyzed"},"UnderpinningResearch":{"type":"string"}}}}}'</pre>
<p>Once youve done this youre good to go reingesting your data and your filter queries should be much more fruitful.</p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/analysis">analysis</a></li>
<li><a href="/tags/elasticsearch">elasticsearch</a></li>
<li><a href="/tags/indexing">indexing</a></li>
<li><a href="/tags/python">python</a></li>
<li><a href="/tags/schema">schema</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>