brainsteam.co.uk/brainsteam/content/posts/legacy/2015-11-29-elasticsearch-turning-analysis-off-and-why-its-useful.md at 40673565c9345e6f0871897873bc5b26e7f70785

6.5 KiB

Raw Blame History

author

date

post_meta

preview

tags

title

type

url

James

2015-11-29 14:59:06+00:00

date

/social/cc3038203e23cca40103212b3f8d7b2e535613b231adecd2b67d86b1ce6ba3d0.png

elasticsearch

python

phd

ElasticSearch: Turning analysis off and why its useful

posts

/2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/

I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to. We recently identified the fact that we do not want to look at all units of impact (my PhD is around impact in science so domains such as Art History are largely irrelevent to me). Therefore I started trying to run queries like this:

{  
   "query":{  
      "filtered":{  
         "query":{  
            "match_all":{  

            }
         },
         "filter":{  
            "term":{  
               "UOA":"General Engineering"
            }
         }
      }
   }
}

For some reason this returns zero results. Now it took me ages to find this page in the elastic manual which talks about the exact phenomenon I’m running into above. It turns out that the default analyser is tokenizing every text field and so Elastic has no notion of UOA ever containing “General Engineering”. Instead it only knows of a UOA field that contains the word “general” and the word “engineering” independently of each other in the model somewhere (bag-of-words). To solve this you have to

Download the existing schema from elastic:

curl -XGET "http://localhost:9200/impact_studies/_mapping/study" master [4cb268b] untracked

{"impact_studies":{"mappings":{"study":{"properties":{"CaseStudyId":{"type":"string"},"Continent":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Country":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Funders":{"type":"string"},"ImpactDetails":{"type":"string"},"ImpactSummary":{"type":"string"},"ImpactType":{"type":"string"},"Institution":{"type":"string"},"Institutions":{"properties":{"AlternativeName":{"type":"string"},"InstitutionName":{"type":"string"},"PeerGroup":{"type":"string"},"Region":{"type":"string"},"UKPRN":{"type":"long"}}},"Panel":{"type":"string"},"PlaceName":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"References":{"type":"string"},"ResearchSubjectAreas":{"properties":{"Level1":{"type":"string"},"Level2":{"type":"string"},"Subject":{"type":"string"}}},"Sources":{"type":"string"},"Title":{"type":"string"},"UKLocation":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UKRegion":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UOA":{"type":"string"},"UnderpinningResearch":{"type":"string"}}}}}}

Delete the schema (unfortunately you can’t make this change on the fly) and then turn off the analyser which tokenizes the values in the field:

$ curl -XDELETE "http://localhost:9200/impact_studies"

Then recreate the schema with “index”:”not_analyzed” on the field you are interested in:

curl -XPUT "http://localhost:9200/impact_studies/" -d '{"mappings":{"study":{"properties":{"CaseStudyId":{"type":"string"},"Continent":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Country":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Funders":{"type":"string"},"ImpactDetails":{"type":"string"},"ImpactSummary":{"type":"string"},"ImpactType":{"type":"string"},"Institution":{"type":"string"},"Institutions":{"properties":{"AlternativeName":{"type":"string"},"InstitutionName":{"type":"string"},"PeerGroup":{"type":"string"},"Region":{"type":"string"},"UKPRN":{"type":"long"}}},"Panel":{"type":"string"},"PlaceName":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"References":{"type":"string"},"ResearchSubjectAreas":{"properties":{"Level1":{"type":"string"},"Level2":{"type":"string"},"Subject":{"type":"string"}}},"Sources":{"type":"string"},"Title":{"type":"string"},"UKLocation":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UKRegion":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UOA":{"type":"string", "index" : "not_analyzed"},"UnderpinningResearch":{"type":"string"}}}}}'

Once you’ve done this you’re good to go reingesting your data and your filter queries should be much more fruitful.

6.5 KiB Raw Blame History Unescape Escape

6.5 KiB

Raw Blame History