brainsteam.co.uk/brainsteam/content/posts/legacy/2017-07-25-dialect-sensitiv...

101 lines
9.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
author: James
date: 2017-07-25 11:02:42+00:00
medium_post:
- O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}
post_meta:
- date
preview: /social/0f9c30a5c78f1aa443ef6ca6603efeb50c4f22f7f162b26f5c7c46fb71a1cab4.png
tags:
- machine learning
- python
- topic model
- PhD
- open source
title: Dialect Sensitive Topic Models
type: posts
url: /2017/07/25/dialect-sensitive-topic-models/
---
As part of my PhD I’m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you’re new to the concept of topic modelling then [this article][1] can give you a quick primer.
## Vanilla LDA
<figure id="attachment_175" aria-describedby="caption-attachment-175" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-175" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/300px-Latent_Dirichlet_allocation.png?resize=300%2C157&#038;ssl=1" alt="" width="300" height="157" data-recalc-dims="1" /><figcaption id="caption-attachment-175" class="wp-caption-text">A diagram of how latent variables in LDA model are connected</figcaption></figure>
Vanilla topic models such as [Blei&#8217;s LDA][2] are great but start to fall down when the wording around one particular concept varies too much. In a scientific paper you might expect to find words like &#8220;gastroenteritis&#8221;, &#8220;stomach&#8221; and &#8220;virus&#8221; whereas in newspapers discussing the same topic you might find &#8220;tummy&#8221;, &#8220;sick&#8221; and &#8220;bug&#8221;.  A vanilla LDA implementation might struggle to understand that these concepts are linked unless the contextual information around the words is similar (e.g. both articles have &#8220;uncooked meat&#8221; and &#8220;symptoms last 24 hours&#8221;).
&nbsp;
We define a set of toy documents that have 3 main topics around sickness and also around health and going to the gym. Half of the documents are written in &#8220;layman&#8217;s&#8221; english and the other half &#8220;scientific&#8221; english. The documents are shown below
<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']</pre>
Using a normal implementation of LDA with 3 topics, we get the following results after 30 iterations:
<figure id="attachment_174" aria-describedby="caption-attachment-174" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="size-medium wp-image-174" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&#038;ssl=1" alt="" width="300" height="209" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?w=482&ssl=1 482w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-174" class="wp-caption-text">Vanilla LDA results</figcaption></figure>
It is fair to say that Vanilla LDA didn&#8217;t do a terrible job but it did make end up with some strange decisions like putting poisoning (as in &#8216;food poisoning&#8217; in with cardio and calories). The other two topics seem fairly consistent and sensible.
&nbsp;
## DiaTM
Crain et al. 2010 paper [_**&#8220;Dialect topic modeling for improved consumer medical**_ search.&#8221;][3] proposes a modified LDA that they call &#8220;DiaTM&#8221;.
<figure id="attachment_176" aria-describedby="caption-attachment-176" style="width: 286px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-176" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&#038;ssl=1" alt="" width="286" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&ssl=1 286w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=768%2C805&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?w=910&ssl=1 910w" sizes="(max-width: 286px) 100vw, 286px" data-recalc-dims="1" /><figcaption id="caption-attachment-176" class="wp-caption-text">A diagram showing how the latent variables in DiaTM are linked together</figcaption></figure>
DiaTM works in the same way as LDA but also introduces the concept of collections and dialects. A collection defines a set of documents from the same source or written with a similar dialect &#8211; you can imagine having a collection of newspaper articles and a collection of scientific papers for example. Dialects are a bit like topics &#8211; each word is effectively &#8220;generated&#8221; from a dialect and the probability of a dialect being used is defined at collection level.
The handy thing is that words have a probability of appearing in every dialect which is learned by the model. This means that words common to all dialects (such as &#8216;diet&#8217; or &#8216;food&#8217;) can weighted as such in the model.
Running DiaTM on the same corpus as above yields the following results:
<figure id="attachment_178" aria-describedby="caption-attachment-178" style="width: 660px" class="wp-caption alignright"><img loading="lazy" class="wp-image-178 size-large" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=660%2C177&#038;ssl=1" alt="" width="660" height="177" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=1024%2C275&ssl=1 1024w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=300%2C81&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=768%2C206&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?w=1334&ssl=1 1334w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /><figcaption id="caption-attachment-178" class="wp-caption-text">Results of DiaTM on the sickness/exercise corpus</figcaption></figure>
You can see how the model has effectively identified the three key topics in the documents above but has also segmented the topics by dialect. Topic 2 is mainly concerned with food poisoning and sickness. In dialect 0 the words &#8220;sick&#8221; and &#8220;bad&#8221; appear but in dialect 1 the words &#8220;vomit&#8221; and &#8220;gastroenteritis&#8221; appear.
## Open Source Implementation
I have tried to turn my experiment into a Python library that others can make use of. It is currently early stage and a little slow but it works. The code is [available here][4] and pull requests are very welcome.
The library offers a &#8216;Scikit-Learn-like&#8217; interface where you fit the model to your data like so:
<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']
collection1 = [doc1,doc2,doc3, doc7, doc11]
# 'scientific' documents
collection2 = [doc4,doc5,doc6, doc8, doc9, doc10]
collections = [collection1, collection2]
dtm = DiaTM(n_topic=3, n_dialects=2)
dtm.fit(X)
</pre>
Fitting the model to new documents using transform() will be available soon as will finding the log probability of the current model parameters.
[1]: http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html
[2]: http://dl.acm.org/citation.cfm?id=2133826
[3]: http://www.ncbi.nlm.nih.gov/pubmed/21346955
[4]: https://github.com/ravenscroftj/diatm