brainsteam.co.uk/brainsteam/content/posts/legacy/2017-07-25-dialect-sensitiv...

---
title: Dialect Sensitive Topic Models
author: James
type: post
date: 2017-07-25T11:02:42+00:00
url: /2017/07/25/dialect-sensitive-topic-models/
medium_post:
  - 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
  - Open Source
  - PhD
tags:
  - lda
  - machine learning
  - python
  - topic model

---
As part of my PhD I&#8217;m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you&#8217;re new to the concept of topic modelling then [this article][1] can give you a quick primer.

## Vanilla LDA

<figure id="attachment_175" aria-describedby="caption-attachment-175" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-175" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/300px-Latent_Dirichlet_allocation.png?resize=300%2C157&#038;ssl=1" alt="" width="300" height="157" data-recalc-dims="1" /><figcaption id="caption-attachment-175" class="wp-caption-text">A diagram of how latent variables in LDA model are connected</figcaption></figure>

Vanilla topic models such as [Blei&#8217;s LDA][2] are great but start to fall down when the wording around one particular concept varies too much. In a scientific paper you might expect to find words like &#8220;gastroenteritis&#8221;, &#8220;stomach&#8221; and &#8220;virus&#8221; whereas in newspapers discussing the same topic you might find &#8220;tummy&#8221;, &#8220;sick&#8221; and &#8220;bug&#8221;.  A vanilla LDA implementation might struggle to understand that these concepts are linked unless the contextual information around the words is similar (e.g. both articles have &#8220;uncooked meat&#8221; and &#8220;symptoms last 24 hours&#8221;).

&nbsp;

We define a set of toy documents that have 3 main topics around sickness and also around health and going to the gym. Half of the documents are written in &#8220;layman&#8217;s&#8221; english and the other half &#8220;scientific&#8221; english. The documents are shown below

<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']</pre>

Using a normal implementation of LDA with 3 topics, we get the following results after 30 iterations:

<figure id="attachment_174" aria-describedby="caption-attachment-174" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="size-medium wp-image-174" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&#038;ssl=1" alt="" width="300" height="209" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?w=482&ssl=1 482w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-174" class="wp-caption-text">Vanilla LDA results</figcaption></figure>

It is fair to say that Vanilla LDA didn&#8217;t do a terrible job but it did make end up with some strange decisions like putting poisoning (as in &#8216;food poisoning&#8217; in with cardio and calories). The other two topics seem fairly consistent and sensible.

&nbsp;

## DiaTM

Crain et al. 2010 paper [_**&#8220;Dialect topic modeling for improved consumer medical**_ search.&#8221;][3] proposes a modified LDA that they call &#8220;DiaTM&#8221;.

<figure id="attachment_176" aria-describedby="caption-attachment-176" style="width: 286px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-176" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&#038;ssl=1" alt="" width="286" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&ssl=1 286w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=768%2C805&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?w=910&ssl=1 910w" sizes="(max-width: 286px) 100vw, 286px" data-recalc-dims="1" /><figcaption id="caption-attachment-176" class="wp-caption-text">A diagram showing how the latent variables in DiaTM are linked together</figcaption></figure>

DiaTM works in the same way as LDA but also introduces the concept of collections and dialects. A collection defines a set of documents from the same source or written with a similar dialect &#8211; you can imagine having a collection of newspaper articles and a collection of scientific papers for example. Dialects are a bit like topics &#8211; each word is effectively &#8220;generated&#8221; from a dialect and the probability of a dialect being used is defined at collection level.

The handy thing is that words have a probability of appearing in every dialect which is learned by the model. This means that words common to all dialects (such as &#8216;diet&#8217; or &#8216;food&#8217;) can weighted as such in the model.

Running DiaTM on the same corpus as above yields the following results:

<figure id="attachment_178" aria-describedby="caption-attachment-178" style="width: 660px" class="wp-caption alignright"><img loading="lazy" class="wp-image-178 size-large" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=660%2C177&#038;ssl=1" alt="" width="660" height="177" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=1024%2C275&ssl=1 1024w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=300%2C81&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=768%2C206&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?w=1334&ssl=1 1334w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /><figcaption id="caption-attachment-178" class="wp-caption-text">Results of DiaTM on the sickness/exercise corpus</figcaption></figure>

You can see how the model has effectively identified the three key topics in the documents above but has also segmented the topics by dialect. Topic 2 is mainly concerned with food poisoning and sickness. In dialect 0 the words &#8220;sick&#8221; and &#8220;bad&#8221; appear but in dialect 1 the words &#8220;vomit&#8221; and &#8220;gastroenteritis&#8221; appear.

## Open Source Implementation

I have tried to turn my experiment into a Python library that others can make use of. It is currently early stage and a little slow but it works. The code is [available here][4] and pull requests are very welcome.

The library offers a &#8216;Scikit-Learn-like&#8217; interface where you fit the model to your data like so:

<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']

collection1 = [doc1,doc2,doc3, doc7, doc11]
# 'scientific' documents
collection2 = [doc4,doc5,doc6, doc8, doc9, doc10]

collections = [collection1, collection2]

dtm = DiaTM(n_topic=3, n_dialects=2)
dtm.fit(X)
</pre>

Fitting the model to new documents using transform() will be available soon as will finding the log probability of the current model parameters.

 [1]: http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html
 [2]: http://dl.acm.org/citation.cfm?id=2133826
 [3]: http://www.ncbi.nlm.nih.gov/pubmed/21346955
 [4]: https://github.com/ravenscroftj/diatm