As part of my PhD I’m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you’re new to the concept of topic modelling then [this article][1] can give you a quick primer.
## Vanilla LDA
<figureid="attachment_175"aria-describedby="caption-attachment-175"style="width: 300px"class="wp-caption alignright"><imgloading="lazy"class="size-medium wp-image-175"src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/300px-Latent_Dirichlet_allocation.png?resize=300%2C157&ssl=1"alt=""width="300"height="157"data-recalc-dims="1"/><figcaptionid="caption-attachment-175"class="wp-caption-text">A diagram of how latent variables in LDA model are connected</figcaption></figure>
Vanilla topic models such as [Blei’s LDA][2] are great but start to fall down when the wording around one particular concept varies too much. In a scientific paper you might expect to find words like “gastroenteritis”, “stomach” and “virus” whereas in newspapers discussing the same topic you might find “tummy”, “sick” and “bug”. A vanilla LDA implementation might struggle to understand that these concepts are linked unless the contextual information around the words is similar (e.g. both articles have “uncooked meat” and “symptoms last 24 hours”).
We define a set of toy documents that have 3 main topics around sickness and also around health and going to the gym. Half of the documents are written in “layman’s” english and the other half “scientific” english. The documents are shown below
It is fair to say that Vanilla LDA didn’t do a terrible job but it did make end up with some strange decisions like putting poisoning (as in ‘food poisoning’ in with cardio and calories). The other two topics seem fairly consistent and sensible.
## DiaTM
Crain et al. 2010 paper [_**“Dialect topic modeling for improved consumer medical**_ search.”][3] proposes a modified LDA that they call “DiaTM”.
<figureid="attachment_176"aria-describedby="caption-attachment-176"style="width: 286px"class="wp-caption alignright"><imgloading="lazy"class="size-medium wp-image-176"src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&ssl=1"alt=""width="286"height="300"srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&ssl=1 286w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=768%2C805&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?w=910&ssl=1 910w"sizes="(max-width: 286px) 100vw, 286px"data-recalc-dims="1"/><figcaptionid="caption-attachment-176"class="wp-caption-text">A diagram showing how the latent variables in DiaTM are linked together</figcaption></figure>
DiaTM works in the same way as LDA but also introduces the concept of collections and dialects. A collection defines a set of documents from the same source or written with a similar dialect – you can imagine having a collection of newspaper articles and a collection of scientific papers for example. Dialects are a bit like topics – each word is effectively “generated” from a dialect and the probability of a dialect being used is defined at collection level.
The handy thing is that words have a probability of appearing in every dialect which is learned by the model. This means that words common to all dialects (such as ‘diet’ or ‘food’) can weighted as such in the model.
Running DiaTM on the same corpus as above yields the following results:
<figureid="attachment_178"aria-describedby="caption-attachment-178"style="width: 660px"class="wp-caption alignright"><imgloading="lazy"class="wp-image-178 size-large"src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=660%2C177&ssl=1"alt=""width="660"height="177"srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=1024%2C275&ssl=1 1024w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=300%2C81&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=768%2C206&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?w=1334&ssl=1 1334w"sizes="(max-width: 660px) 100vw, 660px"data-recalc-dims="1"/><figcaptionid="caption-attachment-178"class="wp-caption-text">Results of DiaTM on the sickness/exercise corpus</figcaption></figure>
You can see how the model has effectively identified the three key topics in the documents above but has also segmented the topics by dialect. Topic 2 is mainly concerned with food poisoning and sickness. In dialect 0 the words “sick” and “bad” appear but in dialect 1 the words “vomit” and “gastroenteritis” appear.
## Open Source Implementation
I have tried to turn my experiment into a Python library that others can make use of. It is currently early stage and a little slow but it works. The code is [available here][4] and pull requests are very welcome.
The library offers a ‘Scikit-Learn-like’ interface where you fit the model to your data like so: