brainsteam.co.uk/new_files/2017/07/25/dialect-sensitive-topic-models/index.html

195 lines
15 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Dialect Sensitive Topic Models - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="Dialect Sensitive Topic Models">
<meta itemprop="description" content="As part of my PhD Im currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If youre new to the concept of topic modelling then this article can give you a quick primer.
Vanilla LDA A diagram of how latent variables in LDA model are connected Vanilla topic models such as Bleis LDA are great but start to fall down when the wording around one particular concept varies too much."><meta itemprop="datePublished" content="2017-07-25T11:02:42&#43;00:00" />
<meta itemprop="dateModified" content="2017-07-25T11:02:42&#43;00:00" />
<meta itemprop="wordCount" content="751">
<meta itemprop="keywords" content="lda,machine learning,python,topic model," /><meta property="og:title" content="Dialect Sensitive Topic Models" />
<meta property="og:description" content="As part of my PhD Im currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If youre new to the concept of topic modelling then this article can give you a quick primer.
Vanilla LDA A diagram of how latent variables in LDA model are connected Vanilla topic models such as Bleis LDA are great but start to fall down when the wording around one particular concept varies too much." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2017/07/25/dialect-sensitive-topic-models/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2017-07-25T11:02:42&#43;00:00" />
<meta property="article:modified_time" content="2017-07-25T11:02:42&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Dialect Sensitive Topic Models"/>
<meta name="twitter:description" content="As part of my PhD Im currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If youre new to the concept of topic modelling then this article can give you a quick primer.
Vanilla LDA A diagram of how latent variables in LDA model are connected Vanilla topic models such as Bleis LDA are great but start to fall down when the wording around one particular concept varies too much."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">25</span>
<span class="rest">Jul 2017</span>
</div>
</div>
<div class="matter">
<h1 class="title">Dialect Sensitive Topic Models</h1>
</div>
</div>
<div class="markdown">
<p>As part of my PhD Im currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If youre new to the concept of topic modelling then <a href="http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html">this article</a> can give you a quick primer.</p>
<h2 id="vanilla-lda">Vanilla LDA</h2>
<figure id="attachment_175" aria-describedby="caption-attachment-175" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-175" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/300px-Latent_Dirichlet_allocation.png?resize=300%2C157&#038;ssl=1" alt="" width="300" height="157" data-recalc-dims="1" /><figcaption id="caption-attachment-175" class="wp-caption-text">A diagram of how latent variables in LDA model are connected</figcaption></figure>
<p>Vanilla topic models such as <a href="http://dl.acm.org/citation.cfm?id=2133826">Bleis LDA</a> are great but start to fall down when the wording around one particular concept varies too much. In a scientific paper you might expect to find words like “gastroenteritis”, “stomach” and “virus” whereas in newspapers discussing the same topic you might find “tummy”, “sick” and “bug”.  A vanilla LDA implementation might struggle to understand that these concepts are linked unless the contextual information around the words is similar (e.g. both articles have “uncooked meat” and “symptoms last 24 hours”).</p>
<p> </p>
<p>We define a set of toy documents that have 3 main topics around sickness and also around health and going to the gym. Half of the documents are written in “laymans” english and the other half “scientific” english. The documents are shown below</p>
<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']</pre>
<p>Using a normal implementation of LDA with 3 topics, we get the following results after 30 iterations:</p>
<figure id="attachment_174" aria-describedby="caption-attachment-174" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="size-medium wp-image-174" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&#038;ssl=1" alt="" width="300" height="209" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?w=482&ssl=1 482w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-174" class="wp-caption-text">Vanilla LDA results</figcaption></figure>
<p>It is fair to say that Vanilla LDA didnt do a terrible job but it did make end up with some strange decisions like putting poisoning (as in food poisoning in with cardio and calories). The other two topics seem fairly consistent and sensible.</p>
<p> </p>
<h2 id="diatm">DiaTM</h2>
<p>Crain et al. 2010 paper <a href="http://www.ncbi.nlm.nih.gov/pubmed/21346955"><em><strong>“Dialect topic modeling for improved consumer medical</strong></em> search.”</a> proposes a modified LDA that they call “DiaTM”.</p>
<figure id="attachment_176" aria-describedby="caption-attachment-176" style="width: 286px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-176" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&#038;ssl=1" alt="" width="286" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&ssl=1 286w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=768%2C805&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?w=910&ssl=1 910w" sizes="(max-width: 286px) 100vw, 286px" data-recalc-dims="1" /><figcaption id="caption-attachment-176" class="wp-caption-text">A diagram showing how the latent variables in DiaTM are linked together</figcaption></figure>
<p>DiaTM works in the same way as LDA but also introduces the concept of collections and dialects. A collection defines a set of documents from the same source or written with a similar dialect you can imagine having a collection of newspaper articles and a collection of scientific papers for example. Dialects are a bit like topics each word is effectively “generated” from a dialect and the probability of a dialect being used is defined at collection level.</p>
<p>The handy thing is that words have a probability of appearing in every dialect which is learned by the model. This means that words common to all dialects (such as diet or food) can weighted as such in the model.</p>
<p>Running DiaTM on the same corpus as above yields the following results:</p>
<figure id="attachment_178" aria-describedby="caption-attachment-178" style="width: 660px" class="wp-caption alignright"><img loading="lazy" class="wp-image-178 size-large" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=660%2C177&#038;ssl=1" alt="" width="660" height="177" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=1024%2C275&ssl=1 1024w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=300%2C81&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=768%2C206&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?w=1334&ssl=1 1334w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /><figcaption id="caption-attachment-178" class="wp-caption-text">Results of DiaTM on the sickness/exercise corpus</figcaption></figure>
<p>You can see how the model has effectively identified the three key topics in the documents above but has also segmented the topics by dialect. Topic 2 is mainly concerned with food poisoning and sickness. In dialect 0 the words “sick” and “bad” appear but in dialect 1 the words “vomit” and “gastroenteritis” appear.</p>
<h2 id="open-source-implementation">Open Source Implementation</h2>
<p>I have tried to turn my experiment into a Python library that others can make use of. It is currently early stage and a little slow but it works. The code is <a href="https://github.com/ravenscroftj/diatm">available here</a> and pull requests are very welcome.</p>
<p>The library offers a Scikit-Learn-like interface where you fit the model to your data like so:</p>
<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']
collection1 = [doc1,doc2,doc3, doc7, doc11]
# 'scientific' documents
collection2 = [doc4,doc5,doc6, doc8, doc9, doc10]
collections = [collection1, collection2]
dtm = DiaTM(n_topic=3, n_dialects=2)
dtm.fit(X)
</pre>
<p>Fitting the model to new documents using transform() will be available soon as will finding the log probability of the current model parameters.</p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/lda">lda</a></li>
<li><a href="/tags/machine-learning">machine learning</a></li>
<li><a href="/tags/python">python</a></li>
<li><a href="/tags/topic-model">topic model</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>