---
author: James
date: 2017-07-25 11:02:42+00:00
medium_post:
- O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}
post_meta:
- date
tags:
- machine learning
- python
- topic model
- PhD
- open source
title: Dialect Sensitive Topic Models
type: posts
url: /2017/07/25/dialect-sensitive-topic-models/
---
As part of my PhD I’m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you’re new to the concept of topic modelling then [this article][1] can give you a quick primer.
## Vanilla LDA
doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']
Using a normal implementation of LDA with 3 topics, we get the following results after 30 iterations:
doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']
collection1 = [doc1,doc2,doc3, doc7, doc11]
# 'scientific' documents
collection2 = [doc4,doc5,doc6, doc8, doc9, doc10]
collections = [collection1, collection2]
dtm = DiaTM(n_topic=3, n_dialects=2)
dtm.fit(X)
Fitting the model to new documents using transform() will be available soon as will finding the log probability of the current model parameters.
[1]: http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html
[2]: http://dl.acm.org/citation.cfm?id=2133826
[3]: http://www.ncbi.nlm.nih.gov/pubmed/21346955
[4]: https://github.com/ravenscroftj/diatm