brainsteam.co.uk/2015/10/22/a-week-in-austin-tx-watson-.../index.html

154 lines
15 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>A week in Austin, TX Watson Labs - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="A week in Austin, TX Watson Labs">
<meta itemprop="description" content="At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as “The Garage” located on the IBM Austin site to which access is prohibited for normal IBMers unless accompanied by a labs team member.
During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use."><meta itemprop="datePublished" content="2015-10-22T18:10:57&#43;00:00" />
<meta itemprop="dateModified" content="2015-10-22T18:10:57&#43;00:00" />
<meta itemprop="wordCount" content="957">
<meta itemprop="keywords" content="alchemy,taxonomy,watson," /><meta property="og:title" content="A week in Austin, TX Watson Labs" />
<meta property="og:description" content="At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as “The Garage” located on the IBM Austin site to which access is prohibited for normal IBMers unless accompanied by a labs team member.
During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2015/10/22/a-week-in-austin-tx-watson-labs/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2015-10-22T18:10:57&#43;00:00" />
<meta property="article:modified_time" content="2015-10-22T18:10:57&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="A week in Austin, TX Watson Labs"/>
<meta name="twitter:description" content="At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as “The Garage” located on the IBM Austin site to which access is prohibited for normal IBMers unless accompanied by a labs team member.
During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">22</span>
<span class="rest">Oct 2015</span>
</div>
</div>
<div class="matter">
<h1 class="title">A week in Austin, TX Watson Labs</h1>
</div>
</div>
<div class="markdown">
<p>At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as “The Garage” located on the IBM Austin site to which access is prohibited for normal IBMers unless accompanied by a labs team member.</p>
<p>During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use. However, I can share with you a couple of the general techniques that I used since I think they might be useful for a number of applications</p>
<h2 id="technique-number-1-query-expansion-using-part-of-speech-tagging-and-the-concept-expansion-api">Technique number 1: query expansion using Part-of-speech tagging and the Concept Expansion API.</h2>
<h3 id="introduction">Introduction</h3>
<p>The idea here was to address the fact that a user might phrase their question using language synonymous in nature but different to the data being searched for or queried.</p>
<p>Our <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/">Retrieve And Rank</a> service makes use of Apache <a href="http://lucene.apache.org/solr/">SOLR</a> which already offers <a href="http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/">synonym expansion within queries</a>. However I found adding this further capability using the <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/concept-expansion.html">Concept Expansion</a> (service that builds a thesaurus from large corpuses discussing related concepts) service came up with some synonyms that SOLR didnt. This might be because the SOLR query expansion system uses <a href="http://www.ncbi.nlm.nih.gov/mesh">MeSH</a> which is a formal medical ontology and Concept Expansion (or at least the demo) uses a corpus of twitter data which offers a lot more informal word pairings and implicit links. For example, feeding “<a href="https://en.wikipedia.org/wiki/Michael_Jackson">Michael Jackson</a>” into Concept Expansion will give you outputs like “<a href="https://en.wikipedia.org/wiki/Stevie_Nicks">Stevie Nicks</a>” and “<a href="https://en.wikipedia.org/wiki/Bruce_Springsteen">Bruce Springsteen</a>” who are both musicians who released music around the same sort of era of Michael Jackson. By contrast Michael Jackson is (perhaps unsurprisingly) not present in the MeSH ontology.</p>
<p>Although “Stevie Nicks” might not be directly relevent to those who are looking for “Michael Jackson” and those of you who are music fans might know where Im going next the answer to the question “Who did Michael Jackson perform alongside with at Bill Clintons 1993 inaugural ball?” is <a href="https://www.youtube.com/watch?v=h91glweLuBw">Fleetwood Mac</a> for whom Stevie Nicks sings (that said, my question is specific enough that the keywords “bill clinton, 1993, inaugural ball, michael jackson” get you the right answer in google albeit at position 2 in the results). So there is definitely some value in using Concept Expansion for this purpose even if you have to be very clever and careful about matching up context around queries.</p>
<h3 id="implementation">Implementation</h3>
<p>The first problem you face using this approach is in choosing which words to send off to concept expansion and which ones not to bother with. Were not interested in <a href="https://en.wikipedia.org/wiki/Stop_words">stopwords </a> or personal pronouns (putting “we” into concept expansion comes back with interesting results like “testinitialize” “usaian” and “linux preinstallation” because of the vast amount of noise around pronouns on twitter). We are more interested in nouns like “Chair”, entities and people like “Michael Jackson”, adjectives like “enigmatic” and verbs like “going”. All of these words and phrases are things that could be expanded upon in some way to make our information retrieval query more useful.</p>
<p>To <img loading="lazy" class="alignright size-medium wp-image-33" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-19-165725.png?resize=300%2C249&#038;ssl=1" alt="Screenshot from 2015-10-19 16:57:25" width="300" height="249" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-19-165725.png?resize=300%2C249&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-19-165725.png?w=781&ssl=1 781w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />get around this problem I used the <a href="http://nlp.stanford.edu/software/tagger.shtml">Stanford Part of Speech Tagger </a>to annotate the queries and only sent words labelled as one of the above mentioned types to the service. Asking “how much does the CEO earn?” yields something like the output to the right.</p>
<p>Another problem I ran into very quickly was dealing with nouns consisting of multiple words. For example “Michael Jackson”. In my code, I assume that any words tagged Noun that reside next to each other are the same object and should be treated as such. This assumption seems to have worked so far for my limited set of test data</p>
<h2 id="alchemy-api-and-taxonomy-distance">Alchemy API and Taxonomy Distance</h2>
<p>Another small piece of work I carried out was around measuring how “similar” two documents are from a very high level based on their distance in the alchemy API taxonomy. If you didnt know already, Alchemy has an API for classifying a document into a high level taxonomy. This can often give you a very early indication of how likely that document is to contain information relevent to your use case or requirements. For example a document tagged “automotive manufacturers” is unlikely to contain medical text or instructions on sewing and embroidery.</p>
<p><a href="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?ssl=1"><img loading="lazy" class="size-medium wp-image-36 alignleft" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?resize=300%2C193&#038;ssl=1" alt="Screenshot from 2015-10-22 19:04:56" width="300" height="193" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?resize=300%2C193&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?w=901&ssl=1 901w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /></a>The taxonomy is a tree structure which contains a huge list of <a href="http://www.alchemyapi.com/products/alchemylanguage/taxonomy">different categories and subcategories.</a> The idea here was to walk the tree between the category “node” assigned to one document to the category assigned to the second document and count the steps more steps means further away.  So for each document I made an Alchemy API call to get its taxonomy class. Then I split on “/” characters and count how far away A is from B. Its pretty straight forward. To the left you can see that a question about burgers and a question about salad dressings are roughly “2” categories away from each other moving up to food from fast food counts as one jump and moving back down to condiments and dressing counts as another.</p>
<p><img loading="lazy" class="alignright size-medium wp-image-35" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-185815.png?resize=300%2C166&#038;ssl=1" alt="Screenshot from 2015-10-22 18:58:15" width="300" height="166" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-185815.png?resize=300%2C166&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-185815.png?w=536&ssl=1 536w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />Interestingly the API did seem to struggle with some questions. I used “What was the market share of Ford in Australia?” for my first document and “What type of car should I buy?” as my second doc and got /automative and vehicle/vehicle brands/ford for my first classification and /finance/personal finance/insurance/car for my second. I have a suspicion that this API is not set up for dealing with short documents like questions and that confused it but I need to do some further testing.</p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/alchemy">alchemy</a></li>
<li><a href="/tags/taxonomy">taxonomy</a></li>
<li><a href="/tags/watson">watson</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>