add wordpress content

This commit is contained in:
James Ravenscroft 2020-12-28 11:39:11 +00:00
parent 86f2e104eb
commit 3dc3c77901
55 changed files with 4451 additions and 1 deletions

3
.gitignore vendored
View File

@ -1,7 +1,8 @@
# ---> Hugo
# Generated files by hugo
/public/
/resources/_gen/
/brainsteam/public
/brainstorm/resources/_gen/
# Executable may be added to repository
hugo.exe

3
.gitmodules vendored Normal file
View File

@ -0,0 +1,3 @@
[submodule "brainsteam/themes/hugo-ink"]
path = brainsteam/themes/hugo-ink
url = https://github.com/knadh/hugo-ink.git

View File

@ -0,0 +1,6 @@
---
title: "{{ replace .Name "-" " " | title }}"
date: {{ .Date }}
draft: true
---

51
brainsteam/config.toml Normal file
View File

@ -0,0 +1,51 @@
baseURL = "http://brainsteam.co.uk/"
languageCode = "en-us"
title = "Brainsteam"
theme='hugo-ink'
[markup.goldmark.renderer]
unsafe= true
[params]
subtitle = "The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way."
avatar = "/images/avatar.png"
[[menu.main]]
name = "Home"
url = "/"
weight = 1
[[menu.main]]
name = "All posts"
url = "/posts"
weight = 2
[[menu.main]]
name = "Tags"
url = "/tags"
weight = 4
[[menu.main]]
name = "My Home Page"
url = "https://jamesravey.me"
weight = 3
[[params.social]]
name = "Twitter"
icon = "twitter"
url = "https://twitter.com/jamesravey/"
[[params.social]]
name = "Github"
icon = "github"
url = "https://github.com/ravenscroftj"
[[params.social]]
name = "RSS"
icon = "rss"
url = "/index.xml"
[taxonomies]
tag = "tags"

View File

@ -0,0 +1,28 @@
---
title: Bedford Place Vintage Festival
author: James
type: post
date: 2015-06-28T10:36:28+00:00
url: /2015/06/28/bedford-place-vintage-festival/
categories:
- Lindyhop
tags:
- bic
- festival
- lindyhop
- shimsham
- simone
- southampton
- vintage
format: video
---
Last week a bunch of my lindyhop group went and performed at the Bedford Place Vintage Festival in Southampton – its an annual event that I’ve been to twice now and we had an absolute ball.
I think I enjoyed it that much more this year purely because I’ve been dancing twice as long now and I can hold my own on the social dance floor.
Here’s a video of our crew performing the Shim Sham to “Mama do the hump”
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/zMYHAuvuImw?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>

View File

@ -0,0 +1,32 @@
---
title: Tidying up XML in one click
author: James
type: post
date: 2015-06-28T10:24:33+00:00
url: /2015/06/28/tidying-up-xml-in-one-click/
categories:
- PhD
- Work
tags:
- processing
- sapienta
- tidy
- xml
---
When I&#8217;m working on Partridge and SAPIENTA, I find myself dealing with a lot of badly formatted XML. I used to manually run _xmllint &#8211;format_ against every file before opening it but that gets annoying very quickly (even if you have it saved in your bash history). So I decided to write a Nemo script that does it automatically for me.
<pre lang="bash">#!/bin/sh
for xmlfile in $NEMO_SCRIPT_SELECTED_FILE_PATHS; do
if [[ $xmlfile == *.xml ]]
then
xmllint --format $xmlfile > $xmlfile.tmp
rm $xmlfile
mv $xmlfile.tmp $xmlfile
fi
done
</pre>
Pop that in a file called &#8220;Tidy XML&#8221; in your ~/.local/share/nemo/scripts directory and when you inspect files with Nemo it should appear in the right click menu.

View File

@ -0,0 +1,77 @@
---
title: SSSplit Improvements
author: James
type: post
date: 2015-07-15T19:33:29+00:00
url: /2015/07/15/sssplit-improvements/
categories:
- PhD
- Work
tags:
- demo
- improvements
- java
- partridge
- python
- regex
- sapienta
- split
- sssplit
- test
---
## Introduction
As part of my continuing work on [Partridge][1], I&#8217;ve been working on improving the sentence splitting capability of SSSplit &#8211; the component used to split academic papers from PLosOne and PubMedCentral into separate sentences.
Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order to apply CoreSC annotations to the sentences we need to know where each sentence starts and ends. Of course that means we also have to take into account the other &#8216;stuff&#8217; (listed above) floating around in the documents too. We can&#8217;t just ignore formulae and citations &#8211; they&#8217;re pretty important! That&#8217;s what SSSplit does. It carves up papers into sentence (_<s>_) elements whilst also leaving the XML structure of the rest of the document in tact.
The original SSSplit utility was written a number of years ago in Java and used Regular Expressions to parse XML (something that readers of [this StackOverflow article][2] will already know has a propensity to summon eldrich abominations from the otherworld). Due to the complex regular expressions, the old splitter was not particularly performant . Especially given the complex nature of some of the expressions (if you&#8217;re interested, check out one of the _simpler_ ones [here][3]).
Now, I can definitely see what the original authors were going for here. Regular expressions are very good for splitting sentences but not sentences inside complex XML documents. XML parsers are not particularly good for splitting sentences but are obviously good at parsing XML. I also understand that the original splitter was designed and then new bits glued on to make it suitable for new and different standards of XML leading to the gargantuan expressions like the one linked to above. I think they did a pretty good job given the information available to them at the time of writing.
I decided that the splitter needed a rewrite and went straight to my comfort zone to get it done: Python. I&#8217;m very familiar with the language &#8211; to the point now that I can write a fairly complicated program in it in a day if I&#8217;ve had enough coffee and sugar.
## Writing SSSplit 2.0
I decided that we needed to try and minimise excessive uses of regular expressions for both performance and maintainence/readability reasons.  I decided to try and do as much of the parsing of the document structure as possible using a traditional XML parser. I&#8217;d heard good things about [etree][4] which is part of the standard Python library and provides an informal dom-like interface. I used etree to inspect what I dubbed &#8216;P-level&#8217; xml elements first. These are elements that I consider to be at a &#8220;paragraph&#8221; level. Any sentences inside these elements are completely contained &#8211; they do not cross the boundaries into the next container (unless the author is a poet/fiction writer/doesn&#8217;t do English very well I think its a safe bet that they wouldn&#8217;t finish a paragraph mid-sentence). Within the p-level containers, I sweep for any sort of XML node &#8211; we&#8217;re interested in text nodes but also any sort of formatting like bold (<b>) elements.
When a text node is encountered, that&#8217;s when regular expressions start to kick in. We do a very simple match for punctuation just in front of a space and a capital letter and run it over the text node &#8211; these are &#8220;potential&#8221; splits. We also look for full stops at the very end of the text.
<pre lang="python">pattern = re.compile('(\.|\?|\!)(?=\s*[A-Z0-9$])|\.$')
m = pattern.search(txt)
</pre>
Of course this generates lots of false positives &#8211; what if we&#8217;ve found a decimal point inside a number? What if it&#8217;s an abbreviation like e.g. or i.e. or an initial like J. Ravenscroft? There is another regular expression check for decimal points and the string around the punctuation is checked against a list of common abbreviations. There&#8217;s also a list of authors both the writers of the paper in question and those who are cited in the paper too. The function checks that the full stop is not part of one of these authors&#8217; names.
There&#8217;s an important factor to remember: Text node does not imply finished sentence &#8211; they are interspersed with formulae and references as explained above. Therefore we can&#8217;t just finish the current sentence when we reach the end of a text node &#8211; only when we encounter a full stop (not part of an abbreviation or number), question mark or explanation mark. We also know that we can complete the current sentence at the end of a p-level container as I explained above.
Every time we start parsing a sentence, text nodes and other &#8216;stuff&#8217; deemed to be inside that sentence is accumulated into a list. Once we encounter the end of the sentence, the list is glued together and turned into an XML <s> element.
The next step was to see how effective the new splitter was against the old splitter and also manual annotation by professional scientific literature readers.
## Testing the splitter
To test the system I originally wrote a simple script that takes a set of manually annotated papers &#8211; strips them of their annotations so that the new splitter doesn&#8217;t get any clues &#8211; runs the new routine over them and then compares the output. This was very rudimentary as I was in a rush and didn&#8217;t tell me much about the success rate of my splitter. It did display the first and last words of each &#8220;detected&#8221; sentence for both manual and automatic annotation so I could at least see how well (if at all) the two lined up. I had to run the script on a paper-by-paper basis.
I managed to get the splitter working really well on a number of papers (we&#8217;re talking a 100% match) using this tool. However I realised that the majority of papers were still not being matched and it was becoming more and more of a chore to find which ones weren&#8217;t matching.
That&#8217;s why I decided to write a web-based visualisation tool for checking the splitter. The idea is that it runs on all papers giving an overall percentage of how well the automated splitter is working vs the manual splitter but also gives a per-paper figure. If you want to see which papers the system is really struggling with you can inspect them by clicking on them. This brings up a list of all the sentences and whether or not they align.
The tool is pretty useful as it gives me a clue as to which papers I need to tune the splitter with next.
Here&#8217;s a quick demo video of me using the tool to find papers that don&#8217;t match very well.
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/o1EpJ_zJcno?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>
## Next steps
A lot of tuning has been done on how this system works but there&#8217;s still a long way to go yet. I&#8217;ll probably post another article talking about what further changes had to be made to make the parser effective!
[1]: http://papro.org.uk
[2]: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
[3]: https://www.debuggex.com/r/vEyxqRg6xgN9ui_P
[4]: https://docs.python.org/2/library/xml.etree.elementtree.html

View File

@ -0,0 +1,53 @@
---
title: CUSP Challenge Week 2015
author: James
type: post
date: 2015-08-30T16:52:59+00:00
url: /2015/08/30/cusp-challenge-week-2015/
categories:
- Lindyhop
tags:
- cdt
- cusp
- phd
- warwick
---
<figure id="attachment_23" aria-describedby="caption-attachment-23" style="width: 300px" class="wp-caption alignright">[<img loading="lazy" class="wp-image-23 size-medium" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/Warwick-1.jpg?resize=300%2C200&#038;ssl=1" alt="" width="300" height="200" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/Warwick-1.jpg?resize=300%2C200&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/Warwick-1.jpg?resize=1024%2C683&ssl=1 1024w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/Warwick-1.jpg?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/Warwick-1.jpg?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />][1]<figcaption id="caption-attachment-23" class="wp-caption-text">Warwick CDT intake 2015: From left to right &#8211; at the front Jacques, Zakiyya, Corinne, Neha and myself. Rear: David, John, Stephen (CDT director), Mo, Vaggelis, Malkiat and Greg</figcaption></figure>
Hello again readers &#8211; those of you who follow me on other social media (twitter, instagram, facebook etc) probably know that I&#8217;ve just returned from a week in New York City as part of my PhD. My reason for visiting was a kind of ice-breaking activity called the CUSP (Centre for Urban Science + Progress) Challenge Week. This consisted of  working with my PhD cohort (photographed) as well as the 80-something NYU students starting their Urban Science masters courses at CUSP to tackle urban data problems.
We were split into 20 random teams of 4 or 5 people and assigned an &#8216;Urban Science&#8217; task. These tasks involved taking data sets &#8211; usually collected by CUSP staff members &#8211; and doing analysis on them. Our group had to investigate &#8220;Street Quality in New York City&#8221; which turned out to be analysing data on the city&#8217;s potholes. The problem may sound a little dull but once you get going it actually gets quite exciting! Potholes cost NYC millions of dollars per year in litigation and getting them fixed before someone falls into one or damages their car could save the city lots of money.
<figure id="attachment_24" aria-describedby="caption-attachment-24" style="width: 300px" class="wp-caption alignleft">[<img loading="lazy" class="size-medium wp-image-24" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/20150830084943.jpg?resize=300%2C214&#038;ssl=1" alt="An amusing image found by one of my pothole challenge teammates" width="300" height="214" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/20150830084943.jpg?resize=300%2C214&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/20150830084943.jpg?w=424&ssl=1 424w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />][2]<figcaption id="caption-attachment-24" class="wp-caption-text">An amusing image found by one of my pothole challenge teammates</figcaption></figure>
We were given a set of accelerometer readings tied to photographs captured by a device designed by our mentors [Varun][3] and [Graham][4]. The device is fitted to the dashboard of a car and takes 3D accelerometer readings every second as well as a photo as you drive along. The result is a dataset that roughly records where the driver encounters the most &#8220;bumpy&#8221; roads in the city. The dataset only covered one neighbourhood in Brooklyn known as Cobble Hill. However, it was still extensive enough to give us an idea of how you might identify poor quality roads in other areas of the city.
The other dataset we were given was the GPS coordinates of 311 complaints made about street quality during 2015. For those unfamiliar, 311 is a non-emergency hotline you can call in NYC to have a moan about some aspect of the city &#8211; noone&#8217;s been to pick up my rubbish, there&#8217;s a pothole in the road by my house, someone&#8217;s graffitied the bus stop &#8211; that sort of thing. I think the closest parallel we have to 311 in the UK is NHS direct &#8211; i.e. you call 111 if you have a cold and 999 if you&#8217;re having a heart attack. Thanks to New York&#8217;s open data initiative, we had geo-plots for all &#8216;street quality&#8217; related queries right at our finger tips.
It was time to get stuck in. The team had a brainstorm about what questions we should try and answer based on the data available &#8211; we came up with around 20 but since we only had about 10 hours to investigate in total &#8211; decided to restrict ourselves to 3:
1. Can we find a correlation between the accelerometer data and 311 complaints in order to show that 311 complaints could be used as a proxy for potholes?
2. Do the number of 311 complaints correlate to population density or average salary of nearby residents &#8211; we hypothesised that more prosperous, well travelled areas of the city might be more eager to complain about the road quality?
3. Can we train a machine learning system to recognise the presence or lack of potholes given input images and accelerometer data?
Linfeng &#8211; a teammate and budding applied mathematician &#8211; answered question three first. Using the accelerometer data and associated images he set about building a binary classifier &#8211; a system that could take an X,Y,Z reading from the accelerometer and spit out a &#8220;yes thats a pothole&#8221; or &#8220;no that&#8217;s not a pothole&#8221; reading. He did this by manually eye-balling all 843 images that the sensor snapped and putting each of the accelerometer readings into one or the other of the two categories.  After training using 5-folds cross validation &#8211; the final classifier worked with something like a 73% accuracy. We felt that this was pretty good for a first run and that there may be other features that could help this classification. However, our initial finding was &#8211; yes &#8211; it is definitely possible to train a machine learning classifier to detect potholes.
<figure id="attachment_25" aria-describedby="caption-attachment-25" style="width: 300px" class="wp-caption alignleft">[<img loading="lazy" class="size-medium wp-image-25" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/cobble_hill_potholes.png?resize=300%2C296&#038;ssl=1" alt="Potholes and 311 complaints in cobble hills." width="300" height="296" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/cobble_hill_potholes.png?resize=300%2C296&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/cobble_hill_potholes.png?w=567&ssl=1 567w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />][5]<figcaption id="caption-attachment-25" class="wp-caption-text">Potholes(blue) and 311 complaints(orange) in cobble hills. Click to open interactive map.</figcaption></figure>
The next task was to see if there was any correlation between 311 complaints and the pothole data. We used our classifier to only consider points determined to be potholes. The sensor data was recorded on the 3rd of May. Therefore, we also decided to filter 311 complaints by time of report. We thought it was reasonable to assume that potholes found in May would have been reported at earliest &#8211; April and fixed by June. Including 311 complaints from too far in the past or future would add noise to our investigation and slow things down.
We overlayed the 311 data onto the map of potholes to see how well they lined up. There was some loose correlation but the maps did not correlate brilliantly well. Upon reflection we realised that the GPS location associated with the 311 complaints represent where the call was made from rather than the location of the pothole the call was about. It is a fair assertion that most people would wait to make such a call from a safe location rather than stopping in the road as soon as they encounter a hole. We also realised that multiple calls could also be made regarding the same pothole but from different locations. These two assertions validate the need for a more granular data capture system like Graham and Varun&#8217;s.
<figure id="attachment_26" aria-describedby="caption-attachment-26" style="width: 300px" class="wp-caption alignright">[<img loading="lazy" class="wp-image-26 size-medium" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/complaint_and_pop_density.png?resize=300%2C275&#038;ssl=1" alt="complaints and population density. Click for an interactive map" width="300" height="275" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/complaint_and_pop_density.png?resize=300%2C275&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/complaint_and_pop_density.png?w=816&ssl=1 816w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />][6]<figcaption id="caption-attachment-26" class="wp-caption-text">complaints and population density. Click for an interactive map</figcaption></figure>
Finally we looked into the issue of Population Denisy vs. Potholes. I struggled for a while to find a map of population density and ended up having to make my own. New York City has a map of what it calls Neighbourhood Tabulation Areas (NTA). These are small geographical areas used to tabulate statistics for census data i.e. each NTA has its own population density figure. I found a dataset for NTAs covering New York City and another dataset for population density by NTA. I was able to &#8216;wrangle&#8217; the two datasets together and plot them on a map. I then did some geo-SQL &#8211; summing 311 pothole complaints for each NTA and storing it in a database table. This allowed me to plot a map showing both population density and 311 complaint &#8216;density&#8217; for the whole of new york. Interestingly (but perhaps not surprisingly) the map shows strong positive correlation of population to 311 complaints. However, Statten Island as a borough serves as an outlier &#8211; having a lower population but a large number of complaints. I read that Statten Island has a larger vehicle ownership per capita and that this might explain the discrepancy. However, I did not have time to investigate this further. The population density and pothole complaint density correlation serves to further validate the need for more granular data collection. More people are complaining but are they all complaining about the same potholes or are they just better at finding potholes? Are there more potholes in the road or just more people to complain about them? These are questions that could be answered with more data.
I had a great week at CUSP and would like to thank all of their staff for hosting (and putting up with) us.
[1]: https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/Warwick-1.jpg?ssl=1
[2]: https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/20150830084943.jpg?ssl=1
[3]: https://www.linkedin.com/pub/varun-adibhatla/3/356/b60
[4]: https://www.linkedin.com/profile/view?id=AAIAAApppTwBBKcswj9pC3ehsTLjnd_POJHUgro&authType=name&authToken=IJJp&trk=Skyline_click_NBM&sl=NBM%3B152521238%3A1440949464555%3B0%3B9820152%3B
[5]: https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/08/cobble_hill_potholes.png?ssl=1
[6]: http://cdb.io/1Jlmpfn

View File

@ -0,0 +1,62 @@
---
title: A week in Austin, TX Watson Labs
author: James
type: post
date: 2015-10-22T18:10:57+00:00
url: /2015/10/22/a-week-in-austin-tx-watson-labs/
categories:
- Uncategorized
tags:
- alchemy
- austin
- labs
- questions
- rank
- retrieve
- taxonomy
- watson
---
At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as &#8220;The Garage&#8221; located on the IBM Austin site &#8211; to which access is prohibited for normal IBMers unless accompanied by a labs team member.
During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use. However, I can share with you a couple of the general techniques that I used since I think they might be useful for a number of applications
## Technique number 1: query expansion using Part-of-speech tagging and the Concept Expansion API.
### Introduction
The idea here was to address the fact that a user might phrase their question using language synonymous in nature but different to the data being searched for or queried.
Our [Retrieve And Rank][1] service makes use of Apache [SOLR][2] which already offers [synonym expansion within queries][3]. However I found adding this further capability using the [Concept Expansion][4] (service that builds a thesaurus from large corpuses discussing related concepts) service came up with some synonyms that SOLR didn&#8217;t. This might be because the SOLR query expansion system uses [MeSH][5] which is a formal medical ontology and Concept Expansion (or at least the demo) uses a corpus of twitter data which offers a lot more informal word pairings and implicit links. For example, feeding &#8220;[Michael Jackson][6]&#8221; into Concept Expansion will give you outputs like &#8220;[Stevie Nicks][7]&#8221; and &#8220;[Bruce Springsteen][8]&#8221; who are both musicians who released music around the same sort of era of Michael Jackson. By contrast Michael Jackson is (perhaps unsurprisingly) not present in the MeSH ontology.
Although &#8220;Stevie Nicks&#8221; might not be directly relevent to those who are looking for &#8220;Michael Jackson&#8221; &#8211; and those of you who are music fans might know where I&#8217;m going next &#8211; the answer to the question &#8220;Who did Michael Jackson perform alongside with at Bill Clinton&#8217;s 1993 inaugural ball?&#8221; &#8211; is [Fleetwood Mac][9] &#8211; for whom Stevie Nicks sings (that said, my question is specific enough that the keywords &#8220;bill clinton, 1993, inaugural ball, michael jackson&#8221; get you the right answer in google &#8211; albeit at position 2 in the results). So there is definitely some value in using Concept Expansion for this purpose even if you have to be very clever and careful about matching up context around queries.
### Implementation
The first problem you face using this approach is in choosing which words to send off to concept expansion and which ones not to bother with. We&#8217;re not interested in [stopwords ][10] or personal pronouns (putting &#8220;we&#8221; into concept expansion comes back with interesting results like &#8220;testinitialize&#8221; &#8220;usaian&#8221; and &#8220;linux preinstallation&#8221; because of the vast amount of noise around pronouns on twitter). We are more interested in nouns like &#8220;Chair&#8221;, entities and people like &#8220;Michael Jackson&#8221;, adjectives like &#8220;enigmatic&#8221; and verbs like &#8220;going&#8221;. All of these words and phrases are things that could be expanded upon in some way to make our information retrieval query more useful.
To <img loading="lazy" class="alignright size-medium wp-image-33" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-19-165725.png?resize=300%2C249&#038;ssl=1" alt="Screenshot from 2015-10-19 16:57:25" width="300" height="249" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-19-165725.png?resize=300%2C249&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-19-165725.png?w=781&ssl=1 781w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />get around this problem &#8211; I used the [Stanford Part of Speech Tagger ][11]to annotate the queries and only sent words labelled as one of the above mentioned types to the service. Asking &#8220;how much does the CEO earn?&#8221; yields something like the output to the right.
Another problem I ran into very quickly was dealing with nouns consisting of multiple words. For example &#8220;Michael Jackson&#8221;. In my code, I assume that any words tagged Noun that reside next to each other are the same object and should be treated as such. This assumption seems to have worked so far for my limited set of test data
## Alchemy API and Taxonomy Distance
Another small piece of work I carried out was around measuring how &#8220;similar&#8221; two documents are from a very high level based on their distance in the alchemy API taxonomy. If you didn&#8217;t know already, Alchemy has an API for classifying a document into a high level taxonomy. This can often give you a very early indication of how likely that document is to contain information relevent to your use case or requirements. For example a document tagged &#8220;automotive manufacturers&#8221; is unlikely to contain medical text or instructions on sewing and embroidery.
[<img loading="lazy" class="size-medium wp-image-36 alignleft" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?resize=300%2C193&#038;ssl=1" alt="Screenshot from 2015-10-22 19:04:56" width="300" height="193" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?resize=300%2C193&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?w=901&ssl=1 901w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />][12]The taxonomy is a tree structure which contains a huge list of [different categories and subcategories.][13] The idea here was to walk the tree between the category &#8220;node&#8221; assigned to one document to the category assigned to the second document and count the steps &#8211; more steps means further away.  So for each document I made an Alchemy API call to get its taxonomy class. Then I split on &#8220;/&#8221; characters and count how far away A is from B. It&#8217;s pretty straight forward. To the left you can see that a question about burgers and a question about salad dressings are roughly &#8220;2&#8221; categories away from each other &#8211; moving up to food from fast food counts as one jump and moving back down to condiments and dressing counts as another.
<img loading="lazy" class="alignright size-medium wp-image-35" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-185815.png?resize=300%2C166&#038;ssl=1" alt="Screenshot from 2015-10-22 18:58:15" width="300" height="166" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-185815.png?resize=300%2C166&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-185815.png?w=536&ssl=1 536w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />Interestingly the API did seem to struggle with some questions. I used &#8220;What was the market share of Ford in Australia?&#8221; for my first document and &#8220;What type of car should I buy?&#8221; as my second doc and got /automative and vehicle/vehicle brands/ford for my first classification and /finance/personal finance/insurance/car for my second. I have a suspicion that this API is not set up for dealing with short documents like questions and that confused it but I need to do some further testing.
[1]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/
[2]: http://lucene.apache.org/solr/
[3]: http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
[4]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/concept-expansion.html
[5]: http://www.ncbi.nlm.nih.gov/mesh
[6]: https://en.wikipedia.org/wiki/Michael_Jackson
[7]: https://en.wikipedia.org/wiki/Stevie_Nicks
[8]: https://en.wikipedia.org/wiki/Bruce_Springsteen
[9]: https://www.youtube.com/watch?v=h91glweLuBw
[10]: https://en.wikipedia.org/wiki/Stop_words
[11]: http://nlp.stanford.edu/software/tagger.shtml
[12]: https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2015/10/Screenshot-from-2015-10-22-190456.png?ssl=1
[13]: http://www.alchemyapi.com/products/alchemylanguage/taxonomy

View File

@ -0,0 +1,31 @@
---
title: SAPIENTA Web Service and CLI
author: James
type: post
date: 2015-11-01T19:50:52+00:00
url: /2015/11/01/sapienta-web-service-and-cli/
categories:
- PhD
tags:
- docker
- partridge
- sapienta
- script
- web
- websockets
---
Hoorah! After a number of weeks I&#8217;ve finally managed to get SAPIENTA running inside docker containers on our EBI cloud instance. You can try it out at <http://sapienta.papro.org.uk/>.
The project was previously running via a number of very precarious scripts that had a habit of stopping and not coming back up. Hopefully the new docker environment should be a lot more stable.
Another improvement I&#8217;ve made is to create a websocket interface for calling the service and a Python-based commandline client. If you&#8217;re interested I&#8217;m using [socket.io][1] and the relevent python libraries ([server][2] and [client][3]). This means that anyone who needs to can now request annotations in large batches. I&#8217;m planning on using socket.io to interface Partridge with SAPIENTA since they are hosted on separate servers and this approach avoids any complicated firewall issues.
This is also very good news for some of the data scientists who have wanted to use SAPIENTA but haven&#8217;t wanted (or been able) to take on the 2-3 hour installation process that the full blown server and pipeline require (and that&#8217;s on a good day). The websocket CLI is a very simple python script with a low number of dependencies so it should be up and running in 5 minutes or your money back.
To get your hands on this tool &#8211; see [this bitbucket][4].
[1]: http://socket.io/
[2]: https://github.com/miguelgrinberg/Flask-SocketIO/
[3]: https://pypi.python.org/pypi/socketIO-client
[4]: https://bitbucket.org/partridge/sapienta_wsclient

View File

@ -0,0 +1,65 @@
---
title: 'Keynote at YDS 2015: Information Discovery, Partridge and Watson'
author: James
type: post
date: 2015-11-02T21:07:28+00:00
url: /2015/11/02/keynote-at-yds-2015-information-discovery-partridge-and-watson/
categories:
- PhD
- Work
tags:
- extraction
- ibm
- information
- papers
- partridge
- retrieval
- scientific
- watson
- yds
---
<div dir="ltr">
Here is a recording of my recent keynote talk on the power of Natural Language processing through Watson and my academic/PhD topic &#8211; Partridge &#8211; at York Doctoral Symposium.
</div>
<div dir="ltr">
</div>
<div dir="ltr">
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/L4g4F9UDK64?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>
</div>
<li dir="ltr">
0-11 minutes &#8211; history of mankind, invention and the acceleration of scientific progress (warming people to the idea that farming out your scientific reading to a computer is a much better idea than trying to read every paper written)
</li>
<li dir="ltr">
11-26 minutes &#8211; My personal academic work &#8211; scientific paper annotation and cognitive scientific research using NLP
</li>
<li dir="ltr">
26- 44 minutes &#8211; Watson &#8211; Jeopardy, MSK and Ecosystem
</li>
<li dir="ltr">
44 &#8211; 48 minutes Q&A on Watson and Partridge
</li>
Please don&#8217;t cringe too much at my technical explanation of Watson &#8211; especially those of you who know much more about WEA and the original DeepQA setup than I do! This was me after a few days of reading the original 2011 and 2012 papers and making copious notes!
<div dir="ltr">
</div>
<div dir="ltr">
(Equally please don&#8217;t cringe too much about my history of US Presidents @ 37:30- I got Roosevelt and Reagan mixed up in my head!)
</div>
<div dir="ltr">
</div>
<div dir="ltr">
</div>
&nbsp;
&nbsp;

View File

@ -0,0 +1,125 @@
---
title: Retrieve and Rank and Python
author: James
type: post
date: 2015-11-16T18:25:39+00:00
url: /2015/11/16/retrieve-and-rank-and-python/
categories:
- Work
tags:
- api
- cloud
- custom
- developer
- ecosystem
- fcselect
- ibm
- python
- query
- rank
- ranker
- retrieve
- services
- solr
- train
- watson
- wdc
---
## Introduction
Retrieve and Rank (R&R), if you hadn&#8217;t already heard about it, is IBM Watson&#8217;s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way [here][1].
R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt &#8220;relevance&#8221; order.
Some of my partners have found that getting documents in and out of retrieve and rank is a little bit cumbersome using CURL and json files from the command-line. Here I want to demonstrate a much easier way of managing your SOLR documents with [solrpy][2] &#8211; a wrapper around Apache SOLR in Python. Since R&R and SOLR are API compatible (until you start using and training the custom ranker) it is perfectly fine to use solrpy &#8211; in R&R with a few special tweaks.
## Getting Started
**You will need
** An R&R instance with a cluster and collection already configured. I&#8217;m using a schema which has three fields fields &#8211; id, title and text.
Firstly you&#8217;ll want to install the library -normally you could do this with pip. Unfortunately I had to make a small change to get the library to work with retrieve and rank so you&#8217;ll need to install it from my github repo:
<pre>$ git clone git@github.com:ravenscroftj/solrpy.git
$ python setup.py install</pre>
The next step is to run python and initialise your connection. The URL you should use to initialise your SOLR connection has the following structure:
<pre>https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;</pre>
You will also need the credentials from your bluemix service which should look something like this:
<pre>{
"credentials": {
"url": "https://gateway.watsonplatform.net/retrieve-and-rank/api",
"username": "&lt;USERNAME&gt;",
"password": "&lt;PASSWORD&gt;"
}
}</pre>
In python you should try running the following (I am using the interactive python shell [IDLE][3] for this example)
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; s.search("hello world")
<em><strong>&lt;solr.core.Response object at 0x7ff77f91d7d0&gt;</strong></em></pre>
If this worked then you will see something like _**<solr.core.Response object at 0x7ff77f91d7d0> **_as output here. If you get an error response &#8211; try checking that you have substituted in valid values for <CLUSTER\_ID>, <COLLECTION\_NAME>, <USERNAME> and <PASSWORD>.
From this point onwards things get very easy. solrpy has simple functions for creating, removing and searching items in the SOLR index.
To add a document you can use the code below:
<pre>&gt;&gt;&gt; s.add({"title" : "test", "text" : "this is a test", "id" : 1})
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;167&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong>
>&gt;&gt; s.commit()
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;68&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
The XML output shows that the initial add and then commit operations were both successful.
## Content Management
You can also add a number of documents &#8211; this is specifically useful if you have a large number of python objects to insert into SOLR:
<pre>&gt;&gt;&gt; s.add_many( [ { "title" : x['title'], "text" : x['text'], "id" : i } for i,x in enumerate(my_list_of_items) ] )
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;20&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
Of course you can also delete items via their ID from python too:
<pre>&gt;&gt;&gt; s.delete(id=1)
<strong>'&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n&lt;response&gt;\n&lt;lst name="responseHeader"&gt;&lt;int name="status"&gt;0&lt;/int&gt;&lt;int name="QTime"&gt;43&lt;/int&gt;&lt;/lst&gt;\n&lt;/response&gt;\n'</strong></pre>
## Querying SOLR (unranked results)
And you can use SOLR queries too (but importantly note that this does not use the retrieve and rank rankers &#8211; this only gives you access to the SOLR rankers.)
<pre>&gt;&gt;&gt; r = s.select("test")
>&gt;&gt; r.numFound
<strong>1L
</strong>&gt;&gt;&gt; r.results
<strong>[{u'_version_': 1518020997236654080L, u'text': [u'this is a test'], u'score': 0.0, u'id': u'1', u'title': [u'test']}]</strong>
</pre>
## Querying Rankers
Provided you have [successfully trained a ranker ][4] and have the ranker ID handy, you can also query your ranker directly from Python using solrpy too.
<pre>&gt;&gt;&gt; import solr
>&gt;&gt; s = solr.Solr("https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/&lt;CLUSTER_ID&gt;/solr/&lt;COLLECTION_NAME&gt;", http_user="&lt;USERNAME&gt;", http_pass="&lt;PASSWORD&gt;")
>&gt;&gt; fcselect = solr.SearchHandler(s, "/fcselect")
>&gt;&gt; r = fcselect("my query text", ranker_id="&lt;RANKER-ID&gt;")</pre>
in this case **r **is the same type as in the above non-ranker example, you can access the results via **r.results.**
## More information
For more information on how to use solrpy, visit their documentation page [here][5]
[1]: http://cmadison.me/2015/10/23/introducing-ibms-retrieve-and-rank-service/
[2]: https://github.com/edsu/solrpy
[3]: https://en.wikipedia.org/wiki/IDLE_(Python)
[4]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml#create-train
[5]: http://pythonhosted.org/solrpy/

View File

@ -0,0 +1,175 @@
---
title: Spellchecking in retrieve and rank
author: James
type: post
date: 2015-11-17T21:41:09+00:00
url: /2015/11/17/spellchecking-in-retrieve-and-rank/
categories:
- Work
tags:
- checker
- improvements
- rank
- retrieve
- search
- solr
- spell
- spelling
- suggestions
- tuning
- watson
---
### Introduction
Being able to deal with typos and incorrect spellings is an absolute must in any modern search facility. Humans can be lazy and clumsy and I personally often search for things with incorrect terms due to my sausage fingers. In this article I will explain how to turn on spelling suggestions in retrieve and rank so that if your users ask your system for something with a clumsy query, you can suggest spelling fixes for them so that they can submit another, more fruitful question to the system.
Spellchecking is a standard feature of Apache SOLR which is turned off by default with Retrieve and Rank. This post will walk through the process of turning it on for your instance and enabling spell checking suggestions to be returned as part of calls rankers through fcselect. Massive shout out to David Duffett on Stack Overflow who posted [this answer][1] from which most of my blog post is derived.
### Enabling spell checking in your schema
The first thing we need to do is set up a spell checker field in our SOLR schema. For the sake of simplicity, the example schema used below only has a title and text field which are used in indexing and querying. However, this methodology can easily be extended to as many fields as your use case requires.
### Set up field type
The first thing you need to do is define a &#8220;textSpell&#8221; field type which SOLR can use to build a field into which it can dump valid words from your corpus that have been preprocessed and made ready for use in the spell checker. Create the following element in **your schema.xml** file:
<pre>&lt;fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true"&gt;
&lt;analyzer type="index"&gt;
&lt;tokenizer class="solr.StandardTokenizerFactory" /&gt;
&lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /&gt;
&lt;filter class="solr.LowerCaseFilterFactory" /&gt;
&lt;filter class="solr.StandardFilterFactory" /&gt;
&lt;/analyzer&gt;
&lt;analyzer type="query"&gt;
&lt;tokenizer class="solr.StandardTokenizerFactory" /&gt;
&lt;filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /&gt;
&lt;filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /&gt;
&lt;filter class="solr.LowerCaseFilterFactory" /&gt;
&lt;filter class="solr.StandardFilterFactory" /&gt;
&lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
This field type runs a lower case filter over the words provided in the input and also expands any synonyms defined in synonyms.txt and ignores any stopwords defined in stopwords.txt before storing the output in the field. This should give us a list of lower case words that are useful in search and spell checking.
### Create a spellcheck copy field in your schema
The next step is to create a &#8220;textSpell&#8221; field in your SOLR schema that stores the &#8220;suggestions&#8221; from the main content to be used by the spellchecker API.
The following XML defines the field in your schema and should be copied **into schema.xml.** It assumes that you have a couple of content fields called &#8220;title&#8221; and &#8220;text&#8221; from which content can be copied and filtered for use in the spell checker.
<pre>&lt;field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" /&gt;
&lt;copyField source="title" dest="spell"/&gt;
&lt;copyField source="text" dest="spell"/&gt;</pre>
### Defining the spellcheck search component
Once you have finished setting up your schema, you can define the spellchecker parameters **in solrconfig.xml.**
The following XML defines two spelling analysers. The DirectSolrSpellChecker which pulls search terms directly from the index adhoc &#8211; this means that it does not need to be regularly reindexed/rebuilt and always has up to date spelling suggestions.
[`WordBreakSolrSpellChecker` offers suggestions by combining adjacent query terms and/or breaking terms into multiple words.][2] This means that it can provide suggestions that DirectSolrSpellChecker might not find where, for example, a user has a spelling mistake in one of the words in a multi-word search term.
Notice that both lst elements contain a <str name=&#8221;field&#8221;>spell</str> attribute. This must map to at the spell field we defined in the above step so if you used a different name for your field, substitute this in here.
[The documentation][2] provides more detail on how to configure the individual spell check components as well as some alternatives to Direct and Wordbreak which might be more useful depending on your own use case. **Your mileage may vary.**
<pre>&lt;searchComponent name="spellcheck" class="solr.SpellCheckComponent"&gt;
&lt;lst name="spellchecker"&gt;
&lt;str name="name"&gt;default&lt;/str&gt;
&lt;str name="field"&gt;spell&lt;/str&gt;
&lt;str name="classname"&gt;solr.DirectSolrSpellChecker&lt;/str&gt;
&lt;str name="distanceMeasure"&gt;internal&lt;/str&gt;
&lt;float name="accuracy"&gt;0.5&lt;/float&gt;
&lt;int name="maxEdits"&gt;2&lt;/int&gt;
&lt;int name="minPrefix"&gt;1&lt;/int&gt;
&lt;int name="maxInspections"&gt;5&lt;/int&gt;
&lt;int name="minQueryLength"&gt;4&lt;/int&gt;
&lt;float name="maxQueryFrequency"&gt;0.01&lt;/float&gt;
&lt;float name="thresholdTokenFrequency"&gt;.01&lt;/float&gt;
&lt;/lst&gt;
&lt;lst name="spellchecker"&gt;
&lt;str name="name"&gt;wordbreak&lt;/str&gt;
&lt;str name="classname"&gt;solr.WordBreakSolrSpellChecker&lt;/str&gt;
&lt;str name="field"&gt;spell&lt;/str&gt;
&lt;str name="combineWords"&gt;true&lt;/str&gt;
&lt;str name="breakWords"&gt;true&lt;/str&gt;
&lt;int name="maxChanges"&gt;10&lt;/int&gt;
&lt;/lst&gt;
&lt;/searchComponent&gt;</pre>
### Add spelling suggestions to your request handlers
The default SOLR approach is to add a new request handler that deals with searches on the **/spell** endpoint. However, there is no reason why you can&#8217;t add spelling suggestions to any endpoint including **/select** and perhaps more relevently in retrieve and rank **/fcselect**. Below is a snippet of XML for a custom /spell endpoint:
<pre>&lt;requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"&gt;
&lt;lst name="defaults"&gt;
&lt;!-- Solr will use suggestions from both the 'default' spellchecker
and from the 'wordbreak' spellchecker and combine them.
collations (re-written queries) can include a combination of
corrections from both spellcheckers --&gt;
&lt;str name="spellcheck.dictionary"&gt;default&lt;/str&gt;
&lt;str name="spellcheck.dictionary"&gt;wordbreak&lt;/str&gt;
&lt;str name="spellcheck"&gt;on&lt;/str&gt;
&lt;str name="spellcheck.extendedResults"&gt;true&lt;/str&gt;
&lt;str name="spellcheck.count"&gt;10&lt;/str&gt;
&lt;str name="spellcheck.alternativeTermCount"&gt;5&lt;/str&gt;
&lt;str name="spellcheck.maxResultsForSuggest"&gt;5&lt;/str&gt;
&lt;str name="spellcheck.collate"&gt;true&lt;/str&gt;
&lt;str name="spellcheck.collateExtendedResults"&gt;true&lt;/str&gt;
&lt;str name="spellcheck.maxCollationTries"&gt;10&lt;/str&gt;
&lt;str name="spellcheck.maxCollations"&gt;5&lt;/str&gt;
&lt;/lst&gt;
&lt;arr name="last-components"&gt;
&lt;str&gt;spellcheck&lt;/str&gt;
&lt;/arr&gt;
&lt;/requestHandler&gt;</pre>
The following snippet adds spellchecking suggestions to the **/fcselect** endpoint. Simply append the XML inside the _**<requestHandler name=&#8221;/fcselect&#8221; class=&#8221;com.ibm.watson.hector.plugins.ss.FCSearchHandler&#8221;></requestHandler> **_markup area.
<pre>&lt;requestHandler name="/fcselect" class="com.ibm.watson.hector.plugins.ss.FCSearchHandler"&gt;
&lt;lst name="defaults"&gt;
&lt;str name="defType"&gt;fcQueryParser&lt;/str&gt;
&lt;str name="spellcheck.dictionary"&gt;default&lt;/str&gt;
&lt;str name="spellcheck.dictionary"&gt;wordbreak&lt;/str&gt;
&lt;str name="spellcheck.count"&gt;20&lt;/str&gt;
&lt;/lst&gt;
&lt;arr name="last-components"&gt;
&lt;str&gt;fcFeatureGenerator&lt;/str&gt;
&lt;str&gt;spellcheck&lt;/str&gt;
&lt;/arr&gt;
&lt;/requestHandler&gt;</pre>
### Create and populate your SOLR index in Retrieve and Rank
If you haven&#8217;t done this before, you should really read the [official documentation][3] and may want to read [my post about using python to do it too.][4]
You should also [train a ranker][5] so that you can take advantage of the fcselect with spelling suggestions example below.
### Test your new spelling suggestor
Once you&#8217;ve got your collection up and running you should be able to try out the new spelling suggestor. First we&#8217;ll inspect **/spell:**
<pre>$ curl -u $USER:$PASSWORD "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/$CLUSTER_ID/solr/$COLLECTION_NAME/spell?q=businwss&wt=json
{"responseHeader":{"status":0,"QTime":4},"response":{"numFound":0,"start":0,"docs":[]},"spellcheck":{"suggestions":["businwss",{"numFound":1,"startOffset":0,"endOffset":8,"origFreq":0,"suggestion":[{"word":"business","freq":3}]}],"correctlySpelled":false,"collations":["collation",{"collationQuery":"business","hits":3,"misspellingsAndCorrections":["businwss","business"]}]}}
</pre>
As you can see, the system has not found any documents containing the word **businwss. **However, it has identified **businwss **(easily misspelt because &#8216;e&#8217; and &#8216;w&#8217; are next to each other) as a typo of **business**. It has also suggested business as a correction. This can be presented back to the user so that they can refine their search and presented with more results.
Now lets also look at how to use spellcheck with your ranker results.
<pre>$ curl -u $USER:$PASSWORD "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/$CLUSTER_ID/solr/$COLLECTION_NAME/fcselect?ranker_id=$RANKER_ID&q=test+splling+mstaek&wt=json&fl=id,title,score&spellcheck=true&spellcheck=true"
{"responseHeader":{"status":0,"QTime":4},"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]},"spellcheck":{"suggestions":["businwss",{"numFound":1,"startOffset":0,"endOffset":8,"suggestion":["business"]}]}}</pre>
You should see something similar to the above.  The SOLR search failed to return any results for the ranker to rank. However it has come up with a spelling correction which should return more results for ranking next time.
[1]: http://stackoverflow.com/questions/6653186/solr-suggester-not-returning-any-results
[2]: https://cwiki.apache.org/confluence/display/solr/Spell+Checking
[3]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/get_start.shtml
[4]: https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/
[5]: https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/plugin_overview.shtml#generate_queries

View File

@ -0,0 +1,30 @@
---
title: Scrolling in ElasticSearch
author: James
type: post
date: 2015-11-21T09:41:19+00:00
url: /2015/11/21/scrolling-in-elasticsearch/
categories:
- PhD
tags:
- elasticsearch
- lucene
- python
- results
- scan
- scroll
---
I know I&#8217;m doing a lot of flip-flopping between SOLR and Elastic at the moment &#8211; I&#8217;m trying to figure out key similarities and differences between them and where one is more suitable than the other.
The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API.
If you use elastic, it is possible to do paging by adding a size and a from parameter. For example if you wanted to retrieve results in pages of 5 starting from the 3rd page (i.e. show results 11-15) you would do:
<pre><span class="pln">GET </span><span class="pun">/</span><span class="pln">_search</span><span class="pun">?</span><span class="pln">size</span><span class="pun">=</span><span class="lit">5</span><span class="pun">&</span><span class="pln">from</span><span class="pun">=</span><span class="lit">10</span></pre>
However this becomes more expensive as you move further and further into the list of results. Each time you make one of these calls you are re-running the search operation &#8211; forcing Lucene to go off and re-score all the results, rank them and then discard the first 10 (or 10000 if you get that far). There is an easier option: the scan and scroll API.
The idea is that you run your actual query once and then Elastic caches the result somewhere gives you an &#8220;access token&#8221; to go back in and get them. Then you call the scroll API endpoint with said token to get each page of results (a caveat of this is that each time you make a call your token updates and you need to use the new one. My code sample deals with this but it took me a while to figure out what was going on).
The below code uses the python elasticsearch library to make a scan and scroll call to an index and continues to load results until there are no more hits. For each page it maps a function _**f**_** **onto the results. It would not be hard to modify this code to work on multiple threads/processes using the Python multiprocessing API. Take a look!

View File

@ -0,0 +1,25 @@
---
title: Freecite python wrapper
author: James
type: post
date: 2015-11-22T19:20:19+00:00
url: /2015/11/22/freecite-python-wrapper/
categories:
- PhD
tags:
- citations
- freecite
- python
- rcuk
- ref
- references
---
I&#8217;ve written a simple wrapper around the Brown University Citation parser [FreeCite][1]. I&#8217;m planning to use the service to pull out author names from references in REF impact studies and try to link them back to investigators listed on RCUK funding applications.
The code is [here][2] and is MIT licensed. It provides a simple method which takes a string representing a reference and returns a dict with each field separated. There is also a parse_many function which takes an array of reference strings and returns an array of dicts.
&nbsp;
[1]: http://freecite.library.brown.edu/
[2]: https://github.com/ravenscroftj/freecite

View File

@ -0,0 +1,50 @@
---
title: Home automation with Raspberry Pi and Watson
author: James
type: post
date: 2015-11-28T10:57:14+00:00
url: /2015/11/28/watson-home-automation/
categories:
- Work
tags:
- automation
- home
- iot
- jasper
- pi
- raspberry
- speech
- speech-to-text
- stt
- text
- watson
---
I&#8217;ve recently been playing with trying to build a Watson powered home automation system using my Raspberry Pi and some other electronic bits that I have on hand.
There are already a lot of people doing work in this space. One of the most successful projects being [JASPER][1] which uses speech to text and an always on background listening microphone to talk to you and carry out actions when you ask it things in natural language like &#8220;What&#8217;s the weather going to be like tomorrow?&#8221; and &#8220;What is the meaning of life?&#8221; Jasper works using a library called [Sphinx][2] developed by Carnegie Mellon University to do speech recognition. However the models aren&#8217;t great &#8211; especially if you have a british accent.
Jasper also allows you to use other speech to text libraries and services too such as the [Google Speech service][3] and the [AT&T speech service][4]. However there is no currently available code for using the Watson speech to text API &#8211; until now.
The below code snippet can be added to your stt.py file in your jasper project.
Then you need to create a Watson speech-to-text instance in bluemix add the following to your JASPER configuration:
<pre>stt_engine: watson
stt_passive_engine: sphinx
watson-stt:
username: "&lt;Text-to-speech-credentials-username&gt;"
password: "&lt;Text-to-speech-credentials-password&gt;"</pre>
This configuration will use the local Sphinx engine to listen out for &#8220;JASPER&#8221; or whatever you choose to call your companion (which it is actually pretty good at) and then send off 10-15s of audio to Watson STT to be analysed more accurately once the trigger word has been detected. Here&#8217;s a video of the system in action:
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/MBDaJDPKrYE?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>
[1]: http://jasperproject.github.io
[2]: http://cmusphinx.sourceforge.net/
[3]: http://jasperproject.github.io/documentation/configuration/#google-stt
[4]: http://jasperproject.github.io/documentation/configuration/#att-stt

View File

@ -0,0 +1,52 @@
---
title: 'ElasticSearch: Turning analysis off and why its useful'
author: James
type: post
date: 2015-11-29T14:59:06+00:00
url: /2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/
categories:
- PhD
tags:
- analysis
- elasticsearch
- indexing
- python
- schema
---
I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is &#8220;UOA&#8221; which contains the title of the unit of impact that the case study belongs to. We recently identified the fact that we do not want to look at all units of impact (my PhD is around impact in science so domains such as Art History are largely irrelevent to me). Therefore I started trying to run queries like this:
<pre><span id="s-1" class="sBrace structure-1">{ <i class="fa fa-minus-square-o"></i> </span>
   <span id="s-2" class="sObjectK">"query"</span><span id="s-3" class="sColon">:</span><span id="s-4" class="sBrace structure-2">{ <i class="fa fa-minus-square-o"></i> </span>
      <span id="s-5" class="sObjectK">"filtered"</span><span id="s-6" class="sColon">:</span><span id="s-7" class="sBrace structure-3">{ <i class="fa fa-minus-square-o"></i> </span>
         <span id="s-8" class="sObjectK">"query"</span><span id="s-9" class="sColon">:</span><span id="s-10" class="sBrace structure-4">{ <i class="fa fa-minus-square-o"></i> </span>
            <span id="s-11" class="sObjectK">"match_all"</span><span id="s-12" class="sColon">:</span><span id="s-13" class="sBrace structure-5">{ <i class="fa fa-minus-square-o"></i> </span>
            <span id="s-14" class="sBrace structure-5">}</span>
         <span id="s-15" class="sBrace structure-4">}</span><span id="s-16" class="sComma">,</span>
         <span id="s-17" class="sObjectK">"filter"</span><span id="s-18" class="sColon">:</span><span id="s-19" class="sBrace structure-4">{ <i class="fa fa-minus-square-o"></i> </span>
            <span id="s-20" class="sObjectK">"term"</span><span id="s-21" class="sColon">:</span><span id="s-22" class="sBrace structure-5">{ <i class="fa fa-minus-square-o"></i> </span>
               <span id="s-23" class="sObjectK">"UOA"</span><span id="s-24" class="sColon">:</span><span id="s-25" class="sObjectV">"General Engineering"</span>
            <span id="s-26" class="sBrace structure-5">}</span>
         <span id="s-27" class="sBrace structure-4">}</span>
      <span id="s-28" class="sBrace structure-3">}</span>
   <span id="s-29" class="sBrace structure-2">}</span>
<span id="s-30" class="sBrace structure-1">}</span></pre>
For some reason this returns zero results. Now it took me ages to find [this page][1] in the elastic manual which talks about the exact phenomenon I&#8217;m running into above. It turns out that the default analyser is tokenizing every text field and so Elastic has no notion of UOA ever containing &#8220;General Engineering&#8221;. Instead it only knows of a UOA field that contains the word &#8220;general&#8221; and the word &#8220;engineering&#8221; independently of each other in the model somewhere (bag-of-words). To solve this you have to
* Download the existing schema from elastic:
* <pre>curl -XGET "http://localhost:9200/impact_studies/_mapping/study" master [4cb268b] untracked
{"impact_studies":{"mappings":{"study":{"properties":{"CaseStudyId":{"type":"string"},"Continent":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Country":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Funders":{"type":"string"},"ImpactDetails":{"type":"string"},"ImpactSummary":{"type":"string"},"ImpactType":{"type":"string"},"Institution":{"type":"string"},"Institutions":{"properties":{"AlternativeName":{"type":"string"},"InstitutionName":{"type":"string"},"PeerGroup":{"type":"string"},"Region":{"type":"string"},"UKPRN":{"type":"long"}}},"Panel":{"type":"string"},"PlaceName":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"References":{"type":"string"},"ResearchSubjectAreas":{"properties":{"Level1":{"type":"string"},"Level2":{"type":"string"},"Subject":{"type":"string"}}},"Sources":{"type":"string"},"Title":{"type":"string"},"UKLocation":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UKRegion":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UOA":{"type":"string"},"UnderpinningResearch":{"type":"string"}}}}}}</pre>
* Delete the schema (unfortunately you can&#8217;t make this change on the fly) and then turn off the analyser which tokenizes the values in the field:
<pre>$ curl -XDELETE "http://localhost:9200/impact_studies"</pre>
* Then recreate the schema with &#8220;index&#8221;:&#8221;not_analyzed&#8221; on the field you are interested in:
<pre>curl -XPUT "http://localhost:9200/impact_studies/" -d '{"mappings":{"study":{"properties":{"CaseStudyId":{"type":"string"},"Continent":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Country":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"Funders":{"type":"string"},"ImpactDetails":{"type":"string"},"ImpactSummary":{"type":"string"},"ImpactType":{"type":"string"},"Institution":{"type":"string"},"Institutions":{"properties":{"AlternativeName":{"type":"string"},"InstitutionName":{"type":"string"},"PeerGroup":{"type":"string"},"Region":{"type":"string"},"UKPRN":{"type":"long"}}},"Panel":{"type":"string"},"PlaceName":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"References":{"type":"string"},"ResearchSubjectAreas":{"properties":{"Level1":{"type":"string"},"Level2":{"type":"string"},"Subject":{"type":"string"}}},"Sources":{"type":"string"},"Title":{"type":"string"},"UKLocation":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UKRegion":{"properties":{"GeoNamesId":{"type":"string"},"Name":{"type":"string"}}},"UOA":{"type":"string", "index" : "not_analyzed"},"UnderpinningResearch":{"type":"string"}}}}}'</pre>
Once you&#8217;ve done this you&#8217;re good to go reingesting your data and your filter queries should be much more fruitful.
[1]: https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_term_filter_with_text

View File

@ -0,0 +1,184 @@
---
title: Cognitive Quality Assurance An Introduction
author: James
type: post
date: 2016-03-29T08:50:29+00:00
url: /2016/03/29/cognitive-quality-assurance-an-introduction/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"e20dc490dab8";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"22a2beb5a88a";s:6:"status";s:5:"draft";s:3:"url";s:43:"https://medium.com/@jamesravey/e20dc490dab8";}'
categories:
- Uncategorized
- Work
tags:
- assurance
- cognitive
- cqa
- machine learning
- qa
- quality
- watson
---
***EDIT: Hello readers, these articles are now 4 years old and many of the Watson services and APIs have moved or been changed. The concepts discussed in these articles are still relevant but I am working on 2nd editions of them.***
<div>
<strong><br /> This article has a slant towards the IBM Watson Developer Cloud Services but the principles and rules of thumb expressed here are applicable to most cognitive/machine learning problems.</strong>
</div>
## Introduction
<div>
<p>
<img loading="lazy" class="wp-image-94 alignleft" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/imagebot-com-2012042714194724316-800px.png?resize=146%2C147&#038;ssl=1" alt="imagebot-com-2012042714194724316-800px" width="146" height="147" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/imagebot-com-2012042714194724316-800px.png?resize=297%2C300&ssl=1 297w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/imagebot-com-2012042714194724316-800px.png?resize=150%2C150&ssl=1 150w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/imagebot-com-2012042714194724316-800px.png?resize=768%2C776&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/imagebot-com-2012042714194724316-800px.png?w=773&ssl=1 773w" sizes="(max-width: 146px) 100vw, 146px" data-recalc-dims="1" />Quality assurance is arguably one of the most important parts of the software development lifecycle. In order to release a product that is production ready, it must be put under, and pass, a number of tests &#8211; these include unit testing, boundary testing, stress testing and other practices that many software testers are no doubt familiar with. The ways in which traditional software are relatively clear.In a normal system, developers write deterministic functions, that is &#8211; if you put an input parameter in, unless there is a bug, you will always get the same output back. This principal makes it.. well not easy&#8230; but less difficult to write good test scripts and know that there is a bug or regression in your system if these scripts get a different answer back than usual.
</p>
<p>
Cognitive systems are not deterministic in nature. This means that you can receive different results from the same input data when training a system. Such systems tend to be randomly initialised and learn in different, nuanced ways every time they are trained. This is similar to how identical twins who may be biologically identical still learn their own preferences, memories and  skillsets.
</p>
<p>
Thus, a traditional unit testing approach with tests that pass and fail depending on how the output of the system compares to an expected result is not helpful.
</p>
<p>
This article is the first in a series on Cognitive Quality Assurance. Or in other words, how to test and validate the performance of non-deterministic, machine learning systems. In today&#8217;s article we look at how to build a good quality ground truth and then carrying out train/test/blind data segmentation and how you can use your ground truth to verify that a cognitive system is doing its job.
</p>
<h2>
Ground Truth
</h2>
<p>
Let&#8217;s take a step back for a moment and make sure we&#8217;re ok with the concept of ground truth.
</p>
<p>
In machine learning/cognitive applications, the ground truth is the dataset which you use to train and test the system. You can think of it like a school textbook that the cognitive system treats as the absolute truth and first point of reference for learning the subject at hand. Its structure and layout can vary depending on the nature of the system you are trying to build but it will always abide by a number of rules. As I like to remember them: <strong>R-C-S!</strong>
</p>
<h3>
<strong>Representative of the problem</strong>
</h3>
<ul>
<li>
The ground truth must accurately reflect the problem you are trying to solve.
</li>
<li>
If you are building a question answering system, how sure are you that the questions in the ground truth are also the questions that end users will be asking?
</li>
<li>
If you are building an image classification system, are the images in your ground truth of a similar size and quality to the images that you will need to tag and classify in production? Do your positive and negative examples truly represent the problem (i.e. if you only have black and white images in positive but are learning to find cat, the machine might learn to assume that black and white implies cat).
</li>
<li>
The proportions of each type is an important factor. If you have 10 classes of image or text and one particular class occurs 35% of the time in the field, you should try and reflect this in your ground truth too.
</li>
</ul>
<p>
<figure id="attachment_95" aria-describedby="caption-attachment-95" class="wp-caption alignright"><img loading="lazy" class="wp-image-95 size-medium" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vitruvian-man-800px.png?resize=300%2C300&#038;ssl=1" alt="vitruvian-man-800px" width="300" height="300" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vitruvian-man-800px.png?resize=300%2C300&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vitruvian-man-800px.png?resize=150%2C150&ssl=1 150w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vitruvian-man-800px.png?resize=768%2C768&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vitruvian-man-800px.png?w=800&ssl=1 800w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-95" class="wp-caption-text">Like Da Vinci when he drew the anatomically correct Vitruvian man, strive to represent the data as clearly and accurately as possible &#8211; errors make learning harder!</figcaption></figure>
</p>
</div>
### Consistent
* The data in your ground truth must follow a logical set of rules &#8211; even if these are
a bit &#8220;fuzzy&#8221; &#8211; after all if a human can&#8217;t decide on how to classify a set of data consistently, how can we expect a machine to do this?
* Building a ground truth can often be a very large task requiring a team of people. When working in groups it may be useful to build a set of guidelines that detail which data belongs to which class and lists some examples. I will cover this in more detail on my article on working in groups.
* Humans ourselves can be inconsistent in nature so if at all possible, try to automate some of the classification &#8211; using dictionaries or pattern matching rules.
**
<img loading="lazy" class="wp-image-98 alignleft" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=93%2C86&#038;ssl=1" alt="Warning-2400px" width="93" height="86" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=300%2C278&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=768%2C712&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=1024%2C950&ssl=1 1024w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?w=1980&ssl=1 1980w" sizes="(max-width: 93px) 100vw, 93px" data-recalc-dims="1" />
**Important: never use cognitive systems to generate ground truth or you run the risk of introducing compounding learn errors.**
### **Statistically Significant** &#8211;
* The ground truth should be as large as is affordable. When you were a child and learned the concept of dog or cat, the chances are you learned that from seeing a large number of these animals and were able to draw up mental rules for what a dog entails (4 legs, furry, barks, wags tail) vs what cat entails (4 legs, sometimes furry, meows, retractable claws). The more  diverse examples of these animals you see, the better you are able to refine your mental model for what each animal entails. The same applies with machine learning and cognitive systems.
* Some of the  Watson APIs list minimal ground truth quality requirements and these vary from service to service. You should always be aiming as high as possible but as an absolute minimum, for at least 25% more than the service requirement so that we have some data for our blind testing (all will be revealed)
<figure id="attachment_97" aria-describedby="caption-attachment-97" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-97" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vector-x-2400px.png?resize=212%2C300&#038;ssl=1" alt="More data points means that the cognitive system has more to work with - don't skimp on ground truth - it will cost you your accuracy!" width="212" height="300" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vector-x-2400px.png?resize=212%2C300&ssl=1 212w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vector-x-2400px.png?resize=768%2C1086&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vector-x-2400px.png?resize=724%2C1024&ssl=1 724w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vector-x-2400px.png?w=1697&ssl=1 1697w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/vector-x-2400px.png?w=1320&ssl=1 1320w" sizes="(max-width: 212px) 100vw, 212px" data-recalc-dims="1" /><figcaption id="caption-attachment-97" class="wp-caption-text">More data points means that the cognitive system has more to work with &#8211; don&#8217;t skimp on ground truth &#8211; it will cost you your accuracy!</figcaption></figure>
There are some test techniques for dealing with testing smaller corpuses that I will cover in a follow up article.
## Training and Testing &#8211; Concepts
Once we are happy with our ground truth, we need to decide how best to train and test the system. In a standard software environment, you would want to test every combination of every function and make sure that all combinations work. It may be tempting to jump to this conclusion with Cognitive systems too. However, this is not the answer.
Taking a step back again, let&#8217;s think remember when you were back at school. Over the course of a year you would learn about a topic and at the end there was an exam. We knew that the exam would test what we had learned during the year but we did not know:
* The exact questions that we would be tested on &#8211; you have some idea of the sorts of questions you might be tested on but if you knew what the exact questions were you could go and find out what the answers are ahead of time
* The exact exam answers that would get us the best results before we went into the exam room and took the test. That&#8217;d be cheating right?
* With machine learning, this concept of learning and then blind testing is equally important. If we train the algorithm on all of the ground truth available to us and then test it, we are essentially asking it questions we already told it the answers to. We&#8217;re allowing it to cheat.
* By splitting the ground truth into two datasets, training on one and then asking questions with the other &#8211; we are really demonstrating that the machine has learned the concepts we are trying to teach and not just memorised the answer sheet.
## Training and Testing &#8211; Best Practices
<img loading="lazy" class="size-medium wp-image-100 alignleft" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/SteveLambert-Dumbell-Lifter-800px.png?resize=300%2C243&#038;ssl=1" alt="SteveLambert-Dumbell-Lifter-800px" width="300" height="243" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/SteveLambert-Dumbell-Lifter-800px.png?resize=300%2C243&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/SteveLambert-Dumbell-Lifter-800px.png?resize=768%2C622&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/SteveLambert-Dumbell-Lifter-800px.png?w=800&ssl=1 800w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />Generally we split our data set into 80% training data and 20% testing data &#8211; this means that we are giving the cognitive system the larger chunk of information to learn from and testing it on a small subset of those concepts (in the same way that your professor gave you 12 weeks of lectures to lean from and then a 2 hour exam at the end of term).
It is important that the test questions are well represented in the train data (it would have been mean of your professors to ask you questions in the exam that were never taught in the lectures). Therefore, you should make sure to sample ground truth pairs from each class or concept that you are trying to teach.
You should not simply take the first 80% of the ground truth file and feed it into the algorithm and use the last 20% of the file to test the algorithm &#8211; this is making a huge assumption about how well each class is represented in the data. For example, you might find that all of the questions about car insurance come at the end of your banking FAQ ground truth resulting in:
* The algorithm never seeing what a car insurance question looks like and not learning this concept.
* The algorithm fails miserably at the test because most of the questions were on car insurance and it didn&#8217;t know much about that.
* The algorithm has examples of mortgage and credit card questions but is never tested on these &#8211; we can&#8217;t make any assertions about how well it has learned to classify these concepts.
The best way to divide test and training data for the above [NLC](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/nl-classifier.html) problem is as follows:
* Iterate over the ground truth &#8211; separating out each example into groups by class/concept
* Randomly select 80% of each of the groups to become the training data for that group/class
* Take the other 20% of each group and use this as the test data for that group/class
* Recombine the subgroups into two groups: test and train
With some of the other Watson cognitive APIs (I&#8217;m looking at you,<a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/visual-recognition.html"> Visual Recognition</a> and <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/retrieve-rank.html">Retrieve & Rank</a>) you will need to alter this process a little bit. However the key here is making sure that the test data set is a fair representation (and a fair test) of the information in the train dataset.
### Testing the model
Once you have your train set and test set, the next bit is easy. Train a classifier with the train set and then write a script that loads in your test set, asks the question (or shows the classifier the image) and then compare the answer that the classifier gives with the answer in the ground truth. If they match, increment a &#8220;correct&#8221; number. If they don&#8217;t match, too bad! You can then calculate the accuracy of your classifier &#8211; it is the percentage of the total number of answers that were marked as correct.
### Blind Testing and Performance Reporting
<p>
<img loading="lazy" class="alignright size-medium wp-image-99" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Blindfolded-Darts-Player-800px.png?resize=300%2C247&#038;ssl=1" alt="Blindfolded-Darts-Player-800px" width="300" height="247" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Blindfolded-Darts-Player-800px.png?resize=300%2C247&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Blindfolded-Darts-Player-800px.png?w=680&ssl=1 680w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />In a typical work flow you may be training, testing, altering your ground truth to try and improve performance and re-training.  This is perfectly normal and it often takes some time to tune and tweak a model in order to get optimal performance.
</p>
<p>
However, in doing this, you may be inadvertently biasing your model towards the test data &#8211; which in itself may change how the model performs in the real world. When you are happy with your test performance, you may wish to benchmark against another third dataset &#8211; a blind test set that the machine has not been &#8216;tweaked&#8217; in order to perform better against. This will give you the most accurate view, with respect to the data available, of how well your classifier is performing in the real world.
</p>
<p>
In the case of three data sets (test, train, blind) you should use a similar algorithm/work flow as describe in the above section.  The important thing is that the three sets must not overlap in any way and should all be representative of the problem you are trying to train on.
</p>
<p>
There are a lot of differing opinions on what proportions to separate out the data set into. Some folks advocate 50%, 25%, 25% for test, train, blind respectively, others 70, 20, 10. I personally start at the latter and change these around if they don&#8217;t work &#8211; your mileage may vary depending on the type of model you are trying to build and the sort of problem you are trying to model.
</p>
<p>
<strong><br /> <img loading="lazy" class=" wp-image-98 alignleft" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=145%2C134&#038;ssl=1" alt="Warning-2400px" width="145" height="134" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=300%2C278&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=768%2C712&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?resize=1024%2C950&ssl=1 1024w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/03/Warning-2400px.png?w=1980&ssl=1 1980w" sizes="(max-width: 145px) 100vw, 145px" data-recalc-dims="1" />Important: once you have done your blind test to get an accurate idea of how well your model performs in the real world, you must not do any more tuning on the model.</strong><strong>If you do, your metrics will be meaningless since you are now biasing the new model towards the blind data set. You can of course, start from scratch and randomly initialise a new set of test, train and blind data sets from your ground truth at any time.</strong>
</p>
<h2>
<strong>Conclusion</strong>
</h2>
<p>
Hopefully, this article has given you some ideas about how best to start assessing the quality of your cognitive application.<a href="https://brainsteam.co.uk/2016/05/29/cognitive-quality-assurance-pt-2-performance-metrics/"> In the next article</a>, I cover some more in depth measurements that you can do on your model to find out where it is performing well and where it needs tuning beyond a simple accuracy rating. We will also discuss some other methods for segmenting test and train data for smaller corpuses in a future article.
</p>
</div>

View File

@ -0,0 +1,25 @@
---
title: IBM Watson Its for data scientists too!
author: James
type: post
date: 2016-05-01T11:28:13+00:00
url: /2016/05/01/ibm-watson-its-for-data-scientists-too/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- Work
tags:
- data science
- ibm
- watson
---
Last week, my colleague Olly and I gave a talk at a data science meetup on how [IBM Watson can be used for data science applications][1].
We had an amazing time and got some really great feedback from the event. We will definitely be doing more talks at events like these in the near future so keep an eye out for us!
I will also be writing a little bit more about the experiment I did around Core Scientific Concepts and Watson Natural Language Classifier in a future blog post.
&nbsp;
[1]: https://skillsmatter.com/skillscasts/8076-ibm-watson-it-s-for-data-scientists-too

View File

@ -0,0 +1,816 @@
---
title: 'Cognitive Quality Assurance Pt 2: Performance Metrics'
author: James
type: post
date: 2016-05-29T09:41:26+00:00
url: /2016/05/29/cognitive-quality-assurance-pt-2-performance-metrics/
featured_image: /wp-content/uploads/2016/05/Oma--825x510.png
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"1f1de4b3132e";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:96:"https://medium.com/@jamesravey/cognitive-quality-assurance-pt-2-performance-metrics-1f1de4b3132e";}'
categories:
- Work
tags:
- cognitive
- cqa
- evaluation
- learning
- machine
- rank
- retrieval
- retrieve
- supervised
- watson
---
***EDIT: Hello readers, these articles are now 4 years old and many of the Watson services and APIs have moved or been changed. The concepts discussed in these articles are still relevant but I am working on 2nd editions of them.***
[Last time][1] we discussed some good practices for collecting data and then splitting it into test and train in order to create a ground truth for your machine learning system. We then talked about calculating accuracy using test and blind data sets.
In this post we will talk about some more metrics you can do on your machine learning system including **Precision**, **Recall**, **F-measure** and **confusion matrices.** These metrics give you a much deeper level of insight into how your system is performing and provide hints at how you could improve performance too!
## A recap &#8211; Accuracy calculation
This is the most simple calculation but perhaps the least interesting. We are just looking at the percentage of times the classifier got it right versus the percentage of times it failed. Simply:
1. sum up the number of results (count the rows),
2. sum up the number of rows where the predicted label and the actual label match.
3. Calculate percentage accuracy: correct / total * 100.
This tells you how good the classifier is in general across all classes. It does not help you in understanding how that result is made up.
## Going above and beyond accuracy: why is it important?
<img loading="lazy" class="alignleft" src="https://i1.wp.com/openclipart.org/image/2400px/svg_to_png/13234/Anonymous-target-with-arrow.png?resize=268%2C250&#038;ssl=1" alt="target with arrow by Anonymous" width="268" height="250" data-recalc-dims="1" />Imagine that you are a hospital and it is critically important to be able to predict different types of cancer and how urgently they should be treated. Your classifier is 73% accurate overall but that does not tell you anything about it&#8217;s ability to predict any one type of cancer. What if the 27% of the answers it got wrong were the cancers that need urgent treatment? We wouldn&#8217;t know!
This is exactly why we need to use measurements like precision, recall and f-measure as well as confusion matrices in order to understand what is really going on inside the classifier and which particular classes (if any) it is really struggling with.
## Precision, Recall and F-measure and confusion matrices (Grandma&#8217;s Memory Game)
<img loading="lazy" class="alignright" src="https://i2.wp.com/openclipart.org/image/2400px/svg_to_png/213139/Oma-.png?resize=264%2C391&#038;ssl=1" alt="Grandma's face by frankes" width="264" height="391" data-recalc-dims="1" />Precision, Recall and F-measure are incredibly useful for getting a deeper understanding of which classes the classifier is struggling with. They can be a little bit tricky to get your head around so lets use a metaphor about Grandma&#8217;s memory.
Imagine Grandma has 24 grandchildren. As you can understand it is particularly difficult to remember their names. Thankfully, her 6 children, the grandchildren&#8217;s parents all had 4 kids and named them after themselves. Her son Steve has 3 sons: Steve I, Steve II, Steve III and so on.
This makes things much easier for Grandma, she now only has to remember 6 names: Brian, Steve, Eliza, Diana, Nick and Reggie. The children do not like being called the wrong name so it is vitally important that she correctly classifies the child into the right name group when she sees them at the family reunion every Christmas.
I will now describe Precision, Recall, F-Measure and confusion matrices in terms of Grandma&#8217;s predicament.
### Some Terminology
Before we get on to precision and recall, I need to introduce the concepts of true positive, false positive, true negative and false negative. Every time Grandma gets an answer wrong or right, we can talk about it in terms of these labels and this will also help us get to grips with precision and recall later.
These phrases are in terms of each class &#8211; you have TP, FP, FN, TN for each class. In this case we can have TP,FP,FN,TN with respect to Brian, with respect to Steve, with respect to Eliza and so on.
This table shows how these four labels apply to the class &#8220;Brian&#8221; &#8211; you can create a table will
<table border="0" cellspacing="0">
<colgroup width="197"></colgroup> <colgroup span="2" width="85"></colgroup> <tr>
<td align="left" height="17">
</td>
<td align="left">
Brian
</td>
<td align="left">
Not Brian
</td>
</tr>
<tr>
<td align="left" height="17">
Grandma says “Brian”
</td>
<td align="left">
True Positive
</td>
<td align="left">
False Positive
</td>
</tr>
<tr>
<td align="left" height="17">
Grandma says <not brian>
</td>
<td align="left">
False Negative
</td>
<td align="left">
True Negative
</td>
</tr>
</table>
* If Grandma calls a Brian, Brian then we have a true positive (with respect to the Brian class) &#8211; the answer is true in both senses- Brian&#8217;s name is indeed Brian AND Grandma said Brian &#8211; go Grandma!
* If Grandma calls a Brian, Steve then we have a false negative (with respect to the Brian class). Brian&#8217;s name is Brian and Grandma said Steve. This is also a false positive with respect to the Steve Class.
* If Grandma calls a Steve, Brian then we have a false positive (with respect to the Brian class). Steve&#8217;s name is Steve, Grandma wrongly said Brian (i.e. identified positively).
* If Grandma calls an Eliza, Eliza, or Steve, or Diana, or Nick &#8211; the result is the same &#8211; we have a true negative (with respect to the Brian class). Eliza,Eliza would obviously be a true positive with respect to the Eliza class but because we are only interested in Brian and what is or isn&#8217;t Brian at this point, we are not measuring this.
When you are recording results, it is helpful to store them in terms of each of these labels where applicable. For example:
Steve,Steve (TP Steve, TN everything else)
Brian,Steve (FN Brian, FP Steve)
### Precision and Recall
Grandma is in the kitchen, pouring herself a Christmas Sherry when three Brians and 2 Steves come in to top up their eggnogs.
Grandma correctly classifies 2 Brians but slips up and calls one of them Eliza. She only gets 1 of the Steve&#8217; and calls the other Brian.
In terms of TP,FP,TN,FN we can say the following (true negative is the least interesting for us):
<table border="0" cellspacing="0">
<colgroup width="197"></colgroup> <colgroup span="3" width="85"></colgroup> <tr>
<td align="left" height="17">
</td>
<td align="left">
TP
</td>
<td align="left">
FP
</td>
<td align="left">
FN
</td>
</tr>
<tr>
<td align="left" height="17">
Brian
</td>
<td align="right">
2
</td>
<td align="right">
1
</td>
<td align="right">
1
</td>
</tr>
<tr>
<td align="left" height="17">
Eliza
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
</td>
</tr>
<tr>
<td align="left" height="17">
Steve
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
1
</td>
</tr>
</table>
* She has correctly identified 2 people who are truly called Brian as Brian (TP)
* She has falsely named someone Eliza when their name is not Eliza (FP)
* She has falsely named someone whose name is truly Steve something else (FN)
**True Positive, False Positive, True Negative and False negative are crucial to understand before you look at precision and recall so make sure you have fully understood this section before you move on.**
#### Precision
Precision, like our TP/FP labels, is expressed in terms of each class or name. It is the proportion of true positive name guesses divided by true positive + false positive guesses.
Put another way, precision is how many times Grandma correctly guessed Brian versus how many times she called other people (like Steve) Brian.
For Grandma to be precise, she needs to be very good at correctly guessing Brians **and also** never call anyone else (Elizas and Steves) Brian.
_**Important: If Grandma came to the conclusion that 70% of her grandchildren were named Brian and decided to just randomly say &#8220;Brian&#8221; most of the time, she could still achieve a high overall accuracy. However, her Precision &#8211; with respect to Brian would be poor because of all the Steves and Elizas she was mis-labelling. This is why precision is important.**_
<table border="0" cellspacing="0">
<colgroup width="197"></colgroup> <colgroup span="4" width="85"></colgroup> <tr>
<td align="left" height="17">
</td>
<td align="left">
TP
</td>
<td align="left">
FP
</td>
<td align="left">
FN
</td>
<td align="left">
Precision
</td>
</tr>
<tr>
<td align="left" height="17">
Brian
</td>
<td align="right">
2
</td>
<td align="right">
1
</td>
<td align="right">
1
</td>
<td align="right">
66%
</td>
</tr>
<tr>
<td align="left" height="17">
Eliza
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
N/A
</td>
</tr>
<tr>
<td align="left" height="17">
Steve
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
100%
</td>
</tr>
</table>
The results from this case are displayed above. As you can see, Grandma uses Brian to incorrectly label Steve so precision is only 66%. Despite only getting one of the Steves correct, Grandma has 100% precision for Steve simply by never using the name incorrectly. We can&#8217;t calculate for Eliza because there were no true positive guesses for that name ( 0 / 1 is still zero ).
So what about false negatives? Surely it&#8217;s important to note how often Grandma is inaccurately calling  Brian by other names? We&#8217;ll look at that now&#8230;
#### Recall
Continuing the theme, Recall is also expressed in terms of each class. It is the proportion of true positive name guesses divided by true positive + false negative guesses.
Another way to look at it is given a population of Brians, how many does Grandma correctly identify and how many does she give another name (i.e. Eliza or Steve)?
This tells us how &#8220;confusing&#8221; Brian is as a class. If Recall is high then its likely that Brians all have a very distinctive feature that distinguishes them as Brians (maybe they all have the same nose). If Recall is low, maybe Brians are very varied in appearance and perhaps look a lot like Elizas or Steves (this presents a problem of its own, check out confusion matrices below for more on this).
<table border="0" cellspacing="0">
<colgroup width="197"></colgroup> <colgroup span="4" width="85"></colgroup> <tr>
<td align="left" height="17">
</td>
<td align="left">
TP
</td>
<td align="left">
FP
</td>
<td align="left">
FN
</td>
<td align="left">
Recall
</td>
</tr>
<tr>
<td align="left" height="17">
Brian
</td>
<td align="right">
2
</td>
<td align="right">
1
</td>
<td align="right">
1
</td>
<td align="right">
66.6%
</td>
</tr>
<tr>
<td align="left" height="17">
Eliza
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
N/A
</td>
</tr>
<tr>
<td align="left" height="17">
Steve
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
50%
</td>
</tr>
</table>
You can see that recall for Brian remains the same (of the 3 Brians Grandma named, she only guessed incorrectly for one). Recall for Steve is 50% because Grandma guessed correctly for 1 and incorrectly for the other Steve. Again Eliza can&#8217;t be calculated because we end up trying to divide zero by zero.
**F-Measure**
F-measure effectively a measurement of how accurate the classifier is per class once you factor in both precision and recall. This gives you a wholistic view of your classifier&#8217;s performance on a particular class.
In terms of Grandma, f-measure give us an aggregate metric of how good Grandma is at dealing with Brians in terms of both precision AND accuracy.
It is very simple to calculate if you already have precision and recall:
![F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}][2]
Here are the F-Measure results for Brian, Steve and Eliza from above.
<table border="0" cellspacing="0">
<colgroup width="197"></colgroup> <colgroup span="6" width="85"></colgroup> <tr>
<td align="left" height="17">
</td>
<td align="left">
TP
</td>
<td align="left">
FP
</td>
<td align="left">
FN
</td>
<td align="left">
Precision
</td>
<td align="left">
Recall
</td>
<td align="left">
F-measure
</td>
</tr>
<tr>
<td align="left" height="17">
Brian
</td>
<td align="right">
2
</td>
<td align="right">
1
</td>
<td align="right">
1
</td>
<td align="right">
66.6%
</td>
<td align="right">
66.6%
</td>
<td align="right">
66.6%
</td>
</tr>
<tr>
<td align="left" height="17">
Eliza
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
N/A
</td>
<td align="right">
N/A
</td>
<td align="right">
N/A
</td>
</tr>
<tr>
<td align="left" height="17">
Steve
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
1
</td>
<td align="right">
0.5
</td>
<td align="right">
0.6666666667
</td>
</tr>
</table>
As you can see &#8211; the F-measure is the average ([harmonic mean][3]) of the two values &#8211; this can often give you a good overview of both precision and recall and is dramatically affected by one of the contributing measurements being poor.
### Confusion Matrices
When a class has a particularly low Recall or Precision, the next question should be why? Often you can improve a classifier&#8217;s performance by modifying  the data or (if you have control of the classifier) which features you are training on.
For example, what if we find out that Brians look a lot like Elizas? We could add a new feature (Grandma could start using their voice pitch to determine their gender and their gender to inform her name choice) or we could update the data (maybe we could make all Brians wear a blue jumper and all Elizas wear a green jumper).
Before we go down that road, we need to understand where there is confusion between classes  and where Grandma is doing well. This is where a confusion matrix helps.
A Confusion Matrix allows us to see which classes are being correctly predicted and which classes Grandma is struggling to predict and getting most confused about. It also crucially gives us insight into which classes Grandma is confusing as above. Here is an example of a confusion Matrix for Grandma&#8217;s family.
<table border="0" cellspacing="0">
<colgroup width="179"></colgroup> <colgroup span="7" width="85"></colgroup> <tr>
<td align="left" height="17">
</td>
<td align="left">
</td>
<td colspan="6" align="center" valign="middle">
<b>Predictions</b>
</td>
</tr>
<tr>
<td align="left" height="17">
</td>
<td align="left">
</td>
<td align="left">
Steve
</td>
<td align="left">
Brian
</td>
<td align="left">
Eliza
</td>
<td align="left">
Diana
</td>
<td align="left">
Nick
</td>
<td align="left">
Reggie
</td>
</tr>
<tr>
<td rowspan="6" align="center" valign="middle" height="102">
<b>Actual </b></p>
<p>
<b>Class</b></td>
<td align="left">
Steve
</td>
<td align="right">
<strong>4</strong>
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
</td></tr>
<tr>
<td align="left">
Brian
</td>
<td align="right">
1
</td>
<td align="right">
<strong>3</strong>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
1
</td>
<td align="right">
1
</td>
</tr>
<tr>
<td align="left">
Eliza
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
<strong>5</strong>
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
</td>
</tr>
<tr>
<td align="left">
Diana
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
5
</td>
<td align="right">
<strong>1</strong>
</td>
<td align="right">
</td>
<td align="right">
</td>
</tr>
<tr>
<td align="left">
Nick
</td>
<td align="right">
1
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
<strong>5</strong>
</td>
<td align="right">
</td>
</tr>
<tr>
<td align="left">
Reggie
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
</td>
<td align="right">
<strong>6</strong>
</td>
</tr></tbody> </table>
<p>
Ok so lets have a closer look at the above.
</p>
<p>
Reading across the rows left to right these are the actual examples of each class &#8211; in this case there are 6 children with each name so if you sum over the row you will find that they each add up to 6.
</p>
<p>
Reading down the columns top-to-bottom you will find the predictions &#8211; i.e. what Grandma thought each child&#8217;s name was.  You will find that these columns may add up to more than or less than 6 because Grandma may overfit for one particular name. In this case she seems to think that all her female Grandchildren are called Eliza (she predicted 5/6 Elizas are called Eliza and 5/6 Dianas are also called Eliza).
</p>
<p>
Reading diagonally where I&#8217;ve shaded things in bold gives you the number of correctly predicted examples. In this case Reggie was 100% accurately predicted with 6/6 children called &#8220;Reggie&#8221; actually being predicted &#8220;Reggie&#8221;. Diana is the poorest performer with only 1/6 children being correctly identified. This can be explained as above with Grandma over-generalising and calling all female relatives &#8220;Eliza&#8221;.
</p>
<p>
<figure id="attachment_118" aria-describedby="caption-attachment-118" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-118" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/05/FEN-Ponytail-800px.png?resize=259%2C300&#038;ssl=1" alt="Steve sings for a Rush tribute band - his Geddy Lee is impeccable." width="259" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/05/FEN-Ponytail-800px.png?resize=259%2C300&ssl=1 259w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2016/05/FEN-Ponytail-800px.png?w=690&ssl=1 690w" sizes="(max-width: 259px) 100vw, 259px" data-recalc-dims="1" /><figcaption id="caption-attachment-118" class="wp-caption-text">Steve sings for a Rush tribute band &#8211; his Geddy Lee is impeccable.</figcaption></figure>
</p>
<p>
Grandma seems to have gender nailed except in the case of one of the Steves (who in fairness does have a Pony Tail and can sing very high).  She is best at predicting Reggies and struggles with Brians (perhaps Brians have the most diverse appearance and look a lot like their respective male cousins). She is also pretty good at Nicks and Steves.
</p>
<p>
Grandma is terrible at female grandchildrens&#8217; names. If this was a machine learning problem we would need to find a way to make it easier to identify the difference between Dianas and Elizas through some kind of further feature extraction or weighting or through the gathering of additional training data.
</p>
<h2>
Conclusion
</h2>
<p>
Machine learning is definitely no walk in the park. There are a lot of intricacies involved in assessing the effectiveness of a classifier. Accuracy is a great start if until now you&#8217;ve been praying to the gods and carrying four-leaf-clovers around with you to improve your cognitive system performance.
</p>
<p>
However, Precision, Recall, F-Measure and Confusion Matrices really give you the insight you need into which classes your system is struggling with and which classes confuse it the most.
</p>
<h4>
A Note for Document Retrieval (Watson Retrieve & Rank) Users
</h4>
<p>
This example is probably directly relevant to those building classification systems (i.e. extracting intent from questions or revealing whether an image contains a particular company&#8217;s logo). However all of this stuff works directly for document retrieval use cases too. Consider true positive to be when the first document returned from the query is the correct answer and false negative is when the first document returned is the wrong answer.
</p>
<p>
There are also variants on this that consider the top 5 retrieved answer (Precision@N) that tell you whether your system can predict the correct answer in the top 1,3,5 or 10 answers by simply identifying &#8220;True Positive&#8221; as the document turning up in the top N answers returned by the query.
</p>
<h3>
Finally&#8230;
</h3>
<p>
Overall I hope this tutorial has helped you to understand the ins and outs of machine learning evaluation.
</p>
<p>
Next time we look at cross-validation techniques and how to assess small corpii where carving out a 30% chunk of the documents would seriously impact the learning. Stay tuned for more!
</p>
[1]: https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/
[2]: https://upload.wikimedia.org/math/9/9/1/991d55cc29b4867c88c6c22d438265f9.png
[3]: https://en.wikipedia.org/wiki/Harmonic_mean#Harmonic_mean_of_two_numbers

View File

@ -0,0 +1,47 @@
---
title: '#BlackgangPi a Raspberry Pi Hack at Blackgang Chine'
author: James
type: post
date: 2016-06-05T07:59:40+00:00
url: /2016/06/05/blackgangpi-a-raspberry-pi-hack-at-blackgang-chine/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"360de275805d";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:94:"https://medium.com/@jamesravey/blackgangpi-a-raspberry-pi-hack-at-blackgang-chine-360de275805d";}'
categories:
- Work
tags:
- cognitive
- hackathon
- ibm
- watson
---
I was very excited to be invited along with some other IBMers to the Blackgang Pi event run by Dr Lucy Rogers on a semi regular basis at the Blackgang Chine theme park on the Isle of Wight.
[Blackgang Chine ][1]is a theme park on the southern tip of the Isle of Wight and holds the title of oldest theme park in the United Kingdom. We were lucky enough to be invited along to help them modernise some of their animatronic exhibits, replacing some of the aging bespoke PCBs and controllers with Raspberry Pis running Node-RED and communicating using MQTT/Watson IOT.
Over the course of two days, my colleague [James Sutton][2] and I built a talking moose head using some of the IBM Watson Cognitive services.
We got it talking fairly quickly using IBM text to speech and had it listening for intents like &#8220;tell joke&#8221; or &#8220;check weather&#8221; via NLC.
<blockquote class="twitter-tweet" data-lang="en">
<p dir="ltr" lang="en">
So good so far! A talking Moose head powered by <a href="https://twitter.com/IBMIoT">@IBMIoT</a>, <a href="https://twitter.com/IBMWatson">@IBMWatson</a> & <a href="https://twitter.com/NodeRED">@NodeRED</a> <a href="https://twitter.com/hashtag/BlackgangPi?src=hash">#BlackgangPi</a> <a href="https://t.co/Vhgkr8q9cw">pic.twitter.com/Vhgkr8q9cw</a>
</p>
<p>
— James Sutton (@jpwsutton) <a href="https://twitter.com/jpwsutton/status/739075900021604352">June 4, 2016</a>
</p>
</blockquote>
I also built out a dialog that would monitor the state of the conversation and make the user comply with the knock knock joke format (i.e. if you say anything except &#8220;who&#8217;s there&#8221; it will moan and call you a spoil-sport).
Video we managed to capture before we had to pack up yesterday below
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/5IMS9VUll6g?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>
[1]: http://www.blackgangchine.com/
[2]: https://jsutton.co.uk/

View File

@ -0,0 +1,113 @@
---
title: The builder, the salesman and the property tycoon
author: James
type: post
date: 2016-11-12T11:43:24+00:00
url: /2016/11/12/the-builder-the-salesman-and-the-property-tycoon/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"45839adb0b2d";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:92:"https://medium.com/@jamesravey/the-builder-the-salesman-and-the-property-tycoon-45839adb0b2d";}'
categories:
- Uncategorized
tags:
- buzzwords
- funny
- machine learning
---
A testament to marketers around the world is the myth that their AI platform X, Y or Z can solve all your problems with no effort. Perhaps it is this, combined with developers and data scientists often being hidden out of sight and out of mind that leads people to think this way.
Unfortunately, the truth of the matter is that ML and AI involve blood sweat and tears &#8211; especially if you are building things from scratch rather than using APIs. If you are using third party APIs there are still challenges. The biggest players in the API space also have large pools of money. Pools of money that can be spent on marketing literature to convince you that their product will solve all your problems with no effort required. I think this is dishonest and is one of the reasons I have so many conversations like the one below.
The take home message is clear! We need to do way more to help clients to understand AI tech and what it can do in a more transparent way. Simply getting customers excited about buzzwords without explaining things in layman&#8217;s terms is a guaranteed way to lose trust and build a bad reputation.
At [Filament][1], we pride ourselves on being honest and transparent about what AI can do for your business and are happy to take the time to explain concepts and buzzwords in laymans&#8217; terms.
**The following is an amusing anecdote about what happens when AI experts get their messaging wrong.**
## The builder, the salesman and the property tycoon
Imagine that a property tycoon is visiting an experienced builder for advice on construction of a new house. _**This is a hugely exaggerated example and all of the people in it are caricatures. No likeness or similarity intended. Our &#8216;master builders&#8217; are patient, understanding and communicative and thankfully, have never met a &#8216;Mr Tycoon&#8217; in real life.**_
<pre>Salesman(SM): Welcome Mr Tycoon, please allow me to introduce to you our master builder. She has over 25 years in the construction industry and qualifications in bricklaying, plumbing and electrics.
Master Builder (MB): Hi Mr Tycoon, great to meet you *handshake*
Tycoon(TC): Lovely to meet you both. I'm here today because I want some advice on my latest building project. I've been buying blocks of apartments and letting them out for years. My portfolio is worth £125 Million. However, I want to get into the construction game.
MB: That's exciting. So how can we help?
TC: Ok I'm a direct kind of guy and I say things how I see them so I'll cut to the chase. I want to build a house. What tools shall I use?
MB: Good question... what kind of house are you looking to build?
TC: Well, whatever house I can build with tools.
MB: ok... well you know there are a lot of options and it depends on budget. I mean you must have something in mind? Bungalow? 2-Story family house? Manor house?
TC: Look, I don't see how this is relevant. I've been told that the tools can figure all this stuff out for me.
SM: Yes MB, we can provide tools that will help TC decide what house to build right?
MB: That's not really how it works but ok... Let's say for the sake of argument that we're going to help you build a 2 bedroom townhouse.
TC: Fantastic! Yes, great initiative MB, a townhouse. Will tools help me build a townhouse?
MB: Yeah... tools will help you build a townhouse...
TC: That's great!
MB: TC, do you have any experience building houses? You said you mainly buy houses, not build them.
T: No not really. However, SM told me that with the right tools, I don't need any experience, the tools will do all the work for me.
MB: Right... ok... SM did you say that?
SM: Well, with recent advances in building techniques and our latest generation of tools, anything is possible!
MB: Yes... that's true tools do make things easier. However, you really do need to know how to use the tools. They're not 'magic' - you should understand which ones are useful in different situations
TC: Oh, that's not the kind of answer I was looking for. SM, you said this wouldn't be a problem.
SM: It's not going to be a problem is it MB? I mean we can help TC figure out which tools to use?
MB: I suppose so...
SM: That's the attitude MB... Tell TC about our services
MB: Sure, I have had many years of experience building townhouses, we have a great architect at our firm who can design the house for you. My team will take care of the damp coursing, wooden frame, brickwork and plastering and then I will personally oversee the installation of the electrics and pipework.
TC: Let's not get bogged down in the detail here MB, I just want a townhouse... Now I have a question. Have you heard of mechanical excavators - I think you brits call them "diggers".
MB: Yes... I have used diggers a number of times in the past.
TC: Oh that's great. MB, do you think diggers can help me build a house faster?
MB: Urm, well maybe. It depends on the state of the terrain that you want to build on.
TC: Oh that's great, some of our potential tenants have heard of diggers and if I tell them we used diggers to build the house they will be so excited.
MB: Wonderful...
TC: I've put an order in for 25 diggers - if we have more that means we can build the house faster right?
MB: Are you serious?
SM: Of course TC is serious, that's how it works right?
MB: Not exactly but ok, if you have already ordered them that's fine *tries to figure out what to do with 24 spare diggers*
TC: Great, it's settled then. One more thing, I don't know if I want to do a townhouse. Can you use diggers to build townhouses? I'm only interested in building things that diggers can build.
MB: Yes don't worry, you can build townhouses with diggers. I've done it personally a number of times
TC: I'm not so sure. I've heard of this new type of house called a Ford Mustang. Everyone in the industry is talking about how we can drive up ROI by building Ford Mustangs instead of Townhouses. What are your thoughts MB?
MB: That's not a... diggers... I... I'm really sorry TC, I've just received an urgent text message from one of our foremen at a building site, I have to go and resolve this. Thanks for your time, SM can you wrap up here? *calmly leaves room and breathes into a paper bag*
SM: Sorry about that TC, anyway yes I really love the Ford mustang idea, what's your budget?
-FIN-</pre>
This post is supposed to raise a chuckle and it&#8217;s not supposed to offend anyone in particular. However, on a more serious note, there is definitely a problem with buzzwords in machine learning and industry. Let&#8217;s try and fix it.
[1]: http://filament.uk.com/

View File

@ -0,0 +1,27 @@
---
title: timetrack a simple time tracking application for developers
author: James
type: post
date: 2016-11-23T14:43:58+00:00
url: /2016/11/23/timetrack-a-simple-time-tracking-application-for-developers/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- Open Source
tags:
- phd
- projects
- python
- time
- tracking
---
I&#8217;ve written a small command line application for tracking my time on my PhD and other projects. We use Harvest at Filament which is great if you&#8217;ve got a huge team and want the complexity (and of course license charges) of an online cloud solution for time tracking.
If, like me, you&#8217;re just interested to see how much time you are spending on your different projects and you don&#8217;t have any requirement for fancy web interfaces or client billing, then [timetrack][1] might be for you. For me personally, I was wondering how much of my week is spent on my PhD as opposed to Filament client work. I know its a fair amount but I want some clear cut numbers.
[timetrack][1] is a simple application that allows you to log what time you&#8217;ve spent and where from the command line with relative ease. It logs everything to a text file which is relatively easy to read by !machines. However it also provides filtering and reporting functions so that you can see how much time you spend on your projects, how much time you used today and how much of your working day is left.
It&#8217;s written in python with minimal dependencies on external libraries (save for progressbar2 which is used for the live tracker). The code is open source and available under an MIT license. Download it from [GitHub][1]
[1]: https://github.com/ravenscroftj/timetrack

View File

@ -0,0 +1,58 @@
---
title: We need to talk about push notifications (and why I stopped wearing my smartwatch)
author: James
type: post
date: 2016-11-27T12:59:22+00:00
url: /2016/11/27/we-need-to-talk-about-push-notifications-and-why-i-stopped-wearing-my-smartwatch/
featured_image: /wp-content/uploads/2016/11/IMG_20161127_130808-e1480252170130-576x510.jpg
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"3a1b15a3f469";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:124:"https://medium.com/@jamesravey/we-need-to-talk-about-push-notifications-and-why-i-stopped-wearing-my-smartwatch-3a1b15a3f469";}'
categories:
- Uncategorized
tags:
- multi-tasking
- notifications
- phd
- planning
- work
---
I own a Pebble Steel which I got for Christmas a couple of years ago. I&#8217;ve been very happy with it so far. I can control my music player from my wrist, get notifications and a summary of my calender. Recently, however I&#8217;ve stopped wearing it. The reason is that constant streams of notifications stress me out, interrupt my workflow and not wearing it makes me feel more calm and in control and allows me to be more productive.
As you can imagine, trying to do a PhD and be a CTO at the same time has its challenges. I struggle with the cognitive dissonance between walling off my research days to focus on my PhD and making sure that the developers at work are getting on ok and being productive without me. I have thus far tended to compromise by leaving slack running and fielding the odd question from colleagues even on my off days.
Conversely, when I&#8217;m working for [Filament,][1] I often get requests from University colleagues to produce reports and posters, share research notes and to resolve problems with [SAPIENTA][2] or [Partridge][3] infrastructure (or even run experiments on behalf of other academics). Both of these scenarios play havoc with my prioritisation of todos when I get notified about them.
## Human Multitasking
Human Multitasking is something of a myth &#8211; as is [the myth that women can multitask and men can&#8217;t][4]. It turns out that we are all ([except for a small group of people scientists call &#8220;supertaskers&#8221;][5]) particularly rubbish at multi-tasking. I am no exception, however much I wish I was.
When we &#8220;multitask&#8221; we are actually context switching. Effectively, we&#8217;re switching between a number of different tasks very quickly, kind of like how a computer is able to run many applications on the same CPU core by executing different bits of each app &#8211; it might deal with an incoming email, then switch to rendering your netflix movie, then switch to continuing to download that email. It does this so quickly that it seems like both activities are happening at once. That&#8217;s obviously different for dual or quad core CPUs but that&#8217;s not really the point here since our brains are not &#8220;quad core&#8221;.
CPUs are really good at context switching very quickly. However, the human brain is really rubbish at this. [Joel Spolsky has written a really cool computer analogy on why][6] but if you don&#8217;t want to read a long article on it, lets just say that where a computer can context-switch in milliseconds, a human needs a few minutes.
It also logically follows that the more cognitively intensive a job is, the more time a brain needs to swap contexts. For example, you might be able to press the &#8220;next&#8221; button on your car stereo while driving at 70 MPH down the motorway, but (aside from the obvious practical implications) you wouldn&#8217;t be able to perform brain surgery and drive at the same time . If you consider studying for a PhD and writing machine learning software for a company to be roughly as complex as the above example, you can hopefully understand why I&#8217;d struggle.
## Push Notifications
The problem I find with &#8220;push&#8221; notifications is that they force you to context switch. We, as a society, have spent the last couple of decades training ourselves to stop what we are doing and check our phones as soon as that little vibration or bling noise comes through. If you are a paramedic or surgeon with a pager, that&#8217;s the best possible use case for this tech, and I&#8217;m not saying we should stop push notifications for emergency situations like that. However, when the notification is &#8220;check out this dank meme dude&#8221; but we are still stimulated into action this can have a very harmful effect on our concentration and ability to focus on the task at hand.
Mobile phone notifications are bad enough but occasionally, if your phone buzzes in your pocket and you are engrossed in another task, you won&#8217;t notice and you&#8217;ll check your phone later. Smartwatch notifications seem to get my attention 9 times out of 10  &#8211; I guess that&#8217;s what they&#8217;re designed for. Having something strapped directly to the skin on my wrist is much more distracting than something buzzing through a couple of layers of clothing on my leg.
I started to find that push notifications forcibly jolt me out of whatever task I&#8217;m doing and I immediately feel anxious until I&#8217;ve handled the new input stimulus. This means that I will often prioritise unimportant stuff like responding to memes that my colleague has posted in slack over the research paper I&#8217;m reading. Maybe this means I miss something crucial, or maybe I just have to go back to the start of the page I&#8217;m looking at. Either way, time is a&#8217;wastin&#8217;.
## The Solution
For me, it&#8217;s obvious. Push notifications need a huge re-think. I am currently reorganising the way I work, think and plan and ripping out as many push notification mechanisms as I can. [I&#8217;ve also started keeping track of how I&#8217;m spending my time using a tool I wrote last week.][7]
I can definitely see a use case for &#8220;machine learning&#8221; triage of notifications based on intent detection and personal priorities. If a relative is trying to get hold of me because there&#8217;s been an emergency, I wouldn&#8217;t mind being interrupted during a PhD reading session. If a notification asking for support on Sapienta or a work project comes through, that&#8217;s urgent but can probably wait half an hour until I finish my current reading session. If a colleague wants to send me a video of grumpy cat, that should wait in a list of things to check out after 5:30pm.
Until me, or someone with more time to do so builds a machine learning filter like this one, I&#8217;ve stopped wearing my smart watch and my phone is on silent. If you need me and I&#8217;m ignoring you, don&#8217;t take it personally. I&#8217;ll get back to you when I&#8217;m done with my current task. If it&#8217;s urgent,  you&#8217;ll just have to try phoning and hoping I notice the buzz in my pocket (until I find a more elegant way to screen urgent calls and messages).
[1]: http://filament.uk.com
[2]: http://sapienta.papro.org.uk
[3]: http://farnsworth.papro.org.uk/
[4]: http://link.springer.com/article/10.3758%2FPBR.17.4.479
[5]: http://link.springer.com/article/10.3758/PBR.17.4.479
[6]: http://www.joelonsoftware.com/articles/fog0000000022.html
[7]: https://brainsteam.co.uk/2016/11/23/timetrack-a-simple-time-tracking-application-for-developers/

View File

@ -0,0 +1,56 @@
---
title: AI cant solve all our problems, but that doesnt mean it isnt intelligent
author: James
type: post
date: 2016-12-08T10:08:13+00:00
url: /2016/12/08/ai-cant-solve-all-our-problems-but-that-doesnt-mean-it-isnt-intelligent/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"e3e315592001";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"6fc55de34f53";s:6:"status";s:6:"public";s:3:"url";s:117:"https://medium.com/@jamesravey/ai-cant-solve-all-our-problems-but-that-doesn-t-mean-it-isn-t-intelligent-e3e315592001";}'
categories:
- PhD
- Work
tags:
- AI
- machine learning
- philosophy
---
<figure id="attachment_150" aria-describedby="caption-attachment-150" style="width: 285px" class="wp-caption alignright"><img loading="lazy" class="wp-image-150 size-medium" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/Thomas_Hobbes_portrait.jpg?resize=285%2C300&#038;ssl=1" width="285" height="300" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/Thomas_Hobbes_portrait.jpg?resize=285%2C300&ssl=1 285w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/Thomas_Hobbes_portrait.jpg?resize=768%2C810&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/Thomas_Hobbes_portrait.jpg?resize=971%2C1024&ssl=1 971w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/Thomas_Hobbes_portrait.jpg?w=1109&ssl=1 1109w" sizes="(max-width: 285px) 100vw, 285px" data-recalc-dims="1" /><figcaption id="caption-attachment-150" class="wp-caption-text">Thomas Hobbes, perhaps most famous for his thinking on western politics, was also thinking about how the human mind &#8220;computes things&#8221; 500 years ago.</figcaption></figure>
[A recent opinion piece I read on Wired][1] called for us to stop labelling our current specific machine learning models AI because they are not intelligent. I respectfully disagree.
AI is not a new concept. The idea that a computer could &#8216;think&#8217; like a human and one day pass for a human has been around since Turing and even in some form long before him. The inner workings the human brain and how we carry out computational processes have even been discussed by great philosophers such as Thomas Hobbes who wrote in his book, De Corpore in 1655 that _&#8220;by reasoning, I understand computation. And to compute is to collect the sum of many things added together at the same time, or to know the remainder when one thing has been taken from another. To reason therefore is the same as to add or to subtract.&#8221;_ Over the years, AI has continued to capture the hearts and minds of great thinkers, scientists and of course creatives and artists.
<figure id="attachment_151" aria-describedby="caption-attachment-151" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="wp-image-151 size-full" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/The_Matrix_soundtrack_cover.jpg?resize=300%2C300&#038;ssl=1" width="300" height="300" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/The_Matrix_soundtrack_cover.jpg?w=300&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/The_Matrix_soundtrack_cover.jpg?resize=150%2C150&ssl=1 150w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-151" class="wp-caption-text">The Matrix: a modern day telling of [Rene Descartes&#8217; &#8220;Evil Demon&#8221;][2] theorem</figcaption></figure>
Visionary Science Fiction authors of the 20th century: Arthur C Clarke, Isaac Asimov and Philip K Dick have built worlds of fantasy inhabited by self-aware artificial intelligence systems and robots, [some of whom could pass for humans unless subject to a very specific and complicated test][3].  Endless films have been released that &#8220;sex up&#8221; AI. The Terminator series, The Matrix, Ex Machina, the list goes on. However, like all good science fiction, these stories that paint marvellous and thrilling visions of futures that are still in the future even in 2016.
The science of AI is a hugely exciting place to be too (_I would say that, wouldn&#8217;t I). _In the 20th century we&#8217;ve mastered speech recognition, optical character recognition and machine translation good enough that I can visit Japan and communicate, via my mobile phone, with a local shop keeper without either party having to learn the language of their counterpart. We have arrived at a point where we can train machine learning models to do some specific tasks better than people (including drive cars and [diagnostic oncology][4]). We call these current generation AI models &#8220;weak AI&#8221;. Computers that can solve any problem we throw at them (in other words, ones that have generalised intelligence and known as &#8220;strong AI&#8221; systems) are a long way off. However, that shouldn&#8217;t detract from what we have solved already with weak AI.
One of the problems with living in a world of 24/7 new cycles and clickbait titles is that nothing is new or exciting any more. Every small incremental change in the world is reported straight away across the globe. Every new discovery, every fractional increase in performance from AI gets a blog post or a news article. It makes everything seem boring. _Oh Tesla&#8217;s cars can drive themselves? So what? Google&#8217;s cracked Go? Whatever&#8230; _
<figure id="attachment_152" aria-describedby="caption-attachment-152" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="wp-image-152 size-medium" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?resize=300%2C300&#038;ssl=1" width="300" height="300" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?resize=300%2C300&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?resize=150%2C150&ssl=1 150w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?resize=768%2C769&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?resize=1024%2C1024&ssl=1 1024w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/tom-Bathroom-scale-2400px.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-152" class="wp-caption-text">If you lose 0.2Kg overnight, your spouse probably won&#8217;t notice. Lose 50 kg and I can guarantee they would</figcaption></figure>
If you lose 50kgs in weight over 6 months, your spouse is only going to notice when you buy a new shirt that&#8217;s 2 sizes smaller or notice a change in your muscle when you get out of the shower. A friend you meet up with once a year is going to see a huge change because last time they saw you you were twice the size. In this day and age, technology moves on so quickly in tiny increments that we don&#8217;t notice the huge changes any more because we&#8217;re like the spouse &#8211; we constantly see the tiny changes.
What if we did see huge changes? What if we could cut ourselves off from the world for months at a time? If you went back in time to 1982 and told them that every day you talk to your phone using just your voice and it is able to tell you about your schedule and what restaurant to go to, would anyone question that what you describe is AI? If you told someone from 1995 that you can [buy a self driving car][5] via a small glass tablet you carry around in your pocket, are they not going to wonder at the world that we live in? We have come a long long way and we take it for granted. Most of us use AI on a day to day basis without even questioning it.
Another common criticism of current weak AI models is the exact lack of general reasoning skills that would make them strong AI.
> <span class="lede" tabindex="-1">DEEPMIND HAS SURPASSED </span>the <a href="https://www.wired.com/2016/03/googles-ai-taking-one-worlds-top-go-players/" target="_blank">human mind</a> on the Go board. Watson <a href="https://www.wired.com/2014/01/watson-cloud/" target="_blank">has crushed</a> Americas trivia gods on _Jeopardy_. But ask DeepMind to play Monopoly or Watson to play _Family Feud_, and they wont even know where to start.
That&#8217;s absolutely true. The AI/compsci definition of this constraint is the &#8220;no free lunch for optimisation&#8221; theorem. That is that you don&#8217;t get something for nothing when you train a machine learning model. In training a weak AI model for a specific task, you are necessarily hampering its ability to perform well at other tasks. I guess a human analogy would be the education system.
<figure id="attachment_153" aria-describedby="caption-attachment-153" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="wp-image-153 size-medium" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/no_idea.jpg?resize=300%2C169&#038;ssl=1" width="300" height="169" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/no_idea.jpg?resize=300%2C169&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2016/12/no_idea.jpg?w=625&ssl=1 625w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-153" class="wp-caption-text">If you took away my laptop and told me to run cancer screening tests in a lab, I would look like this</figcaption></figure>
Aged 14 in a high school in the UK, I was asked which 11 GCSEs I wanted to take. At 16 I had to reduce this scope to 5 A levels, aged 18 I was asked to specify a single degree and aged 21 I had to decide which tiny part of AI/Robotics (which I&#8217;d studied at degree level) I wanted to specialise in at PhD level. Now that I&#8217;m half way through a PhD in Natural Language Processing in my late 20s, would you suddenly turn around and say &#8220;actually you&#8217;re not intelligent because if I asked you to diagnose lung cancer in a child you wouldn&#8217;t be able to&#8221;? Does what I&#8217;ve achieved become irrelevant and pale against that which I cannot achieve? I do not believe that any reasonable person would make this argument.
The AI Singularity has not happened yet and it&#8217;s definitely a few years away. However, does that detract from what we have achieved so far? No. No it does not.
&nbsp;
[1]: https://www.wired.com/2016/12/artificial-intelligence-artificial-intelligent/
[2]: https://en.wikipedia.org/wiki/Brain_in_a_vat
[3]: https://en.wikipedia.org/wiki/Do_Androids_Dream_of_Electric_Sheep%3F
[4]: https://www.top500.org/news/watson-proving-better-than-doctors-in-diagnosing-cancer/
[5]: https://www.tesla.com/en_GB/models

View File

@ -0,0 +1,23 @@
---
title: timetrack improvements
author: James
type: post
date: 2016-12-10T09:33:41+00:00
url: /2016/12/10/timetrack-improvements/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- Open Source
- PhD
tags:
- python
- timetrack
---
I&#8217;ve just added a couple of improvements to timetrack that allow you to append to existing time recordings (either with an amount like 15m or using live to time additional minutes spent and append them).
You can also remove entries using timetrack rm instead of remove &#8211; saving keystrokes is what programming is all about.
You can find the [updated code over at github.][1]
[1]: https://github.com/ravenscroftj/timetrack

View File

@ -0,0 +1,114 @@
---
title: Exploring Web Archive Data CDX Files
author: James
type: post
date: 2017-06-05T07:24:22+00:00
url: /2017/06/05/exploring-web-archive-data-cdx-files/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- PhD
tags:
- cdx
- python
- webarchive
---
I have recently been working in partnership with [UK Web Archive][1] in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of [web archive dumps of the rest of the .UK top level domain.][1]
## WARC and CDX Files
The web archive project have produced standardized file formats for describing historic web resources in a compressed archive. The website is scraped and the content is stored chronologically in a [WARC][2] file. A CDX index file is also produced describing every URL scraped, the time it was retrieved at and which WARC file the content is in, along with some other metadata.
Our first task is to identify news content in order to narrow down our search to a subset of WARC files (in order not to fill 60TB of storage or have to traverse that amount of data). The CDX files allow us to do this. These files are available for [free download from the Web Archive website.][3] They are compressed using Gzip compression down to around 10-20GB per file. If you try to expand these files locally, you&#8217;re looking at 60-120GB of uncompressed data &#8211; a great way to fill up your hard drive.
## Processing Huge Gzip Files
Ideally we want to explore these files without having to uncompress them explicitly. This is possible using Python 3&#8217;s gzip module but it took me a long time to find the right options.
Python file i/o typically allows you to read a file in line by line. If you have a text file, you can iterate over the lines using something like the following:
<pre lang="python">with open("my_text_file.txt", "r") as f:
for line in f:
print(line)
</pre>
Now clearly trying this approach with a .gz file isn&#8217;t going to work. Using the [gzip][4] module we can open and uncompress gz as a stream &#8211; examining parts of the file in memory and discarding data that we&#8217;ve already seen. This is the most efficient way of dealing with a file of this magnitude that won&#8217;t fit into RAM on a modern machine and would will a hard drive uncompressed.
I tried a number of approaches using the gzip library, trying to run the gzip command line utility using [subprocess][5] and combinations of [TextIOWrapper][6] and [BufferedReader][7] but to no avail.
## The Solution
The solution is actually incredibly simple in Python 3 and I wasn&#8217;t far off the money with [TextIOWrapper.][6] The gzip library offers a file read/write flag for accessing gzipped text in a buffered line-by-line fashion as above for the uncompressed text file. Simply passing in &#8220;rt&#8221; to the gzip.open() function will wrap the input stream from Gzip in a TextIOWrapper and allow you to read the file line by line.
<pre lang="python">import gzip
with gzip.open("2012.cdx.gz","rt") as gzipped:
    for i,line in enumerate(gzipped):
print(line)
# stop this thing running off and printing the whole file.
if i == 10:
break</pre>
If you&#8217;re using an older version of Python (2.7 for example) or you would prefer to see what&#8217;s going on beneath the covers here explicitly, you can also use the following code:
<pre lang="python">import io
import gzip
with io.TextIOWrapper(gzip.open("2012.cdx.gz","r")) as gzipped:
for i,line in enumerate(gzipped):
print(line)
# stop this thing running off and printing the whole file.
if i == 10:
break</pre>
And its as simple as that. You can now start to break down each line in the file using tools like [urllib][8] to identify content stored in the archive from domains of interest.
## Solving a problem
We may want to understand how much content is available in the archive for a given Domain. To put this another way, which are the domains with the most pages that we have stored in the web archive. In order to answer this, we can run a simple script that parses all of the URLs, examines the domain name and counts instances of each.
<pre>import io
import gzip
from collections import Counter
from urllib.parse import urlparse
with gzip.open("2012.cdx.gz","rt") as gzipped:
    for i,line in enumerate(gzipped):
        
        parts = line.split(" ")
        
        urlbits = urlparse(parts[2])
        
        urlcounter[urlbits.netloc] += 1
#at the end we print out the top 10 URLs
print(urlcounter.most_common(10))</pre>
Just to quickly explain what is going on here:
1. We load up the CDX file in compressed text mode as described above
2. We split each line using space characters. This gives us an array of fields, the order and content of which are described by the WebArchive team [here.][3]
3. We parse the URL (which is at index 2) using the [urlparse][9] function which will break the URL up into things like domain, protocol (HTTP/HTTPS), path, query, fragment.
4. We increment the counter for the current domain (described in the &#8216;netloc&#8217; field of the parsed url.
5. After iterating we print out the domains with the most URLs in the CDX file.
This will take a long time to complete since we&#8217;re iterating over ~60TB of text. I intend to investigate parallel processing of these CDX files as a next step.
## Conclusion
We&#8217;ve looked into how to dynamically unzip and examine a CDX file in order to understand which domains host the most content. The next step is to identify which WARC files are of interest and request access to them from the Web Archive.
[1]: https://www.webarchive.org.uk/ukwa/
[2]: http://commoncrawl.org/2014/04/navigating-the-warc-file-format/
[3]: http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/
[4]: https://docs.python.org/3.6/library/gzip.html
[5]: https://docs.python.org/3/library/subprocess.html
[6]: https://docs.python.org/3/library/io.html#io.TextIOWrapper
[7]: https://docs.python.org/3/library/io.html#io.BufferedReader
[8]: https://docs.python.org/3/library/urllib.html
[9]: https://docs.python.org/3/library/urllib.parse.html

View File

@ -0,0 +1,100 @@
---
title: Dialect Sensitive Topic Models
author: James
type: post
date: 2017-07-25T11:02:42+00:00
url: /2017/07/25/dialect-sensitive-topic-models/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- Open Source
- PhD
tags:
- lda
- machine learning
- python
- topic model
---
As part of my PhD I&#8217;m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you&#8217;re new to the concept of topic modelling then [this article][1] can give you a quick primer.
## Vanilla LDA
<figure id="attachment_175" aria-describedby="caption-attachment-175" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-175" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/300px-Latent_Dirichlet_allocation.png?resize=300%2C157&#038;ssl=1" alt="" width="300" height="157" data-recalc-dims="1" /><figcaption id="caption-attachment-175" class="wp-caption-text">A diagram of how latent variables in LDA model are connected</figcaption></figure>
Vanilla topic models such as [Blei&#8217;s LDA][2] are great but start to fall down when the wording around one particular concept varies too much. In a scientific paper you might expect to find words like &#8220;gastroenteritis&#8221;, &#8220;stomach&#8221; and &#8220;virus&#8221; whereas in newspapers discussing the same topic you might find &#8220;tummy&#8221;, &#8220;sick&#8221; and &#8220;bug&#8221;.  A vanilla LDA implementation might struggle to understand that these concepts are linked unless the contextual information around the words is similar (e.g. both articles have &#8220;uncooked meat&#8221; and &#8220;symptoms last 24 hours&#8221;).
&nbsp;
We define a set of toy documents that have 3 main topics around sickness and also around health and going to the gym. Half of the documents are written in &#8220;layman&#8217;s&#8221; english and the other half &#8220;scientific&#8221; english. The documents are shown below
<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']</pre>
Using a normal implementation of LDA with 3 topics, we get the following results after 30 iterations:
<figure id="attachment_174" aria-describedby="caption-attachment-174" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="size-medium wp-image-174" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&#038;ssl=1" alt="" width="300" height="209" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?resize=300%2C209&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.31.20.png?w=482&ssl=1 482w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-174" class="wp-caption-text">Vanilla LDA results</figcaption></figure>
It is fair to say that Vanilla LDA didn&#8217;t do a terrible job but it did make end up with some strange decisions like putting poisoning (as in &#8216;food poisoning&#8217; in with cardio and calories). The other two topics seem fairly consistent and sensible.
&nbsp;
## DiaTM
Crain et al. 2010 paper [_**&#8220;Dialect topic modeling for improved consumer medical**_ search.&#8221;][3] proposes a modified LDA that they call &#8220;DiaTM&#8221;.
<figure id="attachment_176" aria-describedby="caption-attachment-176" style="width: 286px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-176" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&#038;ssl=1" alt="" width="286" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=286%2C300&ssl=1 286w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?resize=768%2C805&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/dialda.png?w=910&ssl=1 910w" sizes="(max-width: 286px) 100vw, 286px" data-recalc-dims="1" /><figcaption id="caption-attachment-176" class="wp-caption-text">A diagram showing how the latent variables in DiaTM are linked together</figcaption></figure>
DiaTM works in the same way as LDA but also introduces the concept of collections and dialects. A collection defines a set of documents from the same source or written with a similar dialect &#8211; you can imagine having a collection of newspaper articles and a collection of scientific papers for example. Dialects are a bit like topics &#8211; each word is effectively &#8220;generated&#8221; from a dialect and the probability of a dialect being used is defined at collection level.
The handy thing is that words have a probability of appearing in every dialect which is learned by the model. This means that words common to all dialects (such as &#8216;diet&#8217; or &#8216;food&#8217;) can weighted as such in the model.
Running DiaTM on the same corpus as above yields the following results:
<figure id="attachment_178" aria-describedby="caption-attachment-178" style="width: 660px" class="wp-caption alignright"><img loading="lazy" class="wp-image-178 size-large" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=660%2C177&#038;ssl=1" alt="" width="660" height="177" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=1024%2C275&ssl=1 1024w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=300%2C81&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?resize=768%2C206&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2017/07/Screen-Shot-2017-07-25-at-11.27.47.png?w=1334&ssl=1 1334w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /><figcaption id="caption-attachment-178" class="wp-caption-text">Results of DiaTM on the sickness/exercise corpus</figcaption></figure>
You can see how the model has effectively identified the three key topics in the documents above but has also segmented the topics by dialect. Topic 2 is mainly concerned with food poisoning and sickness. In dialect 0 the words &#8220;sick&#8221; and &#8220;bad&#8221; appear but in dialect 1 the words &#8220;vomit&#8221; and &#8220;gastroenteritis&#8221; appear.
## Open Source Implementation
I have tried to turn my experiment into a Python library that others can make use of. It is currently early stage and a little slow but it works. The code is [available here][4] and pull requests are very welcome.
The library offers a &#8216;Scikit-Learn-like&#8217; interface where you fit the model to your data like so:
<pre lang="python">doc1 = ["tummy", "ache", "bad", "food","poisoning", "sick"]
doc2 = ["pulled","muscle","gym","workout","exercise", "cardio"]
doc3 = ["diet", "exercise", "carbs", "protein", "food","health"]
doc4 = ["stomach", "muscle", "ache", "food", "poisoning", "vomit", "nausea"]
doc5 = ["muscle", "aerobic", "exercise", "cardiovascular", "calories"]
doc6 = ["carbohydrates", "diet", "food", "ketogenic", "protein", "calories"]
doc7 = ["gym", "food", "gainz", "protein", "cardio", "muscle"]
doc8 = ["stomach","crunches", "muscle", "ache", "protein"]
doc9 = ["gastroenteritis", "stomach", "vomit", "nausea", "dehydrated"]
doc10 = ["dehydrated", "water", "exercise", "cardiovascular"]
doc11 = ['drink', 'water', 'daily','diet', 'health']
collection1 = [doc1,doc2,doc3, doc7, doc11]
# 'scientific' documents
collection2 = [doc4,doc5,doc6, doc8, doc9, doc10]
collections = [collection1, collection2]
dtm = DiaTM(n_topic=3, n_dialects=2)
dtm.fit(X)
</pre>
Fitting the model to new documents using transform() will be available soon as will finding the log probability of the current model parameters.
[1]: http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html
[2]: http://dl.acm.org/citation.cfm?id=2133826
[3]: http://www.ncbi.nlm.nih.gov/pubmed/21346955
[4]: https://github.com/ravenscroftj/diatm

View File

@ -0,0 +1,266 @@
---
title: Why I keep going back to Evernote
author: James
type: post
date: 2017-08-03T08:27:53+00:00
url: /2017/08/03/182/
featured_image: /wp-content/uploads/2017/08/cahier-spirale-825x510.png
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"5ce618eb3174";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:139:"https://medium.com/@jamesravey/as-the-cto-for-a-london-machine-learning-startup-and-a-phd-student-at-warwick-institute-for-the-5ce618eb3174";}'
categories:
- PhD
- Work
tags:
- evernote
- filament
- knowledge management
- markdown
- phd
---
<div>
As the CTO for a <a href="http://filament.uk.com">London machine learning startup</a> and a <a href="https://www.wisc.warwick.ac.uk/">PhD student at Warwick Institute for the Science of Cities</a>, to say Im busy is an understatement.
</div>
<div>
</div>
<div>
At any given point in time, my mind is awash with hundreds of ideas around Filament tech strategy, a cool app Id like to build, ways to measure scientific impact, wondering what the name of that new song I heard on the radio was or some combination thereof.
</div>
<div>
</div>
<div>
To be effective at my job (and to stay sane in general), I really have to stay on top of what I know and how I manage my knowledge. My brain just doesnt have enough room for everything I need to know and Ill be the first to admit that Im one of those stereotypical absent minded academic types, like <a href="http://www.bbc.co.uk/programmes/b06wk4ph">Harry Hills portrayal of Professor BraneStawm</a>. Thats why I spend so much time trying trying to perfect the art of building a “second brain”.
</div>
<div>
</div>
<div>
There are so many different ways I could be keeping my notes but they all have their flaws. Thats why I keep coming back to Evernote. It is certainly not amazing, it just happens to be the best of a bad lot. Here are some strategies Ive tried and why they failed.
</div>
<div>
</div>
## Approach 1: Keeping paper notebooks
<div>
I love stationary. Theres nothing quite like writing your name on the front page of a brand new moleskin notebook. I accumulate notebooks like some kind of eccentric collector. Last count I had something like 28 of them under my desk. The sad thing is that most of them are at about 25% usage if that. I am well aware of how wasteful that sounds because it is wasteful and I am kind of ashamed. Heres the key problem though. Physical paper notebooks are easy to forget.
</div>
### Downside 1: Forgetting your notebook
<div>
As I said above, Im a real stereotype absent minded professor type. As my girlfriend often tells me, and my parents many years before her, Id forget my head if it wasnt screwed on. Its a miracle I leave the house fully dressed with my wallet and my mobile phone. A miracle or years of conditioning and last minute “pat checks” anyway.
</div>
<div>
</div>
<div>
Joking aside, I am pretty well organised most of the time but every time I go through a physical notebook phase I really struggle to get into a routine of remembering to put my notebook in my bag on the way out of the door.
</div>
<div>
</div>
### Downside 2: Practicality
<div>
If I am able to remember to bring it, it doesnt take me long to run into obstacle number 2 to notebook adoption: practicality. I spent a great deal of times on busy London commuter trains. Extracting your notebook and pen from your satchel and having enough room to move your arms around when youre packed like a sardine on the 8.05 to Waterloo is particularly difficult and even on the miraculous occasion when get a seat, there usually isnt room to swing an amoeba, nevermind a cat.
</div>
<div>
</div>
### Downside 3: Organising and Searching
<div>
Assuming I remember it and I have room to open it and read it, the next problem with a physical notebook is finding information in it. Given my two jobs and my general curiosity, I am always making notes, having ideas and context switching. If I carry only one physical notebook then organising things by idea or context becomes very difficult. Ive tried a couple of approaches here: carrying a wedge of coloured sticky notes and sticking a different one to my notebook each time I context switch seems like a great idea until you realise it makes things even less practical and there are only so many colours. I also tried carrying different notebooks for different projects or even just 2: one for Filament and one for my PhD. Again this is not overly practical and if I cant remember one, remembering 2 is even less likely.
</div>
<div>
</div>
### Downside 4: Security and Backup
<div>
The last big flaw with paper is that it is easy to lose or steal and once its gone its gone. Say what you will about people snooping and hacking on the internet, at least I cant leave “the cloud” on a train or in a restaurant. Again, thanks to years of conditioning, I am much less prone to leave big ticket items like mobile phones and laptops lying around in public places but even if I did, my devices are encrypted, password protected and enabled for remote wipe (assuming enough battery life and connectivity). You just cant remote wipe a slither of tree.
</div>
<div>
</div>
## Approach 2: Personal Wiki
<div>
While I was at Uni, a friend of mine started keeping a personal wiki which was locked down and only readable to them. I thought that was a great idea. An online notebook, accessible from anywhere. Ive been keeping a personal wiki for years (off and on) using <a href="https://www.dokuwiki.org/dokuwiki#">DokuWiki</a> and I highly rate the software as wikis come. However, there are a couple of reasons I struggle with personal wiki maintenance.
</div>
### Downside 1: Connectivity
<div>
One of the biggest problems with the personal wiki approach is offline access. The trains I typically catch dont have great wifi and even when I can get online, it is often very slow or patchy and unreliable. I have tried my hand at hosting dokuwiki on my laptop and using the <a href="https://www.dokuwiki.org/plugin:sync">XML-RPC sync plugin </a>to mirror everything to my personal web server but the plugin (while amazingly and lovingly maintained by volunteers whose time I am incredibly grateful for) is not particularly reliable and I have sometimes lost notes using this process. I also cant run a full LAMP stack on my phone which brings me to my next point…
</div>
<div>
</div>
### Downside 2: Mobile Usability
<div>
When Im crammed into a sardine tin or just want to make a quick 30 second note, the last thing I want to be doing is booting up my laptop to edit my personal wiki. Yes, I know I can edit my wiki from Chrome for Android and I have done on numerous occasions. However, I just dont find the experience to be particularly pleasant or even practical. What dokuwiki et al are missing is a really good mobile app for making quick adjustments to content and ideally syncing sections of the wiki offline for reference on the go.
</div>
<div>
</div>
### Downside 3: Dashboard or “making wikis sticky”
<div>
When Im trying to build a new habit, what I really need is for whatever the tool I am trying to get used to to be in my face as much as possible. Thats one of the reasons dokuwiki on its own just isnt sticky. Evernote is in your face on your desktop and phone and offers notifications. Physical notebooks are in your phase because they are physical books (if you remember to take them with you). Remembering to log into dokuwiki and read your todos is a bit like remembering to take your notebook in your bag. If its not muscle memory, it just doesnt stick.
</div>
<div>
</div>
## Approach 3: Markdown Notes
<div>
</div>
<div>
Similarly to dokuwiki, Ive been evangelised by techie friends who use markdown notes and dropbox or owncloud a few times. Its a great idea in principle &#8211; using open source markdown <-> html <-> pdf programs like <a href="http://johnmacfarlane.net/pandoc/installing.html">pandoc </a>to make my notes readable. Unfortunately markdown notes share a number of shortcomings with physical notebooks and personal wikis.
</div>
<div>
</div>
<div>
Youll see from my below notes that my main gripe with this workflow is that it requires me to configure lots of different moving parts and write bespoke tools where existing open (or even paid) ones dont do what I need.
</div>
<div>
</div>
### Downside 1: Usability
<div>
Ok so Im a geek and arguably I should just feel happy writing and maintaining my notes from the command line. The problem for me is that its not tremendously usable or “in your face” as I touched on above. I mean sure I love command lines as much as the next Linux power user, I do most of my development in vim and most of my source code management using git cli (as opposed to the GUIs that you can get for Eclipse and Atom etc). Personal knowledge management is one of the few places where Id much rather just have an all in one tool that does everything I need and well than have to faff about with loads of moving parts and utilities. A few of cases where this is particularly poignant:
</div>
<div>
</div>
1. Searching for notes in evernote or dokuwiki is as easy as typing into a search box and hitting enter. Searching for notes in markdown requires me to read the grep man pages for the umpteenth time because Ive forgotten which flag to set to turn on or off regex in the particular flavour I need.
2. Inserting images into markdown notes is a pain because I need to download the image, place it in a folder relatively near to my content and add an image markup section to my code. My assets tend to get very fragmented in this case too because I might use the same image in multiple notes and end up storing multiple copies of it in different “assets” folders. There is a [really good atom plugin][1] for this that allows you to “paste” images in your clipboard buffer into your notes and will automatically save it to disk and generate a link. That one plugin, as good as it is, doesnt solve all my other gripes though.
3. Rendering my markdown is a pain &#8211; the Atom plugins for this are not perfect and due to different flavours of markdown and rendering engines, I can render the notes in one tool and find that they render completely different in another. The best combo Ive found so far has been [Atom][2] + [Pandoc Markdown Plugin][3] + [Pandoc][4].
### Downside 2: Mobile Usability
<div>
Exactly the same problem as with personal wikis. There are no good apps for a holistic markdown-notes-based workflow for Android. Sure there are a few good markdown editors and the best one Ive found is <a href="https://play.google.com/store/apps/details?id=com.ekartoyev.enotes&hl=en_GB">Epsilon notes</a>. However, I need something that has a widget for my home screen and notifications and lights and bells and whistles to keep my absent minded academic brain on track. Epsilon is great if you remember to open it and check it (anyone else see a theme here?) but suffers from some of the same usability issues that desktop markdown editors do (inserting images etc).It is still great though, if youre looking for a markdown editor for android and dont have crazy demands like me, check them out!
</div>
## Major Common Downside: Web Clipper (or lack thereof)
<div>
</div>
<div>
All of the approaches Ive listed above are great in different ways and each has its own drawbacks but one thing that they are all missing is a good Web Clipper. A web clipper is typically a browser plugin or bookmarklet that allows you to save a web pages content offline for reading later. Think of it like bookmarking on steroids. I suppose the real world simile would be instead of putting a bookmark in Game of Thrones when you put it down to go and have a cup of tea, you photocopy the whole chapter in case someone steals your copy of the book. This is super useful for me as I travel on the train a lot with no wifi but offers other amazing advantages too. For example, I can annotate/highlight and comment on content in the page directly and if the author of the original page takes their content down or forgets to pay their hosting bill, I can still read it.
</div>
<div>
</div>
<div>
Evernote are by no means the only people on the market with one of these babies. There are a few standalone tools that do this too and Microsofts impressive-but-lets-be-honest-less-usable-and-more-clunky Evernote competitor, <a href="https://www.onenote.com/">OneNote</a> has a web clipper too. The web clipper is an absolutely crucial part of my workflow and Ive never found a replacement for Evernotes implementation that is quite as good. Honestly, I could probably live without and overlook some of the other things I moaned about earlier in this post but web clipper really is the be-all-and-end-all feature that I am looking for in a knowledge management suite.
</div>
## Downsides of Evernote
<div>
</div>
<div>
As I said at the beginning, I am not an evernote fan boy and this is supposed to be a fair and representative posting. Therefore lets talk about Evernotes warts.
</div>
### Downside 1: Pricing Model
<div>
Look, Im a professional, I know how much time and effort people put into software development and I am happy to pay for high quality products if they offer a genuine advantage to me. Thats why I think that for £30 a year for the Plus option, Evernote is a steal. What I dont like is the idea of artificially turning off some of the features just so that I drop another £20 for their Premium tier. 1GB of storage per month? Fine! I only need that. 99% of the time I dont need to scan business cards and to be perfectly honest, I can live without indexing text in pdfs too. However, those are all “nice to have” things that Im not going to drop £20 for. I actually prefer Microsofts OneNote pricing model over Evernote&#8217;s. They dont charge for the tooling, they charge for the disk space youre using.
</div>
### Downside 2: No Math Markup
<div>
As a machine learning specialist, I spend a lot of time reading and writing mathematical formulae. Therefore it kind of sucks that Evernote still doesnt have any support for math. I can write math in my paper notebooks (or anything else for that matter) and I can use <a href="https://www.mathjax.org/">MathJax</a> in dokuwiki and LaTeX math syntax in markdown. I mean jeez, even OneNote has math formulae. If Evernote had maths markup I think itd be pretty difficult to get me to leave their platform.
</div>
### Downside 3: The Cloud is just someone elses computer
<div>
Perhaps one of the biggest drawbacks of Evernote (and OneNote) is that they both require you to trust that they are not doing naughty things with your data and I dont. Call me what you will: conspiracy theorist, foil hat wearer &#8211; I can take it, I was chubby, spotty and wore thick glasses at high school, I can take it. However, the point still stands that if you are putting data in someone elses “black box” then you either have to take their promises at face value or just assume that they can see everything you are sending. <a href="https://evernote.com/security/">Evernote make all the right noises</a> about securing your data against attackers and hackers and thats good news but as a “tin foil hat wearing conspiracy theorist” I would rather just keep all my private data in my own private network. I would much rather pay a monthly license cost to run Evernotes server stack software on my own hardware (yes I know running software compiled by someone else is scary but its still less scary than trusting them blindly with my data and I could always <a href="https://www.wireshark.org/">wireshark </a>the machine its running on and see if its sending any mysterious packets back to the mothership).
</div>
<div>
</div>
## So why \*do\* I keep going back?
<div>
</div>
<div>
Despite all its flaws and warts, in terms of pricing, features and security, Evernote keeps enticing me back because:
</div>
1. The user experience is great. The desktop app is clear and well designed, it works well and it does notifications and “in your face-ness” really well.
2. It doesnt let me forget it &#8211; kind of the same point as 1. but specifically Evernote stays front of mind and in a mind like mine, thats a real feat.
3. Mobile integration is really good and the Evernote android app is fantastic.
4. The web clipper is so great on both desktop and mobile. I can glance at websites and “download” them for later then I can read them when Im not busy &#8211; signal or not.
5. While the pricing tiers are obnoxious, the Plus plan isnt unreasonable given how much I spend on coffee.
<div>
</div>
## What would entice me away?
<div>
If the open source community rallied around dokuwiki or markdown to create a high quality web scraper and a decent mobile app I would SO be there. Or if ownCloud/NextCloud notes was a bit more mature and had these features.
</div>
<div>
</div>
<div>
</div>
## Concluding thoughts
<div>
</div>
<div>
Evernote is by no means perfect and I have deep concerns about my personal data when I use it. However, it seems to be the best option for keeping me organised on a practical day-to-day level and thats why I keep going back. Heres hoping that we get some good open source alternatives in this space sooner rather than later. If anyone has any suggestions for alternatives I could try, Id love to hear about them in the comments section.
</div>
<div>
</div>
<div>
</div>
<div>
</div>
[1]: https://atom.io/packages/markdown-image-assistant
[2]: https://atom.io/
[3]: https://atom.io/packages/search?q=pandoc
[4]: http://johnmacfarlane.net/pandoc/installing.html

View File

@ -0,0 +1,68 @@
---
title: 'Cython: Some Top Tips'
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=191
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";N;}'
categories:
- Uncategorized
---
This week I&#8217;ve been using [Cython][1] to build &#8220;native&#8221; Python extensions. For the uninitiated, Cython is the secret love-child programming language of C and Python. A common misconception is that Cython is &#8220;an easy way for Python developers to write fast code using C&#8221;. Really using Cython requires familiarity with both Python and C and makes use of concepts from both languages. Therefore I&#8217;d highly recommend reading up on C a little bit before you start working on Cython code.
During the last few days I&#8217;ve been running into some interesting problems and solving a few problems. I&#8217;m hoping that this blog post will provide much needed google results for those who don&#8217;t want to waste hours on these issues like I did.
## Using Cython modules from Python
Cython compiles into a binary library that can be loaded natively with an import statement. However, getting it compiled is the tricky bit.
When you&#8217;re doing quick and dirty dev work and re-running your code to see if it will work every few minutes, I&#8217;d recommend making use of the _**pyximport**_ library that comes with Cython. This module makes importing cython libraries really convenient by wrapping the build process and making the import statement look for and build .pyx files. All you need to do to get it working is run:
<pre lang="python">import pyximport; pyximport.install()</pre>
Then you can literally just import your library. Imagine your Cython file is called test.pyx, you can just do:
<pre lang="python">import test</pre>
and off you go.
If, like me, you&#8217;re a big fan of Jupyter notebooks and using importlib reload to bring in new versions of models you&#8217;re developing, Cython and pyximport offer a hack that supports this. When you import pyximport, add reload_support=True to the install function call to enable this.
<pre lang="python">import pyximport; pyximport.install(reload_support=True)</pre>
I found this to be very hacky and that reloading often failed with this method unless preceeded by another import statement. Something like this usually works:
<pre lang="python">from importlib import reload
import test
reload(test)
</pre>
## Optimising and Understanding Cython Code
Remember that Cython code is first &#8220;re-written&#8221; or &#8220;transpiled&#8221; to C code and then is compiled to machine readable binary by your system&#8217;s C compiler. Well written C is still one of the fastest languages you can write an application in (but also complex and easy to cause a crash from). Since Python is an interpreted language that lives inside a virtual environment, each operation &#8211; such as adding together two numbers &#8211; actually translates to several C expressions.
Well written Cython code can be compiled down to a small number of instructions but badly optimised Cython will just result in lines and lines of C code. In these cases, the benefit you&#8217;re going to be getting from having written the module in Cython is likely to be negligible over standard interpreted Python code.
Cython comes with a handy tool which generates a HTML report showing how well optimised your code is. You can run it on your code by doing
<pre lang="bash">cython -a test.pyx</pre>
What you should now have is a test.c file and a test.html file in the directory. IF you open the HTML file in the browser you&#8217;ll see your Cython code and yellow highlights. It&#8217;s pretty simple: the brighter/more intense the yellow colouring, the more likely it is that your code is interacting with normal Python objects rather than pure C ones and ergo the more likely it is that you can optimise that code and speed things up*.
*Of course this isn&#8217;t always the case. In some cases you will want to be interacting with the Python world like in code that passes the output from a highly optimised C function back into the world of the Python interpreter so that it can be used by normal Python code.
If you&#8217;re trying to squeeze loads of performance out of Cython, what you should be aiming for is to get to a point where all your variables have a C type (by using **cde****f** to declare them before you use them) and by only applying C operations and functions wherever possible.
For example the code:
<pre>i = 0
while i &lt; 99:
    i += 1
</pre>
will result in
[1]: http://cython.org/

View File

@ -0,0 +1,87 @@
---
title: Machine Learning and Hardware Requirements
author: James
type: post
date: 2017-08-11T17:22:12+00:00
draft: true
url: /?p=195
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"6e9abb882f26";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:86:"https://medium.com/@jamesravey/machine-learning-and-hardware-requirements-6e9abb882f26";}'
categories:
- Uncategorized
---
_**With recent advances in machine learning techniques, vendors like [Nvidia][1], [Intel][2], [AMD][3] and [IBM][3] are announcing hardware offerings specifically tailored around machine learning. In this post we examine the key differences between &#8220;traditional&#8221; software and machine learning software and why those differences necessitate a new type of hardware stack.**_
Most readers would certainly be forgiven for wondering why NVidia (NVDA on the stock market), a company that rose to prominence for manufacturing and distributing graphics processing chips to video games enthusiasts, are suddenly being mentioned in tandem with machine learning and AI products. You would also be forgiven for wondering why machine learning needs its own hardware at all. Surely a program is a program right? To understand how these things are connected, we need to talk a little bit about how software runs and the key differences between a procedural application that you&#8217;d run on your smart phone versus a deep neural network.
## How Traditional (Procedural) Software Works
&nbsp;<figure style="width: 293px" class="wp-caption alignright">
<img loading="lazy" src="https://i1.wp.com/openclipart.org/image/2400px/svg_to_png/28411/freephile-Cake.png?resize=293%2C210&#038;ssl=1" alt="Cake by freephile" width="293" height="210" data-recalc-dims="1" /><figcaption class="wp-caption-text">An algorithm is a lot like a cake recipe</figcaption></figure>
You can think of software as a series of instructions. In fact, that&#8217;s all an algorithm is. A cooking recipe that tells you how to make a cake step-by-step is a real world example of an algorithm that you carry out by hand every day.
Traditional software is very similar to a food recipe in principle.
1. First you define your variables (a recipe tells you what ingredients you need and how much you&#8217;ll need for each).
2. Then you follow a series of instructions. (Measure out the flour, add it to the mixing bowl, measure out the sugar, add that to the bowl).
3. Somewhere along the way you&#8217;re going to encounter conditions (mix in the butter until the mixture is smooth or whip the cream until it is stiff).
4. At the end you produce a result (i.e. you present the cake to the birthday girl or boy).
A traditional Central Processing Unit (CPU) that you&#8217;d find in your laptop, mobile phone or server is designed to process one instruction at a time. When you are baking a cake that&#8217;s fine because often the steps are dependent upon each other. You wouldn&#8217;t want to beat the eggs, put them in the oven and start pouring the flour all at the same time because that would make a huge mess. In the same way, it makes no sense to send each character in an email at the same time unless you want the recipient&#8217;s message to be garbled.
## Parallel Processing and &#8220;Dual Core&#8221;<figure style="width: 273px" class="wp-caption alignleft">
<img loading="lazy" src="https://i0.wp.com/openclipart.org/image/2400px/svg_to_png/25734/markroth8-Conveyor-Belt.png?resize=273%2C114&#038;ssl=1" alt="Conveyor Belt by markroth8" width="273" height="114" data-recalc-dims="1" /><figcaption class="wp-caption-text">CPUs have been getting faster at processing like more and more efficient cake making production lines</figcaption></figure>
Over the last 2 decades, processing speed of CPUs has got faster and faster which effectively means that they are able to do more and more instructions one at a time. Imagine moving from one person making a cake to a machine that makes cakes on a conveyer belt. However, consumer computing has also become more and more demanding and with many homes globally connected to high speed internet, multitasking, running more than one application on your laptop at the same time or looking at multiple tabs in your browser, is becoming more and more common.
Before Parallel Processing (machines that advertise being &#8220;dual core&#8221;, and more recently &#8220;quad core&#8221; and even &#8220;octo-core&#8221;), computers appeared to be running multiple applications at the same time by doing little bits of each of the applications and switching around. Continuing our cake analogy, this would be like putting a chocolate cake in the oven and then proceeding to mix the flour and eggs for a vanilla sponge all the time, periodically checking that the chocolate cake isn&#8217;t burning.
Multi-processing (dual/quad/octo core) allows your computer really run multiple programs at the same time, rather than appearing to. This is because each chip has 2 (duo) 4 (quad) or 8 (octo) CPUs all working on the data at the same time. The cake analogy is that we now have 2 chefs or even 2 conveyer belt factory machines.
## How [Deep] Neural Networks Work
Neural Networks are modelled around how the human brain processes and understands information. Like a brain, they consist of neurons which get excited under certain circumstances like observing a particular word or picture and synapses which pass messages between neurons. Training a neural network is about strengthening and weakening the synapses that connect the neurons to manipulate which neurons get excited based on particular inputs. This is more or less how humans learn too!
The thing about human thinking is that we don&#8217;t tend to process the things we see and hear in small chunks, one at a time, like a traditional processor would. We process a whole image in one go, or at least if feels that way right? Our brains do a huge amount of parallel processing. Each neuron in our retinas receives a small part of the light coming in through our eyes and through communication via the synapses connecting our brain cells, we assemble a single coherent image.
<img loading="lazy" class="alignright size-medium wp-image-196" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2017/08/IMG_20170811_173437.jpg?resize=169%2C300&#038;ssl=1" alt="" width="169" height="300" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2017/08/IMG_20170811_173437.jpg?resize=169%2C300&ssl=1 169w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2017/08/IMG_20170811_173437.jpg?resize=768%2C1365&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2017/08/IMG_20170811_173437.jpg?resize=576%2C1024&ssl=1 576w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2017/08/IMG_20170811_173437.jpg?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2017/08/IMG_20170811_173437.jpg?w=1980&ssl=1 1980w" sizes="(max-width: 169px) 100vw, 169px" data-recalc-dims="1" />
Simulated neural networks work in the same way. In a model learning to recognise faces in an image, each neuron receives a small part of the picture &#8211; usually a single pixel &#8211; carries out some operation and passes the message along a synapse to the next neuron which carries out an operation. The calculations that each neuron makes is largely independent unless it is waiting for the output from a neuron the next layer up. That means that while it is possible to simulate a neural network on a single CPU, it is very inefficient because it has to calculate what each neuron&#8217;s verdict about it&#8217;s pixel is independently. It&#8217;s a bit like the end of the Eurovision song contest where each country is asked for its own vote over the course of about an hour. Or if you&#8217;re unfamiliar with our wonderful but[ obscure european talent contest][4], you could say its a bit like a government vote where each representative has to say &#8220;Yea&#8221; or &#8220;Ney&#8221; one after another. Even with a dual, quad or octo core machine, you can still only simulate a small number of neurons at a time. If only there was a way to do that&#8230;
## Not Just for Gaming: Enter NVidia and GPUs
&nbsp;<figure style="width: 273px" class="wp-caption alignright">
<img loading="lazy" src="https://i1.wp.com/openclipart.org/image/2400px/svg_to_png/213387/Video-card.png?resize=273%2C198&#038;ssl=1" alt="Video card by jhnri4" width="273" height="198" data-recalc-dims="1" /><figcaption class="wp-caption-text">GPUs with sporty go-faster stripes are quite common in the video gaming market.</figcaption></figure>
GPUs or Graphical Processing Units are microprocessors that were historically designed for running graphics-based workloads such as rendering 3D models in video games or animated movies like Toy Story or Shrek. Graphics workloads are also massively parallel in nature.
An image on a computer is made up of a series of pixels. In order to generate a coherent image, a traditional single-core CPU has to calculate what colour each pixel should be one-by-one. When a modern (1280&#215;1024) laptop screen is made up of 1310720 pixels &#8211; that&#8217;s 1.3 million pixels. If we&#8217;re watching a video, which usually runs at 30 frames per second, we&#8217;re looking at nearly 40 million pixels per second that have to be processed. That is a LOT of processing. If we&#8217;re playing a video game, then on top of this your CPU has to deal with all the maths that comes with running around a virtual environment and the behaviours and actions of the in-game characters. You can see how things could quickly add up and your machine grind to a halt.
GPUs, unlike CPUs are made up of thousands &#8211; that&#8217;s right, not duo or octo but thousands of processing cores so that they can do a lot of that pixel rendering in parallel. The below video, which is also hosted on the [NVidia website,][5] gives an amusing example of the differences here.
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/-P28LKWTzrI?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>
GPUs trade off their massively parallel nature with their speed at handling sequential functions. Back to the cake analogy, a GPU is more like having 10 thousand human chefs versus a CPU which is like having 2 to 8 cake-factory-conveyer-machines. This is why traditional CPUs remain relevant for running traditional workloads today.
## GPUs and Neural Networks
In the same way that thousands of cores in a GPU can be leveraged to render an image by rendering all of the pixels at the same time, a GPU can also be used to simulate a very large number of neurons in a neural network at the same time. This is why NVidia et al., formally famous for rendering cars and tracks in your favourite racing simulation to steering real self-driving cars via a simulated deep neural network.
You don&#8217;t always need a GPU to run a Neural Network. When building a model, the training is the computationally expensive bit. This is where we expose the network to thousands of images and change the synapse weights according to whether the network provided the correct answer (e.,g. is this a picture of a face? Yes or no?). Once the network has been trained, the weights are frozen and typically the throughput of images is a lot lower. Therefore, it can sometimes be feasible to train your neural network on more expensive GPU hardware and then query or run it on cheaper commodity CPUs. Again, this all depends on the amount of usage that your model is going to be getting.
## Final Thoughts
In a world where machine learning and artificial intelligence software are transforming the way we use computers, the underlying hardware is also shifting. In order to stay relevant, organisations must understand the difference between CPU and GPU workloads and as they integrate machine learning and AI into their businesses, they need to make sure that they have the right hardware available to run these tasks effectively.
[1]: http://www.nvidia.com/object/machine-learning.html
[2]: https://software.intel.com/en-us/ai-academy/training
[3]: https://medium.com/intuitionmachine/building-a-50-teraflops-amd-vega-deep-learning-box-for-under-3k-ebdd60d4a93c
[4]: https://www.youtube.com/watch?time_continue=45&v=hfjHJneVonE
[5]: http://www.nvidia.com/object/what-is-gpu-computing.html

View File

@ -0,0 +1,71 @@
---
title: Spacy, Spark and BIG NLP
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=212
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";N;}'
categories:
- Uncategorized
---
Recently I have been working on a project that involves trawling full text newspaper articles from the JISC UK Web Domain Dataset &#8211; covering all websites with a .uk domain suffix from 1996 to 2013. As you can imagine, this task is pretty gargantuan and the archives themselves are over 27 Terabytes in size (that&#8217;s enough space to store 5000 high definition movies).
I&#8217;ll be writing more about my work with the JISC dataset another time. This article focuses on getting started with Apache Spark and Spacy which has the potential to be a bit of a pain.
## **Installing Spark + Hadoop **
Installing Spark + Hadoop is actually relatively easy. Apache ship tar balls for [Windows, Mac and Linux][1] which you can simply download and extract (on Mac and Linux I recommend extracting to /usr/local/spark as a sensible home.
You&#8217;ll need Java and although Spark seems to ship with Python (in the bin folder you&#8217;ll find a script called pyspark which launches spark with a python 2.7 session and a SparkContext object already set up) I tend to use standalone Python and findspark which I&#8217;ll explain now.
## FindSpark
[findspark][2] is a python library for finding a spark installation and adding it to your PYTHONPATH during runtime. This means you can use your existing python install with a newly created Spark setup without any faff.
Firstly run
<pre>pip install findspark</pre>
Then you&#8217;ll want to export SPARK_HOME environment variable so that findspark knows where to look for the libraries (if you don&#8217;t do this, you&#8217;ll get an error in your python session.
<pre>export SPARK_HOME=/usr/local/spark</pre>
Obviously you&#8217;ll want to change this if you&#8217;re working with a Spark install at a different location &#8211; just point it to the root directory of the Spark installation that you unzipped above.
a pro-tip here is to actually add this line to your .bashrc or .profile files so that every time you start a new terminal instance, this information is already available.
## Python and Findspark first steps
If you did the above properly you can now launch python and start your first Spark job.
Try running the following:
<code lang="python">import findspark&lt;br />
findspark.init()&lt;/p>
&lt;p>import pyspark #if the above didn't work then you'll get an error here&lt;/p>
&lt;p>from pyspark.sql import SQLContext&lt;/p>
&lt;p>if &lt;strong>name&lt;/strong> == "&lt;strong>main&lt;/strong>":&lt;br />
"""&lt;br />
Usage: pi [partitions]&lt;br />
"""&lt;br />
sc = pyspark.SparkContext()&lt;br />
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2&lt;br />
n = 100000 * partitions&lt;/p>
&lt;pre>&lt;code>def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 &lt;= 1 else 0
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
</code></pre>
</code>
[1]: https://spark.apache.org/downloads.html
[2]: https://pypi.python.org/pypi/findspark

View File

@ -0,0 +1,110 @@
---
title: How I became a gopher over christmas
author: James
type: post
date: 2018-01-27T10:09:34+00:00
url: /2018/01/27/how-i-became-a-gopher/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"452cd617afb4";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:95:"https://medium.com/@jamesravey/how-i-became-a-gopher-and-learned-myself-an-angular-452cd617afb4";}'
categories:
- Uncategorized
tags:
- chatbots
- filament
- go
---
Happy new year to one and all. It&#8217;s been a while since I posted and life continues onwards at a crazy pace. I meant to publish this post just after Christmas but have only found time to sit down and write now.
If anyone is wondering what&#8217;s with the crazy title &#8211; a gopher is someone who practices the Go programming language (just as those who write in Python refer to themselves as pythonistas. There&#8217;s an interesting list of labels that programmers self-assign [here][1] if you&#8217;re interested).
Over Christmas I decided I was going to have a break from working as a [CTO][2] and a [PhD student][3] in order to do something relaxing. That&#8217;s why I thought I&#8217;d teach myself a new programming language. I also taught myself how to use Angular.js too but I&#8217;ll probably write about that separately.
## Go Go Gadget&#8230; keyboard?
First on my list is the Go programming language. I decided that I would spend 2 weeks over xmas (in between playing with my nintendo switch and of course spending time with my lovely fiancee and our families) building something useful and practical in the language because if there&#8217;s one thing I can&#8217;t stand its trying to learn to use a programming language by writing yet another [todo list.][4]
<figure id="attachment_233" aria-describedby="caption-attachment-233" style="width: 221px" class="wp-caption alignright"><img loading="lazy" class="wp-image-233 size-medium" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Screenshot-from-2018-01-27-09-17-40.png?resize=221%2C300&#038;ssl=1" alt="" width="221" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Screenshot-from-2018-01-27-09-17-40.png?resize=221%2C300&ssl=1 221w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Screenshot-from-2018-01-27-09-17-40.png?w=622&ssl=1 622w" sizes="(max-width: 221px) 100vw, 221px" data-recalc-dims="1" /><figcaption id="caption-attachment-233" class="wp-caption-text">Chatty Cathy &#8211; she&#8217;s still very much a work in progress</figcaption></figure>
At Filament, we are fast becoming one of the UK&#8217;s leading chat-bot providers, working with brands like T-Mobile and Hiscox insurance. We have an excellent team of chat-bot builders run by our very own Mr Chat-bot, [Rory][5] and supported by  a very stringent but also very manual QA process. I decided I was going to try and help the team work smarter by building a PoC chatbot testing framework.
The general gist of my tool, named Chatty Cathy, is that I can &#8220;record&#8221; a conversation with a bot, go make some changes to the intents and flows and then &#8220;playback&#8221; my conversation to make sure the bot still responds in the way I&#8217;d like.
## Why Go?
<figure id="attachment_234" aria-describedby="caption-attachment-234" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="wp-image-234 size-medium" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Gopher-2400px.png?resize=300%2C168&#038;ssl=1" alt="" width="300" height="168" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Gopher-2400px.png?resize=300%2C168&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Gopher-2400px.png?resize=768%2C430&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Gopher-2400px.png?resize=1024%2C574&ssl=1 1024w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Gopher-2400px.png?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/01/Gopher-2400px.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-234" class="wp-caption-text">Gophers are pretty cool!</figcaption></figure>
If you hadn&#8217;t gathered, I&#8217;m in to performance-sensitive compute applications: AI, statistical modelling, machine learning, natural language processing. All of these things need a lot of juice. We&#8217;re also in the business of writing large-scale web applications that serve these kinds of machine learning models to huge numbers of users. I love python and node.js as much as the next man but most machine learning applications that have interfaces in those languages are written in something low level (C or C++) and bound to these higher level languages for ease of use. I know that Go is still higher level than C or something like Rust* but it outperforms all of the interpreted languages and Java in the [Benchmark Games][6] and is very easy to learn (see below). It is this trade-off relatively-high-performance-versus-ease-of-use that has me so excited.
*Incidentally I spent some time last year working in Rust and even though I loved it, found it very hard going  &#8211; I&#8217;ve been following [Julia Evans&#8217; Ruby Profiler][7] project and it&#8217;s got me excited about Rust again so maybe I&#8217;ll dive back in some day soon.
### Pros of Working with Go
<li style="list-style-type: none;">
<ul>
<li>
Go-lang is like a pared down, simplistic dialect of C. You could also think about it as a much simplified Java spin off.
</li>
<li>
The build tool is amazingly simple to use. Everything is done based on filenames and locations. No more over-engineered pom.xml or package.json files.
</li>
<li>
Like Python and node.js, Go has a huge number of &#8220;batteries included&#8221; modules and functions available as part of its standard library. Web applications and JSON are first class citizens in Go-lang and you can build an app that serves up serialized json in ~30 lines of code. (compare that to the 300 lines of code you need just to define your POJOs and your DTOs in Java and you can already see why this is so awesome).
</li>
<li>
Also like <a href="https://pypi.python.org/pypi">Python</a> and <a href="https://www.npmjs.com/">node.js</a> Go has a brilliant ecosystem. One of the perks of coming into existence in the golden age of the internet I guess. Go&#8217;s build tool can automatically grab other projects from git repositories without needing a centralised server like pip/npm/maven which in my eyes makes it even more awesome (no more <a href="https://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/">left pad chaos</a>).
</li>
<li>
Unit testing is also a core part of the language and build tool. Simply adding some files to your library with the suffex _tests.go  running <em>go test </em>is all you need.
</li>
<li>
Go&#8217;s goroutine interface for parallelism is a super exciting feature. Again, having been designed very recently in the age of multi-core processors, the language treats multi-core processing as a first class citizen. Compared to Python which is held back from multi-processing bliss by the<a href="https://wiki.python.org/moin/GlobalInterpreterLock"> GIL </a>and node.js which uses a<a href="https://medium.com/@stevennatera/multithreading-multiprocessing-and-the-nodejs-event-loop-5b2929bd450b"> single event loop.</a>
</li>
<li>
<strong>Biggest pro: learning curve</strong>. For someone who&#8217;s written in any kind of imperative programming language before and has some level of understanding around pointers and referencing, Go is SOOO easy to pick up. I reckon the java developers at our company could be effective in Go within a week or less. (If you&#8217;re reading Rob, Max, Alex, that&#8217;s a challenge). It truly is like Java without the bulls***!
</li>
</ul>
</li>
### Cons of working with Go
* I still find workspaces kind of a strange concept. Instead of checking out one git project and popping it in your $HOME/workspace directory (yes this is a hangover from years of Eclipse development when I was back at big blue), code is stored hierarchically based on where you got it from. For example my workspace looks a bit like this:
<pre>go
└── src
├── github.com
│ ├── author1
│ │ └── project1
│ │ ├── LICENSE.txt
│ │ ├── README.md
│ │ ├── somefile.go
│ │ ├── more.go
│ ├── author2
│ │ └── project2
│ │ ├── LICENSE.txt
│ │ ├── README.md
│ │ ├── somefile.go
│ │ ├── more.go</pre>
Basically, each project gets stored somewhere hierarchically in the tree and all dependencies of your project end up somewhere in here. This can make it a little bit confusing when you&#8217;re working on really large projects but then perhaps this is no more confusing than a python virtualenv or the node_modules directory of a mature node.js app and I&#8217;m being silly?
* Some of the libraries are still a little bit immature. I don&#8217;t mean to be disparaging in any way towards the authors of these open source libraries that are doing a fantastic job for absolutely zero payment. However, a lot of the java tooling we use at work has been in development for 2 (in some cases nearly 3) decades and a lot of the crazier use cases I want to try and do are supported out of the box.
* I&#8217;m still not a massive fan of the way that [struct tag][8] syntax works. Again, perhaps this is because I am a programming luddite but I definitely prefer Java Annotations and Python Decorators a lot more.
### Summary
Go is a really exciting language for me personally but also for many of us still plagued by nightmarish visions of java boilerplate in our sleep. It has a fantastic ecosystem and first-class support for parallel programming and building web services via a really nice set of REST server libraries and JSON serialization libraries that are already built in. There are a few small issues that probably still need addressing but I&#8217;m sure they will come out in the wash.
The most exciting thing for me is how well Go slots into the toolbox of any competent imperative language programmer (e.g. C, C++, Java, Python, Javascript, C#, PHP etc.). Whichever language you come from there are likely to be a few minor changes to get used to but the syntax and concepts are familiar enough that you can be up and running in no time at all!
Is Go ready to be used in prime-time production-ready systems? I definitely think so but if you don&#8217;t believe me, why don&#8217;t you ask [the Docker Foundation?][9]
[1]: https://gist.github.com/coolaj86/9256619
[2]: http://filament.uk.com/
[3]: https://www.wisc.warwick.ac.uk/people/student-profiles/2015-intake/james-ravenscroft/
[4]: https://github.com/search?l=JavaScript&q=todo&type=Repositories&utf8=%E2%9C%93
[5]: https://disruptionhub.com/whats-crappy-chatbots/
[6]: http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=go&lang2=node
[7]: https://jvns.ca/blog/2017/12/02/taking-a-sabbatical-to-work-on-ruby-profiling-tools/
[8]: https://golang.org/ref/spec#Tag
[9]: https://www.slideshare.net/jpetazzo/docker-and-go-why-did-we-decide-to-write-docker-in-go

View File

@ -0,0 +1,13 @@
---
title: Upgrading from legacy ui-router
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=231
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}'
categories:
- Uncategorized
---

View File

@ -0,0 +1,133 @@
---
title: Re-using machine learning models and the “no free lunch” theorem
author: James
type: post
date: 2018-03-21T11:26:27+00:00
url: /2018/03/21/re-using-machine-learning-models-and-the-no-free-lunch-theorem/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"dd5196577b34";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:106:"https://medium.com/@jamesravey/re-using-machine-learning-models-and-the-no-free-lunch-theorem-dd5196577b34";}'
categories:
- Uncategorized
---
## Why re-use machine learning models?
<div>
<p>
<figure id="attachment_246" aria-describedby="caption-attachment-246" style="width: 302px" class="wp-caption alignleft"><img loading="lazy" class="wp-image-246 " src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/egore911-recycling-2400px.png?resize=302%2C320&#038;ssl=1" alt="" width="302" height="320" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/egore911-recycling-2400px.png?resize=283%2C300&ssl=1 283w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/egore911-recycling-2400px.png?resize=768%2C813&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/egore911-recycling-2400px.png?resize=967%2C1024&ssl=1 967w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/egore911-recycling-2400px.png?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/egore911-recycling-2400px.png?w=1980&ssl=1 1980w" sizes="(max-width: 302px) 100vw, 302px" data-recalc-dims="1" /><figcaption id="caption-attachment-246" class="wp-caption-text">Model re-use can be a huge cost saver when developing AI systems. But how well will your models perform in their new environment?</figcaption></figure>
</p>
<p>
You can get a lot of value out of training a machine learning model to solve a single use case, like predicting emotion in your customer chatbot transcripts and putting the angry ones through to real humans. However, you might be able to extract even more value out of your model by using it in more than one use case. You could use an emotion model to prioritise customer chat sessions but also to help monitor incoming email inquiries and social media channels too. A model can often be deployed across multiple channels and use cases with significantly less effort than creating a new, complete training set for each problem your business encounters. However there are some caveats that you should be aware of. In particular, the “No Free Lunch Theorem” which is concerned with the theoretical drawbacks of deploying a model across multiple use cases.
</p>
</div>
## No free lunch? What has food got to do with it?
<div>
Anyone whos familiar with Artificial Intelligence and Machine Learning should probably be familiar with the “No Free Lunch Theorem”.
</div>
<div>
<p>
<figure id="attachment_247" aria-describedby="caption-attachment-247" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="wp-image-247 size-medium" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Gerald-G-Fast-Food-Lunch-Dinner-FF-Menu-6-2400px.png?resize=300%2C260&#038;ssl=1" alt="" width="300" height="260" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Gerald-G-Fast-Food-Lunch-Dinner-FF-Menu-6-2400px.png?resize=300%2C260&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Gerald-G-Fast-Food-Lunch-Dinner-FF-Menu-6-2400px.png?resize=768%2C667&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Gerald-G-Fast-Food-Lunch-Dinner-FF-Menu-6-2400px.png?resize=1024%2C889&ssl=1 1024w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Gerald-G-Fast-Food-Lunch-Dinner-FF-Menu-6-2400px.png?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Gerald-G-Fast-Food-Lunch-Dinner-FF-Menu-6-2400px.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-247" class="wp-caption-text">You can&#8217;t get something for nothing but maybe you can get a massive discount!</figcaption></figure>
</p>
</div>
<div>
The theorem, posited by <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.390.9412&rep=rep1&type=pdf">David Wolpert in 1996</a> is based upon the adage “theres no such thing as a free lunch”, referring to the idea that it is unusual or even impossible to to get something for nothing in life. Wolperts theorem states that no one machine learning model can be best for all problems. Another way of thinking about this is that there are no “silver bullets” in machine learning. A lot of the time when Filament consult on an ML problem we have to try a number of approaches depending on the type and quality of data were given to train the system as well as the sorts of output that our client is expecting. In traditional programming you can write an algorithm once and (assuming it is bug free) apply it to any number of new use cases. Hypothetically speaking, a good developer could build a bank accounting tool and deploy it for a bank in the UK and a bank in China with minimal changes. This is not necessarily the case in machine learning.
</div>
<div>
</div>
<div>
So what are the deeper implications of this theorem on modern machine learning use cases? According to Wolpert, there can be no guarantee that an algorithm trained on dataset A (10000 product reviews on amazon labelled with positive/negative sentiment) will work well on dataset B (1000 social media posts that mention your company,  also labelled with positive/negative sentiment). You could find that it works really well or you could find that its a flop. Theres just no theoretical way to know! Thats scary right? Well in practice, with sensible reasoning, evaluation and by using approaches to adapt models between use cases (known as  domain adaptation) maybe its not all that bad…
</div>
<div>
</div>
## So No Free Lunch means I cant reuse my models?
<div>
<p>
<figure id="attachment_248" aria-describedby="caption-attachment-248" style="width: 137px" class="wp-caption alignright"><img loading="lazy" class="wp-image-248 size-medium" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Openclipart-could-be-so-useful-for-Africa-2016092547-2400px.png?resize=137%2C300&#038;ssl=1" alt="" width="137" height="300" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Openclipart-could-be-so-useful-for-Africa-2016092547-2400px.png?resize=137%2C300&ssl=1 137w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Openclipart-could-be-so-useful-for-Africa-2016092547-2400px.png?resize=768%2C1682&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Openclipart-could-be-so-useful-for-Africa-2016092547-2400px.png?resize=468%2C1024&ssl=1 468w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/Openclipart-could-be-so-useful-for-Africa-2016092547-2400px.png?w=1096&ssl=1 1096w" sizes="(max-width: 137px) 100vw, 137px" data-recalc-dims="1" /><figcaption id="caption-attachment-248" class="wp-caption-text">A doctor knows how to operate on people but would need additional training to be able to work with animals.</figcaption></figure>
</p>
<p>
Not exactly. The theorem says that theres no correlation between your models performance in its intended environment and its performance in a completely new environment. However, it doesnt rule out the possibility of be correlations if we know the nature of the new problems and data. You can think about it in terms of human expertise and specialisation. Humans learn to specialise as they work their way through the education system.  A medical doctor and a veterinarian both go through extensive training in order to be able to carry out medical procedures on humans and animals respectively. A veterinarian might be excellent at operating on different species of animals. However, a veterinarian would not be able to operate on an injured human to the same level of professionalism as a medical doctor without some additional training.
</p>
</div>
<div>
</div>
<div>
Intuitively, machines operate in a similar way. A model trained on sentiment in the context of customer inquiries will learn the vocabulary that customers use to describe how happy they are with the service they received from your company. Therefore it follows that you could deploy this model across a number of channels where customers are likely to communicate with your company. On the other hand, the language that a film critic uses when reviewing a movie is quite different to that of a customer complaining about their shopping experience. This means that your customer inquiries model probably shouldnt be re-used for sentiment analysis in product reviews without some evaluation and potentially some domain adaptation.  A model trained to detect a persons age from high-definition digital images might work well running in a phone app where the phone has a suitably high quality camera but may struggle with low quality images from webcams.
</div>
<div>
</div>
## How do I know whether my model will work well for a new problem?
<div>
The first step in understanding the suitability of a model to a new domain is to try and understand how well the model performs on the new domain without any modification. We can build a small (but not too small) ground truth for our new domain and see how well the old model does at predicting the correct labels.
</div>
<div>
</div>
<div>
A good way to accelerate this process is to run your model on the new data and have humans mark the results like a teacher scores test results. As the number of possible labels in your model increases, the mental load of annotation increases. Asking the human “is this correct?” is a much easier task than making them label the data from scratch. Explosion.ai have some really good thoughts on data collection that <a href="https://explosion.ai/blog/supervised-learning-data-collection">suggest a similar viewpoint.</a>
</div>
<div>
</div>
<div>
Once we have a dataset in the new domain we can measure accuracy, recall and precision and make a judgement on whether our old model is performing well enough that we want to simply deploy it into the new environment without modification. Of course, if we do choose to go down this path, it is still good practice to monitor your models performance over time and retrain it periodically on examples where classification failed.
</div>
<div>
</div>
<div>
If the model didnt do well, weve got another trick up our sleeve. Domain adaptation.
</div>
<div>
</div>
## What can domain adaptation do for model reuse?
<div>
<p>
<figure id="attachment_249" aria-describedby="caption-attachment-249" style="width: 160px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-249" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/big-rocket-blast-off-fat-2400px.png?resize=160%2C300&#038;ssl=1" alt="" width="160" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/big-rocket-blast-off-fat-2400px.png?resize=160%2C300&ssl=1 160w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/big-rocket-blast-off-fat-2400px.png?resize=768%2C1445&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/big-rocket-blast-off-fat-2400px.png?resize=544%2C1024&ssl=1 544w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/03/big-rocket-blast-off-fat-2400px.png?w=1276&ssl=1 1276w" sizes="(max-width: 160px) 100vw, 160px" data-recalc-dims="1" /><figcaption id="caption-attachment-249" class="wp-caption-text">A communication block between teams working in imperial and metric measurements caused an expensive accident. Sometimes when the science is right but the language is wrong, all you need is a common dialect.</figcaption></figure>
</p>
<p>
On September 23rd, 1999, a $125 Million Mars lander <a href="https://www.wired.com/2010/11/1110mars-climate-observer-report/">burned up in orbit</a> around the red planet, much to the dismay of the incredibly talented engineering teams who worked on it. The theory was sound. Diligent checks had been carried out. So why, then, had the mission gone up in flames? A review panel meeting showed that a miscommunication between teams working on the project was the cause. One set of engineers expressed  force in pounds, the other team preferred to use newtons. Neither group were wrong but this small miscommunication had everything grinding to a half.</div>
<div>
</div>
<div>
The point here is that the machine learning model that you trained for sentiment in one environment may not be that far off. It might just need “tweaking” to make it work well for the new domain. This exercise, known as “domain adaptation” is about mapping features (words) that help the classifier understand sentiment in one domain onto features (words) that help the it to understand the other domain. For example a model trained on reviews for mobile phones might learn to use negative words like “unresponsive, slow, outdated” but reviews for movies might use negative words like “cliched, dull, slow-moving”. This mapping of features is not an exact science but good mappings can be found using approaches like that of <a href="http://www.icml-2011.org/papers/342_icmlpaper.pdf">Glorot, Bordes and Bengio (2011).</a>
</div>
<h2>
Conclusion
</h2>
<div>
Building machine learning models is a valuable but time consuming activity. It makes sense to build and reuse models where possible. However, it is important to be aware of the fact that models can become ultra-specialised at the task that theyre trained on and that some adaptation may be required to get them working in new environments.  We have given some tips for evaluating a models performance on new tasks as well as some guidance on when re-using a model in a new environment is appropriate. We have also introduced the concept of domain adaptation, sometimes referred to as “transfer learning” which allows your existing model to learn the “language” of the new problem space.
</div>
<div>
</div>
<div>
No Free Lunch theorem can sound pretty scary when it comes to model reuse. We cant guarantee that a model will work in a new space given what we know about its performance on an existing problem. However,  using some of the skills and techniques discussed above, you can have a certain level of confidence that a model will work on a new problem and you can always “domain adapt” to make it better.  Ultimately the proof that these techniques are effective lies with suppliers like IBM, Microsoft and Google. These titans of tech have developed widely used and respected general models for families of problems that are relevant across many sectors for NLP, Voice Recognition and Image Processing. Some of them are static, some trainable (although trainable often means domain-adaptation ready) but most of them work very well with little to no modification on a wide range of use cases. The most important thing is to do some due diligence around checking that they work for you and your business.
</div>

View File

@ -0,0 +1,16 @@
---
title: HarriGT and news coverage of science
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=255
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- Uncategorized
---
A major theme of my PhD is around how scientific work is portrayed in the media. News articles that report on scientific papers serve a number of purposes for the research community. Firstly, they broadcast academic work to a much wider audience.
A scientific paper&#8217;s purpose is to be read and understood by scientists, engineers and other specialists who are interested in reproducing, rebutting or building atop of the work (or heck, maybe they&#8217;re just curious and have a spare half hour). News articles are supposed to inform and entertain (a cynic might place the latter before the former) the general public with regard to current affairs. This difference in purpose and target audience can lead to news articles and scientific papers that refer to the same study but use very different vocabularies and writing styles.

View File

@ -0,0 +1,34 @@
---
title: 'Part time PhD: Mini-Sabbaticals'
author: James
type: post
date: 2018-04-05T13:08:51+00:00
url: /2018/04/05/phd-mini-sabbaticals/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"78e62379c12b";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:74:"https://medium.com/@jamesravey/part-time-phd-mini-sabbaticals-78e62379c12b";}'
categories:
- PhD
tags:
- phd
- productivity
- sabbatical
---
Avid readers amongst you will know that I&#8217;m currently in the third year of my PhD in Computational Linguistics at the University of Warwick  whilst also serving as CTO at [Filament][1]. An incredibly exciting pair of positions that certainly have their challenges and would be untenable without an incredibly supportive set of PhD supervisors ([Amanda Clare][2] and [Maria Liakata][3]) and an equally supportive and understanding pair of company directors ([Phil and Doug][4]). Of course I have to shout out to my fiancee Amy who also puts up with a lot when I&#8217;m stressed out or I have to work weekends.
Until recently, I&#8217;d been working 3 days a week at Filament and 2 days a week (plus weekends where necessary) on my PhD. However I found that the context switching back and forth between my PhD and work life was incredibly disruptive and I was wasting a huge amount of time simply switching between projects. There&#8217;s a really good [article on human context switching][5] by Joel Spolsky that explains the harm it can cause.
Just after Christmas, I had an idea (in no small part inspired by Julia Evans&#8217; [rust sabbatical][6]). What if I could minimise context switching but maintain the same ratio of PhD to Work time? My plan was simple: work for Filament for 4 days a week, 4 weeks at a time. This still gives me at least day a week (sometimes I PhD on saturdays too) to focus on smaller tasks, read papers, reply to emails and usually get half a day of productive coding done. Then there&#8217;s the clever bit: every 5th week, I take a mini-sabbatical from Filament. I block out my calendar, turn on my email auto responder and mute slack. This means I can be super productive for a whole week and really get to grips with the more complex challenges of my PhD that take more than 1 or 2 days of focus.  From Filament&#8217;s point of view, this arrangement is better too. Instead of lots of sporadic, short term absences, everyone knows what my schedule is well in advance and we can plan around it.
I was incredibly grateful to Phil and Doug who welcomed this idea and let me trial it for the first time at the end of Feburary when I was putting together a submission to ACL 2018 (paper&#8217;s in review, fingers crossed). The trial was a success as far as both parties were concerned and today I&#8217;m 4 days into my 2nd mini-sabbatical, having spent 3 very productive days on part of my current mini-project (details will be revealed soon)  and today getting organised and starting to figure out where I need to go next.
It&#8217;s difficult to explain just how productive these mini-sabbs are compared to ad-hoc days off every week. I&#8217;ve decided that I&#8217;m going to start writing a blog summary each time I have a mini-sabb to remind myself just how much I can get done in a week of solid, focused PhD time.
If you&#8217;re already doing a part-time PhD or considering it, finding the perfect balance between work, play and study is a tricky task. If you&#8217;re in a fortunate enough position to have a great working relationship with your company and your phd support staff (and make no mistake, I am aware how incredibly lucky I am in this respect), you might want to think about whether there&#8217;s a more efficient way to be splitting your time and if its feasible to give it a go!
[1]: https://filament.ai/
[2]: https://www.aber.ac.uk/en/cs/staff-list/staff_profiles/?login=afc
[3]: https://warwick.ac.uk/fac/sci/dcs/people/maria_liakata/
[4]: https://filament.ai/about-us/
[5]: https://www.joelonsoftware.com/2001/02/12/human-task-switches-considered-harmful/
[6]: https://jvns.ca/blog/2017/12/02/taking-a-sabbatical-to-work-on-ruby-profiling-tools/

View File

@ -0,0 +1,31 @@
---
title: Programmatically Downloading Open Access Papers
author: James
type: post
date: 2018-04-13T16:04:47+00:00
url: /2018/04/13/programmatically-downloading-open-access-papers/
featured_image: /wp-content/uploads/2018/04/6216334720_54e29fc13c_o-825x510.jpg
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"9cbbb57ab932";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:91:"https://medium.com/@jamesravey/programmatically-downloading-open-access-papers-9cbbb57ab932";}'
categories:
- Open Source
- PhD
tags:
- open access
- scientific papers
- unpaywall
---
_<a href="https://www.flickr.com/photos/seanhobson/6216334720/in/photolist-atjkJQ-QuYgDA-cb9bGo-4o84DP-9GAeQ5-5dopRY-hyQV19-ngTMst-4rRwgg-qQr5Sy-e4XhCg-mQJpZ-6ttPLT-6zQxh2-dsE6bM-qQcUxd-6msKYB-4HRo5J-8W2ryV-4B5rRC-xj9C8-2V5HKa-7zS5wE-Ldsdy-bwMFxR-nibhxt-5mKLS5-5m2URM-7CsC9C-4nJ5jt-a4mQik-6GPYgf-cb9c8s-363XxR-8R4jGd-4qHxrv-T4A8wx-T1NyJG-4tR45P-f5bde-4tV62J-cDEZ9L-Te2m9S-NLeKd-orGJh5-4j53Za-T4Abnn-fqPY88-T1NwPE-7deVVp" target="_blank" rel="noopener">(Cover image &#8220;Unlocked&#8221; by Sean Hobson)</a>_
If you&#8217;re an academic or you&#8217;ve got an interest in reading scientific papers, you&#8217;ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It&#8217;s ok if you&#8217;re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don&#8217;t work and you still can&#8217;t see the paper. Thankfully, the guys at[ Unpaywall][1] (actually built by [Impact Story][2]) have been doing a fantastic job of making open access papers much more easily available to interested academics in the browser. If you end up at a publisher paywall and Unpaywall know about a legitimate free copy of the paper you&#8217;re trying to read, they&#8217;ll link you straight to it for direct download. Problem solved.
For me, as someone interested in text mining on large volumes of scientific papers, getting hold of high quality, peer reviewed open access papers that I can analyse can be a pain. I previously wrote about [downloading batches of papers from PLOS One][3] for data mining purposes but I&#8217;m currently interested in downloading papers that get mentioned and linked to in the news and although that can sometimes include PLOS journals, it also includes many other publishers, both open access and closed. Thankfully, Unpaywall come to the rescue again.
Unpaywall.org provide a free API that takes in a DOI and spits out any and all known free versions of that paper. That makes my life a lot easier: all I have to do is find a long list of DOIs that I&#8217;m interested in analysing and run them through the API.
I&#8217;ve provided a gist of the python function I&#8217;ve written that wraps this API. I&#8217;ve been using it in a Jupyter notebook (which I&#8217;m not ready to publish just yet). Feel free to use it in your project. It might save you an hour or two of development time (it took me a while to work out what errors I needed to try and catch).
[1]: http://unpaywall.org/
[2]: http://impactstory.org/
[3]: https://papro.org.uk/2013/02/26/plosget-py/

View File

@ -0,0 +1,65 @@
---
title: GPUs are not just for images any more…
author: James
type: post
date: 2018-05-13T07:26:12+00:00
url: /2018/05/13/gpus-are-not-just-for-images-any-more/
featured_image: /wp-content/uploads/2018/05/Video-card-825x510.png
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"9ce53222d3c0";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"6fc55de34f53";s:6:"status";s:6:"public";s:3:"url";s:81:"https://medium.com/@jamesravey/gpus-are-not-just-for-images-any-more-9ce53222d3c0";}'
categories:
- Uncategorized
tags:
- gpu
- machine learning
---
As a machine learning professional specialising in computational linguistics (helping machines to extract meaning from human text), I have confused people on multiple occasions by suggesting that their document processing problem could be solved by neural networks trained using a Graphics Processing Unit (GPU). You&#8217;d be well within your rights to be confused. To the uninitiated what I just said was &#8220;Let&#8217;s solve this problem involving reading lots of text by building a system that runs on specialised computer chips designed specifically to render images at high speed&#8221;.
### _**&#8220;In the age of the neural network, Graphics Processing Unit (GPU) is one of the biggest misnomers of our time.&#8221;**_
Well it turns out that GPUs are good for more than playing Doom in high definition or rendering the latest Pixar movie. GPUs are great for doing maths. As it happens, they&#8217;re great for the kind of maths needed for training neural networks and [other kinds][1] of [machine learning][2] [models][3]. So what I&#8217;m trying to say here is that in the age of the neural network, Graphics Processing Unit is one of the biggest misnomers of our time. Really they should be called &#8220;Tensor-based linear algebra acceleration unit&#8221; or something like that (this is probably why I&#8217;m a data scientist and not a marketer).
## Where did GPUs come from?
One of the earliest known uses of the term GPU is from a 1986 book called &#8220;Advances in Computer Graphics&#8221;.  Originally, GPUs were designed to speed up the process of rendering computer games to the user&#8217;s display. Traditional processor chips used for running your computer&#8217;s operating system and applications process one instruction at a time in sequence. Digital images are made up of thousands or millions of pixels in a grid format. Traditional CPUs have to render images by running calculations on each pixel, one at a time: row by row, column by column. GPUs accelerate this process by building an image in parallel. The below video explains the key difference here quite well:
<div class="jetpack-video-wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/-P28LKWTzrI?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div>
So I know what you&#8217;re thinking&#8230; &#8220;If GPUs are so magical and can do all this cool stuff in parallel, why don&#8217;t we just use them all the time instead of CPUS?&#8221; &#8211; am I right?
Well here&#8217;s the thing&#8230; GPUs are specialised for high speed maths and CPUs are generalised for many tasks. That means that GPUS are actually pretty rubbish at a lot of things CPUs are good at &#8211; they&#8217;ve traded flexibility for speed. Let me try and explain with a metaphor.
## The Patisserie Chef and the Cake Factory
Let&#8217;s away from the computer for a second and take a few moments to think about a subject very close to my heart&#8230; food.
&nbsp;
<img loading="lazy" class=" wp-image-273 alignright" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/chefs-hat.png?resize=187%2C229&#038;ssl=1" alt="" width="187" height="229" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/chefs-hat.png?resize=245%2C300&ssl=1 245w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/chefs-hat.png?resize=768%2C939&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/chefs-hat.png?resize=838%2C1024&ssl=1 838w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/chefs-hat.png?w=1880&ssl=1 1880w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/chefs-hat.png?w=1320&ssl=1 1320w" sizes="(max-width: 187px) 100vw, 187px" data-recalc-dims="1" />
A patisserie chef is highly efficient at making yummy cakes and pastries to delight their customers. They can only really pay attention to one cake at a time but they can switch between tasks. For example, when their meringue is in the oven they can focus on icing a cake they left to cool earlier. Trained chefs are typically very flexible and talented and they can make many different recipes &#8211; switching between tasks when they get time.
&nbsp;
<img loading="lazy" class="size-medium wp-image-271 alignleft" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Anonymous-Factory.png?resize=300%2C246&#038;ssl=1" alt="" width="300" height="246" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Anonymous-Factory.png?resize=300%2C246&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Anonymous-Factory.png?resize=768%2C631&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Anonymous-Factory.png?resize=1024%2C841&ssl=1 1024w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Anonymous-Factory.png?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Anonymous-Factory.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" />
At some point in history, [Mr Kipling][4] and [those guys who make twinkies][5] got so many orders that human bakers would never be able to keep up with demand. They had to add cake machines. A factory contains machines that spit out thousands of cakes in parallel. The key difference is that these machines are not flexible in the way that a human chef would be. What if we&#8217;re churning out a batch of 10,000 [french fancies][6] when we get a call to stop production and make [country slices][7] instead? Imagine how long it would take to go around and stop all the machines, put the new ingredients in and then start the process again! A human chef could just throw out the contents of their oven and get started on the new order right away! The factory probably can&#8217;t even handle doing lots of different jobs. I bet they have different machines for the different products or at the very least have to significantly alter the production line. In contrast, the patisserie chef can just change what they do with their hands! _**By the way, this post is not in any way sponsored by Hostess or Mr Kipling. I just like cake.**_
Did you spot the metaphor here? The slower but more flexible chef is a CPU &#8211; plodding along one order at a time, switching when he gets some availability. The cake factory is a GPU &#8211; designed to churn out thousands of similar things as quickly as possible at the cost of flexibility. This is why GPUs aren&#8217;t a one size fits all solution for all of our computing needs.
## Conclusion
Like I said earlier, GPUs are great at maths. They can be employed to draw really pretty pictures but they can also be used for all sorts of real world, mathematical operations where you need to run the same calculations on large batches of data. Training a neural network uses a lot of these streamlined mathematical operations regardless of whether they are trained to [detect cats,][8] play Go or [detect nouns, verbs and adjectives in text][9]. Using a trained neural network to make predictions is less computationally expensive but you might still benefit from running it on a GPU if you are trying to make a lot of predictions very quickly!
[1]: https://devblogs.nvidia.com/gradient-boosting-decision-trees-xgboost-cuda/
[2]: https://github.com/zeyiwen/thundersvm
[3]: https://github.com/vincentfpgarcia/kNN-CUDA
[4]: http://www.mrkipling.co.uk/
[5]: http://hostesscakes.com/products
[6]: http://www.mrkipling.co.uk/range/favourites/french-fancies
[7]: http://www.mrkipling.co.uk/range/favourites/country-slices
[8]: https://medium.com/@curiousily/tensorflow-for-hackers-part-iii-convolutional-neural-networks-c077618e590b
[9]: https://ai.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

View File

@ -0,0 +1,14 @@
---
title: What next for AI in the UK?
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=287
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}'
categories:
- Uncategorized
---
In light of the ever-evolving AI landscape globally and within the UK, last year the House of Lords in the UK formed a Select Committee who were appointed to assess the UK&#8217;s ability to support Artificial Intelligence in the near and medium term. They spent the best part of a year collecting evidence and speaking to experts in the field and myself and some colleagues were lucky enough to have some evidence accepted and taken into consideration.

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,227 @@
---
title: 'Dont forget your life jacket: the dangers of diving in deep at the deep end with deep learning'
author: James
type: post
date: 2018-10-18T14:35:05+00:00
url: /2018/10/18/dont-forget-your-life-jacket-the-dangers-of-diving-in-deep-at-the-deep-end-with-deep-learning/
featured_image: /wp-content/uploads/2018/10/livesaver-825x510.png
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"735db0cf9d14";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"6fc55de34f53";s:6:"status";s:6:"public";s:3:"url";s:137:"https://medium.com/@jamesravey/dont-forget-your-life-jacket-the-dangers-of-diving-in-deep-at-the-deep-end-with-deep-learning-735db0cf9d14";}'
categories:
- PhD
- Work
tags:
- deep learning
- filament
- machine learning
- neural networks
---
<div>
<h1>
Deep Learning is a powerful technology but you might want to try some &#8220;shallow&#8221; approaches before you dive in.
</h1>
</div>
<div>
<p>
<figure id="attachment_321" aria-describedby="caption-attachment-321" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="wp-image-321 size-medium" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/nn1.png?resize=300%2C212&#038;ssl=1" alt="" width="300" height="212" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/nn1.png?resize=300%2C212&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/nn1.png?resize=768%2C543&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/nn1.png?resize=1024%2C724&ssl=1 1024w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/nn1.png?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/nn1.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-321" class="wp-caption-text">Neural networks are made up of neurones and synapses</figcaption></figure>
</p>
<p>
It&#8217;s unquestionable that over the last decade, deep learning has changed machine learning landscape for the better. Deep Neural Networks (DNNs), first popularised by Yan LeCunn, Yoshua Bengio and Geoffrey Hinton, are a family of machine learning models that are capable of learning to see and <a href="https://www.tensorflow.org/tutorials/images/image_recognition" target="_blank" rel="noopener" shape="rect">categorise objects</a>, <a href="https://towardsdatascience.com/stock-prediction-with-deep-learning-studio-545c28fddf5" target="_blank" rel="noopener" shape="rect">predict stock market trends</a>, <a href="http://neuralconvo.huggingface.co/" target="_blank" rel="noopener" shape="rect">understand written text</a> and even <a href="https://deepmind.com/blog/deepmind-and-blizzard-open-starcraft-ii-ai-research-environment/" target="_blank" rel="noopener" shape="rect">play video games</a>.</div>
<div>
</div>
<div>
<h3>
Buzzwords like “LSTM” and “GAN” sound very cool but are they the right fit for purpose for your business problem?
</h3>
</div>
<div>
Neural Networks are (very loosely) modelled on the human brain. A series of <a href="https://en.wikipedia.org/wiki/Neuron">neurones</a> that pass signals to each other through synapses. Given recent news about deep learning and AI, youd be forgiven for thinking that Deep Learning can do anything and everything and make humans all but obsolete. However, there are still lots of things they cant master.  Buzzwords like “LSTM” and “GAN” sound very cool but are they the right fit for purpose for your business problem?
</div>
<div>
</div>
<h2>
Why is Training Data Important for Deep Learning?
</h2>
<div>
Neural Networks learn by backpropagation: this is an iterative process whereby the system makes a prediction and gets feedback about whether it was right or not. Over time and with many examples, the system is able to learn the correct answer by adjusting its internal model. It&#8217;s actually very similar to how children learn over time. Over time, an infant will learn to associate noises to sights. The more a parent says &#8220;Mummy&#8221; or &#8220;Daddy&#8221;, the more the child&#8217;s brain learns that these are important words. If the child points at their father and says &#8220;Mummy&#8221;, they will likely be corrected &#8211; &#8220;no that&#8217;s your Daddy&#8221; and over time they start to learn the correct association.
</div>
<div>
</div>
<div>
<p>
<figure id="attachment_320" aria-describedby="caption-attachment-320" style="width: 218px" class="wp-caption alignright"><img loading="lazy" class="wp-image-320 size-medium" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=218%2C300&#038;ssl=1" alt="" width="218" height="300" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=218%2C300&ssl=1 218w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=768%2C1059&ssl=1 768w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?resize=743%2C1024&ssl=1 743w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?w=1741&ssl=1 1741w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/papapishu-Baby-boy-sitting.png?w=1320&ssl=1 1320w" sizes="(max-width: 218px) 100vw, 218px" data-recalc-dims="1" /><figcaption id="caption-attachment-320" class="wp-caption-text">machines learn by example &#8211; just like babies.</figcaption></figure>
</p>
</div>
<div>
</div>
<div>
The thing about neural network back-propagation (and human learning) is that it takes time and it takes lots of experience &#8211; just like human brains! Imagine if humans took everything we heard as the absolute truth the first time we heard it. Wed be a race stuck with some terrible, incorrect opinions and assumptions OR wed flip-flop between different points of view as they are presented to us. We learn by sampling experiences from many different sources and trying to generalise across them. Babies need to hear “mummy” and “daddy” hundreds or even thousands of times before it starts to sink in that the noisy signals that their ears are receiving have some higher significance. It takes the average human 10-15 months to learn to walk and 18-36 months to learn to talk. We dont just “pick things up” after one exposure to a concept, it takes our brains time to connect the dots and to understand correlations. The same is true of deep neural networks. This “thirst” for data and its associated drawbacks, like the need for huge amounts of compute power, can make deep learning the sub-optimal solution in a number of cases.
</div>
<div>
</div>
<div>
But never fear! Classical machine learning approaches like <a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a>, <a href="https://en.wikipedia.org/wiki/Random_forest">Random Decision Forest</a> or <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">Bayesian Classifiers</a> may have fallen out of vogue but they often present a viable and appropriate solution in cases where “deep learning” wont work.
</div>
<div>
</div>
<h2>
<b>Deep Learning or Classical Learning?</b>
</h2>
<div>
Deciding whether to use deep learning ultimately comes down to a trade-off between how much data and compute power you can get your hands on vs how much time your engineers have to spend on the problem and how well they understand the problem.
</div>
<h3>
<span>Most data scientists will prefer to KISS than to charge in with a deep learning model. </span>
</h3>
<div>
</div>
<div>
Here are 3 rules of thumb for deciding whether to use deep learning or not. You should consider classical models if at least one of these is true:
</div>
<div>
</div>
<ol>
<li>
<div>
You have an experienced data science team who understand feature engineering and the data theyre being asked to model or at the very least can get hold of people that understand the data.
</div>
</li>
<li>
<div>
You dont have access to GPUs and large amounts of compute power or hardware and computing power are at a premium
</div>
</li>
<li>
<div>
You dont have lots of data (i.e. you have 100 or 1000 examples rather than 100k or 1 million)
</div>
</li>
</ol>
<div>
Before we dive into those, I also have a rule zero: KISS &#8211; keep it simple stupid. Most data scientists will prefer to KISS than to charge in with a deep learning model. If you start with a classical model and dont get the performance that you need, a neural network could be a great secondary avenue. <a href="https://developers.google.com/machine-learning/guides/rules-of-ml/#ml_phase_i_your_first_pipeline">Googles data science community hold the same point of view.</a> If you are considering DNNs then nows the time to consider our other rules of thumb.
</div>
<div>
</div>
<div>
</div>
<h2>
<b>1: Data Scientists, Feature Engineering and Understanding the data</b>
</h2>
<div>
Features are fundamental properties of data. Think of an apple: features include colour e.g. red/green/brown(eww), size and sweetness (granny smiths are bitter, golden delicious is sweet). In traditional machine learning, a huge amount of manual work is invested in feature engineering. The data scientist needs to understand a) the problem the model is trying to solve b) the data being fed into the system and c) the attributes or features of that data that are relevant for solving the problem. For example a model that predicts house price may need to know about the number of bedrooms that the house has and the year that the house was built in but may not care about which way around the toilet roll holder was installed in the bathroom. The data scientist needs to tweak the data that the model receives, turning features on and off in order to generate the most accurate results.
</div>
<h3>
<span>&#8230;a deep learning model may be able to learn features of the data that data scientists can&#8217;t but if a hand-engineered model gets you to 90% accuracy, is the extra data gathering and compute power worth it&#8230;?</span>
</h3>
<div>
</div>
<div>
Traditionally, feature engineering is a very manual process that requires experienced data scientists who understand the data and can make good inferences about how the model might react to data changes. Even with experienced data scientists, this activity can be more of an art than a science and often very time consuming. It is also really important that the data scientist understands the classifiers purpose and is able to make good intuitions about which parts of the input might have a bigger effect on model accuracy. If the data scientist doesnt have this information then they typically work very closely with a domain expert. For example, Filament doesnt employ <a href="https://www.filament.ai/case-studies/improving-airport-operations-data-science/">aerospace logistics specialists</a> or <a href="https://www.filament.ai/case-studies/revolutionising-deal-origination-machine-learning/">private equity investors</a> but we were able to create useful models for our clients by working collaboratively their teams of experts.  For some problems it is possible to guess the most effective features and <a href="https://www.filament.ai/ai-suite/engine/">use software to automatically tune the model iteratively.</a>
</div>
<div>
</div>
<div>
<p>
<figure id="attachment_322" aria-describedby="caption-attachment-322" style="width: 300px" class="wp-caption alignright"><img loading="lazy" class="size-medium wp-image-322" src="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=300%2C215&#038;ssl=1" alt="" width="300" height="215" srcset="https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=300%2C215&ssl=1 300w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=768%2C549&ssl=1 768w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?resize=1024%2C733&ssl=1 1024w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?w=1320&ssl=1 1320w, https://i0.wp.com/brainsteam.co.uk/wp-content/uploads/2018/10/settings.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-322" class="wp-caption-text">feature engineering is traditionally a very manual process</figcaption></figure>
</p>
<p>
Conversely, one of the most exciting things about “deep learning” is that these models are able to learn complex features for themselves over time. Just like a human brain slowly assigns meaning to the seemingly random photons that hit our retinas, deep networks are able to receive series of pixels from images and slowly learn which patterns of pixels are interesting or predictive. The caveat is that automatically deriving these features requires huge volumes of data to learn from (see point 3). Ultimately a deep learning model may be able to implicitly learn features of the data that human data scientists are unable to isolate but if a classical, hand-engineered model gets you to 90% accuracy, is the extra data gathering and compute power worth it for that 5-7% boost?
</p>
</div>
<div>
</div>
<h2>
<b>2. Compute Power Requirements </b>
</h2>
<div>
Deep learning models usually consist of a vast number of neurons and synapses connected together in layers stacked on top of each other (hence the deep). The more neurones, the more connections between them and the more calculations the neural network has to make during training and usage. Classical models are typically orders of magnitude simpler and thus much faster to train and use. DNNs are often so complex and resource intensive that they <a href="https://brainsteam.co.uk/2018/05/13/gpus-are-not-just-for-images-any-more/">require special hardware </a> in order to train and run.
</div>
<div>
</div>
<div>
<p>
<figure id="attachment_274" aria-describedby="caption-attachment-274" style="width: 300px" class="wp-caption alignleft"><img loading="lazy" class="wp-image-274 size-medium" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=300%2C218&#038;ssl=1" alt="" width="300" height="218" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=300%2C218&ssl=1 300w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=768%2C557&ssl=1 768w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?resize=1024%2C743&ssl=1 1024w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?w=1320&ssl=1 1320w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/05/Video-card.png?w=1980&ssl=1 1980w" sizes="(max-width: 300px) 100vw, 300px" data-recalc-dims="1" /><figcaption id="caption-attachment-274" class="wp-caption-text">Deep Neural Networks tend to rely on GPUs for their computational requirements</figcaption></figure>
</p>
<p>
It often makes sense to prefer simpler models in cases where compute resource is at a premium or even not available and where classical models give “good enough” accuracy. For example in an edge computing environment in a factory or in an anti-fraud solution at a retail bank where millions of transactions must be examined in real-time. It would either be impossible or obscenely expensive to run a complex deep learning model on millions of data records in real time. Or, it might not be practical to install a cluster of whirring servers into your working environment. On the other hand, if accuracy is what you need and you have lots of data then maybe its time to buy those GPUs&#8230;
</p>
</div>
<div>
</div>
<h2>
3: Lack of data and difficulty gathering data
</h2>
<div>
One of the biggest challenges in supervised machine learning is gathering training data. In order to train a classification or regression model (deep neural network or otherwise) we need to have loads of examples of inputs and their desired matching outputs. For example “Heres a picture of a cat, when you see it I want you to say cat”. <a href="https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/">Ive previously written about some best practices for gathering these kinds of datasets</a>. For a classical machine learning model you need to collect a few hundred or thousand examples of representative, consistent data points.
</div>
<div>
</div>
<div>
To train a deep learning model from scratch you need a lot, lot more. Were not talking hundreds or even thousands. Academic state-of-the-art image recognition models like <a href="https://en.wikipedia.org/wiki/AlexNet">AlexNet</a> are typically trained on millions of examples of images. NLP models for <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> and chatbots rely on <a href="https://en.wikipedia.org/wiki/Word2vec">word vectors</a> trained on the entirety of wikipedia or Google News 3-billion-word news article corpus (thankfully word2vec is an unsupervised algorithm so you dont need to manually annotate those billions of words but you do need to label documents for downstream tasks like sentiment analysis or topic classification). These are the sorts of datasets that only digital behemoths like Google and Facebook who collect millions of documents per day over many years are able to build and curate.
</div>
<div>
</div>
<div>
<a href="https://www.kaggle.com/c/word2vec-nlp-tutorial#part-4-comparing-deep-and-non-deep-learning-methods">Recent benchmarks</a> show that manually feature-engineered “classical” machine learning models like those mentioned above sometimes outperform deep learning systems where datasets are relatively small. In other cases, DNNs offer a <a href="https://www.kdnuggets.com/2018/07/overview-benchmark-deep-learning-models-text-classification.html">slight uplift in performance</a> of the order of a few percent.
</div>
<div>
<b> </b>
</div>
<h2>
<b>Conclusion</b>
</h2>
<div>
Deploying a machine learning product is a complex and multi-faceted problem with many trade-offs and decisions to be made. Deep Learning and DNNs are a very exciting family of technologies that truly are revolutionising the world around us but theyre not always the best approach to a machine learning problem. You should always consider the complexity of the problem youre trying to solve, the amount of data you have, the human expertise and the compute power you have access to. If a simpler model works well then go with it and potentially plan to swap in a more complex deep learning model when you have enough data to make it worthwhile. Dont use Deep Learning, Recurrent Networks, LSTMs, Convolutional Networks or GANS because its cool. Use them because simple methods didnt work. Use them because manual feature engineering isnt giving optimal results. Use them because even though your simple SVM model has been producing great results for the last 10 years, you think that the 10 million rows of data that youve collected could potentially feed a more powerful model that will increase performance by 30%.
</div>

View File

@ -0,0 +1,39 @@
---
title: Uploading HUGE files to Gitea
author: James
type: post
date: 2018-10-20T10:09:41+00:00
url: /2018/10/20/uploading-huge-files-to-gitea/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- PhD
- Work
tags:
- devops
- docker
- git
- lfs
---
I recently stumbled upon and fell in love with [Gitea][1] &#8211; a lightweight self-hosted Github and Gitlab alternative written in the Go programming language. One of my favourite things about it &#8211; other than the speed and efficiency that mean [you can even run it on a raspberry pi][2] &#8211; is the built in LFS support. For the unfamiliar, [LFS is a protocol initially introduced by GitHub][3] that allows users to version control large binary files &#8211; something that Git is traditionally pretty poor at.
Some of my projects have huge datasets that I want to store somewhere safe and keep under version control. LFS is not perfect but it is a reasonable solution for this particular problem.
When I installed Gitea I was initially disappointed that uploading large files to LFS seemed to result in errors. I was getting:
<pre>api error: Authentication required: Authorization error: &lt;REPO_URL&gt;/info/lfs/objects/batch
Check that you have proper access to the repository
batch response: Authentication required: Authorization error: &lt;REPO_URL&gt;/info/lfs/objects/batch
Check that you have proper access to the repository</pre>
Irritatingly I couldn&#8217;t find any references to this particular error message or documentation. But, I had a hunch that the authentication on the LFS upload was timing out because I was able to upload smaller files that don&#8217;t take as long to send.
It turns out that this is exactly what was happening. When you push to a Gitea SSH repository, the server gives your local machine an authorization token that it can use to upload the LFS files. This token has an expiry time which defaults to 20 minutes in the future. If you&#8217;re uploading 5GB of data over 100Mb/down 10Mb/up DSL line then you&#8217;re gonna have a bad time&#8230;
I had a dig through the Gitea github repository and came across an example [config file][4] which includes a variable called LFS\_HTTP\_AUTH_EXPIRY with a default value of 20m. In your gitea config file you can set this to 120m then you have 2 hours to get that file uploaded. Adjust as you see fit/require.
[1]: https://gitea.io/en-us/
[2]: https://pimylifeup.com/raspberry-pi-gitea/
[3]: https://git-lfs.github.com/
[4]: https://github.com/go-gitea/gitea/blob/master/custom/conf/app.ini.sample

View File

@ -0,0 +1,25 @@
---
title: Why is Tmux crashing on start?
author: James
type: post
date: 2018-11-07T07:40:45+00:00
url: /2018/11/07/why-is-tmux-crashing-on-start/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}'
categories:
- Open Source
tags:
- linux
- script
- tmux
---
I spent several hours trying to get to the bottom of why tmux was crashing as soon as I ran it on Fedora. It turns out there&#8217;s a simple fix. When tmux starts it uses /dev/ptmx to create a new TTY (virtual terminal) that the user can interact with. If your user does not have permission to access this device then tmux will silently die. A good way to verify this is to try running [screen][1] too.
In my case I realised that my user was not a member of the user group &#8220;tty&#8221; on my system. The answer was therefore simple:
<pre>sudo usermod -a -G tty james</pre>
I hope this helps someone avoid spending hours searching for the right incantation.
[1]: https://en.wikipedia.org/wiki/GNU_Screen

View File

@ -0,0 +1,121 @@
---
title: 🤐🤐Can Bots Keep Secrets? The Future of Chatbot Security and Conversational “Hacks”
author: James
type: post
date: 2018-12-09T10:36:34+00:00
url: /2018/12/09/🤐🤐can-bots-keep-secrets-the-future-of-chatbot-security-and-conversational-hacks/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:3:"yes";s:2:"id";s:12:"8be78d43ff66";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"6fc55de34f53";s:6:"status";s:6:"public";s:3:"url";s:121:"https://medium.com/@jamesravey/can-bots-keep-secrets-the-future-of-chatbot-security-and-conversational-hacks-8be78d43ff66";}'
categories:
- Work
tags:
- bots
- chatbots
- nlu
- security
---
**As adoption of chatbots and conversational interfaces continues to grow, how will businesses keep their brand safe and their customer&#8217;s data safer?**
From [deliberate infiltration of  systems][1] to[ bugs that cause accidental data leakage][2], these days, the exposure or loss of personal data is a large part of what occupies almost every self-respecting CIO&#8217;s mind. Especially since [the EU has just slapped its first defendant with a GDPR fine.][3]
Over the last 10-15 years, through the rise of the &#8220;interactive&#8221; web and social media, many companies have learned the hard way about the importance of techniques like&nbsp;[hashing passwords][4] stored in databases and&nbsp;[sanitising user input before it is used for querying databases][5]. However as the use of chatbots continues to grow, conversational systems are almost certain to become an attractive method of attack for discerning hackers.
In this article I&#8217;m going to talk about some different types of chatbot attacks that we might start to see and what could be done to prevent them.
## Man in the Middle Attack
In a man in the middle attack, the adversary intercepts traffic in between the many components that make up a chatbot. Baddies might be able to [inject something into a library][6] that your beautiful UX uses that logs everything that your user is saying or they might not need to change the code at all&nbsp;[if you are not using HTTPS][7].<figure class="wp-block-image">
<img loading="lazy" width="660" height="218" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Secure-Chat-1.png?resize=660%2C218&#038;ssl=1" alt="" class="wp-image-348" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Secure-Chat-1.png?w=756&ssl=1 756w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Secure-Chat-1.png?resize=300%2C99&ssl=1 300w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /><figcaption>The chat interface on your device communicates (hopefully securely over HTTPS) with a server that the developer operates and may in term communicate with an external NLU provider. If someone was able to build a man-in-the-middle attack between any of these components it could be a big problem.</figcaption></figure>
These sorts of attacks are clearly a serious problem for any chatbot that will be talking to users about personal information. Even if your chatbot is designed to answer frequently asked questions without any specific link to personal accounts, vulnerability to this attack could give away personal information that the user has inadvertently shared (From &#8220;Do you have kids meals?&#8221; and &#8220;Do you deliver to Example Street&#8221; we can infer that the user has children and lives on Example Street).&nbsp;&nbsp;
### Mitigation
Developers of chatbots should make sure that bots are using the [latest security standards][8] &#8211; at a minimum [all communication should be encrypted at the transport layer (e.g. HTTPS)][9] but you might also consider [encrypting the actual messages][10] before they are transmitted as well. If you&#8217;re reliant on external open source libraries then make sure you [regularly run security checks on your codebase][11] to make sure that those external libraries can be trusted. If you are deploying a bot in a commercial context then you should definitely have independent security/penetration testing of chatbots as a key part of your quality assurance process.
## Exploitation of Third Party Services
The chatbot has often been seen as the &#8220;silver bullet&#8221; for quickly acquiring usage. No longer do you need to build an app that users have to install on their devices, simply integrate with the platforms that people already use e.g. Facebook, Google Home, Alexa and others. However, it&#8217;s important to remember the security consequences of this approach, especially in use cases with sensitive personal information and high stakes if there was ever a data leak.<figure class="wp-block-image">
<img loading="lazy" width="660" height="272" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Secure-Chat-Webhook.png?resize=660%2C272&#038;ssl=1" alt="" class="wp-image-350" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Secure-Chat-Webhook.png?w=756&ssl=1 756w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Secure-Chat-Webhook.png?resize=300%2C123&ssl=1 300w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /><figcaption>Facebook, Alexa, WhatsApp, Telegram, Google Home and other bots use this pattern: your device communicates with the chat service you are engaging with which in turn sends messages back to your service via a &#8220;WebHook&#8221;
</figcaption></figure>
In this scenario your bot&#8217;s security is heavily reliant on the security of the messaging platform that you deploy your system onto. For the most part,&nbsp; these platforms typically have[&nbsp;sensible security procedures][12]. However it&#8217;s important to consider that large companies and platforms are desirable targets for hackers due to the huge potential personal data pay off from a successful breach.&nbsp;
Of course it&#8217;s not just the &#8220;Messenger Platform&#8221; part of this system that&#8217;s of interest to attackers. The &#8220;External NLU provider&#8221; in our diagram above could also be the target of an attack and user utterances stolen. Remember that any external service, whilst useful in many use cases, should be regarded with a healthy scepticism where security is concerned.
### Mitigation
If you are building chatbots tied to third party platforms then you can try to mitigate risks by coding defensively and sharing information sparingly.&nbsp;For example, never have your chatbot ask the user for things like passwords or credit card numbers through one of these portals. Instead use your companion app or website to gather this information securely and tie the user&#8217;s Messenger ID to their user account within your infrastructure.
When it comes to using external NLU a good practice is to run some [anonymisation, removing things like names, addresses, phone numbers etc,][13] on input utterances before passing them on to the service. You might also consider using on-premise NLU solutions so that chat utterances never have to leave your secure environment once they&#8217;ve been received.
## Webhook Exploits
When your bot relies on an external messaging platform as in the above scenario, the WebHook can be another point of weakness. If hackers can find the URL of your webhook then [they can probe it and they can send it messages][14] that look like they&#8217;re from the messaging platform.&nbsp;
### Mitigation
Make sure that your webhook requires authentication and make sure that you follow the guidelines of whichever messenger platform you are using in order to authenticate all incoming messages. Never process messages that fail these checks.&nbsp;
## Unprotected Device Attacks
Have you ever left your computer unlocked and gone to the water cooler? How about handing your mobile phone to a friend in order to make a call or look at a funny meme? Most people have done this at least once and if you haven&#8217;t, well done!
You should&nbsp;[be prepared for opportunistic attackers posing as other users when using your chatbot][15]. They might ask probing questions in order to get the user&#8217;s information &#8220;What delivery address do you have for me again?&#8221; or &#8220;What credit card am I using?&#8221;&nbsp;
### Mitigation
Remember to code and design defensively. Responding with something like &#8220;I&#8217;m sorry I don&#8217;t know that but you can find out by logging in to the secure preferences page [URL Here]&#8221; would be a relatively good response.
Of course there&#8217;s not much you can do if the user leaves their passwords written down on a sticky note next to the terminal or leaves their password manager app unlocked but by requiring users log in to get access to sensitive personal info we&#8217;ve taken some sensible precautions.
## Brand Poisoning Attacks {#mce_12}
<div class="wp-block-image">
<figure class="alignleft"><img loading="lazy" width="200" height="200" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Tay_bot_logo.jpg?resize=200%2C200&#038;ssl=1" alt="" class="wp-image-343" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Tay_bot_logo.jpg?w=200&ssl=1 200w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/Tay_bot_logo.jpg?resize=150%2C150&ssl=1 150w" sizes="(max-width: 200px) 100vw, 200px" data-recalc-dims="1" /><figcaption>Microsoft Tay is one of the most famous examples of a brand poisoning attack</figcaption></figure>
</div>
User data and proprietary information are clearly a high priority but there are other risks to your chatbot that you should also be mindful of. An adversary could poison the way that your chatbot responds in order to screen capture it saying something controversial and start a defamation campaign, poisoning your brand and putting you in a sticky situation.&nbsp;
In March 2016, Microsoft brought online an experimental chatbot called &#8220;Tay&#8221; which was designed to learn to respond in new ways by interacting with its users over time. From a technical perspective, Tay was an incredible piece of kit combining state of the art Natural Language Processing with Online Machine Learning. However, the developers didn&#8217;t bank on swathes of twitter trolls poisoning Tay&#8217;s memory bank and [turning her into a Holocaust denying racist.][16]
This attack was able to happen because of Tay&#8217;s state-of-the-art architecture that allowed her to learn over time and change her vocabulary and responses over time.&nbsp;&nbsp;In 2018 most bots still&nbsp; use a combination of intent detection and static rules in order to work out how to reply to users.&nbsp; This means that most bots probably isn&#8217;t susceptible to this kind of attack.&nbsp;
<div class="wp-block-image">
<figure class="alignleft is-resized"><img loading="lazy" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/image-2.png?resize=262%2C475&#038;ssl=1" alt="" class="wp-image-351" width="262" height="475" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/image-2.png?w=391&ssl=1 391w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2018/12/image-2.png?resize=165%2C300&ssl=1 165w" sizes="(max-width: 262px) 100vw, 262px" data-recalc-dims="1" /></figure>
</div>
&nbsp;However, there are still ways that this kind of attack can trip you up. It all hinges on how your bot reacts to abusive messages and whether it&#8217;s allowed to reiterate stuff that the user has said.
Take the example conversation to the left here. It&#8217;s not exactly undeniable proof of wrongdoing by Joe&#8217;s Shoe Emporium but a well timed social media post or BuzzFeed article with &#8220;#NotADenial #BoycottJoes #ChildLabour&#8221; or could be enough to really do a number on Joe&#8217;s brand.
## Mitigation
So how can we avoid this kind of thing? Well a good start would be to check the user input for profanity as part of validation and then refuse to continue the conversation if things turn hairy. Think of this a bit like a real contact centre handler who has been trained to hang up the phone if the customer gets angry or aggressive. IBM advocate for [all chatbots being able to detect and react to profanity][17] and there&#8217;s a great post [here][18] about some approaches to doing that. Ultimately the way that your bot reacts to rude input &#8211; whether passive, humorous or a simple shut down &#8211; will depend on how you want your brand to come across.
I&#8217;d advocate for &#8220;dealing with aggressive/subversive user interactions&#8221; being high on the chatbot QA team&#8217;s todo list.
[1]: https://www.bbc.co.uk/news/technology-46401890
[2]: https://www.eurogamer.net/articles/2018-12-06-bethesda-has-leaked-fallout-76-customer-names-addresses-contact-details
[3]: https://www.lexology.com/library/detail.aspx?g=d8d0c69a-620e-4f26-ab30-f44e270a0d2e
[4]: https://appleinsider.com/articles/18/05/03/twitter-urges-all-336m-users-to-reset-passwords-due-to-hashing-bug
[5]: https://nakedsecurity.sophos.com/2018/02/19/hackers-sentenced-for-sql-injections-that-cost-300-million/
[6]: https://char.gd/recharged/daily/npm-as-an-attack-vector
[7]: https://developers.google.com/web/fundamentals/security/encrypt-in-transit/why-https
[8]: https://chatbotsmagazine.com/5-tips-for-securing-conversational-apps-a-security-guide-to-the-innovative-cio-b55128e3bc89
[9]: https://dev.to/chiangs/theres-no-excuse-not-to-have-ssl-anymore-76f
[10]: https://jwt.io/
[11]: https://medium.com/intrinsic/common-node-js-attack-vectors-the-dangers-of-malicious-modules-863ae949e7e8
[12]: https://cloud.google.com/security/
[13]: https://medium.com/@dudsdu/named-entity-recognition-for-unstructured-documents-c325d47c7e3a
[14]: https://chatbotsmagazine.com/how-to-kill-a-bot-with-10-http-requests-ca7eb57c2ad1
[15]: https://www.securityroundtable.org/chatbots-rage-something-risk/
[16]: https://gizmodo.com/here-are-the-microsoft-twitter-bot-s-craziest-racist-ra-1766820160
[17]: https://www.ibm.com/blogs/watson/2017/10/the-code-of-ethics-for-ai-and-chatbots-that-every-brand-should-follow/
[18]: https://medium.com/@steve.worswick/the-curse-of-the-chatbot-users-b8af9e186d2e

View File

@ -0,0 +1,93 @@
---
title: Applied AI in 2019
author: James
type: post
date: 2019-01-06T09:52:35+00:00
url: /2019/01/06/applied-ai-in-2019/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:3:"yes";s:2:"id";s:12:"d1473c0a48ca";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"6fc55de34f53";s:6:"status";s:6:"public";s:3:"url";s:62:"https://medium.com/@jamesravey/applied-ai-in-2019-d1473c0a48ca";}'
categories:
- Uncategorized
tags:
- AI
- futurism
- nlp
- vision
---
<p style="font-size:0">
<strong>Looking back at some of the biggest AI and ML developments from 2018 and how they might influence applied AI in the coming year.</strong>
</p>
2018 was a pretty exciting year for AI developments. It&#8217;s true to say there is still a lot of hype in the space but it feels like people are beginning to really understand where AI can and can&#8217;t help them solve practical problems.
In this article we&#8217;ll take a look at some of the AI innovation that came out of academia and research teams in 2018 and how they might affect practical AI use cases in the coming year.
## More Accurate and Faster-to-Train NLP Models with Pre-trained Language Models
Imagine if instead of going to school and university you could be given a microchip implant that teaches you most things you need to know about your subject of choice. You&#8217;d still need to learn-by-doing when you landed a job with your &#8220;instant&#8221; degree and fine tune the knowledge that had been given to you but hey, we&#8217;re talking about 6-12 months of learning instead of 12-18 years. That&#8217;s the essence of what Transfer Learning is all about within the Machine Learning and Deep Learning space.
BERT is a state-of-the-art neural NLP model [unveiled by Google][1] in November 2018. It, like a number of other models unveiled in 2018 like [ELMo][2] and [ULMFiT][3] can be pre-trained on unlabelled text (think news articles, contracts and legal terms, research papers or even wikipedia) and then used to support supervised/labelled tasks that require much smaller volumes of training data than an end-to-end supervised task. For example we might want to automate trading of stocks and shares based on sentiment about companies in the news. In the old days we&#8217;d have to spend weeks having armies of annotators read news articles and highlight companies and indicators of sentiment. A pre-trained language model may already have captured the underlying semantic relationships needed to understand company sentiment so we only need to annotate a fraction of the data that we would if we were training from scratch.
Of course another benefit of using pre-trained models is reduced training time, compute resources (read: server/energy usage costs). Like half-baked bread, the model still needs some time in the oven to make the connections it needs to perform its final task but this is a relatively short amount of time compared to training from empty.
In 2019 we&#8217;ll be training lots of NLP models a lot more quickly and effectively thanks to these techniques.
## Photo-realistic Image Creation
For those unfamiliar with GANs, we&#8217;re talking about unsupervised neural models that can learn to generate photo-realistic images of people, places and things that don&#8217;t really exist. Let that sink in for a moment!
Originally invented by [Ian Goodfellow in 2014][4], GANs back then were able to generate small, pixelated likenesses but they&#8217;ve come a long way. [StyleGAN][5] is a paper published by a research team at NVIDIA which came out in December and might have slipped under your radar in the festive mayhem of that month. However StyleGAN represents some serious progress in generated photo-realism.
Firstly StyleGAN can generate images up to 1024&#215;1024 pixels. That&#8217;s still not huge in terms of modern photography but instagram pictures are 1080&#215;1080 and most social media networks will chop your images down to this kind of ballpark in order to save bandwidth so we&#8217;re pretty close to having GANs that can generate social-media-ready images.
The second major leap represented by StyleGAN is the ability to exercise tight control over the style of the image being generated. Previous GAN implementations generated their images at random. StyleGAN uses parameters to control the styles of the output images, changing things like hair colour, whether or not the person is wearing glasses, and other physical properties.
<div class="wp-block-image">
<figure class="aligncenter"><img loading="lazy" width="420" height="183" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2019/01/image.png?resize=420%2C183&#038;ssl=1" alt="" class="wp-image-357" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2019/01/image.png?w=420&ssl=1 420w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2019/01/image.png?resize=300%2C131&ssl=1 300w" sizes="(max-width: 420px) 100vw, 420px" data-recalc-dims="1" /><figcaption>Figure 8 from the StyleGan paper published <a href="https://arxiv.org/pdf/1812.04948.pdf">here</a> shows how manipulating one of the model parameters can give fine-grained control over the appearance of the output.</figcaption></figure>
</div>
Brands and digital marketing agencies are already seeing huge success with [CGI brand influencers][6] on instagram. GANs that can be tightly controlled in order to position items, clothing and products in the image could be the next logical evolution of these kinds of accounts.
In 2019 we think fake photos could be the next big thing in digital media.
## Hyper-Realistic Voice Assistants
In May 2018 Google showed us a glimpse of Google Duplex, an extension of their assistant product that used hyper-realistic speech recognition and generation [to phone a hair dresser and schedule an appointment][7]. There were a few pretty well argued and important [ethical concerns about having AIs pretend to be humans.][8] However, hyper-realistic voice assistant tech is coming.
There are huge advantages of these approaches, not just to end consumers, but to businesses too. Many businesses already have chatbots that allows users to chat to them via WhatsApp or Facebook and there are early-adopters building voice skills for Google Home and Amazon Alexa. Humans are always going to be a necessary and important part of customer interaction handling since machines will always make mistakes and need re-training. Nonetheless, automation can help reduce the stress and strain on contact-center operators at peek times and allow humans to deal with the more interesting enquiries by handling the most common customer questions on their behalf.
In 2019 we expect the voice interface revolution to continue to pick up pace.
## GDPR and Model Interpretability
Ok so I&#8217;m cheating a bit here since GDPR was not a technical AI/ML development but a legal one. In May 2018, GDPR was enacted across Europe and since the internet knows no borders, most web providers internationally started to adopt GDPR best practices like asking you if its ok to track your behavior with cookies and telling you what data they store on you.
GDPR also grants individuals the following right:
<blockquote class="wp-block-quote">
<p>
not to be subject to a decision, which may include a measure, evaluating personal aspects relating to him or her which is based solely on automated processing and which produces legal effects concerning him or her or similarly significantly affects him or her, such as automatic refusal of an online credit application or e-recruiting practices without any human intervention.
</p>
<cite><a href="https://www.privacy-regulation.eu/en/r71.htm">GDPR Recital 71</a> </cite>
</blockquote>
This isn&#8217;t a clear cut right to an explanation for all automated decisions but it does mean that extra dilligence should be carried out where possible in order to understand automated decisions that affect users&#8217; legal rights. As the provision says this could massively affect credit scoring bureaus and e-recruitment firms but could also affect car insurance firms who use telemetrics data as part of their decision making process when paying out for claims or retailers that use algorithms to decide whether to accept returned digital or physical goods.
In 2018 the best practices for model interpretability lay in training a &#8220;meta model&#8221; that sits on top of your highly accurate deep neural network and tries to guess which features of the data caused it to make a particular decision. These meta-models are normally simple in implementation (e.g. automated decision trees) so that they themselves can be directly inspected and interpreted.
Whether spurred on by the letter of the law or not, understanding why your model made a particular decision can be useful for diagnosing flaws and undesirable biases in your systems anyway.
In 2019 we expect that model interpretability will help providers and developers of AI to improve their approach and offer their users more transparency about decisions made.
[1]: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
[2]: https://arxiv.org/abs/1802.05365
[3]: https://arxiv.org/abs/1801.06146
[4]: https://arxiv.org/abs/1406.2661
[5]: https://arxiv.org/abs/1812.04948
[6]: https://www.thecut.com/2018/05/lil-miquela-digital-avatar-instagram-influencer.html
[7]: https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html
[8]: https://uk.pcmag.com/opinions/94828/google-duplex-is-classist-heres-how-to-fix-it

View File

@ -0,0 +1,37 @@
---
title: Spacy Link or “How not to keep downloading the same files over and over”
author: James
type: post
date: 2019-01-15T18:14:16+00:00
url: /2019/01/15/spacy-link-or-how-not-to-keep-downloading-the-same-files-over-and-over/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"11a44e1c247f";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:114:"https://medium.com/@jamesravey/spacy-link-or-how-not-to-keep-downloading-the-same-files-over-and-over-11a44e1c247f";}'
categories:
- Uncategorized
---
If you&#8217;re a frequent user of spacy and virtualenv you might well be all too familiar with the following:
<blockquote class="wp-block-quote">
<p>
python -m spacy download en_core_web_lg<br /> Collecting en_core_web_lg==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz#egg=en_core_web_lg==2.0.0<br /> Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB)<br /> 5% |█▉ | 49.8MB 11.5MB/s eta 0:01:10
</p>
</blockquote>
If you&#8217;re lucky and you have a decent internet connection then great, if not it&#8217;s time to make a cup of tea.
Even if your internet connection is good. Did you ever stop to look at how much disk space your python virtual environments were using up? I recently found that about 40GB of disk space on my laptop was being used by spacy models I&#8217;d downloaded and forgotten about.
Fear not &#8211; spacy link offers you salvation from this wasteful use of disk space.
Spacy link essentially allows you to link your virtualenv copy of spacy to a copy of the model you already downloaded. Say you installed your desired spacy model to your global python3 installation &#8211; somewhere like** _/usr/lib/python3/site-packages/spacy/data_**** __**
Spacy link will let you link your existing model into a virtualenv to save redownloading (and using extra disk space). From your virtualenv you can do:
python -m spacy link ** _/usr/lib/python3/site-packages/spacy/data/<name\_of\_model> <name of model>_**
For example if we wanted to make the **en\_core\_web_lg** the default english model model in our virtualenv we could do
python -m spacy link ** _/usr/lib/python3/site-packages/spacy/data/en\_core\_web_lg en_**
Presto! Now when we do **spacy.load(&#8216;en&#8217;)** inside our virtualenv we get the large model!

View File

@ -0,0 +1,73 @@
---
title: 'Why Im excited about Kubernetes + Google Anthos: the Future of Enterprise AI deployment'
author: James
type: post
date: 2019-04-24T10:33:24+00:00
url: /2019/04/24/why-im-excited-about-kubernetes-google-anthos-the-future-of-enterprise-ai-deployment/
featured_image: /wp-content/uploads/2019/04/cargo-cargo-container-city-262353-825x510.jpg
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:3:"yes";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:12:"6fc55de34f53";s:6:"status";s:6:"public";s:3:"url";N;}'
categories:
- Uncategorized
tags:
- devops
- docker
- filament
- google
- kubernetes
---
### _Filament build and deploy enterprise AI applications on behalf of incumbent &nbsp;institutions in finance, biotech, facilities management and other sectors. James Ravenscroft, CTO at Filament, writes about the challenges of enterprise software deployment and the opportunities presented by Kubernetes and Googles Anthos offering._
It is a big myth that bringing a software package to market starts and ends with developers and testers. One of the most important, complex and time consuming parts of enterprise software projects is around packaging up the code and making it run across lots of different systems: commonly and affectionately termed “DevOps” in many organisations.
## The Role of Devops in B2C and B2B Software companies
DevOps engineers for consumer-facing software have a pretty tough job. They engineer and maintain complex pipelines of automation that take the latest version of the developers code, run tests on it and then build a smorgasbord of installers and self-extracting packages &nbsp;to work on Android, iOS, Mac and Windows (and probably some common flavours of Linux). There are sometimes platform-specific tricks that need to be carried out in order to get things running smoothly but thankfully these are less common when youre supporting a limited set of operating systems.
DevOps for enterprise software is a different ballgame. Most enterprise software will need to interact with a number of external systems such as a relational-database-server for data storage or an enterprise directory service for company-wide security. Furthermore, these systems are often configured differently or running completely different products from organisation to organisation. &nbsp;When deploying enterprise software there are very real legal and internal/organisational rules about which systems can access each other, the flow of data between components and even whether new systems can be installed at all. Sometimes it takes months or years for DevOps to work with the customer and the development team to ensure that each environment is supported.
## Docker and Kubernetes: A step in the right direction.
Docker + Kubernetes have gone a long way towards fixing this problem. Docker allows you to pack away your application code and all its dependencies into a self-contained, neatly packed little shipping container that is pretty much plug-and-play on any other system running the docker environment. If docker provides shipping containers, Kubernetes is the crew of the cargo ship that makes sure all the cargo gets to its destination and [keeps the fox from eating the goose][1]. Kubernetes organises containers so that those that need to communicate can do so and those that need to be kept isolated are secured.
Just as a consumer software package can be shipped as an installer package, Kubernetes + Docker allow multi-faceted enterprise applications with a number of moving parts to be shipped and deployed in a standard way too. Anyone running a compatible Kubernetes engine cluster can easily deploy self-contained applications to their systems and flexibly swap out components as per company policy (for example, they might switch the packaged database system for one already provided by their organisation).
Unfortunately customizability and configurability are something of a catch 22 in this scenario. Kubernetes configurations can vary widely from organisation-to-organisation and even vary quite significantly between cloud providers (Google Kubernetes Engine and Amazons Elastic Kubernetes Engine are vastly different in implementation and configuration).
At Filament we follow the [Kubernetes community guidelines][2] but many organisations still benefit from running Kubernetes without conforming to these guidelines. Some organisations may modify their setup due to governance and internal policies, others may be early adopters who started using Kubernetes before best practices were established and for whom changing their process may be an expense theyd rather not bear.
So, aside from making enterprise application deployment easier, why should organisations standardise their Kubernetes implementations?
## Anthos: a great reason to standardise.
[Anthos][3] is Googles new hybrid/multi cloud system that promises to allow developers to build and package enterprise apps once then run them in a hybrid cloud environment.
The benefits of a hybrid cloud are numerous. One of the biggest is that you can centralise and coordinate the management of your on-premise and cloud-based IT resources &#8211; reducing overhead in IT project implementations.Your high-volume systems can leverage rapid and dynamic scaling in the cloud whilst securely communicating with secure systems that keep sensitive data safely on premise.
With Googles Cloud offering you get access to the Kubernetes marketplace with instant access to a vast number of instantly deployable enterprise solutions. Of course, as long as your kubernetes setup is roughly in line with the community best practices.
Google have also mentioned that Anthos [will support other cloud providers such as AWS and Azure][4] which is really exciting. Although the specific details of this statement arent clear yet, it may mean that even if your organisation uses AWS you might be able to leverage Google Kubernetes Marketplace applications. Of course the flip-side of this being that organisations that provide Google-Marketplace compatible applications **_might_** get deployment onto AWS and Azure for free.
Open Kubernetes standardisation, likely driven by Anthos uptake, is an exciting opportunity for enterprise software vendors and enterprise software consumers. &nbsp;With a standard Kubernetes deployment youll be able to quickly deploy vastly complex enterprise applications with ease across a number of cloud and on-premise environments and save days or weeks of headaches.
Filament are standardising all of our applications to use Kubernetes for deployment and weve seen some incredible time savings that we only anticipate getting bigger!
I think were likely to see most of the enterprise IT sector work towards standard Kubernetes deployment strategies in the next 5 years.
[1]: https://en.wikipedia.org/wiki/Fox,_goose_and_bag_of_beans_puzzle
[2]: https://kubernetes.io/docs/concepts/configuration/overview/
[3]: https://cloud.google.com/anthos/
[4]: https://techcrunch.com/2019/04/09/googles-anthos-hybrid-cloud-platform-is-coming-to-aws-and-azure/

View File

@ -0,0 +1,99 @@
---
title: How can AI practitioners reduce our carbon footprint?
author: James
type: post
date: 2019-06-20T09:18:40+00:00
url: /2019/06/20/how-can-ai-practitioners-reduce-our-carbon-footprint/
featured_image: /wp-content/uploads/2019/06/ash-blaze-burn-266487-825x510.jpg
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}'
categories:
- Uncategorized
tags:
- AI
- climate catastrophe
- climate change
- machine learning
- nlp
---
In recent weeks and months the impending global climate catastrophe has been at the forefront of many peoples&#8217; minds. Thanks to movements like [Extinction Rebellion][1] and high profile environmentalists like [Greta Thunberg][2] and [David Attenborough][3] as well as damning reports from the [IPCC][4], it finally feels like momentum is building behind significant reduction of carbon emissions. That said, knowing how we can help on an individual level beyond driving and flying less still feels very overwhelming.
### The Energy Issue
A recent study by [Strubel et al. (2019)][5] gave insight into exactly how much energy certain neural architectures require to train. Their findings show that training some of the largest and most complex neural models and neural architecture search (in which multiple models are trained and measured against a fitness function to find the most performant model for a given task) consumes huge amounts of energy. Assuming that energy came from fossil-fuel power plants, a fair assumption since most researchers are using cloud providers like AWS and GCP which rely largely on carbon-generated electricity, the models are producing more CO2 pollution than a car produces in its lifetime.
Predictably, mainstream media misconstrued the findings and articles proposing abandonment of deep learning as a field started to surface (see Charles Radclyffe&#8217;s Forbes article: [AI&#8217;s Dirty Secret][6], [&#8220;][6]if AI really does burn this much electricity, then maybe we should just pull the plug if were serious about climate change?&#8221;).
My biggest objection to this conclusion is that it is based upon the notion that all AI is this power hungry. As I said above, Strubel&#8217;s study is based on some of the biggest and most complex models in the field today. My intuition would be that most data scientists and AI researchers are not training models anywhere near this big and for many data problems it is not even necessary to use deep learning (as I discuss below).
My second objection to this notion that we should scrap AI is that we necessarily dismiss any and all potential benefits of continuing to develop models that reduce energy consumption by optimising [data centres][7], [logistics routes][8] and even [energy grids][9]. In the future the mass adoption of self driving tech could save vast amounts of energy by removing erratic human drivers from the road with their fuel-hungry acceleration and braking behaviours. No more human drivers? Less need for traffic control measures which force millions of us to slow down and speed up every day &#8211; burning large amounts of fuel that wouldn&#8217;t be needed if we maintained a steady speed. None of this would be possible if we just stop trying to improve deep learning approaches overnight.
The BERT language model, one of Strubel&#8217;s worst offenders is, at the time of writing, the state of the art approach for a number of natural language processing tasks. What if BERT-based models powering chatbots and smart speakers could help consumers to make better purchasing decisions and prevent thousands of packages from being shipped and then returned on gas guzzling lorries, planes and cargo ships?
20 years ago most of us had power hungry CRT monitors and TVs that we&#8217;ve since replaced with [more efficient LCD and LED displays][10]. We were using incandescent lightbulbs that [use 6x more electricity than a modern bulb and need replacing a order of magnitude more frequently][11]. Our renewable generation technology has come on leaps and bounds with solar panels becoming [significantly cheaper and more efficient over the last 20 years.][12] My point here is that humans are pretty good at improving the energy efficiency of our inventions. I&#8217;m sure most readers who frequently sit in their electrically lit living room at 10pm at night watching a flat screen TV or scrolling on an OLED touch screen on their smartphone are glad that we didn&#8217;t give up on these technologies because CRT screens and incandescent bulbs are to energy hungry.
### What can the AI community do?
There are a number of things that the AI community can do to help reduce their carbon footprint. Some are simpler and more straight forward, others are a little more involved.
#### KISS &#8211; Keep it simple stupid!
When you&#8217;re building ML models always start with a simple model first. It may be tempting to charge in with a deep learning model immediately but these models are slow to train, prone to overfitting due to their complexity and of course energy hungry. Aside from apeasing the marketing department, there is absolutely no advantage to using a deep learning model before you&#8217;ve even tried Logistic Regression or, whoa don&#8217;t go too crazy now, a random decision forest!
Even if you train a few different &#8216;simple&#8217; models with different data folds and hyper parameters you&#8217;ll probably find it quicker and a less energy hungry starting point. Of course if simple models don&#8217;t work, deep learning is a good option.
#### Pre-trained models and transfer learning
This could apply to both simple models (well kinda) and deep learning models.
It is well known by now that the best way to get near state of the art performance for classification tasks in NLP and computer vision is to take a pre-trained model like [BERT][13] or [ResNet][14] and &#8220;continue&#8221; training by updating the last few layers of the neural model with new weights.
Unless you&#8217;re a multi-national or a top tier research institute with lots of money and data to throw at training then trying to train one of these systems from scratch may be a waste of time and energy anyway (I said &#8216;may be,&#8217; not &#8216;always&#8217;. If you&#8217;re working on new state-of-the-art models then I salute you! We should always strive to better ourselves!).
You can also combine the KISS approach with pre-trained weights. You can achieve some [really great text classification results][15] by using pre-trained word embeddings like [GloVe][16], [word2vec][17] or [fastText][18] with a linear classification model like SVM.
#### **Scale down big data**
If you&#8217;re developing a model and working with a massive dataset, you might consider training on a small but representational subset of the data. You&#8217;ll need to be very careful about this, especially if your dataset is not well balanced or has very rare features (in NLP this could be words that are important but only occur in a tiny proportion of documents). However, if you know that you&#8217;re likely to need to change the model 10 more times before you calculate your final performance metrics, it might (but won&#8217;t always) makes sense to train it on 10,000 samples instead of 100,000 samples.
If you&#8217;re building models that use a gradient descent or evolutionary training approach then you could also limit the number of epochs during development of your model.
#### Give patronage to &#8220;green&#8221; hosting providers
Big companies are not always the most transparent so this suggestion could be trickier. That said, taking your money where the ethical hosting is could be a good way to reduce your model&#8217;s carbon footprint. Especially if you are one of the pioneers working on massive models that use a lot of electricity. Hardware is an important consideration too. GPUs have been a key tool in the evolution of deep learning over the last 10 years but it turns out that [TPUs are better suited to deep learning and much less energy][19] hungry with that.
#### Controversial Suggestion: Carbon Reporting in AI and ML Scientific Publications
This one&#8217;s probably going to be a divisive suggestion but what if we could get all the big ML academic conferences to require some basic calculation of energy usage with all new model architecture submissions? The idea is to introduce a race to the bottom for AI model power consumption. A model that uses 100x less electricity and achieves near state-of-the-art performance would be much more interesting than one that improves state-of-the-art performance by 0.1%
I&#8217;m well aware that this solution is far from perfect given cloud hosting transparency concerns (see above) and conference organisers would have to think carefully about how to set up peer reviews in a way that avoids always rewarding energy efficiency at the expense of model task performance.
I guess another approach could be an international conference for energy efficient machine learning systems. I&#8217;d be interested in whether there&#8217;s enough interest in such a conference from the academic community that I&#8217;d seriously consider organising such an event. Also if one already exists I&#8217;d be interested in participating.
If you&#8217;d like to discuss the above I&#8217;m on twitter [@jamesravey][20]
## Conclusion
In closing, I&#8217;m really glad that Strubel et al have brought this issue to the forefront of our minds and that the work has picked up so much attention. Rather than panicking and downing our tools, I think it&#8217;s important that we remain optimistic about AI and the huge advantages that it can bring and that we try to be as considerate as possible of environmental factors whenever we develop new approaches.
[1]: https://www.standard.co.uk/news/london/extinction-rebellion-activists-block-major-roads-in-north-london-in-latest-stunt-a4171456.html
[2]: https://www.theguardian.com/world/2019/mar/11/greta-thunberg-schoolgirl-climate-change-warrior-some-people-can-let-things-go-i-cant
[3]: https://www.bbc.co.uk/news/entertainment-arts-47988337
[4]: https://www.theguardian.com/environment/2018/oct/08/global-warming-must-not-exceed-15c-warns-landmark-un-report
[5]: https://arxiv.org/abs/1906.02243
[6]: https://www.forbes.com/sites/charlesradclyffe/2019/06/10/ais-dirty-secret/#3ba5ac665331
[7]: https://www.datacenterdynamics.com/analysis/the-machine-itself-the-state-of-ai-in-the-data-center/
[8]: https://www.logistics.dhl/content/dam/dhl/global/core/documents/pdf/glo-artificial-intelligence-in-logistics-trend-report.pdf
[9]: https://www.newtonx.com/insights/2018/08/21/ai-machine-learning-power-grid/
[10]: http://energyusecalculator.com/electricity_lcdleddisplay.htm
[11]: http://energyusecalculator.com/electricity_ledlightbulb.htm
[12]: http://sitn.hms.harvard.edu/flash/2019/future-solar-bright/
[13]: https://github.com/google-research/bert
[14]: https://github.com/tensorflow/models/tree/master/official/resnet
[15]: https://arxiv.org/abs/1805.09843
[16]: https://nlp.stanford.edu/projects/glove/
[17]: https://code.google.com/archive/p/word2vec/
[18]: https://fasttext.cc/docs/en/english-vectors.html
[19]: https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
[20]: https://twitter.com/jamesravey

View File

@ -0,0 +1,35 @@
---
title: PyTorch 1.X.X and Pipenv and Specific versions of CUDA
author: James
type: post
date: 2020-02-02T14:40:46+00:00
url: /2020/02/02/pytorch-1-x-x-and-pipenv-and-specific-versions-of-cuda/
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"8e038847a808";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:98:"https://medium.com/@jamesravey/pytorch-1-x-x-and-pipenv-and-specific-versions-of-cuda-8e038847a808";}'
categories:
- Uncategorized
tags:
- developer
- projects
- python
---
I recently ran into an issue where the newest version of Torch (as of writing 1.4.0) requires a newer version of CUDA/Nvidia Drivers than I have installed.
Last time I tried to upgrade my CUDA version it took me several hours/days so I didn&#8217;t really want to have to spend lots of time on that.
As it happens PyTorch has an archive of compiled python whl objects for different combinations of Python version (3.5, 3.6, 3.7, 3.8 &#8211; heck even 2.X which is no longer officially supported), CUDA Version (9.2, 10.0, 10.1) and Torch version (from 0.1 to 1.4). You can specify which you want to install if you know the right incantation. The full index is available [here][1]
If you&#8217;re using pip and virtualenvs to manage your python environment you can just run:
<pre class="wp-block-preformatted" lang="shell">pip install torch==1.4.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html </pre>
This will install torch 1.4.0 with cuda 10.0 support and it&#8217;ll work out which version of Python you&#8217;re running for you.
Now if you&#8217;re using Pipenv which tries to simplify virtualenv management and package versioning, you&#8217;ll quickly see that there is no way to run the above with `pipenv install`. Currently the only solution I can find is to manually run the above command prefixed with `pipenv run`
So far I&#8217;ve only found one other person who&#8217;s asked about this particular issue on [stackoverflow][2]. I&#8217;ve also [opened a github ticket][3] in the pipenv project. I am curious to know if anyone else has run into this issue or has a solution
[1]: https://download.pytorch.org/whl/torch_stable.html
[2]: https://stackoverflow.com/questions/59752559/how-to-specify-pytorch-cuda-version-in-pipenv
[3]: https://github.com/pypa/pipenv/issues/4121

View File

@ -0,0 +1,107 @@
---
title: 'Dark Recommendation Engines: Algorithmic curation as part of a healthy information diet.'
author: James
type: post
date: 2020-09-04T15:30:19+00:00
url: /2020/09/04/dark-recommendation-engines-algorithmic-curation-as-part-of-a-healthy-information-diet/
featured_image: /wp-content/uploads/2020/09/maxresdefault-825x510.jpg
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:3:"yes";s:2:"id";s:12:"2969b63de7ec";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:130:"https://medium.com/@jamesravey/dark-recommendation-engines-algorithmic-curation-as-part-of-a-healthy-information-diet-2969b63de7ec";}'
categories:
- Uncategorized
---
### In an ever-growing digital landscape filled with more content than a person can consume in their lifetime, recommendation engines are a blessing but can also be a a curse and understanding their strengths and weaknesses is a vital skill as part of a balanced media diet.
If you remember when connecting to the internet involved a squawking modem and images that took 5 minutes to load then you probably discovered your favourite musician after hearing them on the radio, reading about them in NME being told about them by a friend. Likewise you probably discovered your favourite TV show by watching live terrestrial TV, your favourite book by taking a chance at your local library and your favourite movie at a cinema. You only saw the movies that had cool TV ads or rave reviews &#8211; you couldn&#8217;t afford to take a chance on a dud when one ticket, plus bus fare plus popcorn and a drink cost more than two weeks pocket money.
In the year 2020 you can plug your phone into your car, load up Spotify and instantly access over 40 million songs at the touch of a button. You can watch almost any TV show or movie from the last 60 years from your couch. You can read almost any book ever written for free or next to nothing online (especially if your library has free ebook access [like mine][1]). In the space of a few years, our media consumption habits have COMPLETELY changed and that is wonderful and amazing in a kind of utopian star trek &#8220;land of plenty&#8221; kind of way.
Unfortunately there&#8217;s a downside to having access to the entirety of humanity&#8217;s collective knowledge at the click of a button. With so much choice and [3 weeks of video content being added to youtube every minute][2] it is easy to become overwhelmed. Humans aren&#8217;t good at choices that have too many options. We are overcome with [analysis paralysis][3] and if it is left unchecked, we can waste hours of our lives scrolling netflix, reading show synopses but never watching any shows. After all, time is precious and a 90 minute movie is a sizeable, non-refundable investment. What if you don&#8217;t like it when there&#8217;s thousands of hours of other movies that you could be watching instead that could be better? Solving this problem across all sorts of media (news articles, movies, songs, video games) was the original motivation behind recommendation systems.
## Recommender Systems 101
Recommendation engines are all about driving people towards a certain type of content &#8211; in the use case above, it&#8217;s about driving people towards stuff they&#8217;ll like so that they feel like they&#8217;re getting value out of the platform they&#8217;re paying for and they continue to use the platform. There are a few different ways that recommender systems work but here are the basics:
#### Collaborative Recommendation
**_If Bob buys nappies (diapers) and Fred buys diapers AND powdered milk then_ maybe we should recommend powdered milk to Bob**
The above sentence summarises the underlying theory behind collaborative recommenders. We can build a big table of all of our customers and the products that they bought (or movies that they watched) and we can use a technique called [matrix factorization][4] to find sets of products that commonly get consumed together and then finding users who already consumed a subset of these products and recommending the missing piece. The below video explains this concept in more detail.<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio">
<div class="wp-block-embed__wrapper">
<span class="embed-youtube" style="text-align:center; display: block;"><iframe class='youtube-player' width='660' height='372' src='https://www.youtube.com/embed/ZspR5PZemcs?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent' allowfullscreen='true' style='border:0;' sandbox='allow-scripts allow-same-origin allow-popups allow-presentation'></iframe></span>
</div></figure>
Collaborative filtering has a neat little surprise up its sleeve: emergent novelty. The chances are that someone you don&#8217;t know who has similar taste to you is in a good position to introduce you to new content that you didn&#8217;t know you liked. If Bob buys a coffee machine and we recommend it to Fred, the latter user might go &#8220;oh wow, I am pretty tired, I hadn&#8217;t considered a coffee machine &#8211; neat!&#8221; Of course this can have the opposite effect too.
#### Content-based Recommendation
**_Bob likes Terminator 2 which has the properties: &#8216;science fiction&#8217;, &#8217;80s movie&#8217;_,&#8217;directed-by-James-Cameron&#8217; he might also therefore like &#8220;Aliens&#8221;**,
Content-based recommenders, as the summary above suggests, are all about taking properties of the content and using them to draw similarities with other content that might interest the user. Content-based recommendation is more computationally expensive than collaborative filtering since you need to extract &#8216;features&#8217; of the things you&#8217;re recommending at scale (e.g. you might build an algorithm that looks at every frame of every movie in your collection and checks for cyborgs). It&#8217;s also very hard to do feature extraction on physical products and e-commerce sites tend to stick to collaborative approaches.
Content-based recommenders can sometimes get stuck in an echo-chamber mode of recommending very &#8216;samey&#8217; stuff all the time &#8211; there&#8217;s no element of surprise or novelty like you&#8217;d get with collaborative filtering.
#### Hybrid Content-Collaborative Recommendation
**Bob likes Terminator &#8211; an 80s sci-fi movie, Fred likes Terminator- an 80s sci-fi movie and Aliens, Janet likes Ghostbusters, an 80s sci-fi comedy. Recommend Aliens and Terminator to Janet and Ghostbusters to Bob and Fred.**
In this mode of operating, we get the best of both worlds. Terminator and Aliens have a very different tone to Ghostbusters but there&#8217;s a decent chance that Bob and Fred would like it and there&#8217;s some &#8216;feature&#8217; overlap between the three movies (80s, sci-fi).
Hybrid recommendation is also pretty useful when you have limited information about your users because they only just joined or they didn&#8217;t use your system very much yet (This is known as the cold start problem). For example, if a new user, Rachael, comes along we can&#8217;t use collaborative filtering because we don&#8217;t know what films she likes and what other users with her taste have watched. However, we could give her an on-boarding questionnaire and if she tells us she likes 80s sci-fi but not comedy then we can recommend Aliens, Terminator and not Ghostbusters. The more we learn about her, the better these recommendations will get.
## Manipulation and ulterior motive: the dark side of recommendation engines
Recommendation engines are a great way to introduce people to movies, songs, news articles and even physical products that they might be interested in. But what if the motivation behind your recommendation system is no longer to make the user happy? As long as we have a large, consistent set of data relating products (movies/songs/books etc) to users we can train a recommendation engine to optimise itself towards that end. We could train a recommender that always makes terrible recommendations by flipping the dataset we collected about what users like &#8211; not a particularly useful exercise but it could be fun.
What if the recommendation engine serving up your news articles isn&#8217;t optimised to show you what you like but in fact is optimised to show you more of what keeps you engaged? There may be some overlap here but the distinction is **key**. All the system&#8217;s owner would need to do is collect a table of content that the user likes or comments on or shares.
The phrase &#8220;there&#8217;s no such thing as bad press&#8221; is a lot older than social media but has never been more relevant. For decades, traditional print media outlets have used bad news and emotive content to sell more papers. Journalists have become experts at politicising and polarising everything from [avocados][5] to [gen z][6]. Online news outlets use a similar mechanism.
Online news outlets don&#8217;t make money from selling print media but from selling space on their websites for showing adverts and they get paid for every person who clicks on an advert. It&#8217;s probably only 1 in 1000 people that clicks on an ad but if 100,000 people read your article then maybe you&#8217;ll get 100 clicks. This has given rise to [&#8220;clickbait&#8221;][7] headlines that use misleading exaggeration to pull users in to what is more often than not an article of dubious or no interest. Clickbait, at least, is usually fairly easy to detect since the headlines are pretty formulaic and open ended (that&#8217;s my one neat trick that journalists hate me for).
Social networks, like online news outlets, also make money from driving users towards adverts. Most people would read a news article once and close the page, 1 in 1000 of them might click a relevant advert while they&#8217;re at it. However, users typically spend a lot more time on a social network site, liking their neighbour&#8217;s cat picture, wishing their great aunt a happy birthday, getting into arguments and crucially clicking adverts. The longer you spend on the social network site, the more adverts you&#8217;re exposed to and maybe, just maybe, if you see the picture of the new coffee machine enough times you&#8217;ll finally click buy.
So how can social networks keep users clicking around for as long as possible? Maybe by showing them content that piques their interest, that they respond emotionally to, that they want to share with their friends and comment on. How can they make sure that the content that they show is relevant and engaging? Well they can use recommendation engines!
#### A recipe for a &#8220;dark&#8221; recommendation engine
In order to train a pretty good hybrid recommendation engine that can combine social recommendations with &#8220;features&#8221; of the content to get relevant data we need:
1. Information about users &#8211; what they like, what they dislike &#8211; what they had for breakfast (they know it was a muffin and a latte from that cute selfie you uploaded at Starbucks this morning), what your political alignment is (from when you joined &#8220;Socialist memes for marxist teens&#8221; facebook group) &#8211; **CHECK**
2. Information about the content &#8211; what&#8217;s the topic? Does it use emotive words/swears? Does it have a strong political alignment either way? &#8211; using Natural Language Processing they can automatically find all of this information for millions of articles at a time &#8211; **CHECK**
3. Information about users who interact with certain content &#8211; they know who commented on what. They know that the photo of your breakfast got 25 likes 2 comments and that the news article in the Washington Post about Trump got 1500 likes, 240 angry reacts and 300 comments. They also know that 250 of the 300 comments were left by people from the left-wing of politics &#8211; **CHECK**
That&#8217;s all they need to optimise for &#8220;engagement&#8221;. A hybrid recommendation engine can learn that putting pro-Trump articles in front of people who like &#8220;Bernie 2020&#8221; is going to drive a lot of &#8220;engagement&#8221; and it can learn that displaying articles branding millenials as lazy and workshy in front of 20-to-30-somethings is going to drive a lot of &#8220;engagement&#8221; too.
Recommendation engines can learn to only ever share left wing content with left wing people, likewise for right-wingers &#8211; creating an echo-chamber effect. Even worse, articles containing misinformation can be promoted to the top of everyone&#8217;s &#8220;to read&#8221; list because of the controversial response they will receive.
These effects contribute to the often depressing and exhausting experience of spending time on a social media site in 2020. You might come away miserable but the algorithm has done its job &#8211; it&#8217;s kept a large number of people engaged with the site and exposed them to lots of adverts.
## Good news everyone!
Let&#8217;s face it &#8211; its not all bad &#8211; I love pictures of cats sat in boxes and the algorithms have learned this. Spotify has exposed me to a number of bands that absolutely love and that would never get played on the local terrestrial radio station I periodically listen to in the car. I&#8217;ve found shows and books I adore on Netflix and Kindle. I&#8217;ve found loads of scientific papers that were very relevant for my research into NLP using sites like [Semantic Scholar][8]
I guess its also worth noting that the motivation of media platforms like Netflix and Spotify is to help you enjoy yourself so that you pay your subscription as opposed to &#8216;free&#8217; social sites that are happy to make you miserable if it means that you&#8217;ll use them for longer.
The aim of this article was to show you how recommendation engines work and why the motivation for building them is **SO IMPORTANT.** Secondly, I wanted to show you that it&#8217;s important for us to diversify our information intake beyond what the big social media platforms spoon feed us.
You can use sites like reddit where content is aggregated by human votes rather than machines (although fair warning, controversial material can still be disproportionately represented and certain subreddits might depress you more than your social media feed).
You can use chronological social media systems like [mastodon][9] that don&#8217;t shuffle content around hoping to get you to bite on something juicy. I can also recommend the use of RSS reader systems like [Feedly][10] which aggregate content from blog sites in chronological order with minimal interference.
Finally I want to issue a rallying cry to fellow machine learning engineers and data scientists to really think about the recommendation systems that you&#8217;re building and the optimisation mission you&#8217;ve been set. Would you let your family use it or would it make them miserable? Be responsible and be kind.
[1]: https://www.hants.gov.uk/librariesandarchives/library/whatyoucanborrow/ebooksaudiobooks
[2]: https://tubularinsights.com/youtube-300-hours/
[3]: https://www.ted.com/talks/barry_schwartz_the_paradox_of_choice
[4]: https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)
[5]: https://www.theguardian.com/lifeandstyle/2017/may/15/australian-millionaire-millennials-avocado-toast-house
[6]: https://www.benzinga.com/fintech/20/09/17377335/the-pandemic-is-contributing-to-financial-scams-and-generation-z-is-especially-vulnerable
[7]: https://www.merriam-webster.com/dictionary/clickbait
[8]: http://semanticscholar.org/
[9]: https://mastodon.social/about
[10]: https://feedly.com/

View File

@ -0,0 +1,29 @@
---
title: Do more than kick the tires of your NLP model
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=498
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}'
categories:
- Uncategorized
---
### _We&#8217;ve known for a while that &#8216;accuracy&#8217; doesn&#8217;t tell you much about your machine learning models but now we have a better alternative!_
&#8220;So how accurate is it?&#8221; &#8211; a phrase that many data scientists like myself fear and dread being asked by business stakeholders. It&#8217;s not that I fear I&#8217;ve done a bad job but that evaluation of model performance is complex and multi-faceted and that summarising it with a single number usually doesn&#8217;t do it justice. Accuracy can also be a communications hurdle &#8211; it is not an intuitive concept and it can lead to friction and misunderstanding if you&#8217;re not &#8216;in&#8217; with the AI crowd. 50% model accuracy across a model that has 1500 possible answers could be considered pretty good. 80% accuracy in a task setting where data is split 80:10 across two classes is meaningless (that means that randomly guessing is more effective than the model).
I&#8217;ve written before about [how we can use finer-grained metrics like Recall, Precision and F1-score to evaluate machine learning models][1]. However, many of us in the AI/NLP community still feel that these metrics are too simplistic and do not adequately describe the characteristics of trained ML models. Unfortunately, we didn&#8217;t have many other options for evaluating model performance&#8230; until now that is&#8230;
## Checklist &#8211; When machine learning met test automation
At the Annual Meeting of the Association for Computational Linguistics 2020 &#8211; a very popular academic conference on NLP &#8211; [Ribeiro et al presented a new method for evaluating NLP models,][2] inspired by principles and techniques that software quality assurance (QA) specialists have been using for years.
The idea is that we should design and implement test cases for NLP models that reflect the tasks that the model will be required to perform &#8220;in the wild&#8221;. Like software QA, these test cases should include tricky edge cases that may trip the model up in order to understand the practical limitations of the model.
For example, we might train a named entity recognition model that
[1]: https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/
[2]: https://www.aclweb.org/anthology/2020.acl-main.442.pdf

View File

@ -0,0 +1,161 @@
---
title: 'DVC and Backblaze B2 for Reliable & Reproducible Data Science'
author: James
type: post
date: 2020-11-27T15:43:48+00:00
url: /2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/
featured_image: /wp-content/uploads/2020/11/pexels-panumas-nikhomkhai-1148820-825x510.jpg
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:3:"yes";s:2:"id";s:12:"d44d231b648f";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:103:"https://medium.com/@jamesravey/dvc-and-backblaze-b2-for-reliable-reproducible-data-science-d44d231b648f";}'
categories:
- Uncategorized
tags:
- data science
- devops
- machine learning
---
## Introduction
When you&#8217;re working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not <a href="https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7" data-type="URL" data-id="https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7">really suited to large files</a> and whilst general purpose solutions exist ([Git LFS][1] being perhaps the most famous and widely adopted solution), [DVC][2] is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out [here.][3]
It&#8217;s also worth pointing out that another point in DVC&#8217;s favour is its [powerful dependency system][4] and [being able to precisely recreate data science projects down to the command line flag][5] &#8211; particularly desirable in academic and commercial R&D settings.
I use data buckets like S3 and Google Cloud Storage at work frequently and they&#8217;re very useful as an off-site backup large quantities of training data. However, in my personal life my favourite S3-like vendor is [BackBlaze][6] who offer a professional, reliable service with [cheaper data access rates than Amazon and Google][7] and [offer an S3-compatible API][8] which you can use in many places &#8211; including DVC. If you&#8217;re new to remote storage buckets or you want to try-before-you-buy, BackBlaze offer 10GB of remote storage free &#8211; plenty of room for a few hundred thousand pictures of [dogs and chicken nuggets][9] to train your classifier with.
## Setting up your DVC Project
Configuring DVC to use B2 instead of S3 is actually a breeze once you find the right incantation in the documentation. Our first step, if you haven&#8217;t done it already is to install dvc. You can download an installer bundle/debian package/RPM package from [their website][2] or if you prefer you can install it inside python via `pip install dvc[all]` &#8211; the [all] on the end pulls in all the various DVC remote storage libraries &#8211; you could swap this for [s3] if you just want to use that.
Next you will want to create your data science project &#8211; I usually set mine up like this:
<pre class="wp-block-code"><code>- README.md
- .gitignore &lt;-- prefilled with pythonic ignore rules
- environment.yml &lt;-- my conda environment yaml
- data/
- raw/ &lt;-- raw unprocessed data assets go here
- processed/ &lt;-- partially processed and pre-processed data assets go here
-
</code></pre>
Now we can initialize git and dvc:
<pre class="wp-block-code"><code>git init
dvc init</code></pre>
## Setting up your Backblaze Bucket and Credentials
Now we&#8217;re going to create our bucket in backblaze. Assuming you&#8217;ve registered an account, you&#8217;ll want to go to &#8220;My Account&#8221; in the top right hand corner, then click &#8220;Create a new bucket&#8221;<figure class="wp-block-image size-large">
<img loading="lazy" width="660" height="210" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=660%2C210&#038;ssl=1" alt="" class="wp-image-515" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?w=1008&ssl=1 1008w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=300%2C96&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=768%2C245&ssl=1 768w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /></figure>
Enter a bucket name (little gotcha: the name must be unique across the whole of backblaze &#8211; not just your account) and click &#8220;Create a Bucket&#8221; taking the default options on the rest of the fields.<figure class="wp-block-image size-large">
<img loading="lazy" width="577" height="647" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?resize=577%2C647&#038;ssl=1" alt="" class="wp-image-517" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?w=577&ssl=1 577w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?resize=268%2C300&ssl=1 268w" sizes="(max-width: 577px) 100vw, 577px" data-recalc-dims="1" /></figure>
Once your bucket is created you&#8217;ll also need to copy down the &#8220;endpoint&#8221; value that shows up in the information box &#8211; we&#8217;ll need this later when we set up DVC.<figure class="wp-block-image size-large">
<img loading="lazy" width="631" height="281" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?resize=631%2C281&#038;ssl=1" alt="" class="wp-image-522" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?w=631&ssl=1 631w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?resize=300%2C134&ssl=1 300w" sizes="(max-width: 631px) 100vw, 631px" data-recalc-dims="1" /></figure>
We&#8217;re also going to need to create credentials for accessing the bucket. Go back to &#8220;My Account&#8221; and then &#8220;App Keys&#8221; and go for &#8220;Add a New Application Key&#8221;<figure class="wp-block-image size-large">
<img loading="lazy" width="660" height="145" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?resize=660%2C145&#038;ssl=1" alt="" class="wp-image-518" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?w=678&ssl=1 678w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?resize=300%2C66&ssl=1 300w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /></figure>
Here you can enter a memorable name for this key &#8211; by convention I normally use the name of the experiment or model that I&#8217;m training. <figure class="wp-block-image size-large">
<img loading="lazy" width="570" height="574" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=570%2C574&#038;ssl=1" alt="" class="wp-image-519" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?w=570&ssl=1 570w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=298%2C300&ssl=1 298w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=150%2C150&ssl=1 150w" sizes="(max-width: 570px) 100vw, 570px" data-recalc-dims="1" /></figure>
You can leave all of the remaining options with default/empty values or you can use these to lock down your security if you have multiple users accessing your account (or in the event that your key got committed to a public github repo) &#8211; for example we could limit this key to only the bucket we just created or only folders with a certain prefix within this bucket. For this tutorial I&#8217;m assuming you left these as they were and if you change them, your mileage may vary.<figure class="wp-block-image size-large">
<img loading="lazy" width="598" height="208" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?resize=598%2C208&#038;ssl=1" alt="" class="wp-image-521" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?w=598&ssl=1 598w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?resize=300%2C104&ssl=1 300w" sizes="(max-width: 598px) 100vw, 598px" data-recalc-dims="1" /></figure>
Once you create the key you will need to copy down the keyID and applicationKey values &#8211; heed the warning &#8211; they will only appear once and as soon as you move off this page it will be gone forever unless you copy the values somewhere safe. It&#8217;s not the end of the world since we can create more keys but still a bit annoying to have to go through again.
If you&#8217;ve got the name of your bucket, your endpoint, your keyID and applicationKey values stored somewhere safe then we&#8217;re done here and we can move on to the next step.
## Configuring your DVC &#8216;remote&#8217;
With our bucket all set up, we can configure DVC to talk to backblaze. First we add a new remote to DVC. The `-d` flag sets this as the default (so that when we push it will send the data to this location by default without being told explicitely).
<pre class="wp-block-code"><code>dvc remote add b2 s3://your-bucket-name/</code></pre>
So DVC knows about our bucket but unless we tell it otherwise it will assume that it&#8217;s an Amazon S3 bucket rather than a B2 bucket. We need to tell it our endpoint value:
<pre class="wp-block-code"><code>dbc remote modify b2 endpointurl https://s3.us-west-002.backblazeb2.com</code></pre>
You&#8217;ll see that I&#8217;ve copied and pasted my endpoint from when I set up my bucket and stuck &#8220;https://&#8221; on the front which dvc needs to know about to form a valid URL.
## Authenticating DVC
Next we need to tell DVC about our auth keys. [In the DVC manual][10] they show you that you can use the `dvc remote modify` command to permanently store your access credentials in the DVC config file. However this stores your super-duper secret credentials in plain text in a file called `.dvc/config` which gets stored in your git repository meaning that if you&#8217;re storing your work on GitHub then Joe Public could come along and start messing with your private bucket.
Instead I advocate the following approach. Firstly, in our `.gitignore` file at the top level of our project (create one if it doesn&#8217;t exist) add a line that says `.env`
Now we&#8217;re going to create a new file &#8211; again in the top level of our project directory called `.env` and paste in the following:
<pre class="wp-block-code"><code>export AWS_ACCESS_KEY_ID='&lt;keyID>'
export AWS_SECRET_ACCESS_KEY='&lt;applicationKey>'</code></pre>
Replace <keyID> and <applicationKey> with the values from the BackBlaze web UI that we copied earlier.
What we&#8217;ve just done is create a local file that contains our credentials that git is not permitted to store in your repository and it&#8217;s easy enough to use these credentials with DVC from the terminal by running `source .env` first &#8211; don&#8217;t worry I&#8217;ll show you now.
Finally we can run `git add .dvc` followed by a `git commit` to lock in our dvc configuration in this git repository.
## Adding files to DVC
Ok so imagine you have a folder full of images for your neural model to train on. stored in `data/raw/training-data`. We&#8217;re going to add this to DVC with:
<pre class="wp-block-code"><code>dvc add data/raw/training-data</code></pre>
After you run this, you&#8217;ll get a message along these lines:
<pre class="wp-block-code"><code>100% Add|████████████████████████████████████████████████████████████|1/1 &#91;00:01, 1.36s/file]
To track the changes with git, run:
git add data/raw/.gitignore data/raw/training-data/001.jpg</code></pre>
Go ahead and execute the git command now. This will update your git repository so that the actual data (the pictures of dogs and chicken nuggets) will be gitignored but the .dvc files which contain metadata about those files and where to find them will be added to the repository. When you&#8217;re ready you can now `git commit` to save the metadata about the data to git permanently.
## Storing DVC data in backblaze
Now we have the acid test: this next step will push your data to your backblaze bucket if we have everything configured correctly. Simply run:
<pre class="wp-block-code"><code>source .env
dvc push</code></pre>
At this point you&#8217;ll either get error messages or a bunch of progress bars that will populate as the images in your folder are uploaded. Once the process is finished you&#8217;ll see a summary that says `N files pushed` where N is the number of pictures you had in your folder. If that happened then congratulations you&#8217;ve successfully configured DVC and backblaze.
## Getting the data back on another machine
If you want to work on this project with your friends on this project or you want to check out the project on your other laptop then you or they will need to install git and dvc before checking out your project from github (or wherever your project is hosted). Once they have a local copy they should be able to go into the `data/raw/training-data` folder and they will see all of the `*.dvc` files describing where the training data is.
Your git repository should have all of your dvc configuration in it already including the endpoint URL for your bucket. However, In order to check out this data they will first need to create a `.env` file of their own containing a key pair (ideally one that you&#8217;ve generated for them that is locked down as much as possible to just the project that you&#8217;d like to collaborate with them on). Then they will need to run:
<pre class="wp-block-code"><code>source .env
dvc checkout</code></pre>
This should begin the process of downloading your files from backblaze and making local copies of them in `data/raw/training-data`.
## Streamlining Workflows
One final tip I&#8217;d offer is using `dvc install` which will add hooks to git so that every time you push and pull, dvc push and pull are also automatically triggered &#8211; saving you from manually running those steps. It will also hook up dvc checkout and git checkout in case you&#8217;re working with different data assets on different project branches.
## Final Thoughts
Congratulations, if you got this far it means you&#8217;ve configured DVC and Backblaze B2 and have a perfectly reproducible data science workflow at the tips of your fingers. This workflow is well optimised for teams of people working on data science experiments that need to be repeatable or have large volumes of unwieldy data that needs a better home than git.
_If you found this post useful please leave claps and comments or follow me on twitter [@jamesravey][11] for more._
[1]: https://git-lfs.github.com/
[2]: https://dvc.org/
[3]: https://dvc.org/doc/command-reference/remote/add
[4]: https://dvc.org/doc/command-reference/dag
[5]: https://dvc.org/doc/command-reference/run
[6]: https://www.backblaze.com/
[7]: https://www.backblaze.com/b2/cloud-storage.html
[8]: https://www.backblaze.com/b2/docs/s3_compatible_api.html
[9]: http://www.mtv.com/news/2752312/bagel-or-dog-or-fried-chicken-or-dog/
[10]: https://dvc.org/doc/command-reference/remote/modify#example-customize-an-s3-remote
[11]: https://twitter.com/jamesravey

View File

@ -0,0 +1,19 @@
---
title: Easy MLFlow Server Hosting with Docker-Compose
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=532
medium_post:
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}'
categories:
- Uncategorized
---
At Filament we&#8217;re really big fans of MLFlow for managing our ML model lifecycle from experiment to deployment. I won&#8217;t go into the [many advantages][1] of using this software since [many others][2] have done a good job of this before me.
If you&#8217;re bought in
[1]: https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
[2]: https://towardsdatascience.com/5-tips-for-mlflow-experiment-tracking-c70ae117b03f

Binary file not shown.

After

Width:  |  Height:  |  Size: 515 KiB