Brainsteam https://brainsteam.co.uk/ Recent content on Brainsteam Hugo -- gohugo.io en-us © James Ravenscroft 2020 Mon, 12 Apr 2021 20:21:11 +0000 An opinionated guide to Python environments in 2021 https://brainsteam.co.uk/2021/04/01/opinionated-guide-to-virtualenvs/ Mon, 12 Apr 2021 20:21:11 +0000 https://brainsteam.co.uk/2021/04/01/opinionated-guide-to-virtualenvs/ A person overwhelmed by boxes by Cottonbro Note: If you don’t want to read the blah-blah context and history stuff then you can jump to the recommendations The Problem The need for virtual python environments becomes fairly obvious early in most Python developers' careers when they switch between two projects and realise that they have incompatible dependences (e.g. project1 needs scikit-learn-0.21 and project2 needs scikit-learn-0.24). Unlike other mainstream languages like Javascript(Node. Reproducing 'ancient' experiments with Pytorch inside docker https://brainsteam.co.uk/2021/03/01/running-old-pytorch-docker/ Mon, 01 Mar 2021 20:21:11 +0000 https://brainsteam.co.uk/2021/03/01/running-old-pytorch-docker/ A beige analog compass by Ylanite Koppens Introduction Open machine learning research is undergoing something of a reproducibiltiy crisis. In fairness it’s not usually the authors' fault - or at least not entirely. We’re a fickle industry and the tools and frameworks were ‘in vogue’ and state of the art a couple of years ago are now obsolete. Furthermore, academics and open source contributors are under no obligation to keep their code up to date. Pickle 5 Madness with MLFlow and Python 3.6/3.7 https://brainsteam.co.uk/2021/01/14/pickle-5-madness-with-mlflow/ Thu, 14 Jan 2021 11:42:28 +0000 https://brainsteam.co.uk/2021/01/14/pickle-5-madness-with-mlflow/ A jar of pickles by Ksenia Charnaya I recently came across an infuriating problem where an MLFlow python model I had trained on one system using Python 3.6 would not load on another system with an identical version of Python. The exact problem was that when I ran mlflow models serve -m <url/to/model/in/bucket> the service would crash saying that the model could not be unserialized because ValueError: unsupported pickle protocol: 5. Serving NLP Models with MLflow https://brainsteam.co.uk/2020/12/29/serving-nlp-models-with-mlflow/ Tue, 29 Dec 2020 09:50:28 +0000 https://brainsteam.co.uk/2020/12/29/serving-nlp-models-with-mlflow/ MLFlow is a powerful open source MLOps platform with built in framework for serving your trained ML models as REST APIs. The REST framework will load data provided in a JSON or CSV format compatible with pandas and pass this directly into your model. This can be handy when your model is expecting a tabular list of numerical and categorical features. However it is less clear how to serve with models and pipelines that are expecting unstructured text data as their primary input. DVC and Backblaze B2 for Reliable & Reproducible Data Science https://brainsteam.co.uk/2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/ Fri, 27 Nov 2020 15:43:48 +0000 https://brainsteam.co.uk/2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/ Introduction When you’re working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not really suited to large files and whilst general purpose solutions exist (Git LFS being perhaps the most famous and widely adopted solution), DVC is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out here. ‘Dark’ Recommendation Engines: Algorithmic curation as part of a ‘healthy’ information diet. https://brainsteam.co.uk/2020/09/04/dark-recommendation-engines-algorithmic-curation-as-part-of-a-healthy-information-diet/ Fri, 04 Sep 2020 15:30:19 +0000 https://brainsteam.co.uk/2020/09/04/dark-recommendation-engines-algorithmic-curation-as-part-of-a-healthy-information-diet/ In an ever-growing digital landscape filled with more content than a person can consume in their lifetime, recommendation engines are a blessing but can also be a a curse and understanding their strengths and weaknesses is a vital skill as part of a balanced media diet. If you remember when connecting to the internet involved a squawking modem and images that took 5 minutes to load then you probably discovered your favourite musician after hearing them on the radio, reading about them in NME being told about them by a friend. PyTorch 1.X.X and Pipenv and Specific versions of CUDA https://brainsteam.co.uk/2020/02/02/pytorch-1-x-x-and-pipenv-and-specific-versions-of-cuda/ Sun, 02 Feb 2020 14:40:46 +0000 https://brainsteam.co.uk/2020/02/02/pytorch-1-x-x-and-pipenv-and-specific-versions-of-cuda/ I recently ran into an issue where the newest version of Torch (as of writing 1.4.0) requires a newer version of CUDA/Nvidia Drivers than I have installed. Last time I tried to upgrade my CUDA version it took me several hours/days so I didn’t really want to have to spend lots of time on that. As it happens PyTorch has an archive of compiled python whl objects for different combinations of Python version (3. How can AI practitioners reduce our carbon footprint? https://brainsteam.co.uk/2019/06/20/how-can-ai-practitioners-reduce-our-carbon-footprint/ Thu, 20 Jun 2019 09:18:40 +0000 https://brainsteam.co.uk/2019/06/20/how-can-ai-practitioners-reduce-our-carbon-footprint/ In recent weeks and months the impending global climate catastrophe has been at the forefront of many peoples’ minds. Thanks to movements like Extinction Rebellion and high profile environmentalists like Greta Thunberg and David Attenborough as well as damning reports from the IPCC, it finally feels like momentum is building behind significant reduction of carbon emissions. That said, knowing how we can help on an individual level beyond driving and flying less still feels very overwhelming. Why I’m excited about Kubernetes + Google Anthos: the Future of Enterprise AI deployment https://brainsteam.co.uk/2019/04/24/why-im-excited-about-kubernetes-google-anthos-the-future-of-enterprise-ai-deployment/ Wed, 24 Apr 2019 10:33:24 +0000 https://brainsteam.co.uk/2019/04/24/why-im-excited-about-kubernetes-google-anthos-the-future-of-enterprise-ai-deployment/ Filament build and deploy enterprise AI applications on behalf of incumbent institutions in finance, biotech, facilities management and other sectors. James Ravenscroft, CTO at Filament, writes about the challenges of enterprise software deployment and the opportunities presented by Kubernetes and Google’s Anthos offering. It is a big myth that bringing a software package to market starts and ends with developers and testers. One of the most important, complex and time consuming parts of enterprise software projects is around packaging up the code and making it run across lots of different systems: commonly and affectionately termed “DevOps” in many organisations. Spacy Link or “How not to keep downloading the same files over and over” https://brainsteam.co.uk/2019/01/15/spacy-link-or-how-not-to-keep-downloading-the-same-files-over-and-over/ Tue, 15 Jan 2019 18:14:16 +0000 https://brainsteam.co.uk/2019/01/15/spacy-link-or-how-not-to-keep-downloading-the-same-files-over-and-over/ If you’re a frequent user of spacy and virtualenv you might well be all too familiar with the following: python -m spacy download en_core_web_lg Collecting en_core_web_lg==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz#egg=en_core_web_lg==2.0.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB) 5% |█▉ | 49.8MB 11.5MB/s eta 0:01:10 If you’re lucky and you have a decent internet connection then great, if not it’s time to make a cup of tea. Even if your internet connection is good. Did you ever stop to look at how much disk space your python virtual environments were using up? Applied AI in 2019 https://brainsteam.co.uk/2019/01/06/applied-ai-in-2019/ Sun, 06 Jan 2019 09:52:35 +0000 https://brainsteam.co.uk/2019/01/06/applied-ai-in-2019/ Looking back at some of the biggest AI and ML developments from 2018 and how they might influence applied AI in the coming year. 2018 was a pretty exciting year for AI developments. It’s true to say there is still a lot of hype in the space but it feels like people are beginning to really understand where AI can and can’t help them solve practical problems. In this article we’ll take a look at some of the AI innovation that came out of academia and research teams in 2018 and how they might affect practical AI use cases in the coming year. 🤐🤐Can Bots Keep Secrets? The Future of Chatbot Security and Conversational “Hacks” https://brainsteam.co.uk/2018/12/09/%F0%9F%A4%90%F0%9F%A4%90can-bots-keep-secrets-the-future-of-chatbot-security-and-conversational-hacks/ Sun, 09 Dec 2018 10:36:34 +0000 https://brainsteam.co.uk/2018/12/09/%F0%9F%A4%90%F0%9F%A4%90can-bots-keep-secrets-the-future-of-chatbot-security-and-conversational-hacks/ As adoption of chatbots and conversational interfaces continues to grow, how will businesses keep their brand safe and their customer’s data safer? From deliberate infiltration of systems tobugs that cause accidental data leakage, these days, the exposure or loss of personal data is a large part of what occupies almost every self-respecting CIO’s mind. Especially since the EU has just slapped its first defendant with a GDPR fine. Over the last 10-15 years, through the rise of the “interactive” web and social media, many companies have learned the hard way about the importance of techniques like hashing passwords stored in databases and sanitising user input before it is used for querying databases. Why is Tmux crashing on start? https://brainsteam.co.uk/2018/11/07/why-is-tmux-crashing-on-start/ Wed, 07 Nov 2018 07:40:45 +0000 https://brainsteam.co.uk/2018/11/07/why-is-tmux-crashing-on-start/ I spent several hours trying to get to the bottom of why tmux was crashing as soon as I ran it on Fedora. It turns out there’s a simple fix. When tmux starts it uses /dev/ptmx to create a new TTY (virtual terminal) that the user can interact with. If your user does not have permission to access this device then tmux will silently die. A good way to verify this is to try running screen too. Uploading HUGE files to Gitea https://brainsteam.co.uk/2018/10/20/uploading-huge-files-to-gitea/ Sat, 20 Oct 2018 10:09:41 +0000 https://brainsteam.co.uk/2018/10/20/uploading-huge-files-to-gitea/ I recently stumbled upon and fell in love with Gitea – a lightweight self-hosted Github and Gitlab alternative written in the Go programming language. One of my favourite things about it – other than the speed and efficiency that mean you can even run it on a raspberry pi – is the built in LFS support. For the unfamiliar, LFS is a protocol initially introduced by GitHub that allows users to version control large binary files – something that Git is traditionally pretty poor at. Don’t forget your life jacket: the ‘dangers’ of diving in deep at the deep end with deep learning https://brainsteam.co.uk/2018/10/18/dont-forget-your-life-jacket-the-dangers-of-diving-in-deep-at-the-deep-end-with-deep-learning/ Thu, 18 Oct 2018 14:35:05 +0000 https://brainsteam.co.uk/2018/10/18/dont-forget-your-life-jacket-the-dangers-of-diving-in-deep-at-the-deep-end-with-deep-learning/ Deep Learning is a powerful technology but you might want to try some “shallow” approaches before you dive in. Neural networks are made up of neurones and synapses It’s unquestionable that over the last decade, deep learning has changed machine learning landscape for the better. Deep Neural Networks (DNNs), first popularised by Yan LeCunn, Yoshua Bengio and Geoffrey Hinton, are a family of machine learning models that are capable of learning to see and categorise objects, predict stock market trends, understand written text and even play video games. GPUs are not just for images any more… https://brainsteam.co.uk/2018/05/13/gpus-are-not-just-for-images-any-more/ Sun, 13 May 2018 07:26:12 +0000 https://brainsteam.co.uk/2018/05/13/gpus-are-not-just-for-images-any-more/ As a machine learning professional specialising in computational linguistics (helping machines to extract meaning from human text), I have confused people on multiple occasions by suggesting that their document processing problem could be solved by neural networks trained using a Graphics Processing Unit (GPU). You’d be well within your rights to be confused. To the uninitiated what I just said was “Let’s solve this problem involving reading lots of text by building a system that runs on specialised computer chips designed specifically to render images at high speed”. Programmatically Downloading Open Access Papers https://brainsteam.co.uk/2018/04/13/programmatically-downloading-open-access-papers/ Fri, 13 Apr 2018 16:04:47 +0000 https://brainsteam.co.uk/2018/04/13/programmatically-downloading-open-access-papers/ (Cover image “Unlocked” by Sean Hobson) If you’re an academic or you’ve got an interest in reading scientific papers, you’ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It’s ok if you’re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don’t work and you still can’t see the paper. Part time PhD: Mini-Sabbaticals https://brainsteam.co.uk/2018/04/05/phd-mini-sabbaticals/ Thu, 05 Apr 2018 13:08:51 +0000 https://brainsteam.co.uk/2018/04/05/phd-mini-sabbaticals/ Avid readers amongst you will know that I’m currently in the third year of my PhD in Computational Linguistics at the University of Warwick whilst also serving as CTO at Filament. An incredibly exciting pair of positions that certainly have their challenges and would be untenable without an incredibly supportive set of PhD supervisors (Amanda Clare and Maria Liakata) and an equally supportive and understanding pair of company directors (Phil and Doug). Re-using machine learning models and the “no free lunch” theorem https://brainsteam.co.uk/2018/03/21/re-using-machine-learning-models-and-the-no-free-lunch-theorem/ Wed, 21 Mar 2018 11:26:27 +0000 https://brainsteam.co.uk/2018/03/21/re-using-machine-learning-models-and-the-no-free-lunch-theorem/ Why re-use machine learning models? Model re-use can be a huge cost saver when developing AI systems. But how well will your models perform in their new environment? You can get a lot of value out of training a machine learning model to solve a single use case, like predicting emotion in your customer chatbot transcripts and putting the angry ones through to real humans. However, you might be able to extract even more value out of your model by using it in more than one use case. How I became a gopher over christmas https://brainsteam.co.uk/2018/01/27/how-i-became-a-gopher/ Sat, 27 Jan 2018 10:09:34 +0000 https://brainsteam.co.uk/2018/01/27/how-i-became-a-gopher/ Happy new year to one and all. It’s been a while since I posted and life continues onwards at a crazy pace. I meant to publish this post just after Christmas but have only found time to sit down and write now. If anyone is wondering what’s with the crazy title – a gopher is someone who practices the Go programming language (just as those who write in Python refer to themselves as pythonistas. Why I keep going back to Evernote https://brainsteam.co.uk/2017/08/03/182/ Thu, 03 Aug 2017 08:27:53 +0000 https://brainsteam.co.uk/2017/08/03/182/ As the CTO for a London machine learning startup and a PhD student at Warwick Institute for the Science of Cities, to say I’m busy is an understatement. At any given point in time, my mind is awash with hundreds of ideas around Filament tech strategy, a cool app I’d like to build, ways to measure scientific impact, wondering what the name of that new song I heard on the radio was or some combination thereof. Dialect Sensitive Topic Models https://brainsteam.co.uk/2017/07/25/dialect-sensitive-topic-models/ Tue, 25 Jul 2017 11:02:42 +0000 https://brainsteam.co.uk/2017/07/25/dialect-sensitive-topic-models/ As part of my PhD I’m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you’re new to the concept of topic modelling then this article can give you a quick primer. Vanilla LDA A diagram of how latent variables in LDA model are connected Vanilla topic models such as Blei’s LDA are great but start to fall down when the wording around one particular concept varies too much. Exploring Web Archive Data – CDX Files https://brainsteam.co.uk/2017/06/05/exploring-web-archive-data-cdx-files/ Mon, 05 Jun 2017 07:24:22 +0000 https://brainsteam.co.uk/2017/06/05/exploring-web-archive-data-cdx-files/ I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the . timetrack improvements https://brainsteam.co.uk/2016/12/10/timetrack-improvements/ Sat, 10 Dec 2016 09:33:41 +0000 https://brainsteam.co.uk/2016/12/10/timetrack-improvements/ I’ve just added a couple of improvements to timetrack that allow you to append to existing time recordings (either with an amount like 15m or using live to time additional minutes spent and append them). You can also remove entries using timetrack rm instead of remove – saving keystrokes is what programming is all about. You can find the updated code over at github. AI can’t solve all our problems, but that doesn’t mean it isn’t intelligent https://brainsteam.co.uk/2016/12/08/ai-cant-solve-all-our-problems-but-that-doesnt-mean-it-isnt-intelligent/ Thu, 08 Dec 2016 10:08:13 +0000 https://brainsteam.co.uk/2016/12/08/ai-cant-solve-all-our-problems-but-that-doesnt-mean-it-isnt-intelligent/ Thomas Hobbes, perhaps most famous for his thinking on western politics, was also thinking about how the human mind “computes things” 500 years ago. A recent opinion piece I read on Wired called for us to stop labelling our current specific machine learning models AI because they are not intelligent. I respectfully disagree. AI is not a new concept. The idea that a computer could ‘think’ like a human and one day pass for a human has been around since Turing and even in some form long before him. We need to talk about push notifications (and why I stopped wearing my smartwatch) https://brainsteam.co.uk/2016/11/27/we-need-to-talk-about-push-notifications-and-why-i-stopped-wearing-my-smartwatch/ Sun, 27 Nov 2016 12:59:22 +0000 https://brainsteam.co.uk/2016/11/27/we-need-to-talk-about-push-notifications-and-why-i-stopped-wearing-my-smartwatch/ I own a Pebble Steel which I got for Christmas a couple of years ago. I’ve been very happy with it so far. I can control my music player from my wrist, get notifications and a summary of my calender. Recently, however I’ve stopped wearing it. The reason is that constant streams of notifications stress me out, interrupt my workflow and not wearing it makes me feel more calm and in control and allows me to be more productive. timetrack – a simple time tracking application for developers https://brainsteam.co.uk/2016/11/23/timetrack-a-simple-time-tracking-application-for-developers/ Wed, 23 Nov 2016 14:43:58 +0000 https://brainsteam.co.uk/2016/11/23/timetrack-a-simple-time-tracking-application-for-developers/ I’ve written a small command line application for tracking my time on my PhD and other projects. We use Harvest at Filament which is great if you’ve got a huge team and want the complexity (and of course license charges) of an online cloud solution for time tracking. If, like me, you’re just interested to see how much time you are spending on your different projects and you don’t have any requirement for fancy web interfaces or client billing, then timetrack might be for you. The builder, the salesman and the property tycoon https://brainsteam.co.uk/2016/11/12/the-builder-the-salesman-and-the-property-tycoon/ Sat, 12 Nov 2016 11:43:24 +0000 https://brainsteam.co.uk/2016/11/12/the-builder-the-salesman-and-the-property-tycoon/ A testament to marketers around the world is the myth that their AI platform X, Y or Z can solve all your problems with no effort. Perhaps it is this, combined with developers and data scientists often being hidden out of sight and out of mind that leads people to think this way. Unfortunately, the truth of the matter is that ML and AI involve blood sweat and tears – especially if you are building things from scratch rather than using APIs. #BlackgangPi – a Raspberry Pi Hack at Blackgang Chine https://brainsteam.co.uk/2016/06/05/blackgangpi-a-raspberry-pi-hack-at-blackgang-chine/ Sun, 05 Jun 2016 07:59:40 +0000 https://brainsteam.co.uk/2016/06/05/blackgangpi-a-raspberry-pi-hack-at-blackgang-chine/ I was very excited to be invited along with some other IBMers to the Blackgang Pi event run by Dr Lucy Rogers on a semi regular basis at the Blackgang Chine theme park on the Isle of Wight. Blackgang Chine is a theme park on the southern tip of the Isle of Wight and holds the title of oldest theme park in the United Kingdom. We were lucky enough to be invited along to help them modernise some of their animatronic exhibits, replacing some of the aging bespoke PCBs and controllers with Raspberry Pis running Node-RED and communicating using MQTT/Watson IOT. Cognitive Quality Assurance Pt 2: Performance Metrics https://brainsteam.co.uk/2016/05/29/cognitive-quality-assurance-pt-2-performance-metrics/ Sun, 29 May 2016 09:41:26 +0000 https://brainsteam.co.uk/2016/05/29/cognitive-quality-assurance-pt-2-performance-metrics/ EDIT: Hello readers, these articles are now 4 years old and many of the Watson services and APIs have moved or been changed. The concepts discussed in these articles are still relevant but I am working on 2nd editions of them. Last time we discussed some good practices for collecting data and then splitting it into test and train in order to create a ground truth for your machine learning system. IBM Watson – It’s for data scientists too! https://brainsteam.co.uk/2016/05/01/ibm-watson-its-for-data-scientists-too/ Sun, 01 May 2016 11:28:13 +0000 https://brainsteam.co.uk/2016/05/01/ibm-watson-its-for-data-scientists-too/ Last week, my colleague Olly and I gave a talk at a data science meetup on how IBM Watson can be used for data science applications. We had an amazing time and got some really great feedback from the event. We will definitely be doing more talks at events like these in the near future so keep an eye out for us! I will also be writing a little bit more about the experiment I did around Core Scientific Concepts and Watson Natural Language Classifier in a future blog post. Cognitive Quality Assurance – An Introduction https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/ Tue, 29 Mar 2016 08:50:29 +0000 https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/ EDIT: Hello readers, these articles are now 4 years old and many of the Watson services and APIs have moved or been changed. The concepts discussed in these articles are still relevant but I am working on 2nd editions of them. This article has a slant towards the IBM Watson Developer Cloud Services but the principles and rules of thumb expressed here are applicable to most cognitive/machine learning problems. ElasticSearch: Turning analysis off and why its useful https://brainsteam.co.uk/2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/ Sun, 29 Nov 2015 14:59:06 +0000 https://brainsteam.co.uk/2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/ I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to. Home automation with Raspberry Pi and Watson https://brainsteam.co.uk/2015/11/28/watson-home-automation/ Sat, 28 Nov 2015 10:57:14 +0000 https://brainsteam.co.uk/2015/11/28/watson-home-automation/ I’ve recently been playing with trying to build a Watson powered home automation system using my Raspberry Pi and some other electronic bits that I have on hand. There are already a lot of people doing work in this space. One of the most successful projects being JASPER which uses speech to text and an always on background listening microphone to talk to you and carry out actions when you ask it things in natural language like “What’s the weather going to be like tomorrow? Freecite python wrapper https://brainsteam.co.uk/2015/11/22/freecite-python-wrapper/ Sun, 22 Nov 2015 19:20:19 +0000 https://brainsteam.co.uk/2015/11/22/freecite-python-wrapper/ I’ve written a simple wrapper around the Brown University Citation parser FreeCite. I’m planning to use the service to pull out author names from references in REF impact studies and try to link them back to investigators listed on RCUK funding applications. The code is here and is MIT licensed. It provides a simple method which takes a string representing a reference and returns a dict with each field separated. There is also a parse_many function which takes an array of reference strings and returns an array of dicts. Scrolling in ElasticSearch https://brainsteam.co.uk/2015/11/21/scrolling-in-elasticsearch/ Sat, 21 Nov 2015 09:41:19 +0000 https://brainsteam.co.uk/2015/11/21/scrolling-in-elasticsearch/ I know I’m doing a lot of flip-flopping between SOLR and Elastic at the moment – I’m trying to figure out key similarities and differences between them and where one is more suitable than the other. The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API. If you use elastic, it is possible to do paging by adding a size and a from parameter. Spellchecking in retrieve and rank https://brainsteam.co.uk/2015/11/17/spellchecking-in-retrieve-and-rank/ Tue, 17 Nov 2015 21:41:09 +0000 https://brainsteam.co.uk/2015/11/17/spellchecking-in-retrieve-and-rank/ Introduction Being able to deal with typos and incorrect spellings is an absolute must in any modern search facility. Humans can be lazy and clumsy and I personally often search for things with incorrect terms due to my sausage fingers. In this article I will explain how to turn on spelling suggestions in retrieve and rank so that if your users ask your system for something with a clumsy query, you can suggest spelling fixes for them so that they can submit another, more fruitful question to the system. Retrieve and Rank and Python https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/ Mon, 16 Nov 2015 18:25:39 +0000 https://brainsteam.co.uk/2015/11/16/retrieve-and-rank-and-python/ Introduction Retrieve and Rank (R&R), if you hadn’t already heard about it, is IBM Watson’s new web service component for information retrieval and question answering. My colleague Chris Madison has summarised how it works in a high level way here. R&R is based on the Apache SOLR search engine with a machine learning result ranking plugin that learns what answers are most relevant given an input query and presents them in the learnt “relevance” order. Keynote at YDS 2015: Information Discovery, Partridge and Watson https://brainsteam.co.uk/2015/11/02/keynote-at-yds-2015-information-discovery-partridge-and-watson/ Mon, 02 Nov 2015 21:07:28 +0000 https://brainsteam.co.uk/2015/11/02/keynote-at-yds-2015-information-discovery-partridge-and-watson/ Here is a recording of my recent keynote talk on the power of Natural Language processing through Watson and my academic/PhD topic – Partridge – at York Doctoral Symposium. 0-11 minutes – history of mankind, invention and the acceleration of scientific progress (warming people to the idea that farming out your scientific reading to a computer is a much better idea than trying to read every paper written) 11-26 minutes – My personal academic work – scientific paper annotation and cognitive scientific research using NLP 26- 44 minutes – Watson – Jeopardy, MSK and Ecosystem 44 – 48 minutes Q&A on Watson and Partridge Please don’t cringe too much at my technical explanation of Watson – especially those of you who know much more about WEA and the original DeepQA setup than I do! SAPIENTA Web Service and CLI https://brainsteam.co.uk/2015/11/01/sapienta-web-service-and-cli/ Sun, 01 Nov 2015 19:50:52 +0000 https://brainsteam.co.uk/2015/11/01/sapienta-web-service-and-cli/ Hoorah! After a number of weeks I’ve finally managed to get SAPIENTA running inside docker containers on our EBI cloud instance. You can try it out at http://sapienta.papro.org.uk/. The project was previously running via a number of very precarious scripts that had a habit of stopping and not coming back up. Hopefully the new docker environment should be a lot more stable. Another improvement I’ve made is to create a websocket interface for calling the service and a Python-based commandline client. A week in Austin, TX – Watson Labs https://brainsteam.co.uk/2015/10/22/a-week-in-austin-tx-watson-labs/ Thu, 22 Oct 2015 18:10:57 +0000 https://brainsteam.co.uk/2015/10/22/a-week-in-austin-tx-watson-labs/ At the beginning of the month, I was lucky enough to spend a month embedded in the Watson Labs team in Austin, TX. These mysterious and enigmatic members of the Watson family have a super secret bat-cave known as “The Garage” located on the IBM Austin site – to which access is prohibited for normal IBMers unless accompanied by a labs team member. During the week I was helping out with a couple of the internal projects but also got the chance to experiment with some of the new Watson Developer Cloud APIS to create some new tools for internal use. CUSP Challenge Week 2015 https://brainsteam.co.uk/2015/08/30/cusp-challenge-week-2015/ Sun, 30 Aug 2015 16:52:59 +0000 https://brainsteam.co.uk/2015/08/30/cusp-challenge-week-2015/ [][1]Warwick CDT intake 2015: From left to right – at the front Jacques, Zakiyya, Corinne, Neha and myself. Rear: David, John, Stephen (CDT director), Mo, Vaggelis, Malkiat and Greg Hello again readers – those of you who follow me on other social media (twitter, instagram, facebook etc) probably know that I’ve just returned from a week in New York City as part of my PhD. My reason for visiting was a kind of ice-breaking activity called the CUSP (Centre for Urban Science + Progress) Challenge Week. SSSplit Improvements https://brainsteam.co.uk/2015/07/15/sssplit-improvements/ Wed, 15 Jul 2015 19:33:29 +0000 https://brainsteam.co.uk/2015/07/15/sssplit-improvements/ Introduction As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences. Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order to apply CoreSC annotations to the sentences we need to know where each sentence starts and ends. Bedford Place Vintage Festival https://brainsteam.co.uk/2015/06/28/bedford-place-vintage-festival/ Sun, 28 Jun 2015 10:36:28 +0000 https://brainsteam.co.uk/2015/06/28/bedford-place-vintage-festival/ Last week a bunch of my lindyhop group went and performed at the Bedford Place Vintage Festival in Southampton – its an annual event that I’ve been to twice now and we had an absolute ball. I think I enjoyed it that much more this year purely because I’ve been dancing twice as long now and I can hold my own on the social dance floor. Here’s a video of our crew performing the Shim Sham to “Mama do the hump” Tidying up XML in one click https://brainsteam.co.uk/2015/06/28/tidying-up-xml-in-one-click/ Sun, 28 Jun 2015 10:24:33 +0000 https://brainsteam.co.uk/2015/06/28/tidying-up-xml-in-one-click/ When I’m working on Partridge and SAPIENTA, I find myself dealing with a lot of badly formatted XML. I used to manually run xmllint –format against every file before opening it but that gets annoying very quickly (even if you have it saved in your bash history). So I decided to write a Nemo script that does it automatically for me. #!/bin/sh for xmlfile in $NEMO_SCRIPT_SELECTED_FILE_PATHS; do if [[ $xmlfile == *.