PhD on Brainsteam https://brainsteam.co.uk/categories/phd/ Recent content in PhD on Brainsteam Hugo -- gohugo.io en-us © James Ravenscroft 2020 Tue, 15 Jan 2019 18:14:16 +0000 Spacy Link or “How not to keep downloading the same files over and over” https://brainsteam.co.uk/2019/01/15/spacy-link-or-how-not-to-keep-downloading-the-same-files-over-and-over/ Tue, 15 Jan 2019 18:14:16 +0000 https://brainsteam.co.uk/2019/01/15/spacy-link-or-how-not-to-keep-downloading-the-same-files-over-and-over/ If you’re a frequent user of spacy and virtualenv you might well be all too familiar with the following: python -m spacy download en_core_web_lg Collecting en_core_web_lg==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz#egg=en_core_web_lg==2.0.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB) 5% |█▉ | 49.8MB 11.5MB/s eta 0:01:10 If you’re lucky and you have a decent internet connection then great, if not it’s time to make a cup of tea. Even if your internet connection is good. Did you ever stop to look at how much disk space your python virtual environments were using up? Uploading HUGE files to Gitea https://brainsteam.co.uk/2018/10/20/uploading-huge-files-to-gitea/ Sat, 20 Oct 2018 10:09:41 +0000 https://brainsteam.co.uk/2018/10/20/uploading-huge-files-to-gitea/ I recently stumbled upon and fell in love with Gitea – a lightweight self-hosted Github and Gitlab alternative written in the Go programming language. One of my favourite things about it – other than the speed and efficiency that mean you can even run it on a raspberry pi – is the built in LFS support. For the unfamiliar, LFS is a protocol initially introduced by GitHub that allows users to version control large binary files – something that Git is traditionally pretty poor at. Don’t forget your life jacket: the ‘dangers’ of diving in deep at the deep end with deep learning https://brainsteam.co.uk/2018/10/18/dont-forget-your-life-jacket-the-dangers-of-diving-in-deep-at-the-deep-end-with-deep-learning/ Thu, 18 Oct 2018 14:35:05 +0000 https://brainsteam.co.uk/2018/10/18/dont-forget-your-life-jacket-the-dangers-of-diving-in-deep-at-the-deep-end-with-deep-learning/ Deep Learning is a powerful technology but you might want to try some “shallow” approaches before you dive in. Neural networks are made up of neurones and synapses It’s unquestionable that over the last decade, deep learning has changed machine learning landscape for the better. Deep Neural Networks (DNNs), first popularised by Yan LeCunn, Yoshua Bengio and Geoffrey Hinton, are a family of machine learning models that are capable of learning to see and categorise objects, predict stock market trends, understand written text and even play video games. Programmatically Downloading Open Access Papers https://brainsteam.co.uk/2018/04/13/programmatically-downloading-open-access-papers/ Fri, 13 Apr 2018 16:04:47 +0000 https://brainsteam.co.uk/2018/04/13/programmatically-downloading-open-access-papers/ (Cover image “Unlocked” by Sean Hobson) If you’re an academic or you’ve got an interest in reading scientific papers, you’ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It’s ok if you’re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don’t work and you still can’t see the paper. Part time PhD: Mini-Sabbaticals https://brainsteam.co.uk/2018/04/05/phd-mini-sabbaticals/ Thu, 05 Apr 2018 13:08:51 +0000 https://brainsteam.co.uk/2018/04/05/phd-mini-sabbaticals/ Avid readers amongst you will know that I’m currently in the third year of my PhD in Computational Linguistics at the University of Warwick whilst also serving as CTO at Filament. An incredibly exciting pair of positions that certainly have their challenges and would be untenable without an incredibly supportive set of PhD supervisors (Amanda Clare and Maria Liakata) and an equally supportive and understanding pair of company directors (Phil and Doug). Why I keep going back to Evernote https://brainsteam.co.uk/2017/08/03/182/ Thu, 03 Aug 2017 08:27:53 +0000 https://brainsteam.co.uk/2017/08/03/182/ As the CTO for a London machine learning startup and a PhD student at Warwick Institute for the Science of Cities, to say I’m busy is an understatement. At any given point in time, my mind is awash with hundreds of ideas around Filament tech strategy, a cool app I’d like to build, ways to measure scientific impact, wondering what the name of that new song I heard on the radio was or some combination thereof. Dialect Sensitive Topic Models https://brainsteam.co.uk/2017/07/25/dialect-sensitive-topic-models/ Tue, 25 Jul 2017 11:02:42 +0000 https://brainsteam.co.uk/2017/07/25/dialect-sensitive-topic-models/ As part of my PhD I’m currently interested in topic models that can take into account the dialect of the writing. That is, how can we build a model that can compare topics discussed in different dialectical styles, such as scientific papers versus newspaper articles. If you’re new to the concept of topic modelling then this article can give you a quick primer. Vanilla LDA A diagram of how latent variables in LDA model are connected Vanilla topic models such as Blei’s LDA are great but start to fall down when the wording around one particular concept varies too much. Exploring Web Archive Data – CDX Files https://brainsteam.co.uk/2017/06/05/exploring-web-archive-data-cdx-files/ Mon, 05 Jun 2017 07:24:22 +0000 https://brainsteam.co.uk/2017/06/05/exploring-web-archive-data-cdx-files/ I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the . timetrack improvements https://brainsteam.co.uk/2016/12/10/timetrack-improvements/ Sat, 10 Dec 2016 09:33:41 +0000 https://brainsteam.co.uk/2016/12/10/timetrack-improvements/ I’ve just added a couple of improvements to timetrack that allow you to append to existing time recordings (either with an amount like 15m or using live to time additional minutes spent and append them). You can also remove entries using timetrack rm instead of remove – saving keystrokes is what programming is all about. You can find the updated code over at github. AI can’t solve all our problems, but that doesn’t mean it isn’t intelligent https://brainsteam.co.uk/2016/12/08/ai-cant-solve-all-our-problems-but-that-doesnt-mean-it-isnt-intelligent/ Thu, 08 Dec 2016 10:08:13 +0000 https://brainsteam.co.uk/2016/12/08/ai-cant-solve-all-our-problems-but-that-doesnt-mean-it-isnt-intelligent/ Thomas Hobbes, perhaps most famous for his thinking on western politics, was also thinking about how the human mind “computes things” 500 years ago. A recent opinion piece I read on Wired called for us to stop labelling our current specific machine learning models AI because they are not intelligent. I respectfully disagree. AI is not a new concept. The idea that a computer could ‘think’ like a human and one day pass for a human has been around since Turing and even in some form long before him. We need to talk about push notifications (and why I stopped wearing my smartwatch) https://brainsteam.co.uk/2016/11/27/we-need-to-talk-about-push-notifications-and-why-i-stopped-wearing-my-smartwatch/ Sun, 27 Nov 2016 12:59:22 +0000 https://brainsteam.co.uk/2016/11/27/we-need-to-talk-about-push-notifications-and-why-i-stopped-wearing-my-smartwatch/ I own a Pebble Steel which I got for Christmas a couple of years ago. I’ve been very happy with it so far. I can control my music player from my wrist, get notifications and a summary of my calender. Recently, however I’ve stopped wearing it. The reason is that constant streams of notifications stress me out, interrupt my workflow and not wearing it makes me feel more calm and in control and allows me to be more productive. ElasticSearch: Turning analysis off and why its useful https://brainsteam.co.uk/2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/ Sun, 29 Nov 2015 14:59:06 +0000 https://brainsteam.co.uk/2015/11/29/elasticsearch-turning-analysis-off-and-why-its-useful/ I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the fields is “UOA” which contains the title of the unit of impact that the case study belongs to. Freecite python wrapper https://brainsteam.co.uk/2015/11/22/freecite-python-wrapper/ Sun, 22 Nov 2015 19:20:19 +0000 https://brainsteam.co.uk/2015/11/22/freecite-python-wrapper/ I’ve written a simple wrapper around the Brown University Citation parser FreeCite. I’m planning to use the service to pull out author names from references in REF impact studies and try to link them back to investigators listed on RCUK funding applications. The code is here and is MIT licensed. It provides a simple method which takes a string representing a reference and returns a dict with each field separated. There is also a parse_many function which takes an array of reference strings and returns an array of dicts. Scrolling in ElasticSearch https://brainsteam.co.uk/2015/11/21/scrolling-in-elasticsearch/ Sat, 21 Nov 2015 09:41:19 +0000 https://brainsteam.co.uk/2015/11/21/scrolling-in-elasticsearch/ I know I’m doing a lot of flip-flopping between SOLR and Elastic at the moment – I’m trying to figure out key similarities and differences between them and where one is more suitable than the other. The following is an example of how to map a function _**f **_onto an entire set of indexed data in elastic using the scroll API. If you use elastic, it is possible to do paging by adding a size and a from parameter. Keynote at YDS 2015: Information Discovery, Partridge and Watson https://brainsteam.co.uk/2015/11/02/keynote-at-yds-2015-information-discovery-partridge-and-watson/ Mon, 02 Nov 2015 21:07:28 +0000 https://brainsteam.co.uk/2015/11/02/keynote-at-yds-2015-information-discovery-partridge-and-watson/ Here is a recording of my recent keynote talk on the power of Natural Language processing through Watson and my academic/PhD topic – Partridge – at York Doctoral Symposium. 0-11 minutes – history of mankind, invention and the acceleration of scientific progress (warming people to the idea that farming out your scientific reading to a computer is a much better idea than trying to read every paper written) 11-26 minutes – My personal academic work – scientific paper annotation and cognitive scientific research using NLP 26- 44 minutes – Watson – Jeopardy, MSK and Ecosystem 44 – 48 minutes Q&A on Watson and Partridge Please don’t cringe too much at my technical explanation of Watson – especially those of you who know much more about WEA and the original DeepQA setup than I do! SAPIENTA Web Service and CLI https://brainsteam.co.uk/2015/11/01/sapienta-web-service-and-cli/ Sun, 01 Nov 2015 19:50:52 +0000 https://brainsteam.co.uk/2015/11/01/sapienta-web-service-and-cli/ Hoorah! After a number of weeks I’ve finally managed to get SAPIENTA running inside docker containers on our EBI cloud instance. You can try it out at http://sapienta.papro.org.uk/. The project was previously running via a number of very precarious scripts that had a habit of stopping and not coming back up. Hopefully the new docker environment should be a lot more stable. Another improvement I’ve made is to create a websocket interface for calling the service and a Python-based commandline client. CUSP Challenge Week 2015 https://brainsteam.co.uk/2015/08/30/cusp-challenge-week-2015/ Sun, 30 Aug 2015 16:52:59 +0000 https://brainsteam.co.uk/2015/08/30/cusp-challenge-week-2015/ [][1]Warwick CDT intake 2015: From left to right – at the front Jacques, Zakiyya, Corinne, Neha and myself. Rear: David, John, Stephen (CDT director), Mo, Vaggelis, Malkiat and Greg Hello again readers – those of you who follow me on other social media (twitter, instagram, facebook etc) probably know that I’ve just returned from a week in New York City as part of my PhD. My reason for visiting was a kind of ice-breaking activity called the CUSP (Centre for Urban Science + Progress) Challenge Week. SSSplit Improvements https://brainsteam.co.uk/2015/07/15/sssplit-improvements/ Wed, 15 Jul 2015 19:33:29 +0000 https://brainsteam.co.uk/2015/07/15/sssplit-improvements/ Introduction As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences. Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order to apply CoreSC annotations to the sentences we need to know where each sentence starts and ends. Tidying up XML in one click https://brainsteam.co.uk/2015/06/28/tidying-up-xml-in-one-click/ Sun, 28 Jun 2015 10:24:33 +0000 https://brainsteam.co.uk/2015/06/28/tidying-up-xml-in-one-click/ When I’m working on Partridge and SAPIENTA, I find myself dealing with a lot of badly formatted XML. I used to manually run xmllint –format against every file before opening it but that gets annoying very quickly (even if you have it saved in your bash history). So I decided to write a Nemo script that does it automatically for me. #!/bin/sh for xmlfile in $NEMO_SCRIPT_SELECTED_FILE_PATHS; do if [[ $xmlfile == *.