brainsteam.co.uk/brainsteam/content/posts/2022/05/15-05-where-have-you-gone.md

---
date: 2022-05-15 08:05:06+01:00
description: 'A quick summary of where I''ve been hiding since April: working on my
  PhD thesis'
draft: true
post_meta:
- date
preview: /social/7fdd082da17ae97daaba8704f4d539792698effe76219d9f721eb1a7c469bce9.png
tags:
- personal
- work
- phd
title: 'Where I''ve Been Recently: My PhD Thesis'
type: posts
url: /2022/5/15/where-ive-been-recently
---

After committing to [weekly notes](/2022/3/20/20-03-2022-weeknote-week11/) in March, I promptly fell off the radar. So What happened? Well basically I've been putting all my time and energy outside of work into my PhD thesis which is due in September. My thesis is still very much a work in progress but I'm relatively happy with the broad structure to the point that I can summarise it.

## What's it all about?

My PhD is about using natural language processing to analyse and better understand news coverage of scientific works. The obvious application of this sort of thing in a Post-Trump, Post-Brexit, Covid-endemic world is to identify and fight misinformation and exaggeration. However, my original motivation which is still at the heart of what I'm working on is to help scientists to better summarise what they've been building in their abstracts and conclusions and to help journalists to better translate from the precise-but-dry science speak into accessible news articles that captivate the general public.


### Understanding The Impact of Science on the World Around Us 

In 2017 [I published an article](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173152) about academia's focus on counting citations , how many times other researchers reference your work in theirs, and altmetrics, crudely summarised as the same as citation counting but with tweets, likes and shares. This attitude is very widespread and has led to the saying (["publish or perish"](https://en.wikipedia.org/wiki/Publish_or_perish)) which is widely used in academic circles. 

In my article I compared these metrics to the UK's [REF](https://ref.ac.uk/) evaluation process which includes a qualitative study of real world outputs (news articles published, patents filed, companies and jobs created, laws and government rulings influenced, health treatments adopted by hospitals) from all UK universities every 4 years. In this work I found that there's very little correlation between 'citationy' type metrics and REF impact - which makes sense since a very theoretical work may receive many citations but not make any direct changes to society. Conversely, the invention of a new, very specific, chemical process might not attract a lot of academic interest but could lead to patents and new companies and jobs being created.

The problem with REF is that it's a super manual, time-intensive process with panels and discussions and months of prep work for all involved. I'm interested in ways we could forecast real-world impacts automatically. Following some further analysis of the REF data I noticed that [news articles are a strong indicator of REF impact](https://arxiv.org/abs/2007.14454). This inspired me to take a closer look at the relationship between scientific articles and news articles written about them.

### Linking News and Science

I needed a way to link 'sciency' news articles to the scientific papers that they discuss. Sometimes thats easy, the journalist might just give a hyperlink to the work. Sometimes though, particularly in tabloids they might just say "Boffins at ABC University have invented a method for XYZ" and we need to do some clever matching. I spent a few months developing tools for linking news articles and scientific papers together intelligently and published a [paper on it](https://aclanthology.org/P18-4004.pdf) in 2018. We worked with the British Library Internet Archive which contained over 52 **terrabytes** of website snapshots taken over about a decade for all websites with a .uk TLD. Not only was getting access to this data an issue (copyright concerns etc) but the practicalities of working with it were pretty crazy too, using "big data" tools like Apache Spark to do batch processing that took days to run before announcing it had failed. 

Eventually I was able to pull together a dataset of news articles linked to scientific papers which we subsequently published as an open-access/open-source dataset and we also made [the semi-supervised tool that I built to curate it](https://github.com/ravenscroftj/harri_gttool) available as open source software too.

### Looking at the Differences Between News Speak and Science Speak

Once I was able to connect news articles and science articles it was time to explore how the two relate. News articles will typically quote bits of scientific papers and paraphrase other bits. The journalist might try to use creative metaphors and similes to help non-techie readers to get their heads around ideas. Journalists may also be a little bit less wary and a little more imaginative about making bold claims regarding the impact the scientific work will have on society compared to the original authors. If I could identify sections of the two documents that talk about the same thing in different ways I could use that knowledge to develop tools that help both parties write better by suggesting ways that scientists can make their conclusions a little more exciting and ways for journalists to keep things factual.

So the first challenge was finding these aligned chunks of text. I experimented a bit with
added draft about phd break 2022-07-03 09:58:44 +01:00			`---`
update post metadata 2023-07-09 11:34:44 +01:00			`date: 2022-05-15 08:05:06+01:00`
			`description: 'A quick summary of where I''ve been hiding since April: working on my`
			`PhD thesis'`
added draft about phd break 2022-07-03 09:58:44 +01:00			`draft: true`
update post metadata 2023-07-09 11:34:44 +01:00			`post_meta:`
			`- date`
update thumbnails 2024-10-28 20:59:46 +00:00			`preview: /social/7fdd082da17ae97daaba8704f4d539792698effe76219d9f721eb1a7c469bce9.png`
added draft about phd break 2022-07-03 09:58:44 +01:00			`tags:`
update post metadata 2023-07-09 11:34:44 +01:00			`- personal`
			`- work`
			`- phd`
			`title: 'Where I''ve Been Recently: My PhD Thesis'`
			`type: posts`
			`url: /2022/5/15/where-ive-been-recently`
added draft about phd break 2022-07-03 09:58:44 +01:00			`---`

			`After committing to [weekly notes](/2022/3/20/20-03-2022-weeknote-week11/) in March, I promptly fell off the radar. So What happened? Well basically I've been putting all my time and energy outside of work into my PhD thesis which is due in September. My thesis is still very much a work in progress but I'm relatively happy with the broad structure to the point that I can summarise it.`

			`## What's it all about?`

			My PhD is about using natural language processing to analyse and better understand news coverage of scientific works. The obvious application of this sort of thing in a Post-Trump, Post-Brexit, Covid-endemic world is to identify and fight misinformation and exaggeration. However, my original motivation which is still at the heart of what I'm working on is to help scientists to better summarise what they've been building in their abstracts and conclusions and to help journalists to better translate from the precise-but-dry science speak into accessible news articles that captivate the general public.


			`### Understanding The Impact of Science on the World Around Us`

			`In 2017 [I published an article](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173152) about academia's focus on counting citations , how many times other researchers reference your work in theirs, and altmetrics, crudely summarised as the same as citation counting but with tweets, likes and shares. This attitude is very widespread and has led to the saying (["publish or perish"](https://en.wikipedia.org/wiki/Publish_or_perish)) which is widely used in academic circles.`

			In my article I compared these metrics to the UK's [REF](https://ref.ac.uk/) evaluation process which includes a qualitative study of real world outputs (news articles published, patents filed, companies and jobs created, laws and government rulings influenced, health treatments adopted by hospitals) from all UK universities every 4 years. In this work I found that there's very little correlation between 'citationy' type metrics and REF impact - which makes sense since a very theoretical work may receive many citations but not make any direct changes to society. Conversely, the invention of a new, very specific, chemical process might not attract a lot of academic interest but could lead to patents and new companies and jobs being created.

			`The problem with REF is that it's a super manual, time-intensive process with panels and discussions and months of prep work for all involved. I'm interested in ways we could forecast real-world impacts automatically. Following some further analysis of the REF data I noticed that [news articles are a strong indicator of REF impact](https://arxiv.org/abs/2007.14454). This inspired me to take a closer look at the relationship between scientific articles and news articles written about them.`

			`### Linking News and Science`

			I needed a way to link 'sciency' news articles to the scientific papers that they discuss. Sometimes thats easy, the journalist might just give a hyperlink to the work. Sometimes though, particularly in tabloids they might just say "Boffins at ABC University have invented a method for XYZ" and we need to do some clever matching. I spent a few months developing tools for linking news articles and scientific papers together intelligently and published a [paper on it](https://aclanthology.org/P18-4004.pdf) in 2018. We worked with the British Library Internet Archive which contained over 52 terrabytes of website snapshots taken over about a decade for all websites with a .uk TLD. Not only was getting access to this data an issue (copyright concerns etc) but the practicalities of working with it were pretty crazy too, using "big data" tools like Apache Spark to do batch processing that took days to run before announcing it had failed.

			`Eventually I was able to pull together a dataset of news articles linked to scientific papers which we subsequently published as an open-access/open-source dataset and we also made [the semi-supervised tool that I built to curate it](https://github.com/ravenscroftj/harri_gttool) available as open source software too.`

			`### Looking at the Differences Between News Speak and Science Speak`

			Once I was able to connect news articles and science articles it was time to explore how the two relate. News articles will typically quote bits of scientific papers and paraphrase other bits. The journalist might try to use creative metaphors and similes to help non-techie readers to get their heads around ideas. Journalists may also be a little bit less wary and a little more imaginative about making bold claims regarding the impact the scientific work will have on society compared to the original authors. If I could identify sections of the two documents that talk about the same thing in different ways I could use that knowledge to develop tools that help both parties write better by suggesting ways that scientists can make their conclusions a little more exciting and ways for journalists to keep things factual.

update post metadata 2023-07-09 11:34:44 +01:00			`So the first challenge was finding these aligned chunks of text. I experimented a bit with`