7.8 KiB
author | date | post_meta | preview | tags | title | type | url | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
James | 2015-07-15 19:33:29+00:00 |
|
/social/62d3f438a593c937683d801ed10407122bcd28a35483824cd76103b18dc8610c.png |
|
SSSplit Improvements | posts | /2015/07/15/sssplit-improvements/ |
Introduction
As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences.
Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order to apply CoreSC annotations to the sentences we need to know where each sentence starts and ends. Of course that means we also have to take into account the other ‘stuff’ (listed above) floating around in the documents too. We can’t just ignore formulae and citations – they’re pretty important! That’s what SSSplit does. It carves up papers into sentence () elements whilst also leaving the XML structure of the rest of the document in tact.
The original SSSplit utility was written a number of years ago in Java and used Regular Expressions to parse XML (something that readers of this StackOverflow article will already know has a propensity to summon eldrich abominations from the otherworld). Due to the complex regular expressions, the old splitter was not particularly performant . Especially given the complex nature of some of the expressions (if you’re interested, check out one of the simpler ones here).
Now, I can definitely see what the original authors were going for here. Regular expressions are very good for splitting sentences but not sentences inside complex XML documents. XML parsers are not particularly good for splitting sentences but are obviously good at parsing XML. I also understand that the original splitter was designed and then new bits glued on to make it suitable for new and different standards of XML leading to the gargantuan expressions like the one linked to above. I think they did a pretty good job given the information available to them at the time of writing.
I decided that the splitter needed a rewrite and went straight to my comfort zone to get it done: Python. I’m very familiar with the language – to the point now that I can write a fairly complicated program in it in a day if I’ve had enough coffee and sugar.
Writing SSSplit 2.0
I decided that we needed to try and minimise excessive uses of regular expressions for both performance and maintainence/readability reasons. I decided to try and do as much of the parsing of the document structure as possible using a traditional XML parser. I’d heard good things about etree which is part of the standard Python library and provides an informal dom-like interface. I used etree to inspect what I dubbed ‘P-level’ xml elements first. These are elements that I consider to be at a “paragraph” level. Any sentences inside these elements are completely contained – they do not cross the boundaries into the next container (unless the author is a poet/fiction writer/doesn’t do English very well I think its a safe bet that they wouldn’t finish a paragraph mid-sentence). Within the p-level containers, I sweep for any sort of XML node – we’re interested in text nodes but also any sort of formatting like bold () elements.
When a text node is encountered, that’s when regular expressions start to kick in. We do a very simple match for punctuation just in front of a space and a capital letter and run it over the text node – these are “potential” splits. We also look for full stops at the very end of the text.
pattern = re.compile('(\.|\?|\!)(?=\s*[A-Z0-9$])|\.$')
m = pattern.search(txt)
Of course this generates lots of false positives – what if we’ve found a decimal point inside a number? What if it’s an abbreviation like e.g. or i.e. or an initial like J. Ravenscroft? There is another regular expression check for decimal points and the string around the punctuation is checked against a list of common abbreviations. There’s also a list of authors both the writers of the paper in question and those who are cited in the paper too. The function checks that the full stop is not part of one of these authors’ names.
There’s an important factor to remember: Text node does not imply finished sentence – they are interspersed with formulae and references as explained above. Therefore we can’t just finish the current sentence when we reach the end of a text node – only when we encounter a full stop (not part of an abbreviation or number), question mark or explanation mark. We also know that we can complete the current sentence at the end of a p-level container as I explained above.
Every time we start parsing a sentence, text nodes and other ‘stuff’ deemed to be inside that sentence is accumulated into a list. Once we encounter the end of the sentence, the list is glued together and turned into an XML element.
The next step was to see how effective the new splitter was against the old splitter and also manual annotation by professional scientific literature readers.
Testing the splitter
To test the system I originally wrote a simple script that takes a set of manually annotated papers – strips them of their annotations so that the new splitter doesn’t get any clues – runs the new routine over them and then compares the output. This was very rudimentary as I was in a rush and didn’t tell me much about the success rate of my splitter. It did display the first and last words of each “detected” sentence for both manual and automatic annotation so I could at least see how well (if at all) the two lined up. I had to run the script on a paper-by-paper basis.
I managed to get the splitter working really well on a number of papers (we’re talking a 100% match) using this tool. However I realised that the majority of papers were still not being matched and it was becoming more and more of a chore to find which ones weren’t matching.
That’s why I decided to write a web-based visualisation tool for checking the splitter. The idea is that it runs on all papers giving an overall percentage of how well the automated splitter is working vs the manual splitter but also gives a per-paper figure. If you want to see which papers the system is really struggling with you can inspect them by clicking on them. This brings up a list of all the sentences and whether or not they align.
The tool is pretty useful as it gives me a clue as to which papers I need to tune the splitter with next.
Here’s a quick demo video of me using the tool to find papers that don’t match very well.
Next steps
A lot of tuning has been done on how this system works but there’s still a long way to go yet. I’ll probably post another article talking about what further changes had to be made to make the parser effective!