--- title: 'Cognitive Quality Assurance Pt 2: Performance Metrics' author: James type: post date: 2016-05-29T09:41:26+00:00 url: /2016/05/29/cognitive-quality-assurance-pt-2-performance-metrics/ featured_image: /wp-content/uploads/2016/05/Oma--825x510.png medium_post: - 'O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";s:12:"1f1de4b3132e";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:96:"https://medium.com/@jamesravey/cognitive-quality-assurance-pt-2-performance-metrics-1f1de4b3132e";}' tags: - quality assurance - machine learning - watson - work --- ***EDIT: Hello readers, these articles are now 4 years old and many of the Watson services and APIs have moved or been changed. The concepts discussed in these articles are still relevant but I am working on 2nd editions of them.*** [Last time][1] we discussed some good practices for collecting data and then splitting it into test and train in order to create a ground truth for your machine learning system. We then talked about calculating accuracy using test and blind data sets. In this post we will talk about some more metrics you can do on your machine learning system including **Precision**, **Recall**, **F-measure** and **confusion matrices.** These metrics give you a much deeper level of insight into how your system is performing and provide hints at how you could improve performance too! ## A recap – Accuracy calculation This is the most simple calculation but perhaps the least interesting. We are just looking at the percentage of times the classifier got it right versus the percentage of times it failed. Simply: 1. sum up the number of results (count the rows), 2. sum up the number of rows where the predicted label and the actual label match. 3. Calculate percentage accuracy: correct / total * 100. This tells you how good the classifier is in general across all classes. It does not help you in understanding how that result is made up. ## Going above and beyond accuracy: why is it important? Imagine that you are a hospital and it is critically important to be able to predict different types of cancer and how urgently they should be treated. Your classifier is 73% accurate overall but that does not tell you anything about it’s ability to predict any one type of cancer. What if the 27% of the answers it got wrong were the cancers that need urgent treatment? We wouldn’t know! This is exactly why we need to use measurements like precision, recall and f-measure as well as confusion matrices in order to understand what is really going on inside the classifier and which particular classes (if any) it is really struggling with. ## Precision, Recall and F-measure and confusion matrices (Grandma’s Memory Game) Precision, Recall and F-measure are incredibly useful for getting a deeper understanding of which classes the classifier is struggling with. They can be a little bit tricky to get your head around so lets use a metaphor about Grandma’s memory. Imagine Grandma has 24 grandchildren. As you can understand it is particularly difficult to remember their names. Thankfully, her 6 children, the grandchildren’s parents all had 4 kids and named them after themselves. Her son Steve has 3 sons: Steve I, Steve II, Steve III and so on. This makes things much easier for Grandma, she now only has to remember 6 names: Brian, Steve, Eliza, Diana, Nick and Reggie. The children do not like being called the wrong name so it is vitally important that she correctly classifies the child into the right name group when she sees them at the family reunion every Christmas. I will now describe Precision, Recall, F-Measure and confusion matrices in terms of Grandma’s predicament. ### Some Terminology Before we get on to precision and recall, I need to introduce the concepts of true positive, false positive, true negative and false negative. Every time Grandma gets an answer wrong or right, we can talk about it in terms of these labels and this will also help us get to grips with precision and recall later. These phrases are in terms of each class – you have TP, FP, FN, TN for each class. In this case we can have TP,FP,FN,TN with respect to Brian, with respect to Steve, with respect to Eliza and so on. This table shows how these four labels apply to the class “Brian” – you can create a table will
Brian | Not Brian | |
Grandma says “Brian” | True Positive | False Positive |
Grandma says |
False Negative | True Negative |
TP | FP | FN | |
Brian | 2 | 1 | 1 |
Eliza | 1 | ||
Steve | 1 | 1 |
TP | FP | FN | Precision | |
Brian | 2 | 1 | 1 | 66% |
Eliza | 1 | N/A | ||
Steve | 1 | 1 | 100% |
TP | FP | FN | Recall | |
Brian | 2 | 1 | 1 | 66.6% |
Eliza | 1 | N/A | ||
Steve | 1 | 1 | 50% |
TP | FP | FN | Precision | Recall | F-measure | |
Brian | 2 | 1 | 1 | 66.6% | 66.6% | 66.6% |
Eliza | 1 | N/A | N/A | N/A | ||
Steve | 1 | 1 | 1 | 0.5 | 0.6666666667 |
Predictions | |||||||
Steve | Brian | Eliza | Diana | Nick | Reggie | ||
Actual
Class |
Steve | 4 | 1 | 1 | |||
Brian | 1 | 3 | 1 | 1 | |||
Eliza | 5 | 1 | |||||
Diana | 5 | 1 | |||||
Nick | 1 | 5 | |||||
Reggie | 6 |
Ok so lets have a closer look at the above.
Reading across the rows left to right these are the actual examples of each class – in this case there are 6 children with each name so if you sum over the row you will find that they each add up to 6.
Reading down the columns top-to-bottom you will find the predictions – i.e. what Grandma thought each child’s name was. You will find that these columns may add up to more than or less than 6 because Grandma may overfit for one particular name. In this case she seems to think that all her female Grandchildren are called Eliza (she predicted 5/6 Elizas are called Eliza and 5/6 Dianas are also called Eliza).
Reading diagonally where I’ve shaded things in bold gives you the number of correctly predicted examples. In this case Reggie was 100% accurately predicted with 6/6 children called “Reggie” actually being predicted “Reggie”. Diana is the poorest performer with only 1/6 children being correctly identified. This can be explained as above with Grandma over-generalising and calling all female relatives “Eliza”.
Grandma seems to have gender nailed except in the case of one of the Steves (who in fairness does have a Pony Tail and can sing very high). She is best at predicting Reggies and struggles with Brians (perhaps Brians have the most diverse appearance and look a lot like their respective male cousins). She is also pretty good at Nicks and Steves.
Grandma is terrible at female grandchildrens’ names. If this was a machine learning problem we would need to find a way to make it easier to identify the difference between Dianas and Elizas through some kind of further feature extraction or weighting or through the gathering of additional training data.
Machine learning is definitely no walk in the park. There are a lot of intricacies involved in assessing the effectiveness of a classifier. Accuracy is a great start if until now you’ve been praying to the gods and carrying four-leaf-clovers around with you to improve your cognitive system performance.
However, Precision, Recall, F-Measure and Confusion Matrices really give you the insight you need into which classes your system is struggling with and which classes confuse it the most.
This example is probably directly relevant to those building classification systems (i.e. extracting intent from questions or revealing whether an image contains a particular company’s logo). However all of this stuff works directly for document retrieval use cases too. Consider true positive to be when the first document returned from the query is the correct answer and false negative is when the first document returned is the wrong answer.
There are also variants on this that consider the top 5 retrieved answer (Precision@N) that tell you whether your system can predict the correct answer in the top 1,3,5 or 10 answers by simply identifying “True Positive” as the document turning up in the top N answers returned by the query.
Overall I hope this tutorial has helped you to understand the ins and outs of machine learning evaluation.
Next time we look at cross-validation techniques and how to assess small corpii where carving out a 30% chunk of the documents would seriously impact the learning. Stay tuned for more!
[1]: https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/ [2]: https://upload.wikimedia.org/math/9/9/1/991d55cc29b4867c88c6c22d438265f9.png [3]: https://en.wikipedia.org/wiki/Harmonic_mean#Harmonic_mean_of_two_numbers