---
title: Do more than ‘kick the tires’ of your NLP model
author: James
type: post
date: -001-11-30T00:00:00+00:00
draft: true
url: /?p=498
medium_post:
  - 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";N;s:2:"id";N;s:21:"follower_notification";N;s:7:"license";N;s:14:"publication_id";N;s:6:"status";N;s:3:"url";N;}'
categories:
  - Uncategorized

---
### _We&#8217;ve known for a while that &#8216;accuracy&#8217; doesn&#8217;t tell you much about your machine learning models but now we have a better alternative!_

&#8220;So how accurate is it?&#8221; &#8211; a phrase that many data scientists like myself fear and dread being asked by business stakeholders. It&#8217;s not that I fear I&#8217;ve done a bad job but that evaluation of model performance is complex and multi-faceted and that summarising it with a single number usually doesn&#8217;t do it justice. Accuracy can also be a communications hurdle &#8211; it is not an intuitive concept and it can lead to friction and misunderstanding if you&#8217;re not &#8216;in&#8217; with the AI crowd. 50% model accuracy across a model that has 1500 possible answers could be considered pretty good. 80% accuracy in a task setting where data is split 80:10 across two classes is meaningless (that means that randomly guessing is more effective than the model). 

I&#8217;ve written before about [how we can use finer-grained metrics like Recall, Precision and F1-score to evaluate machine learning models][1]. However, many of us in the AI/NLP community still feel that these metrics are too simplistic and do not adequately describe the characteristics of trained ML models. Unfortunately, we didn&#8217;t have many other options for evaluating model performance&#8230; until now that is&#8230;

## Checklist &#8211; When machine learning met test automation

At the Annual Meeting of the Association for Computational Linguistics 2020 &#8211; a very popular academic conference on NLP &#8211; [Ribeiro et al presented a new method for evaluating NLP models,][2] inspired by principles and techniques that software quality assurance (QA) specialists have been using for years. 

The idea is that we should design and implement test cases for NLP models that reflect the tasks that the model will be required to perform &#8220;in the wild&#8221;. Like software QA, these test cases should include tricky edge cases that may trip the model up in order to understand the practical limitations of the model.

For example, we might train a named entity recognition model that

 [1]: https://brainsteam.co.uk/2016/03/29/cognitive-quality-assurance-an-introduction/
 [2]: https://www.aclweb.org/anthology/2020.acl-main.442.pdf