brainsteam.co.uk/brainsteam/content/posts/2020-12-29-mlflow-nlp-models/index.md

236 lines
14 KiB
Markdown

---
title: Serving NLP Models with MLflow
author: James
type: post
resources:
- name: feature
src: images/feature.jpg
date: 2020-12-29T09:50:28+00:00
url: /2020/12/29/serving-nlp-models-with-mlflow/
description: Serving NLP models with MLflow is a little trickier than serving models expecting tabular input. In this post we explore one possible solution with code examples.
categories:
- Work
- Open Source
tags:
- machine-learning
- python
- ai
- devops
- mlops
- nlp
- spacy
---
{{<figure src="images/feature.jpg">}}
[MLFlow](https://www.mlflow.org/) is a powerful open source MLOps platform with [built in framework for serving your trained ML models as REST APIs](https://www.mlflow.org/docs/latest/models.html#deploy-mlflow-models). The REST framework will load data provided in a JSON or CSV format compatible with [pandas](https://pandas.pydata.org/) and pass this directly into your model. This can be handy when your model is expecting a tabular list of numerical and categorical features. However it is less clear how to serve with models and pipelines that are expecting unstructured text data as their primary input. In this post we will explore how to train and then serve an NLP model using MLFlow, [scikit-learn](https://scikit-learn.org/) and [spacy](https://spacy.io/).
## What you'll need and Installing dependencies
In order to use MLFlow and to train our NLP model you're going to need Python 3.6+. I'm a big fan of using [miniconda](https://docs.conda.io/en/latest/miniconda.html) to manage Python dependencies and MLFlow uses conda to manage ML server environments. Therefore, it's the logical choice for managing our project and for the remainder of this post I will provide instructions for this. If you're handy with pip or pip-based dependency managers like [Poetry](https://python-poetry.org/) or [pipenv](https://pypi.org/project/pipenv/) then you should find its easy enough to follow along but YMMV (especially when it comes to the environments MLFlow generates).
First I'm going to create a new conda environment with the requirements we need installed already:
```shell
conda create -n mlflow-nlp-model -c conda-forge python==3.7 pandas scikit-learn mlflow spacy pip notebook
```
This may take a couple of minutes to resolve but you should be able to accept (type 'y' when prompted) and wait for conda to download and install the requirements.
Now we can activate our environment by running `conda activate mlflow-nlp-model`
## Collecting and preparing our data
We are going to train a model to classify email messages from the [20 newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset) dataset provided as part of Scikit learn. Of course the techniques we use here could be applied to other real world datasets too.
Firstly (assuming you have a jupyter notebook or Python prompt ready), we're going to download the data and turn it into a Pandas dataframe:
```python
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
def df_from_20ng(subset):
newsgroups_train = fetch_20newsgroups(subset='train')
ngdata = {"text": newsgroups_train.data, "target": newsgroups_train.target}
df = pd.DataFrame.from_dict(ngdata)
df['target_name'] = df.target.apply(lambda x: newsgroups_train.target_names[x])
return df
df_train = df_from_20ng('train')
df_test = df_from_20ng('test')
X_train = df_train.drop(columns=['target','target_name'])
y_train = df_train['target_name']
X_test = df_test.drop(columns=['target','target_name'])
y_test = df_test['target_name']
```
The above code will automatically fetch the example dataset from scikit learn's servers (or use a local cache after the first time you run it). We iterate over the data and load it as a Pandas dataframe.
The data is already conveniently partitioned into *test* and *train* sets but if you are using your own data you could generate a single dataframe and then use [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to partition it - this function works fine on dataframes.
We end up with `X_train` and `X_test` which are pandas data frames containing just the text from each email and data frames `y_train` and `y_test` which contain the corresponding ground truth classifier labels for the emails.
You might have noticed that our `X_train` and `X_test` dataframes only contain one column and you might wonder why we bother using a dataframe here when a 1 dimensional array or list would suffice. Well, the reason is that using a dataframe makes it possible for us to simply pass in CSV and JSON data to the REST API - hopefully it will become a bit clearer below.
## Defining our ML pipeline
The next step is to define our feature transformer and model pipeline. We're going to use Scikit-learn's [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) construct which allows us to easily define the components that we want to chain together.
For our first experiment we are going to keep things simple by using a [TF-IDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) which models each word (up to a vocabulary limit) as a separate sparse feature and takes into account the ratio of each word's term frequency (how many times it appears in a document) divided by word document frequency (how many documents each word appears in). You can read more about TF-IDF in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting). TF-IDF is older and simpler than current state of the art feature extraction methods but it can often work well as a lightweight baseline for text representation. We'll look at more complicated techniques in our next experiment.
We're also going to use a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for our classification model. Again, RF models serve as a relatively low-compute-intensity baseline and a starting point for our modelling.
The final component that you may not recognise is the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html). This provides a user friendly way for scikit-learn to interact with pandas dataframes and it offers some very powerful matching for larger data frames. In this case we are just using it to extract the `text` column from the emails which is then passed to our TFIDF Vectorizer for feature extraction and finally to the classifier for training or prediction.
The code looks like this:
```python
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
ct = ColumnTransformer([
('tfidf', TfidfVectorizer(max_features=5000), 'text')
])
pipe = Pipeline([
('ctransformer', ct),
('clf', RandomForestClassifier(n_estimators=10, max_depth=20))
])
```
Next we can train our model and log it and our initial evaluation metrics to MLFLow:
```python
import mlflow
import mlflow.sklearn
import json
import os
import tempfile
from sklearn.metrics import f1_score, classification_report, plot_confusion_matrix
from mlflow.models.signature import infer_signature
mlflow.set_experiment("My NLP Model")
with mlflow.start_run(run_name="TFIDF + Random Forest"):
pipe.fit(X_train,y_train)
y_pred = pipe.predict(X_test)
mlflow.set_tag('client','That Email Company')
signature = infer_signature(X_test, y_test)
mlflow.log_metric('f1', f1_score(y_test, y_pred, average='micro'))
mlflow.sklearn.log_model(pipe, "model", signature=signature)
with tempfile.TemporaryDirectory() as tmpdir:
report = classification_report(y_test, y_pred, output_dict=True)
with open(os.path.join(tmpdir, "classification_report.json"),'w') as f:
json.dump(report, f, indent=2)
mlflow.log_artifacts(tmpdir, "reporting")
```
We train the model with `pipe.fit()` and then get predictions on the test set with `pipe.predict(X_test)`. This allows us to generate our classification report detailing Precision and Recall per class by comparing `y_pred` and `y_test` - the predicted and actual labels for our test set respectively. We also report the overall micro-averaged F1 score for the model to give a high level indication of how it is performing.
The `infer_signature()` function is quite important here. This is where we tell MLFlow what the inputs and outputs for this model look like. By passing in our `X_test` and `y_test` variables, mlflow will identify that it should expect a dataframe with a column called *text*.
You can verify that the signature was captured correctly by opening the run in the MlFlow server GUI (run `mlflow server` and navigate to http://localhost:5000) and viewing the MLmodel file. You should see something like this:
```yaml
...
signature:
inputs: '[{"name": "data", "type": "string"}]'
outputs: '[{"type": "string"}]'
```
# Running our model
Now we are going to run our model as a REST API and make some API calls to it. Firstly you are going to need to find the full URI to the model that we just trained. the easiest way is to open up the MLFlow server GUI (run `mlflow server` and navigate to http://localhost:5000), open up the run we just created and copy the path from there:
{{<figure src="images/model-select.jpg" title="The full path to the models directory within the run is what we need - if it is shortened with elipses you may need to expand your browser window to make sure you copy all of it.">}}
Now we can simply run the MlFlow model server script in order to test it. The first time you run this it might take a few minutes to initialize since it will try to create a new conda environment for each model (based on the `run_id`). However, you should find it's pretty speedy for subsequent loads.
FYI if you are using mlflow with cloud backed storage (i.e. S3 or GCP instead of local filesystem) then this should work but you will need to set environment variables so
that the script can find the relevant security tokens etc as [documented here](https://www.mlflow.org/docs/latest/tracking.html#artifact-stores). You can just substitute out the `file:///` uri for the relevant string from your model run (i.e. `gs://`)
You should see some output like this:
```shell
> mlflow models serve -m file:///home/james/workspace/mlflow-example-project/notebooks/mlruns/1/872d6cd4b0874c99808c5259d9eb823b/artifacts/model master [0ea16fd] modified untracked
2020/12/29 14:00:28 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2020/12/29 14:00:29 INFO mlflow.pyfunc.backend: === Running command 'source /home/james/miniconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-6fd5007aa398d705b7ced4118b6b9ddf2ad4c4e4 1>&2 && gunicorn --timeout=60 -b 127.0.0.1:5000 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
[2020-12-29 14:00:29 +0000] [1063058] [INFO] Starting gunicorn 20.0.4
[2020-12-29 14:00:29 +0000] [1063058] [INFO] Listening at: http://127.0.0.1:5000 (1063058)
[2020-12-29 14:00:29 +0000] [1063058] [INFO] Using worker: sync
[2020-12-29 14:00:29 +0000] [1063064] [INFO] Booting worker with pid: 1063064
```
## Using the model
Now we should be able to test the model. Here's where it all comes together! Since we used the column transformer and used `infer_signature` when we logged our model, the server should:
* provide a basic level of input validation and provide user errors if columns the model doesn't know about are submitted
* understand that the unstructured text input will come from a column named `text` in a dataframe provided via CSV or JSON.
Without using the ColumnTransformer, the model may have behaved incorrectly or unpredictably by interpretting the first column in the input as the text input regardless of what it contained. The ColumnTransformer lets us specify an explicit contract with the REST server and the model signature provides clear instructions to the user (via validation error messages) on how to format the model input.
Using CURL, you can run the following in your shell session:
```shell
curl --request POST \
--url http://127.0.0.1:5000/invocations \
--header 'Content-Type: application/json; format=pandas-records' \
--data '[
{"text":"hey, I have an old bicycle for sale in the Southampton area"}
]'
```
Hopefully you will see the following response
```json
["misc.forsale"]
```
It looks like our model worked. Hooray! Now look what happens when we have a typo in our input data
```shell
curl --request POST \
--url http://127.0.0.1:5000/invocations \
--header 'Content-Type: application/json; format=pandas-records' \
--data '[
{"txt":"hey, I have an old bicycle for sale in the Southampton area"}
]'
```
We get a response like so:
```json
{
"error_code": "BAD_REQUEST",
"message": "Model input is missing columns ['text']. Note that there were extra columns: ['txt']"
}
```
As you can see we get an error because the 'text' column is missing. We also get a hint about the fact that 'txt' is an unexpected column. If we were to pass in multiple columns (e.g. we get 'text' right but we also pass in 'from' containing the email address of the sender, the) the server would provide a response, silently discarding any columns that it does not recognise. It only warns about extra columns in the event that a required field is missing.
# Conclusion
In this post we've built an end-to-end script that trains and stores an NLP classification model in MLFlow and we've also looked at serving the model using MLFlow's built in deployment tools. There are many ways to skin a cat as the saying goes but this is one tried and tested method for getting MLFlow's built in REST server to play ball.
I've provided the training script as a [Github gist](https://gist.github.com/ravenscroftj/1167487c0262b8dd1d92bcf4c2b7efd2).
Tune in next time when we will be showing how to use SpaCy in our MLFlow NLP pipeline.