brainsteam.co.uk/brainsteam/content/posts/2021/01/2021-01-14-mlflow-pickle5-m.../index.md

56 lines
3.4 KiB
Markdown
Raw Permalink Normal View History

2021-01-14 11:51:48 +00:00
---
title: Pickle 5 Madness with MLFlow and Python 3.6/3.7
author: James
type: post
resources:
- name: feature
src: images/feature.jpg
date: 2021-01-14T11:42:28+00:00
url: /2021/01/14/pickle-5-madness-with-mlflow/
description: "Solving 'unsupported pickle protocol: 5' when trying to load mlflow models"
categories:
- Work
- Open Source
tags:
- machine-learning
- python
- ai
- devops
- mlops
---
{{<figure src="images/feature.jpg" caption="A jar of pickles by <a href='https://www.pexels.com/photo/crop-unrecognizable-person-with-jar-of-pickled-zucchini-3952045/'>Ksenia Charnaya</a>">}}
I recently came across an infuriating problem where an [MLFlow python model](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html) I had trained on one system using Python `3.6` would not load on another system with an identical version of Python.
The exact problem was that when I ran `mlflow models serve -m <url/to/model/in/bucket>` the service would crash saying that the model could not be unserialized because `ValueError: unsupported pickle protocol: 5`.
A quick bit of searching shows that this error happens when something is pickled in Python 3.8 which uses pickle protocol 5 by default and loaded by a system running an earlier version of Python 3 (3.6 or 3.7) which only support pickle protocol up to v4.
Under the covers mlflow uses [cloudpickle](https://github.com/cloudpipe/cloudpickle), a library that provides extended pickle support including the ability to pickle lambda functions and functions/classes defined interactively in the `__main__` module of your program or in a Jupyter notebook. By default `cloudpickle` uses the highest version of pickle protocol available in your python implementation (by checking [pickle.HIGHEST_PROTOCOL](https://docs.python.org/3/library/pickle.html#pickle.HIGHEST_PROTOCOL) constant) - this makes sense for most use cases where you want to serialize objects and pass them around within the same Python setup - as a rule of thumb, more recent protocols are better performing/more efficient.
However this is a mystery because I'm running Python `3.6.12` on both systems which does not support protocol 5, so how is it that cloudpickle is using this version to write the models? I still haven't worked this out and if anyone knows please get in touch because it is driving me mad!
Luckily for us, although the use of v5 is puzzling, there is a solution. The [pickle5](https://pypi.org/project/pickle5/) library provides version 5 support that is backwards compatible with Python 3.6 and 3.7. Furthermore, [cloudpickle will automatically detect and load this library if it is available](https://github.com/dask/distributed/pull/3849). Therefore all we need to do is install `pickle5` in our MLFLow serving environment to make this issue go away.
The easiest way to make sure pickle5 is available to your server is by adding it to your conda env when you save your model to MLFlow:
```python
model = SomeScikitLearnModel()
model.fit(X,y)
conda_env = mlflow.pyfunc.get_default_conda_env()
conda_env['dependencies'].append({'pip': [
'pickle5'
'scikit-learn==0.23.2'
#... some other dependencies
]})
mlflow.sklearn.log_model(model, "model", conda_env=conda_env)
```
Note: I already checked and `pickle5` is not installed in the first environment but the Conda base version of Python on that system is `3.8.3` so I think there must be some weird leakage of the conda paths going on when I train my model.