brainsteam.co.uk/brainsteam/content/posts/2021/01/2021-01-14-mlflow-pickle5-madness/index.md at 030306a7c2e017b61a62d0caf8710ea3d6313ff3

3.4 KiB

Raw Blame History

title

author

type

resources

date

url

description

tags

Pickle 5 Madness with MLFlow and Python 3.6/3.7

James

post

name	src
feature	images/feature.jpg

2021-01-14T11:42:28+00:00

/2021/01/14/pickle-5-madness-with-mlflow/

Solving 'unsupported pickle protocol: 5' when trying to load mlflow models

Work

Open Source

machine-learning

python

devops

mlops

{{

}}

I recently came across an infuriating problem where an MLFlow python model I had trained on one system using Python 3.6 would not load on another system with an identical version of Python.

The exact problem was that when I ran mlflow models serve -m <url/to/model/in/bucket> the service would crash saying that the model could not be unserialized because ValueError: unsupported pickle protocol: 5.

A quick bit of searching shows that this error happens when something is pickled in Python 3.8 which uses pickle protocol 5 by default and loaded by a system running an earlier version of Python 3 (3.6 or 3.7) which only support pickle protocol up to v4.

Under the covers mlflow uses cloudpickle, a library that provides extended pickle support including the ability to pickle lambda functions and functions/classes defined interactively in the __main__ module of your program or in a Jupyter notebook. By default cloudpickle uses the highest version of pickle protocol available in your python implementation (by checking pickle.HIGHEST_PROTOCOL constant) - this makes sense for most use cases where you want to serialize objects and pass them around within the same Python setup - as a rule of thumb, more recent protocols are better performing/more efficient.

However this is a mystery because I'm running Python 3.6.12 on both systems which does not support protocol 5, so how is it that cloudpickle is using this version to write the models? I still haven't worked this out and if anyone knows please get in touch because it is driving me mad!

Luckily for us, although the use of v5 is puzzling, there is a solution. The pickle5 library provides version 5 support that is backwards compatible with Python 3.6 and 3.7. Furthermore, cloudpickle will automatically detect and load this library if it is available. Therefore all we need to do is install pickle5 in our MLFLow serving environment to make this issue go away.

The easiest way to make sure pickle5 is available to your server is by adding it to your conda env when you save your model to MLFlow:


model = SomeScikitLearnModel()
model.fit(X,y)

conda_env = mlflow.pyfunc.get_default_conda_env()
conda_env['dependencies'].append({'pip': [
    'pickle5'
    'scikit-learn==0.23.2'
    #... some other dependencies
    ]})

mlflow.sklearn.log_model(model, "model", conda_env=conda_env)

Note: I already checked and pickle5 is not installed in the first environment but the Conda base version of Python on that system is 3.8.3 so I think there must be some weird leakage of the conda paths going on when I train my model.

3.4 KiB Raw Blame History

3.4 KiB

Raw Blame History