Merge pull request 'Add virtualenvs post' (#2) from james/feature/virtualenv into main
continuous-integration/drone/push Build was killed Details

Reviewed-on: #2
This commit is contained in:
ravenscroftj 2021-04-13 14:05:50 +00:00
commit 0b47525955
3 changed files with 167 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View File

@ -0,0 +1,167 @@
---
title: An opinionated guide to Python environments in 2021
author: James
type: post
resources:
- name: feature
src: images/feature.jpg
date: 2021-04-12T20:21:11+00:00
url: /2021/04/01/running-old-pytorch-docker/
description: A fairly thorough explanation and exploration of python package and environment managers as of April 2021 with some opinionated setups proposed for different user types at the end.
categories:
- Work
- Open Source
tags:
- python
- devops
---
{{<figure src="images/feature.jpg" caption="A person overwhelmed by boxes by <a href='https://www.pexels.com/@cottonbro?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Cottonbro</a>">}}
### Note: If you don't want to read the blah-blah context and history stuff then you can [jump to the recommendations](#recommended-setups-for-various-use-cases)
## The Problem
The need for virtual python environments becomes fairly obvious early in most Python developers' careers when they switch between two projects and realise that they have incompatible dependences (e.g. project1 needs `scikit-learn-0.21` and project2 needs `scikit-learn-0.24`). Unlike other mainstream languages like Javascript(Node.js) and Java (with Maven) where dependencies are stored locally to the project, Python dependencies are installed at system or environment level and affect all projects that are using the same environment.
When you run into this problem - you have two choices: you can either play around with the libraries you have installed and risk breaking things for one of your projects and not being able to get it back or you search "python incompatible dependencies install both?" with a hopeful glimmer in your eye as you hit enter.
## Virtual Environments and Package Managers
Virtual environments are the community approved way to manage this issue and likely the first hit you'll get with the above search in your favourite search engine.
There are two related but distinct activities that we need to manage here:
1. If I'm working on different projects, I want to be able to quickly switch between them and their supporting Python runtime environments without breaking my setup for other projects. This is what an environment manager does.
2. I want to be able to quickly and easily install new Python libraries without worrying about their inter-dependencies. Furthermore, I'd like to be able to package up my project's list of dependencies so that others can quickly and easily use my code. This is what a package manager does.
There are lots of options available to you for both tasks and some tools try to solve both of them for you. Unfortunately this means that there are a large number, partially compatible standards.
## Virtual Environments: What are they?
A virtual environment is a copy of a Python interpreter, bundled away into a folder with project-specific libraries and dependencies. This allows you to keep your project runtimes logically separated and avoids inter-project dependency conflicts.
![virtualenv.png](:./images/virtualenv.png)
Simply: when you install a library into a virtual environment the files are literally in a separate folder to the dependencies of your other projects. Beautiful simplicity.
So then, how do we manage which libraries are installed in the environment and make sure that they are compatible with each other and the software that we're writing/using?
## pip: The Original Python Package Manager
`pip` is the official [Python Python Packaging Authority](https://www.pypa.io/en/latest/index.html) package management tool. It's been a recognisable part of a Python developer's arsenal for at least the last 10 years and became part of the standard Python library as of v3.4
in 2014 (although most operating systems distributed it as standard long before then or if not a very easily installable extra).
Whilst it's the official option, `pip` is very bare bones. It doesn't know or care which environment it is being run in so you have to make sure that you take care of that by using tools like [venv](https://docs.python.org/3/library/venv.html) or [virtualenv](https://virtualenv.pypa.io/en/latest/). Furthermore `pip` doesn't store its list of dependencies in a file by default (you have to manually call `pip freeze > requirements.txt` to store your pip environment state in a text file every time you install or uninstall stuff) so this is yet another overhead.
Another potential problem with pip is its [lack of deterministic builds](https://github.com/pypa/pip/issues/5102) - simply put: if you don't explicitely ask `pip` to install a particular version of a package or one of that package's dependencies it will download the *latest* version of that package. That means that there might be a bug introduced because a dependency-of-a-dependency that I installed on my system last month is a different version to the same package for someone who just installed my software today. What a headache!
None of this is particularly ideal - **more manual steps = more stuff you can forget about**
## Pipenv: Environment + Package Management Swiss Army Knife
[pipenv](https://pipenv.pypa.io/en/latest/) is a tool that tries to solve many of the shortcomings of pip above:
* pipenv generates a `Pipfile` in your project which is conceptually similar to a [Package.json](https://nodejs.dev/learn/the-package-json-guide) file in Node.js land. That is, a manifest at the top level of your project that describes which dependencies and Python version it requires. This file is maintained as you add/remove packages (no more manual `pip freeze` steps)
* Pipenv also maintains a `Pipfile.lock` file - this is a machine readable list of all of your dependencies and subdependencies allowing Pipenv to handle deterministic builds and avoid confusing dependency issues.
* Pipenv will transparently take care of your virtualenv management for you. You can run your commands as normal but prefixed with `pipenv run` and the library will make sure you're using the environment associated with whatever project you're trying to use.
Many people stopped using pipenv when they believed the project to have been abandoned in 2019. However, it turns out pipenv is still under active development. As of writing the most recent release was [v2020.11.15](https://github.com/pypa/pipenv/releases/tag/v2020.11.15).
## pypoetry: A Challenger Environment + Package Management Option
[Poetry](https://python-poetry.org/) is yet another all-in-one virtualenv and package manager which offers similar functionality to `pipenv`. It gained a lot of users during the pipenv project hiatus mentioned above and has [similar performance and functionality](https://dev.to/frostming/a-review-pipenv-vs-poetry-vs-pdm-39b4).
The main reason I prefer poetry over pipenv today is its ability to generate "standard" Python packages (wheels, source distributions) that are fully Pypa compliant natively (you can do this with `pipenv` but it requires manual maintainence of [setup.py and requirements.txt](https://greut.medium.com/building-a-python-package-a-docker-image-using-pipenv-233d8793b6cc) files which is another moving part that could go wrong in a big project).
Pypoetry also stores its project information in a [PEP-621](https://www.python.org/dev/peps/pep-0621/#abstract) compatible `pyproject.toml` format, providing core metadata compatibility with other dependency management tools and indeed PyPA's own [setuptools](https://github.com/pypa/setuptools/issues/1688) toolkit.
## Where do [Mini/Ana]conda Fit Into All of This?
[Anaconda](https://docs.continuum.io/anaconda/install/) and its slimmed down cousin [Miniconda](https://docs.conda.io/projects/continuumio-conda/en/latest/user-guide/install/download.html) are alternatives to the standard CPython/PyPA python distribution distributed by Continuum Analytics. Both environments use the [conda](https://docs.conda.io/projects/continuumio-conda/en/latest/index.html#) package + venv management tool.
Conda is open source but not directly compatible with PyPa packages. However, almost every package you can think of is available on [conda-forge](https://conda-forge.org/) - a community driven conda-compatible package repository. Furthermore, if something is missing from conda you can run `pip` inside your conda virtual environment and get it the normal way from PyPa.
### What About Deterministic Builds and Distributing Software Using Conda?
Well, conda environments and requirements can be stored in an [environment.yml](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) and the file format allows you to specify both packages installed via conda and pip. Furthermore, using the `conda env export` command to generate an `environment.yml` file dumps all of the packages installed in your current environment including their version information for deterministic recreation. Happy days!
### But Wait, There's More!
One feature of conda that is both controversial and convenient - the latter especially if you're a data scientist - is it's management of system libraries and dependencies beyond Python. Conda can install C libraries that your Python packages depend on for you - including Nvidia's CUDA runtime libraries needed for tensorflow and torch.
If you've ever had the pleasure of trying to manually configure Nvidia drivers and CUDA runtime libraries on Linux you'll know how much of a pain this is. Even with pip/virtualenv environments, torch and tensorflow will try to link against and load whichever version of CUDA is installed system wide and that means that switching between versions of these libraries for different projects could mean messing with which system libraries are installed. Assuming you even have permission to do that (you might be on a shared GPU cluster), we're back at square one with tightly coupled inter-project dependencies - the very problem that virtualenv is supposed to fix for us but can't because of the dependency on cuda. As usual there are of course [manual workarounds](https://towardsdatascience.com/installing-multiple-cuda-cudnn-versions-in-ubuntu-fcb6aa5194e2) but to me this is another moving part that could fail or go wrong - especially in a team environment.
As for the controversy? Well purists don't tend to like the fact that `conda` also messes with system libraries - even if those libraries, like with pip/virtualenv based environments are copies isolated in folders.
### If Conda is So Good Why Don't You Marry It?
Conda is great but it has its downsides too:
* It's incompatibility with the pip/pypa universe requires extra faff when building pip-compatible software (or you can accept that your software is doomed to only be run by conda-users)
* `environment.yml` files can be **too** deterministic and this is exacerbated by the system libraries issue. If I generate an `environment.yml` file on my Linux desktop and create a conda environment from it on my Mac it will usually fail because the linux libraries are not compatible with the mac libraries.
* Running conda inside docker environments is a bit weird and again controversial some might argue since you always have permission to install whichever libraries you need inside a container and there shouldn't be any use cases where you'd need two conflicting environments/libraries inside a container. [Again, it's perfectly possible](https://pythonspeed.com/articles/activate-conda-dockerfile/) but in my opinion, another weak link.
## Best of Both Worlds: Conda + Pip-based Package Managers
Both [poetry](https://python-poetry.org/docs/managing-environments/) and [pipenv](https://pipenv.pypa.io/en/latest/advanced/#pipenv-and-other-python-distributions) can be used in combination with other virtual environment managers.
This, in my opinion, offers the best of both worlds: we can take the speed and ease-of-use of conda and team it up with the flexibility and compatibility offered by these pip-based package management offerings.
To use pipenv or poetry inside a conda-based environment you can simply activate the environment you want to use and then run `pip install poetry` or `pip install pipenv` - the tool of your choice will then be available for use whenever you have that environment active in the future.
# Recommended Setups for Various Use Cases
## Some Principles for Use With The Recommendations Below
* **K.I.S.S** Keep it simple stupid - these suggestions get more complicated for more nuanced use cases. My general philosophy, as mentioned earlier in the post is to minimise moving parts so I definitely don't think everyone should be maintaining a `pyproject.toml`, a `requirements.txt` file, an `environment.yml` file for Windows usesrs and an `environment.yml` file for Linux users. You know your use case and you can judge for yourself what is appropriate.
* **If I say *or* then it's up to you**. Pick one and be consistent. Quite a lot of the time `poetry` and `pipenv` offer very similar feature sets and which one you want to use is just a personal preference. They're not directly compatible though so if you pick `poetry` and your colleague picks `pipenv` you're going to have a bad time.
## I'm new to Python (Mac, Windows or Linux)
Firstly, if you're ***really really*** new to Python you might want to consider just getting familiar with the language without having to deal with virtual environments - most modern Linux distributions have Python 3.x pre-installed and if you're on mac you can get it trivially if you use [brew](https://brew.sh/). That said, virtualenvs are likely to be something that you'll need sooner rather than later once you get into intermediate Python development so it might be better to dive in sooner rather than later.
- **If you're new to Python and you're running Mac, Windows or Linux** you might find [Anaconda](https://docs.anaconda.com/anaconda/install/index.html) to be the most intuitive, lowest barrier to entry option for getting started.
- **If you're on Windows**, Conda-based distributions definitely represent the lowest barrier to entry since you don't have to worry about setting up compilers and libraries. That said, if you are running [WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10) you probably already have Python 3 installed and can make use of some [excellent existing resources](https://towardsdatascience.com/python-and-the-wsl-597fbe05659f).
- **If you're new to deep learning**, again conda-based distributions are probably the lowest barrier to entry since conda can handle installing CUDA and dependencies for you.
## I'm an experienced Python developer and noone else needs to run my code
My suggestions assume that even though you're not planning to share your code with others, you're still interested in version controlling it and your dependencies in case your laptop breaks/gets stolen/spontaneously combusts and you need to re-create your project.
- **If you're on Linux or Mac and you don't need CUDA** then, assuming you have root permissions you'll probably find that `pipenv` or `poetry` work well for you. I'm not suggesting conda as a first stop since most of the time Python 3.X is already available in modern *Nix environments so you might not need to install anything (except your chosen package manager via `pip`).
- **If you're on Linux or Mac and you need CUDA** then `conda` is likely the lowest barrier to entry. If you've never done it, try installing and using Tensorflow/PyTorch without conda once - for academic/edification. Then you'll be able to feel the benefit.
- **If you're working on Windows outside of WSL** my default suggestion would still be conda due to its management of compiler toolchains and external libraries. If you're on windows inside WSL then see above for Linux/Mac.
## I'm writing private/proprietary Python code that friends/colleagues need to use
* **If you all run the same OS** (for example you're all on the same analytics team in an organisation that uses Windows 10 company-wide) then K.I.S.S and use conda. If everyone is using the same OS you can probably safely mix `conda install` and `pip install` commands and version control your `environment.yml` file without worrying about cross-platform compatibility issues.
* **If you are writing code that needs to work cross-platform but you don't need CUDA** (e.g. you run MacOS, your colleage runs Linux) then use `pipenv` or `poetry`. This will allow you to provide cross-platform deterministic builds/dependency resolution. Keep the `pyproject.toml` or `Pipfile` and respective lock files version controlled. If you or one of your colleagues runs Windows, they might find that the easiest way to interact with you is to install anaconda and then run `pipenv` or `poetry` inside a conda-managed environment.
* **If you are writing code that needs to work cross-platform and uses CUDA** (e.g. you're building a PyTorch model on Linux and your friend wants to run it on Windows) then you're probably going to want to use `conda` to manage the environment (i.e. pull in specific versions of cuda runtime libraries) and `poetry` or `pipenv` to manage pythonic dependencies. You could version control a hand written `environment.yml` with the specific versions of the cuda runtime that your model is expecting (but without OS-specific build tags) and you will definitely want to version control your `pyproject.toml` and `Pipfile` as above. Alternatively, document the `conda install` commands the user should run in the project readme.
* **If you are writing code that you need to package as a wheel or egg for others to use** (e.g. it's a proprietary Python package you ship to customers) then I refer you to the section below but leveraging [poetry publish](https://python-poetry.org/docs/cli/#publish) `--repository` option to specify a private PIP repository.
## I'm writing Python code that I want to share with the community
* **If you don't need CUDA** then my suggestion would be standalone `poetry` since it has build/distribution tools built in and you can produce wheels and source distributions from the commandline and submit them to [pypi](https://pypi.org/). Version control your `pyproject.toml` and `poetry.lock` files.
* **If you need CUDA** then my suggestion is to use conda to create and manage your virtual environment and install cuda and then use `poetry` to manage packages and PyPi build (or use standalone `poetry` and manually manage your cuda libraries - you masochist you!). You might want to version control your `environment.yml` but this file won't be needed for building or uploading your package to PyPi - it's just for you (and other developers) to use to quickly spin up your project locally in a development context.
* **If you want your package to be available in conda** then you'll need to use [conda-build](https://conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs-skeleton.html) to generate conda-specific package files and metadata for your project.
# PEP-582, PDM and the Future of Python Dependencies?
Without wishing to confuse matters further, I wanted to give [PEP-582](https://www.python.org/dev/peps/pep-0582/) an honourable mention.
This is a Python Enhancement Proposal that will allow the python runtime to support `npm`-esque loading of dependencies from a file in the project directory (like `node_modules`). There is already a package manager [PDM](https://pdm.fming.dev/) in development for working with local directories
This is an interesting and exciting paradigm shift that should simplify python packaging and remove the need completely for virtual environments. However, there are many issues to solve and the proposal is only for Python 3.8 with no plans to backport the functionality to earlier versions of the language runtime.
Given how long it's taken some users to make the jump from Python 2.X to Python 3.X, it is likely that virtual environments are going to be around for a few more years to come.