Export the full, OS specific env spec
!conda env export -n prisons_post_env > environment.yml
Matt Triano
July 13, 2023
July 13, 2023
This post will demonstrate how to reproduce an old conda env (that wasn’t exported to an environment.yml
file at the time of analysis/usage) needed to rerun old analysis1.
Install conda and configure it as shown in steps 3 & 4 here.
Look at the old analysis and any available metadata to determine:
For this demonstration, I’m reproducing an env I used to analyze crime and prison data back in 2018. Specifically, I want to produce an env that enables me to rerun these notebooks:
Looking at the latest commits for these notebooks, we can set an upper bound on versions used. The latest commits for these notebooks are:
From the sidequest described in Section 4.1.1, I’ve decided on using June 15, 2018 as the upper-bound date for analysis.
For this, I simply look at the import
statements, which are compiled below.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
import os
from bokeh.sampledata.us_states import data as states
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.models import LinearColorMapper, ColorBar, BasicTicker
This boils down to [pandas
, numpy
, seaborn
, matplotlib
, IPython
, and bokeh
] (there’s also os
, but that’s a python built-in).
First, I want to determine the version of python to use. Looking at the release dates of python versions, we see that python v3.6 was released on 2016-12-23 and python v3.7 was released on 2018-06-27, so the it’s most likely that python v3.6 was used.
Looking at the raw file, specifically a few lines from the very bottom of the document, the metadata
block indicates the kernel used python v3.6.4
Next, I want to determine max versions for pandas
, numpy
, seaborn
, matplotlib
, IPython
, and bokeh
. I know pandas
uses numpy
and seaborn
uses matplotlib
, so I can ignore numpy
and matplotlib
.
I’ll look at each package’s releases page to see the last version before the cutoff date.
Looking at the raw file, specifically by ctrl+f searching “version”, we see that bokeh v0.12.16 was used.
Also, IPython
was included as it’s a dependency of the (jupyter) notebook
package (which I used to develop the notebooks). There will probably be several other infrastructural
From the prior step, we determined the following version constraints.
python=3.6.4
pandas<=0.24.2
seaborn<=0.8.1
bokeh==0.12.16
notebook<=5.5.0
I’ll run the command below to create a conda env named prisons_post_env that meets those constraints.
conda create --name prisons_post_env "python=3.6.4" "pandas<=0.24.2" "seaborn<=0.8.1" "bokeh=0.12.16" "notebook<=5.5.0"
Activate that conda env
and register that conda env as a notebook kernel
Now you can start up a notebook server (I’ve specified a port number as I’m already running a jupyterlab server on the default port, 8888
)
While trying to open up the part 2 notebook, the connection attempt hung and the terminal showed an error.
...
~/miniconda3/envs/prisons_post_env/lib/python3.6/site-packages/notebook/base/zmqhandlers.py:284: RuntimeWarning: coroutine 'WebSocketHandler.get' was never awaited
Googling the error took me to a Stack Overflow question that indicates package tornado
v6+ caused the issue, so let’s downgrade tornado
in our env
then restart our notebook server (press ctrl+c in the terminal, shut it down, then start it back up with the earlier jupyter notebook
command). When you reopen the Crime_and_Prisons_part2.ipynb
notebook, you should find that it successfully connects to the kernel and you can run through cells. At least up until cell that calls the plot_male_v_female_by_state_sea()
function.
Upon attempting to run that cell, you will see another error message.
~/miniconda3/envs/prisons_post_env/lib/python3.6/site-packages/matplotlib/artist.py in update(self, props)
...
AttributeError: 'Rectangle' object has no property 'normed'
After a few minutes of googling the error message along with the word matplotlib, I’ve determined that the problem is that the installed seaborn version’s distplot()
function calls matplotlib’s hist()
plotter function using a keyword argument, normed
, that was changed in the matplotlib v3.2.0 release. And by running this
I see this env has matplotlib v3.3.2 installed. So let’s downgrade matplotlib.
Looking at the installation plan, I see that conda wants to upgrade a lot of packages in violation of the earlier constraints. I also tried adding the --freeze-installed
option, but conda still wanted to make updates including these.
bokeh 0.12.16-py36_0 --> 2.3.3-py36h5fab9bb_0
notebook 5.5.0-py36_0 --> 6.3.0-py36h5fab9bb_0
pandas 0.24.2-py36hb3f55d8_1 --> 1.1.5-py36h284efc9_0
python 3.6.4-0 --> 3.6.15-hb7a2778_0_cpython
seaborn 0.8.1-py_1 --> 0.11.2-hd8ed1ab_0
tornado 5.1.1-py36h14c3975_1000 --> 6.1-py36h8f6f2f9_1
...
So let’s just completely remove and remake the env with all of our constraints, old and new.
After shutting down the jupyter notebook server and ensuring the env is not activated in any open terminal, remove the env directory
then recreate the env with our additional constraints. Through a fair bit of trial and error, I determined that one of my preferred configs (namely prioritizing the conda-forge channel) was making it impossible to reconcile these constraints, so I overrode the configured channels in favor of the default channel that I was probably using 5 years ago. I’ll also add on the xlrd
package, as the part3 notebook loads a .xls
file.
conda create --name prisons_post_env --override-channels --channel defaults "python=3.6.4" "pandas<=0.24.2" "seaborn<=0.8.1" "bokeh=0.12.16" "notebook<=5.5.0" "tornado<6" "matplotlib=2.2.2" xlrd
Then activate and re-register the env
conda activate prisons_post_env
(prisons_post_env) ...$ python -m ipykernel install --user --name prisons_post_env --display-name "(prisons_post_env)"
and restart the notebook server.
Now all three of those old notebooks can be run successfully (after collecting and locating the data in the right places).
You may wonder “How could changing the package source (aka ‘channel’) make the env solvable? The package versions were the same!”
That’s a good observation and intuition! If converting a python package into a conda package was impossible to mess up, there wouldn’t be any difference in conda packages for a given python package version across channels. But conda isn’t just a tool for packaging python code; it’s a tool for packaging and distributing any executable, and that often means instructions for building the package and for resolving dependencies are needed. In essence, you need a recipe for making the package. In conda terms, that recipe is a package’s meta.yaml file, and the file provides places to point to build scripts and define dependencies. Each conda channel is maintained separately, so each can have different meta.yaml file for a given python package version. Consequently, if dependencies are inconsistent across channels, an env that’s consistent when pulling exclusively from one channel may be unresolvable when pulling exclusively from another channel.
Now that we have a working env, let’s export both the full specification and a cross-platform specification (which only includes the explicitly requested packages).
This post showed how to reverse engineer the conda env that was used to run old notebooks.
This post also showed a concrete example of a valid conda env seeming inconsistent due to conda weirdness, as well as a troubleshooting strategy (albeit not a very generalizable one) for resolving the problem.
While working through technical projects, little problems tangential to the main task often pop out and block progress. Often these side quests can be ignored, but
I don’t recall why I updated Parts 1 and 2 after Part 3. I doubt I made substantive changes, but as I’m using metadata of git commits to determine changes, it only makes sense to look at the diffs. Unfortunately, while github indicates a relatively small number of lines were modified, the diffs are too large to display in browser and I have to review in a locally to see the diffs.
This is a well-known drawback of jupyter notebooks; plots get represented by very long plaintext strings and rerunning a notebook often changes every line in version control, so diffs can be hard to review).
So I cloned the repo, copied down the hash of the commit I’m interested in (commit 0857e6c), and looked at the diffs of that file in that commit via
Most of the changes only changed the cell execution-order number or uuid
-looking tags. There may also have been changes to the extremely long string representations used to render plots, but they were too long to crosscheck. In fact, those long strings took so long to page through that I stopped reviewing that way and just compared the rendered notebooks (pre-commit vs commit) and concluded there weren’t any substantive changes, so the timestamp from the earlier commit is adequate.
Context: Over the years, I’ve writen up a number of posts for a number of different personal blogs, and I want to consolidate those posts into one platform. Many of my posts involved leveraging the capabilities of jupyter notebooks, and while I’ve always used conda envs to avoid polluting my base python environment, I didn’t reliably export my envs or keep separate envs for each project or purpose. So I occassionally run into a situation where I want to rerun old code on a new machine, but I have to go through extra steps to recreate the env.↩︎