1 Overview

This post will demonstrate how to reproduce an old conda env (that wasn’t exported to an environment.yml file at the time of analysis/usage) needed to rerun old analysis¹.

2 Steps

2.1 Step 0. Configure conda

Install conda and configure it as shown in steps 3 & 4 here.

2.2 Step 1. Determine packages used in the old work

Look at the old analysis and any available metadata to determine:

When was the analysis run?
What packages were used?

For this demonstration, I’m reproducing an env I used to analyze crime and prison data back in 2018. Specifically, I want to produce an env that enables me to rerun these notebooks:

2.2.1 Determining when the analysis was run

Looking at the latest commits for these notebooks, we can set an upper bound on versions used. The latest commits for these notebooks are:

Part 1: June 26, 2018
Part 2: Aug 16, 2018
Part 3: June 15, 2018

From the sidequest described in Section 4.1.1, I’ve decided on using June 15, 2018 as the upper-bound date for analysis.

2.2.2 Determining used packages

For this, I simply look at the import statements, which are compiled below.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
import os
from bokeh.sampledata.us_states import data as states
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.models import LinearColorMapper, ColorBar, BasicTicker

This boils down to [pandas, numpy, seaborn, matplotlib, IPython, and bokeh] (there’s also os, but that’s a python built-in).

2.3 Step 2. Determine max versions at analysis-time

First, I want to determine the version of python to use. Looking at the release dates of python versions, we see that python v3.6 was released on 2016-12-23 and python v3.7 was released on 2018-06-27, so the it’s most likely that python v3.6 was used.

Corroboration

Looking at the raw file, specifically a few lines from the very bottom of the document, the metadata block indicates the kernel used python v3.6.4

Next, I want to determine max versions for pandas, numpy, seaborn, matplotlib, IPython, and bokeh. I know pandas uses numpy and seaborn uses matplotlib, so I can ignore numpy and matplotlib.

I’ll look at each package’s releases page to see the last version before the cutoff date.

Corroboration

Looking at the raw file, specifically by ctrl+f searching “version”, we see that bokeh v0.12.16 was used.

Also, IPython was included as it’s a dependency of the (jupyter) notebook package (which I used to develop the notebooks). There will probably be several other infrastructural

jupyter notebook: v5.5.0

2.4 Step 3. Create the env and register it as a notebook kernel

From the prior step, we determined the following version constraints.

python=3.6.4
pandas<=0.24.2
seaborn<=0.8.1
bokeh==0.12.16
notebook<=5.5.0

I’ll run the command below to create a conda env named prisons_post_env that meets those constraints.

conda create --name prisons_post_env "python=3.6.4" "pandas<=0.24.2" "seaborn<=0.8.1" "bokeh=0.12.16" "notebook<=5.5.0"

Activate that conda env

conda activate prisons_post_env

and register that conda env as a notebook kernel

(prisons_post_env) ...$ python -m ipykernel install --user --name prisons_post_env --display-name "(prisons_post_env)"

2.5 Step 4. Attempt to reproduce prior results and troubleshoot issues

Now you can start up a notebook server (I’ve specified a port number as I’m already running a jupyterlab server on the default port, 8888)

jupyter notebook --port=9494

2.5.1 Troubleshooting 1

While trying to open up the part 2 notebook, the connection attempt hung and the terminal showed an error.

...
~/miniconda3/envs/prisons_post_env/lib/python3.6/site-packages/notebook/base/zmqhandlers.py:284:  RuntimeWarning: coroutine 'WebSocketHandler.get' was never awaited

Googling the error took me to a Stack Overflow question that indicates package tornado v6+ caused the issue, so let’s downgrade tornado in our env

(prisons_post_env) ...$ conda install -c conda-forge "tornado<6" --freeze-installed

then restart our notebook server (press ctrl+c in the terminal, shut it down, then start it back up with the earlier jupyter notebook command). When you reopen the Crime_and_Prisons_part2.ipynb notebook, you should find that it successfully connects to the kernel and you can run through cells. At least up until cell that calls the plot_male_v_female_by_state_sea() function.

2.5.2 Troubleshooting 2

Upon attempting to run that cell, you will see another error message.

~/miniconda3/envs/prisons_post_env/lib/python3.6/site-packages/matplotlib/artist.py in update(self, props)
...
AttributeError: 'Rectangle' object has no property 'normed'

After a few minutes of googling the error message along with the word matplotlib, I’ve determined that the problem is that the installed seaborn version’s distplot() function calls matplotlib’s hist() plotter function using a keyword argument, normed, that was changed in the matplotlib v3.2.0 release. And by running this

import matplotlib
matplotlib.__version__

I see this env has matplotlib v3.3.2 installed. So let’s downgrade matplotlib.

(prisons_post_env) ...$ conda install -c conda-forge "matplotlib<3.2"

Looking at the installation plan, I see that conda wants to upgrade a lot of packages in violation of the earlier constraints. I also tried adding the --freeze-installed option, but conda still wanted to make updates including these.

  bokeh                                      0.12.16-py36_0 --> 2.3.3-py36h5fab9bb_0
  notebook                                     5.5.0-py36_0 --> 6.3.0-py36h5fab9bb_0
  pandas                              0.24.2-py36hb3f55d8_1 --> 1.1.5-py36h284efc9_0
  python                                            3.6.4-0 --> 3.6.15-hb7a2778_0_cpython
  seaborn                                        0.8.1-py_1 --> 0.11.2-hd8ed1ab_0
  tornado                           5.1.1-py36h14c3975_1000 --> 6.1-py36h8f6f2f9_1
  ...

So let’s just completely remove and remake the env with all of our constraints, old and new.

After shutting down the jupyter notebook server and ensuring the env is not activated in any open terminal, remove the env directory

rm -r ~/miniconda3/envs/prisons_post_env/

then recreate the env with our additional constraints. Through a fair bit of trial and error, I determined that one of my preferred configs (namely prioritizing the conda-forge channel) was making it impossible to reconcile these constraints, so I overrode the configured channels in favor of the default channel that I was probably using 5 years ago. I’ll also add on the xlrd package, as the part3 notebook loads a .xls file.

conda create --name prisons_post_env --override-channels --channel defaults "python=3.6.4" "pandas<=0.24.2" "seaborn<=0.8.1" "bokeh=0.12.16" "notebook<=5.5.0" "tornado<6" "matplotlib=2.2.2" xlrd

Then activate and re-register the env

conda activate prisons_post_env
(prisons_post_env) ...$ python -m ipykernel install --user --name prisons_post_env --display-name "(prisons_post_env)"

and restart the notebook server.

Now all three of those old notebooks can be run successfully (after collecting and locating the data in the right places).

Why did changing the conda channel make the dependencies solvable?

You may wonder “How could changing the package source (aka ‘channel’) make the env solvable? The package versions were the same!”

That’s a good observation and intuition! If converting a python package into a conda package was impossible to mess up, there wouldn’t be any difference in conda packages for a given python package version across channels. But conda isn’t just a tool for packaging python code; it’s a tool for packaging and distributing any executable, and that often means instructions for building the package and for resolving dependencies are needed. In essence, you need a recipe for making the package. In conda terms, that recipe is a package’s meta.yaml file, and the file provides places to point to build scripts and define dependencies. Each conda channel is maintained separately, so each can have different meta.yaml file for a given python package version. Consequently, if dependencies are inconsistent across channels, an env that’s consistent when pulling exclusively from one channel may be unresolvable when pulling exclusively from another channel.

2.6 Step 5. Export the env

Now that we have a working env, let’s export both the full specification and a cross-platform specification (which only includes the explicitly requested packages).

Export the full, OS specific env spec

!conda env export -n prisons_post_env > environment.yml

Export the cross-platform env spec

!conda env export -n prisons_post_env --from-history > environment_cross_platform.yml

3 Summary

This post showed how to reverse engineer the conda env that was used to run old notebooks.

This post also showed a concrete example of a valid conda env seeming inconsistent due to conda weirdness, as well as a troubleshooting strategy (albeit not a very generalizable one) for resolving the problem.

4 Appendix

4.1 Side Projects

While working through technical projects, little problems tangential to the main task often pop out and block progress. Often these side quests can be ignored, but

4.1.1 git diff side project

I don’t recall why I updated Parts 1 and 2 after Part 3. I doubt I made substantive changes, but as I’m using metadata of git commits to determine changes, it only makes sense to look at the diffs. Unfortunately, while github indicates a relatively small number of lines were modified, the diffs are too large to display in browser and I have to review in a locally to see the diffs.

Note

This is a well-known drawback of jupyter notebooks; plots get represented by very long plaintext strings and rerunning a notebook often changes every line in version control, so diffs can be hard to review).

So I cloned the repo, copied down the hash of the commit I’m interested in (commit 0857e6c), and looked at the diffs of that file in that commit via

git diff 0857e6c^..0857e6c -- Crime_and_Prisons_part2.ipynb

Most of the changes only changed the cell execution-order number or uuid-looking tags. There may also have been changes to the extremely long string representations used to render plots, but they were too long to crosscheck. In fact, those long strings took so long to page through that I stopped reviewing that way and just compared the rendered notebooks (pre-commit vs commit) and concluded there weren’t any substantive changes, so the timestamp from the earlier commit is adequate.

4.2 Footnotes

Footnotes

Context: Over the years, I’ve writen up a number of posts for a number of different personal blogs, and I want to consolidate those posts into one platform. Many of my posts involved leveraging the capabilities of jupyter notebooks, and while I’ve always used conda envs to avoid polluting my base python environment, I didn’t reliably export my envs or keep separate envs for each project or purpose. So I occassionally run into a situation where I want to rerun old code on a new machine, but I have to go through extra steps to recreate the env.↩︎