Update Note

The majority of code and analysis in this post was originally written back in mid-2018 (in this notebook). I’ve consolidated some cells, automated data retrieval, and added cell labels to make things render nicely, but otherwise I’ve left the work as-is. Compared to my current work products, this old code is very messy and unpolished, but like Eric Ma (creator of the networkx graph data analysis package), I believe in showing newer data analysts and scientists I mentor that no one in this field started off with mastery of git, pandas, bash, etc, and that everyone who lasts loves to keep learning and improving.

After I integrate these old posts into this blog, I’ll write an EDA post that starts from scratch using up-to-date data.

1 Exploration in Criminal Justice and Corrections Data

I’ve repeatedly read the statistic that the US makes up around 4% of the global population, but makes up more than 20% of the prison population. Well defined questions are crucial tools for making sense of raw data, and this massive asymmetry is provokes some important questions, such as those listed below.

Why isn’t our prison population proportional to our regular population?
- Do other countries have fewer criminals?
- Do we have a drastically different approach to criminal justice than the rest of the world?
- Are our prison sentences just much longer?
What are the most common crimes?
- How many people are in for those crimes?
How many people enter the prison system each month?
- How many exit?
Recidivism
- How many prisoners have been to prison at least once before?
- What is the recidivism rate for 1st time prisoners who served in public prisons?
- And for private prisons?
- Which prisons have the lowest recidivism rates?

1.1 Table of Contents

Exploration in Criminal Justice and Corrections Data
Datasets
Total Imprisonment Rates
3a. Observations
Imprisonment by Race
4a. Observations
Imprisonment by Gender
5a. Observations
To Be Continued

1.2 Datasets

With some questions in hand, I can start trying to gather data that could shed light on the situation. I may not be able to get data with the resolution needed to answer some of these questions (e.g. I may not be able to find data broken down by month).

The government makes some data publicly available. While government data is rarely current, often involves a bit of cleanup, and always involves hunting through old government sites, it’s the only source for much of this extremely valuable information. From the Bureau of Justice Statistics, I found a government project called the Prisoner Series and I downloaded the most recent data.

Imports, styling, and path definition

import os
from urllib.request import urlretrieve

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
%matplotlib inline

# Notebook Styling
pd.options.display.float_format = lambda x: "%.5f" % x
pd.options.display.max_columns = None
plt.rcParams['figure.figsize'] = 10,10


DATA_DIR_PATH = os.path.join('data', 'prison')
os.makedirs(DATA_DIR_PATH, exist_ok=True)

@ticker.FuncFormatter
def y_formatter(x, pos):
    return '{:4.0f}'.format(x/1000)

1.3 Total Imprisonment Rates (Table p16f01)

The values in the by-race dataset are [per 100k population].

Downloading the relevant data and unzipping it

import zipfile
import requests

url = "https://bjs.ojp.gov/redirect-legacy/content/pub/sheets/p16.zip"
file_path = os.path.join(DATA_DIR_PATH, "p16.zip")
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

if not os.path.isfile(file_path):
    response = requests.get(url, headers=headers, stream=True)
    if response.status_code == 200:
        with open(file_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    file.write(chunk)

    with zipfile.ZipFile(file_path, 'r') as zf:
        zf.extractall(DATA_DIR_PATH)

Loading and preprocessing the f01 dataset, Total Counts

CSV_PATH = os.path.join('data', 'prison', 'p16f01.csv')
total_df = pd.read_csv(CSV_PATH, encoding='latin1', header=12, index_col='Year', parse_dates=['Year'])
total_df.dropna(inplace=True)
total_df.index = total_df.index.values.astype(int)

print(total_df.shape)
total_df.head()

(39, 2)

	All ages	Age 18 or older
1978	131.00000	183.00000
1979	133.00000	185.00000
1980	138.00000	191.00000
1981	153.00000	211.00000
1982	170.00000	232.00000

Plotting out the f01 data

with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(total_df['All ages'])
    ax.plot(total_df['Age 18 or older'])
    ax.set_title('Total Imprisonment Rates (table: p16f01)')
    ax.set_xlabel('Year')
    ax.set_ylabel('People imprisoned (per relevant 100k US population)')
    ax.legend()
    ax.set_ylim([0, 1.1*max([total_df['All ages'].max(), 
                             total_df['Age 18 or older'].max()])])

1.4 Observations

The imprisonment rate normalized to the entire population of US residents is lower than the imprisonment rate normalized to the population of US residents that are 18 or older. This indicates that the imprisonment rate for people under age 18 is much lower than for people 18 or older. That fits with my intuition.

We also see that the imprisonment rate climbs steadily from 1980 up through 1999, dips, and peaks around 2007-2008, at which point it starts trending down. In 1978, 183 people were in prison per every 100k US residents 18 or older. In 2007, 670 people were in prison per every 100k US residents 18 or older. That’s a 266% increase in the imprisonment rate over that 29 year span. That’s huge.

1.4.1 New Questions:

What was responsible for the increase in the rate of imprisonment? What was responsible for the decrease?
- Was it proportional to the actual crime rates?
- Was it a product of different enforcement policies?

To answer these new questions, we will probably have to look at other sets of data.

1.5 Imprisonment by Race (Table p16f02)

Imprisonment rate of sentenced prisoners under the jurisdiction of state or federal correctional authorities, per 100,000 U.S. residents age 18 or older, by race and Hispanic origin, December 31, 2006–2016

Loading and preprocessing the f02 data, Counts by Race

CSV_PATH = os.path.join('data', 'prison', 'p16f02.csv')
race_df = pd.read_csv(CSV_PATH, encoding='latin1', header=12)
race_df.dropna(inplace=True)
race_df.rename(columns={'Unnamed: 0': 'Year'}, inplace=True)
race_df.set_index('Year', inplace=True)
race_df

	White/*	Black/*	Hispanic
Year
2006	324.00000	2261.00000	1073.00000
2007	317.00000	2233.00000	1094.00000
2008	316.00000	2196.00000	1057.00000
2009	308.00000	2134.00000	1060.00000
2010	307.00000	2059.00000	1014.00000
2011	299.00000	1973.00000	990.00000
2012	293.00000	1873.00000	949.00000
2013	291.00000	1817.00000	922.00000
2014	289.00000	1754.00000	893.00000
2015	281.00000	1670.00000	862.00000
2016	274.00000	1608.00000	856.00000

Plotting out the f02 data, Counts by Race

with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(race_df['White/*'])
    ax.plot(race_df['Black/*'])
    ax.plot(race_df['Hispanic'])
    ax.set_title('Imprisonment Rates by Race (table: p16f02)')
    ax.set_xlabel('Year')
    ax.set_ylabel('People imprisoned (per 100k US population)')
    ax.legend()
    ax.set_ylim([0, 1.1*max([race_df['White/*'].max(), 
                         race_df['Black/*'].max(), 
                         race_df['Hispanic'].max()])])

Printing out summary stats for f02 data, Counts by Race

print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}'
      .format('White/*', race_df['White/*'].max(), race_df['White/*'].min()))
print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}'
      .format('Black/*', race_df['Black/*'].max(), race_df['Black/*'].min()))
print('{:>8s} imprisonment per 100k US pop: max: {}, min: {}'
      .format('Hispanic', race_df['Hispanic'].max(), race_df['Hispanic'].min()))
print('*: non-Hispanic')

 White/* imprisonment per 100k US pop: max: 324.0, min: 274.0
 Black/* imprisonment per 100k US pop: max: 2261.0, min: 1608.0
Hispanic imprisonment per 100k US pop: max: 1094.0, min: 856.0
*: non-Hispanic

1.6 Observations

This is very striking. We see that there is a very significant difference in the rates of white (non-Hispanic), black (non-Hispanic), and Hispanic imprisonment. We also see that rates for all three have dropped over this time period.

1.6.1 New Questions:

What is responsible for this difference in imprisonment rates for different demographic groups?

Based on prior research, I suspect that this is the result of many systemic factors, but let’s continue exploring the data.

1.7 Breakdown by Gender (Table p16t01)

Loading and preprocessing the t01 data, Total Counts by Gender and State/Federal

CSV_PATH = os.path.join('data', 'prison', 'p16t01.csv')
sex_df = pd.read_csv(CSV_PATH, encoding='latin1', header=11, thousands=r',')
sex_df.dropna(inplace=True, thresh=3)
sex_df.dropna(inplace=True, axis=1, thresh=3)
sex_df.dropna(inplace=True)
fix = lambda x: x.split('/')[0]
sex_df['Year'] = sex_df['Year'].apply(fix)
sex_df['Year'] = sex_df['Year'].astype(int)
sex_df.set_index('Year', inplace=True)
sex_df

	Total	Federal/a	State	Male	Female
Year
2006	1568674.00000	193046.00000	1375628.00000	1456366.00000	112308.00000
2007	1596835.00000	199618.00000	1397217.00000	1482524.00000	114311.00000
2008	1608282.00000	201280.00000	1407002.00000	1493670.00000	114612.00000
2009	1615487.00000	208118.00000	1407369.00000	1502002.00000	113485.00000
2010	1613803.00000	209771.00000	1404032.00000	1500936.00000	112867.00000
2011	1598968.00000	216362.00000	1382606.00000	1487561.00000	111407.00000
2012	1570397.00000	217815.00000	1352582.00000	1461625.00000	108772.00000
2013	1576950.00000	215866.00000	1361084.00000	1465592.00000	111358.00000
2014	1562319.00000	210567.00000	1351752.00000	1449291.00000	113028.00000
2015	1526603.00000	196455.00000	1330148.00000	1415112.00000	111491.00000
2016	1505397.00000	189192.00000	1316205.00000	1393975.00000	111422.00000

Plotting out the t01 data

with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(sex_df['Male'])
    ax.plot(sex_df['Female'])
    ax.set_title('Imprisonment Counts by Gender (table: p16t01)')
    ax.set_xlabel('Year')
    ax.yaxis.set_major_formatter(y_formatter)
    ax.set_ylabel('People Imprisoned [in thousands of people]')
    ax.legend()
    ax.set_ylim([0, 1.1*max(sex_df['Male'].max(),
                            sex_df['Female'].max())])

1.8 Observations

The first thing that I notice is that the number of men in prison is much higher than the number of females imprisoned. Per the chart below, over the entire span of the data set (2006 to 2016), there are at least 12 men in prison for each woman in prison. This is a massive asymmetry. It doesn’t feel very controversial, but should it? According to the 2010 US Census, the US population is 50.8% female and 49.2% male.

1.8.1 New Questions

Why are men so much more likely to be in prison?
- What are the relevant differences between men and women?
- What is the gender breakdown of crimes?

Plotting out Male to Female imprisonment ratio

sex_df['m_f_ratio'] = sex_df['Male'] / sex_df['Female']
with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(7,5))
    ax.plot(sex_df['m_f_ratio'])
    ax.set_title('Male to Female Imprisonment Ratio (table: p16t01)')
    ax.set_xlabel('Year')
    ax.set_ylabel('Average Number of Males imprisoned per Female')
    ax.legend()
    ax.set_ylim([0, 1.1*sex_df['m_f_ratio'].max()])

Plotting out Counts of State and Federal imprisonment

with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(sex_df['Federal/a'])
    ax.plot(sex_df['State'])
    ax.set_title('Imprisonment Counts in State and Federal Prisons (table: p16t01)')
    ax.set_xlabel('Year')
    ax.yaxis.set_major_formatter(y_formatter)
    ax.set_ylabel('People Imprisoned [in thousands of people]')
    ax.legend()
    ax.set_ylim([0, 1.1*max(sex_df['Federal/a'].max(),
                            sex_df['State'].max())])

1.8.2 Observations

Far more people are in State prisons than are in Federal prisons. That isn’t very controversial and at this level, I won’t dig much deeper. It may be interesting to dig further into imprisonment counts broken down by state.

2 To Be Continued

There are still 20 more tables that I haven’t looked at yet, but so far, we’ve seen * The imprisonment rates increased by just over 266% between 1978 and 2007. * Black people are imprisoned at a far higher rate than either Hispanic people or non-Hispanic white people. * Hispanic people are imprisoned at a far higher rate than non-Hispanic white people. * Far more men are imprisoned than women. * Far more people are in state prisons than in federal prisons.

Continued in the next notebook.