Exploration in Criminal Justice and Corrections Data, Part 2

Exploring prisoner data up through 2016
pandas
bokeh
analysis
old
Author

Matt Triano

Published

June 27, 2018

Modified

June 27, 2018

The majority of code and analysis in this post was originally written back in mid-2018 (in this notebook). I’ve consolidated some cells, automated data retrieval, and added cell labels to make things render nicely, but otherwise I’ve left the work as-is. Compared to my current work products, this old code is very messy and unpolished, but like Eric Ma (creator of the networkx graph data analysis package), I believe in showing newer data analysts and scientists I mentor that no one in this field started off with mastery of git, pandas, bash, etc, and that everyone who lasts loves to keep learning and improving.

After I integrate these old posts into this blog, I’ll write an EDA post that starts from scratch using up-to-date data.

1 Criminal Justice and Corrections Data EDA, Part 2

1.1 Table of Contents

  1. Exploration in Criminal Justice and Corrections Data
  2. Datasets
  3. Total Imprisonment Rates
    3a. Visualization Pitfall
    3b. Population Dataset
    3c. Visualization Breakthrough, Observations, and interactive maps
  4. State Imprisonment Distributions by Gender
    4a. Static Scatter Plots with Distributions
    4b. Interactive Scatter Plots

  5. Summary

2 Datasets

As in part 1, I’m exploring the most recent data from the Bureau of Justice Statistics’ Prisoner Series datasets.

3 Imprisonment Counts and Rates by State (Table p16t02)

In the previous post, Exploratory Data Analysis Part 1, we saw that the vast majority of people in prisons are in state prisons, but we didn’t discover anything about imprisonment rates in individual states. One of my favorite tools for examining geographical data is the choropleth map (which you may recognize from any recent election night). In this post, I use the Bokeh python package to generate some state-level choropleths to show rates of imprisonment and how they vary from state to state.

Scroll to the bottom if you want to jump straight to the maps.

Imports, styling, path definitions, and util funcs
import os
from urllib.request import urlretrieve

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
%matplotlib inline

# Notebook Styling
pd.options.display.float_format = lambda x: "%.5f" % x
pd.options.display.max_columns = None
plt.rcParams['figure.figsize'] = 10,10

DATA_DIR_PATH = os.path.join("..", "010_crime_and_prisons_p1", 'data')
PRISON_DATA_DIR = os.path.join(DATA_DIR_PATH, 'prison')
POP_DATA_DIR = os.path.join(DATA_DIR_PATH, "population")
os.makedirs(POP_DATA_DIR, exist_ok=True)

# Importing modules from a visualization package.
from bokeh.sampledata.us_states import data as states
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.models import LinearColorMapper, ColorBar, BasicTicker

try:
    del states["HI"]
    del states["AK"]
except KeyError:
    pass

# This dictionary maps State names to State Codes
def get_state_code():
    state_code = {
        'United States': 'US',
        'Alabama': 'AL',
        'Alaska': 'AK',
        'American Samoa': 'AS',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'District of Columbia': 'DC',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Guam': 'GU',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'National': 'NA',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Northern Mariana Islands': 'MP',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Puerto Rico': 'PR',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virgin Islands': 'VI',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'
    }
    return state_code
Loading and preprocessing the t02 dataset, Counts by State and Gender
# Breakdown by State and Gender (Table p16t02)
CSV_PATH = os.path.join(PRISON_DATA_DIR, 'p16t02.csv')
state_sex_df = pd.read_csv(CSV_PATH, encoding='latin1',
                           header=[11,12], na_values=':',
                           thousands=r',')
print(f"Initial data format")
display(state_sex_df.head())
state_sex_df.dropna(axis=0, thresh=5, inplace=True)
state_sex_df.dropna(axis=1, thresh=5, inplace=True)
# state_sex_df.dropna(axis=0, inplace=True)
state_sex_df.drop(['Percent change, 2015–2016', 'Unnamed: 12_level_0',
                   'Unnamed: 14_level_0'],
                  axis=1, inplace=True)
state_sex_df.drop([0,1,2], axis=0, inplace=True)

# Fixing the headers for the dataframe
tmp1 = ['Area']
tmp1.extend(3*[state_sex_df.columns.values[1][0]])
tmp1.extend(3*[state_sex_df.columns.values[4][0]])
tmp2 = [el[1] for el in state_sex_df.columns.values]
tmp2[0] = ''
state_sex_df.columns = pd.MultiIndex.from_arrays([tmp1,tmp2], names=['', ''])
state_sex_df.reset_index(drop=True, inplace=True)

# To clean up the '/_' footnote indicators
fix = lambda x: x.split('/')[0]
state_sex_df['Area'] = state_sex_df['Area'].apply(fix)

# This creates a new column in the DataFrame with the state code for each state
state_code = get_state_code()
map_state_codes = lambda x: state_code[x]
state_sex_df['Code'] = state_sex_df['Area'].apply(map_state_codes)
state_sex_df.drop(['Area'], axis=1, inplace=True)  # drops the row with aggregated values
# This sets the state code as the index
state_sex_df.set_index('Code', inplace=True)

# This fixes the names of the columns (to facilitate merging later)
state_sex_df.columns.set_names([None,'SEX'], inplace=True)
state_sex_df.index.name = 'Code'

print(f"\n\nData format after preprocessing")
display(state_sex_df.head())
Initial data format


Data format after preprocessing
Unnamed: 0_level_0 Unnamed: 1_level_0 2015 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 2016 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 Percent change, 2015–2016 Unnamed: 11_level_0 Unnamed: 12_level_0 Unnamed: 13_level_0 Unnamed: 14_level_0 Unnamed: 15_level_0
Jurisdiction Unnamed: 1_level_1 Total Male Female Unnamed: 5_level_1 Total Male Female Unnamed: 9_level_1 Total Unnamed: 11_level_1 Male Unnamed: 13_level_1 Female Unnamed: 15_level_1
0 NaN U.S. total/a 1526603.00000 1415112.00000 111491.00000 nan 1505397.00000 1393975.00000 111422.00000 nan -1.40000 % -1.50000 % -0.10000 %
1 Federal/b NaN 196455.00000 183502.00000 12953.00000 nan 189192.00000 176495.00000 12697.00000 nan -3.70000 % -3.80000 % -2.00000 %
2 State/a NaN 1330148.00000 1231610.00000 98538.00000 nan 1316205.00000 1217480.00000 98725.00000 nan -1.00000 % -1.10000 % 0.20000 %
3 NaN Alabama 30810.00000 28220.00000 2590.00000 nan 28883.00000 26506.00000 2377.00000 nan -6.30000 NaN -6.10000 NaN -8.20000 NaN
4 NaN Alaska/c 5338.00000 4761.00000 577.00000 nan 4434.00000 4024.00000 410.00000 nan -16.90000 NaN -15.50000 NaN -28.90000 NaN
/home/matt/miniconda3/envs/prisons_post_env/lib/python3.6/site-packages/pandas/core/generic.py:3812: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  new_axis = axis.drop(labels, errors=errors)
2015 2016
SEX Total Male Female Total Male Female
Code
AL 30810.00000 28220.00000 2590.00000 28883.00000 26506.00000 2377.00000
AK 5338.00000 4761.00000 577.00000 4434.00000 4024.00000 410.00000
AZ 42719.00000 38738.00000 3981.00000 42320.00000 38323.00000 3997.00000
AR 17707.00000 16305.00000 1402.00000 17537.00000 16161.00000 1376.00000
CA 129593.00000 123808.00000 5785.00000 130390.00000 124487.00000 5903.00000

In the above cell, I’ve cleaned up the DataFrame so that the column header now consists of: * column labels: (‘2015’, ‘2016’, ‘Total’, ‘Male’, ‘Female’) * column label names: (’‘, ’SEX’) * index name: (’Code)

Setting variables to help style plots and locate tooltips
# The color levels to use in the map
COLORS = ['#feedde', '#fdd0a2', '#fdae6b', '#fd8d3c', '#f16913', '#d94801', '#8c2d04']
state_xs = [states[code]["lons"] for code in states]
state_ys = [states[code]["lats"] for code in states]

One of the great things about coding in Jupyter Notebooks is that I can prototype things as I go, and then encapsulate that prototype into a function. I built the prototype for my visualization in the cell below, and after I tune the visualization, I can make it into a function and easily generate more visualizations.

Prototyping Bokeh plotting code
state_names = []
prisoner_counts = []
yr = '2015'
sex = 'Male'
df = state_sex_df
min_val = df.loc[:,yr][sex].min()
max_val = df.loc[:,yr][sex].max()
bins = len(COLORS)
TOOLS = "pan,wheel_zoom,reset,hover,save"
for state_code in states:
    try:
        num_imprisoned = df.loc[state_code,yr][sex]
        state_names.append(state_code)
        prisoner_counts.append(num_imprisoned)
    except KeyError:
        state_names.append(' ')
        prisoner_counts.append(0)

color_mapper = LinearColorMapper(palette=COLORS,
                                 low=min(prisoner_counts),
                                 high=max(prisoner_counts))

source = ColumnDataSource(data=dict(
    x=state_xs,
    y=state_ys,
    name=state_names,
    imp_count = prisoner_counts
))

p = figure(title="(Weak Visualization) Number of Imprisoned Males by State (2015)",
           toolbar_location="left",
           plot_width=800, plot_height=450, tools=TOOLS)
p.patches('x', 'y', source=source, 
          fill_color={'field': 'imp_count', 'transform': color_mapper}, fill_alpha=1.0,
          line_color="#884444", line_width=2, line_alpha=0.3)
color_bar = ColorBar(color_mapper=color_mapper, ticker=BasicTicker(),
                     label_standoff=12, border_line_color=None, location=(0,0))
p.add_layout(color_bar, 'right')
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
        ("State", "@name"),
        ("No. Imprisoned in this State", "@imp_count")
    ]
output_notebook()
show(p)
Loading BokehJS ...

3.1 Visualization Pitfall!

From this map, we see that the most populated states have the most prisoners (you can see the actual number of people in prison by hovering the mouse over a state). That’s approximately what we would expect, but it doesn’t really help us answer questions about the differences in imprisonment rates between States (with different policies on criminal justice). This information is useful for investigating some questions (e.g. Where could prison reform impact the most lives?), but it’s not as useful if we’re investigating the relationship between policies, crime rates, and imprisonment rates. If we want to actually compare imprisonment rates between states, we need to normalize (or scale) the prisoner counts to the population of the state. This requires getting a different data set.

3.1.1 Population Dataset

The US Census tracks and publishes population data, including a data set that tracks estimates of annual state populations (by gender) from 2010 to 2016. I’ll use this data to normalize the raw prisoner counts by state.

(NEW CODE) automating data collection
pop_file_path = os.path.join(POP_DATA_DIR, "sc-est2016-agesex-civ.csv")
url = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2016/state/asrh/sc-est2016-agesex-civ.csv"
if not os.path.isfile(pop_file_path):
    urlretrieve(url=url, filename=pop_file_path)
Loading and preprocessing the Census State and Gender pop. estimates
state_pop_sex_df = pd.read_csv(pop_file_path, encoding='latin1', header=0)
print('\nRaw Population Data Set Format')
display(state_pop_sex_df.head())
state_pop_sex_df = state_pop_sex_df[state_pop_sex_df['SEX'].isin([1,2])]
state_code = get_state_code()
map_state_codes = lambda x: state_code[x]
fix_sex = lambda x: 'Male' if x == 1 else 'Female'
state_pop_sex_df['SEX'] = state_pop_sex_df['SEX'].apply(fix_sex)
state_pop_sex_df.drop(['SUMLEV','REGION','DIVISION','STATE','AGE', 
                       'ESTBASE2010_CIV', 'POPEST2010_CIV', 'POPEST2011_CIV',
                       'POPEST2012_CIV', 'POPEST2013_CIV', 'POPEST2014_CIV'], 
                      axis=1, inplace=True)
state_pop_sex_df.rename({'POPEST2015_CIV':'2015_pop',
                         'POPEST2016_CIV':'2016_pop'},
                        axis=1, inplace=True)
state_pop_sex_df['Code'] = state_pop_sex_df['NAME'].apply(map_state_codes)
state_pop_sex_df = state_pop_sex_df.groupby(['SEX', 'Code']).sum()
state_pop_sex_df = state_pop_sex_df.unstack(0)
print('\n\nPopulation Data Set Format After Reformatting')
state_pop_sex_df.head()

Raw Population Data Set Format


Population Data Set Format After Reformatting
SUMLEV REGION DIVISION STATE NAME SEX AGE ESTBASE2010_CIV POPEST2010_CIV POPEST2011_CIV POPEST2012_CIV POPEST2013_CIV POPEST2014_CIV POPEST2015_CIV POPEST2016_CIV
0 10 0 0 0 United States 0 0 3944160 3951400 3963239 3926677 3931346 3955374 3975414 3970145
1 10 0 0 0 United States 0 1 3978090 3957847 3966617 3978101 3943114 3950083 3974980 3995008
2 10 0 0 0 United States 0 2 4096939 4090856 3971363 3980016 3992752 3959663 3967361 3992154
3 10 0 0 0 United States 0 3 4119051 4111929 4102483 3982920 3992660 4006960 3974468 3982074
4 10 0 0 0 United States 0 4 4063186 4077557 4122286 4112795 3994261 4005464 4020276 3987656
2015_pop 2016_pop
SEX Female Male Female Male
Code
AK 697258 734122 701214 739176
AL 4998694 4681748 5011344 4687956
AR 3028428 2916226 3039182 2926812
AZ 6854628 6744392 6970496 6853282
CA 39205974 38470174 39466330 38716576

Now that I’ve reformatted the population data to match the format of the state prisoner counts, I want to merge these two DataFrames together so that I can calculate the normalized rates of imprisonment.

Joining state pop. estimates with state inmate counts
state_sex_join_df = state_sex_df.join(state_pop_sex_df)
display(state_sex_join_df.head())
2015 2016 2015_pop 2016_pop
SEX Total Male Female Total Male Female Female Male Female Male
Code
AL 30810.00000 28220.00000 2590.00000 28883.00000 26506.00000 2377.00000 4998694 4681748 5011344 4687956
AK 5338.00000 4761.00000 577.00000 4434.00000 4024.00000 410.00000 697258 734122 701214 739176
AZ 42719.00000 38738.00000 3981.00000 42320.00000 38323.00000 3997.00000 6854628 6744392 6970496 6853282
AR 17707.00000 16305.00000 1402.00000 17537.00000 16161.00000 1376.00000 3028428 2916226 3039182 2926812
CA 129593.00000 123808.00000 5785.00000 130390.00000 124487.00000 5903.00000 39205974 38470174 39466330 38716576

Now that I’ve merged the state prisoner counts with population data, I have to divide male prisoner counts by state population (of that gender) and store that information for use later.

Engineering per-capita inmate count features
# This Cell normalizes prisoner counts to be per 100k
#   state population for that year.
yrs = ['2015','2016']
sexes = ['Male','Female']

for yr in yrs:
    for sex in sexes:
        state_sex_join_df.loc[:,(yr, sex + '_rate')] = \
            state_sex_join_df.loc[:,(yr, sex)]\
            .divide(state_sex_join_df.loc[:,(yr + '_pop', sex)]/100000)
display(state_sex_join_df.head(3))
2015 2016 2015_pop 2016_pop 2015 2016
SEX Total Male Female Total Male Female Female Male Female Male Male_rate Female_rate Male_rate Female_rate
Code
AL 30810.00000 28220.00000 2590.00000 28883.00000 26506.00000 2377.00000 4998694 4681748 5011344 4687956 602.76632 51.81353 565.40633 47.43239
AK 5338.00000 4761.00000 577.00000 4434.00000 4024.00000 410.00000 697258 734122 701214 739176 648.52981 82.75273 544.38997 58.47002
AZ 42719.00000 38738.00000 3981.00000 42320.00000 38323.00000 3997.00000 6854628 6744392 6970496 6853282 574.37349 58.07755 559.19193 57.34169
Encapsulating Bokeh Choropleth Plotting code in a function
# I'm going to generate more choropleths, so I'll encapsulate the code
#   I prototyped above into this function.
def make_choropleth(df, level1, level2, title, colors=COLORS):
    """Generates a choropleth with tooltips

    Args:
        df:        The DataFrame containing the relevant data. df should be
                 formatted such that it has a 2 level column-header
        level1:    The top column-header name (eg '2015' in examples above)
        level2:    The next-lower column header name (eg 'Male)
        title:     A title for the Choropleth
        colors:    An ordered list of color codes to use in mapping data.
    """
    state_names = []
    data = []
    min_val = df.loc[:,level1][level2].min()
    max_val = df.loc[:,level1][level2].max()
    bins = len(colors)
    TOOLS = "pan,wheel_zoom,reset,hover,save"
    for state_code in states:
        try:
            num_imprisoned = df.loc[state_code,level1][level2]
            state_names.append(state_code)
            data.append(num_imprisoned)
        except KeyError:
            state_names.append(' ')
            data.append(0)

    # Loading up Bokeh's special data structure
    source = ColumnDataSource(data=dict(
        x=state_xs,
        y=state_ys,
        name=state_names,
        imp_count = data
    ))

    # Maps the data value to a color
    color_mapper = LinearColorMapper(palette=colors,
                                     low=min(data),
                                     high=max(data))

    # Instantiating the figure
    p = figure(
        title=title, toolbar_location="left",
        plot_width=800, plot_height=450, tools=TOOLS
    )

    # Plotting the state-polygons on the figure
    p.patches('x', 'y', source=source,
              fill_color={'field': 'imp_count',
                          'transform': color_mapper},
              fill_alpha=1.0,
              line_color="#884444",
              line_width=2,
              line_alpha=0.3)

    # Setting the legend for colors
    color_bar = ColorBar(color_mapper=color_mapper,
                         ticker=BasicTicker(),
                         label_standoff=12,
                         border_line_color=None,
                         location=(0,0))

    p.add_layout(color_bar, 'right')
    hover = p.select_one(HoverTool)
    hover.point_policy = "follow_mouse"
    hover.tooltips = [
        ("State", "@name"),
        ("Imprisonment rate (per 100k state pop)", "@imp_count")
    ]
    output_notebook()
    show(p)

3.2 Visualization Breakthrough!

Now that we’ve normalized the number of imprisoned people to the state populations, we can compare rates between states. From looking at these choropleths, it’s clear that there are massive differences between states. ### Observations * For both men and women over the two included years, Massachusetts has the lowest imprisonment rates (below 145 men per 100k state male pop. and below 10 women per 100k state female pop). * For men, Louisiana has the highest imprisonment rate (at least 740 men per 100k state male pop.). * Delaware stands out as having a much higher imprisonment rate that its neighbors. Delaware imprisons at least 660 men per 100k state male pop., which is 2 to 3 times more than any neighboring state. * Oklahoma has the highest imprisonment rate for women at over 75 women per 100k state female pop. * In general, southern states have higher male imprisonment rates than the north. * Except for Delaware, New England and the Midwest have lower than average female imprisonment rates. * Bokeh’s built-in state geometries are convenient, but they’re not great (just look at Michigan and the Great Lakes).

Calling our Choropleth-plotting function
title = "Number of Imprisoned Females per 100k Female State Population (2015)"
make_choropleth(state_sex_join_df, '2015', 'Female_rate', title)
Loading BokehJS ...
Calling our Choropleth-plotting function
title = "Number of Imprisoned Females per 100k Female State Population (2016)"
make_choropleth(state_sex_join_df, '2016', 'Female_rate', title)
Loading BokehJS ...
Calling our Choropleth-plotting function
title = "Number of Imprisoned Males per 100k State Male Population (2015)"
make_choropleth(state_sex_join_df, '2015', 'Male_rate', title)
Loading BokehJS ...
Calling our Choropleth-plotting function
title = "Number of Imprisoned Males per 100k State Male Population (2016)"
make_choropleth(state_sex_join_df, '2016', 'Male_rate', title)
Loading BokehJS ...