Exploration in Criminal Justice and Corrections Data, Part 3

Exploring prisoner data up through 2016
pandas
analysis
old
Author

Matt Triano

Published

June 28, 2018

Modified

June 28, 2018

The majority of code and analysis in this post was originally written back in mid-2018 (in this notebook). I’ve consolidated some cells, automated data retrieval, and added cell labels to make things render nicely, but otherwise I’ve left the work as-is. Compared to my current work products, this old code is very messy and unpolished, but like Eric Ma (creator of the networkx graph data analysis package), I believe in showing newer data analysts and scientists I mentor that no one in this field started off with mastery of git, pandas, bash, etc, and that everyone who lasts loves to keep learning and improving.

After I integrate these old posts into this blog, I’ll write an EDA post that starts from scratch using up-to-date data.

1 Imprisonment by Race

This notebook explores the racial breakdown of the 3 largest racial groups in US prisons (from 2006 to 2016): non-Hispanic white people, non-Hispanic black people, and Hispanic people.

In this notebook, I show that, over the entire span of data (2006-2016), black people are the largest racial prison population both in raw numbers and as a proportion of the total same-race US population.

Average Number of Prisoners (from 2006-2016)

Race Average Number in Prison
Black 551209
White 476136
Hispanic 336500

Average Percent US Racial Population in Prison

Race % of US Racial Pop. in Prison
Black 1.440 %
White 0.241 %
Hispanic 0.656 %

Change in Average Percent US Racial Population in Prison

Race Change in % of US Racial Pop. in Prison
Black -0.406 %
White -0.035 %
Hispanic -0.113 %

That is a fairly astonishing. A randomly selected black person is (on average) nearly 6 times as likely to be in prison as a randomly selected white person. There is clearly a statistically significant difference between these rates, so there are clearly more questions to answer.

1.1 Data Source

This data is from file p16t03.csv of the Bureau of Justice Statistics Prisoner Series data.

Imports, styling, and path definitions
import os
from urllib.request import urlretrieve
import zipfile

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from IPython.core.display import display, HTML
%matplotlib inline

# Importing modules from a visualization package.
# from bokeh.sampledata.us_states import data as states|
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.models import LinearColorMapper, ColorBar, BasicTicker

# styling
pd.options.display.max_columns = None
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.float_format',lambda x: '%.3f' % x)
plt.rcParams['figure.figsize'] = 10,10

DATA_DIR_PATH = os.path.join("..", "010_crime_and_prisons_p1", 'data')
PRISON_DATA_DIR = os.path.join(DATA_DIR_PATH, 'prison')
POP_DATA_DIR = os.path.join(DATA_DIR_PATH, "population")
Loading and preprocessing the t03 dataset, Counts by Gender, Race, and Jurisdiction
CSV_PATH = os.path.join(PRISON_DATA_DIR, 'p16t03.csv')
race_sex_raw = pd.read_csv(CSV_PATH, 
                           encoding='latin1',
                           header=11, 
                           na_values=':',
                           thousands=r',')
race_sex_raw.dropna(axis=0, thresh=3, inplace=True)
race_sex_raw.dropna(axis=1, thresh=3, inplace=True)
race_sex_raw.dropna(axis=0, inplace=True)
fix = lambda x: x.split('/')[0]
race_sex_raw['Year'] = race_sex_raw['Year'].apply(fix)
race_sex_raw.columns = [x.split('/')[0] for x in race_sex_raw.columns]
race_sex_raw.set_index('Year', inplace=True)

display(race_sex_raw)
Total Federal State Male Female White Black Hispanic
Year
2006 1504598.000 173533.000 1331065.000 1401261.000 103337.000 507100.000 590300.000 313600.000
2007 1532851.000 179204.000 1353647.000 1427088.000 105763.000 499800.000 592900.000 330400.000
2008 1547742.000 182333.000 1365409.000 1441384.000 106358.000 499900.000 592800.000 329800.000
2009 1553574.000 187886.000 1365688.000 1448239.000 105335.000 490000.000 584800.000 341200.000
2010 1552669.000 190641.000 1362028.000 1447766.000 104903.000 484400.000 572700.000 345800.000
2011 1538847.000 197050.000 1341797.000 1435141.000 103706.000 474300.000 557100.000 347800.000
2012 1512430.000 196574.000 1315856.000 1411076.000 101354.000 466600.000 537800.000 340300.000
2013 1520403.000 195098.000 1325305.000 1416102.000 104301.000 463900.000 529900.000 341200.000
2014 1507781.000 191374.000 1316407.000 1401685.000 106096.000 461500.000 518700.000 338900.000
2015 1476847.000 178688.000 1298159.000 1371879.000 104968.000 450200.000 499400.000 333200.000
2016 1458173.000 171482.000 1286691.000 1352684.000 105489.000 439800.000 486900.000 339300.000
Mean annual inmate count by race
print('Average number of {:>8s} people in prison from 2006 to 2016: {:6.0f}'
      .format('black', race_sex_raw['Black'].mean()))
print('Average number of {:>8s} people in prison from 2006 to 2016: {:6.0f}'
      .format('white', race_sex_raw['White'].mean()))
print('Average number of {:>8s} people in prison from 2006 to 2016: {:6.0f}'
      .format('Hispanic', race_sex_raw['Hispanic'].mean()))
Average number of    black people in prison from 2006 to 2016: 551209
Average number of    white people in prison from 2006 to 2016: 476136
Average number of Hispanic people in prison from 2006 to 2016: 336500
Plotting annual inmate counts by race
with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(race_sex_raw['White'])
    ax.plot(race_sex_raw['Black'])
    ax.plot(race_sex_raw['Hispanic'])
    ax.set_title('Total Imprisonment Counts by Race (table: p16t03)',fontsize=14)
    ax.set_xlabel('Year', fontsize=14)
    ax.set_ylabel('Number of People imprisoned', fontsize=14)
    ax.legend(fontsize=14)
    ax.set_ylim([0, 1.1*max([race_sex_raw['White'].max(), 
                             race_sex_raw['Black'].max(),
                             race_sex_raw['Hispanic'].max()])])

Looking at this plot of total imprisonment counts, we see that: * At any given time, there are more non-Hispanic black people in prison than any other race. * At any given time, there are more non-Hispanic white people than Hispanic people in prison. * The numbers of imprisoned black people and white people decreased steadily from 2008 to 2016, while the number of imprisoned Hispanic people was been fairly flat over that time.

These numbers are raw counts, so they don’t account for the fact that the fact that these races make up different proportions of the entire US population. This motivates questions like:

  • What percent of the population of each race is in prison?

To answer that, I’ll have to find population data broken down by race.

1.1.1 Census Data

The Constitution requires that a full, national census is performed every 10 years that reaches every resident on American soil. This data is used to determine how much federal money is allocated to each district for things like schools, roadways, police, etc. and it determines how many congressional representatives will be allocated to each state. With the help of smaller surveys, the US Census bureau makes estimates of populations for the years between censuses.

data source

(NEW CODE) automating data collection
zip_pop_file_path = os.path.join(POP_DATA_DIR, "PEP_2016_PEPSR6H.zip")
url = "https://www2.census.gov/programs-surveys/popest/tables/2010-2016/state/asrh/PEP_2016_PEPSR6H.zip"
if not os.path.isfile(zip_pop_file_path):
    urlretrieve(url=url, filename=zip_pop_file_path)
    with zipfile.ZipFile(zip_pop_file_path, 'r') as zf:
        zf.extractall(POP_DATA_DIR)
Loading, filtering, and preprocessing annual national pop. estimates by race and ethnicity
CSV_PATH = os.path.join(POP_DATA_DIR, 'PEP_2016_PEPSR6H_with_ann.csv')
race_pop = pd.read_csv(CSV_PATH, header=[1], encoding='latin1')
print('\nInitial Data Format:')
display(race_pop.head(3))
# Only looking for national values
race_pop = race_pop[race_pop['Geography'] == 'United States']
# Eliminating the actual census values
race_pop = race_pop[~race_pop['Year'].str.contains('April')]
# Eliminating aggregated rows
race_pop = race_pop[race_pop['Hispanic Origin'] != 'Total']
# race_pop = race_pop[race_pop['Sex'] != 'Both Sexes']  # for later
race_pop = race_pop[~race_pop['Sex'].isin(['Male','Female'])]

drop_cols = ['Id', 'Id.1', 'Id.2', 'Id2', 'Id.3', 'Geography', 'Sex',
             'Race Alone - American Indian and Alaska Native', 'Race Alone - Asian',
             'Race Alone - Native Hawaiian and Other Pacific Islander']
race_pop.drop(drop_cols, axis=1, inplace=True)

# Reducing the size of long column names
col_map = {'Race Alone - Black or African American':'black_only_pop',
           'Race Alone - White': 'white_only_pop'}
race_pop.rename(col_map, axis=1, inplace=True)

print('\n\nData Format after Processing:')
display(race_pop.head(3))

Initial Data Format:


Data Format after Processing:
Id Year Id.1 Sex Id.2 Hispanic Origin Id.3 Id2 Geography Total Race Alone - White Race Alone - Black or African American Race Alone - American Indian and Alaska Native Race Alone - Asian Race Alone - Native Hawaiian and Other Pacific Islander Two or More Races
0 cen42010 April 1, 2010 Census female Female hisp Hispanic 0100000US nan United States 24858794 21936806 1191984 702309 249346 85203 693146
1 cen42010 April 1, 2010 Census female Female nhisp Not Hispanic 0100000US nan United States 132105418 100301335 19853611 1147502 7691693 246518 2864759
2 cen42010 April 1, 2010 Census female Female tothisp Total 0100000US nan United States 156964212 122238141 21045595 1849811 7941039 331721 3557905
Year Hispanic Origin Total white_only_pop black_only_pop Two or More Races
24 July 1, 2010 Hispanic 50754069 44855529 2343053 1392607
25 July 1, 2010 Not Hispanic 258594124 197394319 38015461 5647760
33 July 1, 2011 Hispanic 51906353 45841771 2410490 1445820

Based on the aggregation of the data I pulled, there is a rather vague ‘Two or More Races’ column. The prisoner data I’ve been working with has not differentiated between multiracial people and single-race people, rather, the prison data only breaks races down to non-Hispanic white, non-Hispanic black, and Hispanic. It’s entirely possible for someone to be non-Hispanic, black, and multiracial, which would make it much more difficult and less accurate to use this population data with the prison data. Fortunately, however, the documentation for the prison data] includes footnotes indicating the data for white and black prison populations excludes persons of two or more races, so, conveniently, I must also exclude it.

Engineering counts for Hispanic origin (any race), Non Hispanic White only, and Non Hispanic Black only
race_pop.drop('Two or More Races', axis=1, inplace=True)
hisp = race_pop[race_pop['Hispanic Origin'] == 'Hispanic'].copy()
non_hisp = race_pop[race_pop['Hispanic Origin'] != 'Hispanic'].copy()
non_hisp.drop(['Hispanic Origin', 'Total'], axis=1, inplace=True)
hisp.drop(['white_only_pop','black_only_pop', 'Hispanic Origin'], axis=1, inplace=True)
hisp.rename({'Total':'Hispanic_pop'}, axis=1, inplace=True)
us_race_pop = hisp.merge(non_hisp, on=['Year'])
fix_yr = lambda x: x.split(' ')[-1]
us_race_pop['Year'] = us_race_pop['Year'].apply(fix_yr)
us_race_pop.set_index('Year', inplace=True)

display(us_race_pop.head(3))
Hispanic_pop white_only_pop black_only_pop
Year
2010 50754069 197394319 38015461
2011 51906353 197519026 38393758
2012 52993496 197701109 38776276

This provides population data from 2010 on, but not for the earlier years. To deal with earlier years, I need to handle another data set.

data source

(NEW CODE) automating data collection
pop_file_path = os.path.join(POP_DATA_DIR, "us-est00int-02.xls")
url = "https://www2.census.gov/programs-surveys/popest/tables/2000-2010/intercensal/national/us-est00int-02.xls"
if not os.path.isfile(pop_file_path):
    urlretrieve(url=url, filename=pop_file_path)
Loading, filtering, and preprocessing annual national pop. estimates by race and ethnicity
race_pop0010 = pd.read_excel(pop_file_path, header=None)
print('\nInitial Data Format:')
display(race_pop0010.head())

race_pop0010.dropna(axis=0, thresh=6, inplace=True)
race_pop0010.drop([1],axis=1, inplace=True)
race_pop0010 = race_pop0010.T
race_pop0010.iloc[0,0] = 'Year'
race_pop0010.columns = race_pop0010.loc[0]
race_pop0010.drop(0, axis=0, inplace=True)
race_pop0010.dropna(axis=0, inplace=True)
race_pop0010['Year'] = race_pop0010['Year'].astype(int)
race_pop0010['Year'] = race_pop0010['Year'].astype(str)
race_pop0010.set_index('Year', inplace=True)

# Code to help drop columns I'm not interested in
drop_cols = []
drop_stumps = ['AIAN','Asian','NHPI', 'One Race', 'Two or']
for col in race_pop0010.columns:
    if any(x in col for x in drop_stumps):
        drop_cols.append(col)
race_pop0010.drop(drop_cols, axis=1, inplace=True)

# Code to facilitate merging this DataFrame with theother Population DataFrame
both0010 = race_pop0010.iloc[:,4:7].copy()
name_map = {'...White'  : 'white_only_pop',
            '...Black'  : 'black_only_pop',
            '.HISPANIC' : 'Hispanic_pop'}
both0010.rename(name_map, axis=1, inplace=True)
both0010 = both0010.astype(int)
pop_span = pd.concat([both0010, us_race_pop], join='inner')
race_sex_pop = race_sex_raw.join(pop_span)

print('\n\nData Format after Processing:')
display(race_pop0010.head())

Initial Data Format:


Data Format after Processing:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 table with row headers in column A and column ... NaN NaN nan nan nan nan nan nan nan nan nan NaN NaN
1 Table 2. Intercensal Estimates of the Resident... NaN NaN nan nan nan nan nan nan nan nan nan NaN NaN
2 Sex, Race, and Hispanic Origin April 1, 20001 Intercensal Estimates (as of July 1) nan nan nan nan nan nan nan nan nan April 1, 20102 July 1, 20103
3 NaN NaN 2000 2001.000 2002.000 2003.000 2004.000 2005.000 2006.000 2007.000 2008.000 2009.000 NaN NaN
4 BOTH SEXES 281424600 282162411 284968955.000 287625193.000 290107933.000 292805298.000 295516599.000 298379912.000 301231207.000 304093966.000 306771529.000 308745538 309349689
BOTH SEXES ..White ..Black .NOT HISPANIC ...White ...Black .HISPANIC ...White ...Black MALE ..White ..Black .NOT HISPANIC ...White ...Black .HISPANIC ...White ...Black FEMALE ..White ..Black .NOT HISPANIC ...White ...Black .HISPANIC ...White ...Black
Year
2000 282162411 228530479 35814706 246500526 195701752 34405800 35661885 32828727 1408906 138443407 112710885 17027514 120101363 95778530 16340639 18342044 16932355 686875 143719004 115819594 18787192 126399163 99923222 18065161 17319841 15896372 722031
2001 284968955.000 230049196.000 36263029.000 247824859.000 195974813.000 34780280.000 37144096.000 34074383.000 1482749.000 139891492.000 113534140.000 17249678.000 120806379.000 95977450.000 16526760.000 19085113.000 17556690.000 722918.000 145077463.000 116515056.000 19013351.000 127018480.000 99997363.000 18253520.000 18058983.000 16517693.000 759831.000
2002 287625193.000 231446915.000 36684650.000 249007573.000 196140540.000 35130061.000 38617620.000 35306375.000 1554589.000 141230559.000 114269447.000 17454795.000 121412514.000 96101267.000 16697032.000 19818045.000 18168180.000 757763.000 146394634.000 117177468.000 19229855.000 127595059.000 100039273.000 18433029.000 18799575.000 17138195.000 796826.000
2003 290107933.000 232717191.000 37066096.000 250058504.000 196232760.000 35438251.000 40049429.000 36484431.000 1627845.000 142428897.000 114897973.000 17631747.000 121910556.000 96155748.000 16839027.000 20518341.000 18742225.000 792720.000 147679036.000 117819218.000 19434349.000 128147948.000 100077012.000 18599224.000 19531088.000 17742206.000 835125.000
2004 292805298.000 234120447.000 37510582.000 251303923.000 196461761.000 35797599.000 41501375.000 37658686.000 1712983.000 143828012.000 115664854.000 17856753.000 122592198.000 96345151.000 17022156.000 21235814.000 19319703.000 834597.000 148977286.000 118455593.000 19653829.000 128711725.000 100116610.000 18775443.000 20265561.000 18338983.000 878386.000

Now that I’ve got a full population data set, I can normalize the prisoner data.

Engineering percent of total group population incarcerated features
race_sex_pop.loc[:,'White_pct'] = race_sex_pop.loc[:,'White']\
                .divide(race_sex_pop.loc[:,'white_only_pop']) * 100
race_sex_pop.loc[:,'Black_pct'] = race_sex_pop.loc[:,'Black']\
                .divide(race_sex_pop.loc[:,'black_only_pop']) * 100
race_sex_pop.loc[:,'Hispanic_pct'] = race_sex_pop.loc[:,'Hispanic']\
                .divide(race_sex_pop.loc[:,'Hispanic_pop']) * 100

display(race_sex_pop)
Total Federal State Male Female White Black Hispanic white_only_pop black_only_pop Hispanic_pop White_pct Black_pct Hispanic_pct
Year
2006 1504598.000 173533.000 1331065.000 1401261.000 103337.000 507100.000 590300.000 313600.000 196832697 36520961 44606305 0.258 1.616 0.703
2007 1532851.000 179204.000 1353647.000 1427088.000 105763.000 499800.000 592900.000 330400.000 197011394 36905758 46196853 0.254 1.607 0.715
2008 1547742.000 182333.000 1365409.000 1441384.000 106358.000 499900.000 592800.000 329800.000 197183535 37290709 47793785 0.254 1.590 0.690
2009 1553574.000 187886.000 1365688.000 1448239.000 105335.000 490000.000 584800.000 341200.000 197274549 37656592 49327489 0.248 1.553 0.692
2010 1552669.000 190641.000 1362028.000 1447766.000 104903.000 484400.000 572700.000 345800.000 197394319 38015461 50754069 0.245 1.506 0.681
2011 1538847.000 197050.000 1341797.000 1435141.000 103706.000 474300.000 557100.000 347800.000 197519026 38393758 51906353 0.240 1.451 0.670
2012 1512430.000 196574.000 1315856.000 1411076.000 101354.000 466600.000 537800.000 340300.000 197701109 38776276 52993496 0.236 1.387 0.642
2013 1520403.000 195098.000 1325305.000 1416102.000 104301.000 463900.000 529900.000 341200.000 197777454 39135988 54064149 0.235 1.354 0.631
2014 1507781.000 191374.000 1316407.000 1401685.000 106096.000 461500.000 518700.000 338900.000 197902336 39507913 55189962 0.233 1.313 0.614
2015 1476847.000 178688.000 1298159.000 1371879.000 104968.000 450200.000 499400.000 333200.000 197964402 39876758 56338521 0.227 1.252 0.591
2016 1458173.000 171482.000 1286691.000 1352684.000 105489.000 439800.000 486900.000 339300.000 197969608 40229236 57470287 0.222 1.210 0.590
Plotting percent of total group population incarcerated by year
with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(race_sex_pop['White_pct'])
    ax.plot(race_sex_pop['Black_pct'])
    ax.plot(race_sex_pop['Hispanic_pct'])
    ax.set_title('% of Total US Racial Population in Prison',fontsize=14)
    ax.set_xlabel('Year', fontsize=14)
    ax.set_ylabel('Percent of Racial Population In Prison [%]', fontsize=14)
    ax.legend(fontsize=14)
    ax.set_ylim([0, 1.1*max([race_sex_pop['White_pct'].max(), 
                             race_sex_pop['Black_pct'].max(),
                             race_sex_pop['Hispanic_pct'].max()])])

Descriptive stats for percent of group incarcerated
print('Average percentage of total {:>8s} population in prison: {:0.3f}%'
      .format('black', race_sex_pop['Black_pct'].mean()))
print('Average percentage of total {:8s} population in prison: {:0.3f}%'
      .format('Hispanic', race_sex_pop['Hispanic_pct'].mean()))
print('Average percentage of total {:>8s} population in prison: {:0.3f}%'
      .format('white', race_sex_pop['White_pct'].mean()))

print('Change in percentage of total {:>8s} population in prison between 2006 to 2016: {:0.3f}%'
      .format('black', race_sex_pop['Black_pct']['2016'] - race_sex_pop['Black_pct']['2006']))
print('Change in percentage of total {:>8s} population in prison between 2006 to 2016: {:0.3f}%'
      .format('Hispanic', race_sex_pop['Hispanic_pct']['2016'] - race_sex_pop['Hispanic_pct']['2006']))
print('Change in percentage of total {:>8s} population in prison between 2006 to 2016: {:0.3f}%'
      .format('white', race_sex_pop['White_pct']['2016'] - race_sex_pop['White_pct']['2006']))
Average percentage of total    black population in prison: 1.440%
Average percentage of total Hispanic population in prison: 0.656%
Average percentage of total    white population in prison: 0.241%
Change in percentage of total    black population in prison between 2006 to 2016: -0.406%
Change in percentage of total Hispanic population in prison between 2006 to 2016: -0.113%
Change in percentage of total    white population in prison between 2006 to 2016: -0.035%

That’s an extremely large difference. A randomly selected black person is nearly 6 times more likely to be in prison than a randomly selected white person and more than twice as likely as a randomly selected Hispanic person.

That motivates the question: Why is there such a large difference between these populations?

To investigate that question, it would be useful to investigate other questions, such as: * Does the criminal justice system treat different racial groups differently? * Is the average quality of education comparable for different racial populations? * Is the distribution of quality of education comparable for different racial populations? * Is the average quality of employment opportunity comparable for different racial populations? * Is the distribution of quality of employment opportunities comparable for different racial populations? * Is there a causal relationship between the pervasive housing segregation across the US and these disparities? * What drove the significant reduction in the incarcerated proportion of black Americans from 2006-2016?

To Be Continued?