Matt Triano - Census TIGER Dataset Collector Dev

Code

import textwrap
from typing import Dict, List, Union, Optional
from urllib.request import urlretrieve

from bs4 import BeautifulSoup
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import requests

The US Census Bureau (USCB) runs a massive array of surveys of residents of the United States. After aggregating responses into demographic and/or geographic groups, the USCB publishes those aggregated datasets to its massive public data catalog. I’ve already developed some tooling to collect and ingest Census datasets into my personal data warehouse and analytics environment, but to map out geographic Census data or join in other geospatial datasets, I also need to collect and ingest spatial files describing the geographic boundaries of Census groupings. Fortunately, the USCB provides shapefiles with these geometries in their TIGER (Topologically Integrated Geographic Encoding and Referencing) data offerings, and in this notebook, I’ll synthesize the findings the experiments and research in my last notebook into a data model and data collection tools.

References:

Each year, the USCB can add or alter geographic boundaries or connected data features, so they organize TIGER data offering into annual vintages, which can be downloaded either through the USCB web interface or from their file server interface (https://www2.census.gov/geo/tiger/).

The functions in the cell below aid in scraping the table off of any of pages linked in that file-tree interface.

Code

def request_page(metadata_url: str) -> requests.models.Response:
    resp = requests.get(metadata_url)
    if resp.status_code == 200:
        return resp
    else:
        raise Exception(f"Couldn't get page metadata for url {metadata_url}")

def scrape_census_ftp_metadata_page(metadata_url: str) -> pd.DataFrame:
    resp = request_page(metadata_url=metadata_url)
    soup = BeautifulSoup(resp.content, "html.parser")
    table = soup.find("table")
    rows = table.find_all("tr")
    table_contents = []
    for row in rows:
        cols = row.find_all("td")
        cols = [col.text.strip() for col in cols]
        table_contents.append(cols)
    table_rows = [el for el in table_contents if len(el) > 0]

    metadata_df = pd.DataFrame(
        [row[1:] for row in table_rows],
        columns=["name", "last_modified", "size", "description"],
    )
    metadata_df["last_modified"] = pd.to_datetime(metadata_df["last_modified"])
    metadata_df["is_dir"] = metadata_df["name"].str.endswith("/")
    metadata_df["clean_name"] = metadata_df["name"].str.replace("/$", "", regex=True)
    metadata_df["is_file"] = (~metadata_df["is_dir"]) & (
        metadata_df["clean_name"] != "Parent Directory"
    )
    while metadata_url.strip().endswith("/"):
        metadata_url = metadata_url[:-1]
    mask = metadata_df["is_file"] | metadata_df["is_dir"]
    metadata_df = metadata_df.loc[mask].copy()
    metadata_df["metadata_url"] = (metadata_url + "/" + metadata_df["clean_name"])
    return metadata_df

request_page() makes a GET request for the content at the URL metadata_url and returns the response (if successful, i.e. if the response has the “success” HTTP response status code). This function is only called by scrape_census_ftp_metadata_page(), but it’s separated out to make it easier to test scrape_census_ftp_metadata_page().

Code

def request_page(metadata_url: str) -> requests.models.Response:
    resp = requests.get(metadata_url)
    if resp.status_code == 200:
        return resp
    else:
        raise Exception(f"Couldn't get page metadata for url {metadata_url}")

scrape_census_ftp_metadata_page() gets the content from a given Census file-tree page (at URL metadata_url), parses it into a convenient and well structured format (a pandas DataFrame) with features that aid in filtering to desired rows.

Code

resp = request_page(metadata_url="https://www2.census.gov/geo/tiger/")
soup = BeautifulSoup(resp.content, "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
rows[0:7]

[<tr><th valign="top"><img alt="[ICO]" src="/icons/blank.gif"/></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>,
 <tr><th colspan="5"><hr/></th></tr>,
 <tr><td valign="top"><img alt="[PARENTDIR]" src="/icons/back.gif"/></td><td><a href="/geo/">Parent Directory</a></td><td> </td><td align="right">  - </td><td> </td></tr>,
 <tr><td valign="top"><img alt="[   ]" src="/icons/layout.gif"/></td><td><a href="Directory_Contents_ReadMe.pdf">Directory_Contents_ReadMe.pdf</a></td><td align="right">2019-06-25 09:13  </td><td align="right">439K</td><td> </td></tr>,
 <tr><td valign="top"><img alt="[DIR]" src="/icons/folder.gif"/></td><td><a href="GENZ2010/">GENZ2010/</a></td><td align="right">2013-07-24 12:46  </td><td align="right">  - </td><td> </td></tr>,
 <tr><td valign="top"><img alt="[DIR]" src="/icons/folder.gif"/></td><td><a href="GENZ2012/">GENZ2012/</a></td><td align="right">2013-07-24 12:47  </td><td align="right">  - </td><td> </td></tr>,
 <tr><td valign="top"><img alt="[DIR]" src="/icons/folder.gif"/></td><td><a href="GENZ2013/">GENZ2013/</a></td><td align="right">2014-07-02 08:28  </td><td align="right">  - </td><td> </td></tr>]

The cell above shows code that extracts HTML tr (table-row) elements extracted from the page (sorry for showing something so ugly!), and the cell below shows the (beautiful) end product of scrape_census_ftp_metadata_page().

Code

all_tiger_vintages_df = scrape_census_ftp_metadata_page(
    metadata_url="https://www2.census.gov/geo/tiger/"
)
display(all_tiger_vintages_df.head(4))

	name	last_modified	size	is_dir	clean_name	is_file	metadata_url
1	Directory_Contents_ReadMe.pdf	2019-06-25 09:13:00	439K	False	Directory_Contents_ReadMe.pdf	True	https://www2.census.gov/geo/tiger/Directory_Co...
2	GENZ2010/	2013-07-24 12:46:00	-	True	GENZ2010	False	https://www2.census.gov/geo/tiger/GENZ2010
3	GENZ2012/	2013-07-24 12:47:00	-	True	GENZ2012	False	https://www2.census.gov/geo/tiger/GENZ2012
4	GENZ2013/	2014-07-02 08:28:00	-	True	GENZ2013	False	https://www2.census.gov/geo/tiger/GENZ2013

Code

print(f"Files on the top-level TIGER dataset page:                {all_tiger_vintages_df['is_file'].sum():>3}")
print(f"TIGER data offerings on the top-level TIGER dataset page: {all_tiger_vintages_df['is_dir'].sum():>3}")

Files on the top-level TIGER dataset page:                  1
TIGER data offerings on the top-level TIGER dataset page:  59

The GENZyyyy TIGER data offerings are interesting, but the real wealth of geospatial files can be found in the TIGER offerings with names matching the TIGERyyyy pattern, and I implemented the get_tiger_vintages_metadata() function to get the metadata for these offerings, or vintages.

Code

def get_tiger_vintages_metadata() -> pd.DataFrame:
    all_tiger_vintages_df = scrape_census_ftp_metadata_page(
        metadata_url="https://www2.census.gov/geo/tiger/"
    )
    tiger_vintages_df = all_tiger_vintages_df.loc[
        all_tiger_vintages_df["name"].str.contains("^TIGER\d{4}/", regex=True)
    ].copy()
    tiger_vintages_df = tiger_vintages_df.sort_values(by="name", ignore_index=True)
    return tiger_vintages_df

tiger_vintages_df = get_tiger_vintages_metadata()
print(f"Available TIGER vintages:")
names = textwrap.wrap(", ".join(list(tiger_vintages_df["clean_name"].values)), width=95)
for line in names:
    print(f"    {line}")
display(tiger_vintages_df.head(3))

Available TIGER vintages:
    TIGER1992, TIGER1999, TIGER2002, TIGER2003, TIGER2008, TIGER2009, TIGER2010, TIGER2011,
    TIGER2012, TIGER2013, TIGER2014, TIGER2015, TIGER2016, TIGER2017, TIGER2018, TIGER2019,
    TIGER2020, TIGER2021, TIGER2022

	name	last_modified	size	is_dir	clean_name	is_file	metadata_url
0	TIGER1992/	2011-12-19 07:56:00	-	True	TIGER1992	False	https://www2.census.gov/geo/tiger/TIGER1992
1	TIGER1999/	2012-02-13 11:19:00	-	True	TIGER1999	False	https://www2.census.gov/geo/tiger/TIGER1999
2	TIGER2002/	2015-05-05 18:37:00	-	True	TIGER2002	False	https://www2.census.gov/geo/tiger/TIGER2002

Now that we have tools for scraping the any page in the Census’s file server, we can use those tools to retrieve data on the geographic entities in a given TIGER vintage.

Code

class TIGERCatalog:
    def __init__(self):
        self.dataset_vintages = get_tiger_vintages_metadata()

    def get_vintage_metadata(self, year: str) -> pd.DataFrame:
        return self.dataset_vintages.loc[self.dataset_vintages["name"] == f"TIGER{year}/"].copy()

class TIGERVintageCatalog:
    def __init__(self, year: str, catalog: Optional[TIGERCatalog] = None):
        self.year = str(year)
        self.set_catalog(catalog=catalog)

    def set_catalog(self, catalog: Optional[TIGERCatalog]) -> None:
        if catalog is None:
            self.catalog = TIGERCatalog()
        else:
            self.catalog = catalog

    @property
    def vintage_metadata(self):
        return self.catalog.get_vintage_metadata(year=self.year)

    @property
    def vintage_entities(self):
        if len(self.vintage_metadata) == 1:
            tiger_vintage_url = self.vintage_metadata["metadata_url"].values[0]
            return scrape_census_ftp_metadata_page(metadata_url=tiger_vintage_url)
        else:
            raise Exception(
                f"Failed to get unambiguous metadata (got {self.vintage_metadata})"
            )

    def get_entity_metadata(self, entity_name: str) -> pd.DataFrame:
        return self.vintage_entities.loc[self.vintage_entities["clean_name"] == entity_name].copy()

    def print_entity_names(self):
        entity_names = self.vintage_entities.loc[
            self.vintage_entities["is_dir"], "clean_name"
        ].values
        print(f"TIGER Entity options for the {self.year} TIGER vintage:")
        for entity_name in entity_names:
            print(f"  - {entity_name}")
        print(f"Entity count: {len(entity_names)}")

Code

tiger_catalog = TIGERCatalog()
vintage_entity_catalog = TIGERVintageCatalog(year="2022", catalog=tiger_catalog)
vintage_entity_catalog.print_entity_names()

TIGER Entity options for the 2022 TIGER vintage:
  - ADDR
  - ADDRFEAT
  - ADDRFN
  - AIANNH
  - AITSN
  - ANRC
  - AREALM
  - AREAWATER
  - BG
  - CD
  - COASTLINE
  - CONCITY
  - COUNTY
  - COUSUB
  - EDGES
  - ELSD
  - ESTATE
  - FACES
  - FACESAH
  - FACESAL
  - FACESMIL
  - FEATNAMES
  - LINEARWATER
  - MIL
  - PLACE
  - POINTLM
  - PRIMARYROADS
  - PRISECROADS
  - PUMA
  - RAILS
  - ROADS
  - SCSD
  - SDADM
  - SLDL
  - SLDU
  - STATE
  - SUBBARRIO
  - TABBLOCK20
  - TBG
  - TRACT
  - TTRACT
  - UAC
  - UNSD
  - ZCTA520
Entity count: 44

Let’s examine Census Tracts.

I’ll need some tooling to collect information on a given entity in a given TIGER vintage. I know this object will need data that’s in the relevant TIGERVintageCatalog instance, so I’ll make that an attribute of the entity vintage class. I could require that the use passes in a TIGERVintageCatalog instance (which would reduce the number of calls to the same Census resource in the usecase where a user is interactively working using these classes), but ultimately I’m going to build out pipelines that only collect one TIGER entity at a from a given vintage (as I don’t need most of the available entities listed above).

Code

class TIGERGeographicEntityVintage:
    def __init__(self, entity_name: str, year: str, catalog: Optional[TIGERCatalog] = None):
        self.entity_name = entity_name
        self.year = str(year)
        self.vintage_catalog = TIGERVintageCatalog(year=year, catalog=catalog)
        self.entity_metadata = self.vintage_catalog.get_entity_metadata(entity_name=self.entity_name)

    @property
    def entity_url(self):
        return self.entity_metadata["metadata_url"].values[0]

tiger_tract22_obj = TIGERGeographicEntityVintage(entity_name="TRACT", year="2022", catalog=tiger_catalog)
display(tiger_tract22_obj.entity_metadata)
entity_df = scrape_census_ftp_metadata_page(metadata_url=tiger_tract22_obj.entity_url)
print(entity_df.shape)
display(entity_df.head(2))

	name	last_modified	size	description	is_dir	clean_name	is_file	metadata_url
41	TRACT/	2022-09-30 22:39:00	-		True	TRACT	False	https://www2.census.gov/geo/tiger/TIGER2022/TRACT

(56, 8)

	name	last_modified	size	description	is_dir	clean_name	is_file	metadata_url
1	tl_2022_01_tract.zip	2022-10-31 19:42:00	11M		False	tl_2022_01_tract.zip	True	https://www2.census.gov/geo/tiger/TIGER2022/TR...
2	tl_2022_02_tract.zip	2022-10-31 19:42:00	3.0M		False	tl_2022_02_tract.zip	True	https://www2.census.gov/geo/tiger/TIGER2022/TR...

Looking at the names, I see that the Census groups tracts by state FIPS code (Federal Information Processing Series code). I know the FIPS code for Illinois is “17” (you can review the FIPS codes for other geographic entities here). I also know I don’t always want to pull data for all states, so I need to add a method that allows the user to filter the entity files. Also, I should build in the step of getting the entity files metadata.

Code

class TIGERGeographicEntityVintage:
    def __init__(self, entity_name: str, year: str, catalog: Optional[TIGERCatalog] = None):
        self.entity_name = entity_name
        self.year = str(year)
        self.vintage_catalog = TIGERVintageCatalog(year=year, catalog=catalog)
        self.entity_metadata = self.vintage_catalog.get_entity_metadata(entity_name=self.entity_name)

    @property
    def entity_url(self):
        return self.entity_metadata["metadata_url"].values[0]

    @property
    def entity_files_metadata(self):
        return scrape_census_ftp_metadata_page(metadata_url=self.entity_url)

    def get_entity_file_metadata(self, filter_str: str) -> pd.DataFrame:
        return self.entity_files_metadata.loc[self.entity_files_metadata["name"].str.contains(filter_str)].copy()

tiger_tract22_obj = TIGERGeographicEntityVintage(entity_name="TRACT", year="2022", catalog=tiger_catalog)
il_tracts22_metadata = tiger_tract22_obj.get_entity_file_metadata(filter_str="_17_")
display(il_tracts22_metadata)
print(f"IL Tracts archive download Url: {il_tracts22_metadata['metadata_url'].values[0]}")

	name	last_modified	size	description	is_dir	clean_name	is_file	metadata_url
14	tl_2022_17_tract.zip	2022-10-31 19:43:00	9.5M		False	tl_2022_17_tract.zip	True	https://www2.census.gov/geo/tiger/TIGER2022/TR...

IL Tracts archive download Url: https://www2.census.gov/geo/tiger/TIGER2022/TRACT/tl_2022_17_tract.zip

Geopandas provides some extremely convenient functionality for loading data. I can provide the URL to a zipped archive (of geospatial files) to geopandas’ read_file() function and it handles the network request and unzipping for me.

Code

il_tracts_gdf = gpd.read_file(il_tracts22_metadata["metadata_url"].values[0])

Code

print(il_tracts_gdf.shape)
il_tracts_gdf.head(2)

(3265, 13)

	STATEFP	COUNTYFP	TRACTCE	GEOID	NAME	NAMELSAD	MTFCC	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	geometry
0	17	019	010701	17019010701	107.01	Census Tract 107.01	G5020	S	5266000	30553	+40.1150269	-088.0329549	POLYGON ((-88.05240 40.11923, -88.05238 40.119...
1	17	019	005902	17019005902	59.02	Census Tract 59.02	G5020	S	962402	4892	+40.1087344	-088.2247204	POLYGON ((-88.22891 40.11271, -88.22882 40.112...

And now I have plottable, spatially-joinable geospatial data ready to ingest into a data warehouse table or to plot out.

Code

fig_width = 14

fig, ax = plt.subplots(figsize=(fig_width, fig_width))
ax = il_tracts_gdf.plot(facecolor="none", edgecolor="black", linewidth=0.015 * fig_width, ax=ax)