How to set up and play with a Postgres db using Docker

A full tutorial of a basic workflow
docker
Postgres
PostGIS
sandbox
tutorial
Author

Matt Triano

Published

August 1, 2023

Modified

August 1, 2023

Code
from pathlib import Path

import pandas as pd

pd.options.display.max_columns = None

The importance of sandboxes

I’ve worked with SQL regularly over the past decade, but over the first half of that decade (and despite extensive formal knowledge of normalization, data modeling, and other database-related topics) I essentially only used SQL to extract data and use Python or R to do analysis. I only had access to production databases and I didn’t know enough about how database management system work to risk experiments, and as a result, I learned at a glacial pace. But in 2018, I stumbled into a project interesting enough to motivate me to install PostGIS on a personal machine and freed from the fear of accidentally taking down a production system (and the power of a superuser), I was able to lift up the hood and see how the machine worked, and this enabled me to learn and build with Postgres + PostGIS (a geospatial Postgres extension) far faster.

In this post, I show how to use Docker to set up a PostGIS database and experiment with it.

What is Docker

Docker and Docker Compose1 enable you to import and run complicated applications in just a few lines of code.

The internals of docker are really interesting, but for this post, all you need to know these things:

  1. A docker image is like a blueprint of an application, and it’s defined in Dockerfiles.
  2. A docker container is a runnable instance of the application, built from the instructions in the blueprint.
  3. You can configure your application in a docker-compose.yml file.

Our Dockerfile

The first one may sound complicated, as the contents of our Dockerfile show, this can only take one line of code (from us). That’s possible because Dockerfiles can build an image based on another image, and the developers of many open source projects (e.g. Ubuntu, PostgreSQL/PostGIS, MySQL, Go, nginx, etc) public official images on Docker Hub. Our Dockerfile pulls the postgis/postgis image (translated, it pulls the postgis base image from the postgis organization), and then adds nothing else. So in one line, we indicate that we want the postgis/postgis image that has the tag “15-3.3” (which they’ve intuitively used to indicate the image provides a PostGIS database that has version 15-3.3).

The full contents of our Dockerfile
!cat db_context/Dockerfile
FROM postgis/postgis:15-3.3

Our docker-compose.yml file

And this is our docker-compose.yml file. The docker-compose.yml file defines the services, and any networks, volumes, configs, and/or secrets/environment variables your system needs to work. The file below defines one service (named postgis) and one volume (named sandbox_postgis_data).

The contents of our docker-compose.yml
!cat docker-compose.yml
version: '3.9'

services:
  postgis:
    image: sandbox_postgis:15.3.3
    build:
      context: ./db_context
      dockerfile: Dockerfile
    ports:
      - 54321:5432
    environment:
      POSTGRES_DB: db_name
      POSTGRES_USER: db_username
      POSTGRES_PASSWORD: db_password
    volumes:
      - sandbox_postgis_data:/var/lib/postgresql/data

volumes:
  sandbox_postgis_data:

There are 3 top level elements in this file: version, services, and volumes.

  • version indicates the version of the Docker Compose specification; I don’t think I’ve ever set it to anything other than 3.9.
  • services defines configurations for each container your system needs.
  • volumes defines persistent data stores that can be shared by different services.

Services

The postgis service has five elements: image, build, ports, environment, and volumes.

The image and build elements define the docker image to use; image defines both the name (“sandbox_postgis”) and tag (“15.5.3”) for the docker image, and build defines the Dockerfile to build into an image as well as the context to build into the docker image.

The port element defines a connection from a port on the host machine (host post 54321) to a port into the container (container port 5432, the default for PostgreSQL). We’ll use that later to connect to the database.

The volumes element (in the postgis service) defines a persistent storage volume that will hold the data in our database. Without this, our database would reset every time we restart the system2.

And the environment element enables you to set environment variables in the container. Here, we pass in environment variables POSTGRES_DB, POSTGRES_USER, and POSTGRES_PASSWORD which are used to name the database (in this case, it’s named db_name) and create a superuser (in this case having username db_username and password db_password) when the database is first created in a new volume.

Building the image(s) used in our system

The first time you build the image defined by your Dockerfile, docker will read your Dockerfile(s), download all layers of the base image(s) (defined in lines starting with FROM), process each subsequent instruction into a layer, cache layers, and then bind the layers into an image. This is a template for creating containers.

The second time you run this command (assuming no changes have been made to the docker-compose.yml file, Dockerfile(s), or any other files the Dockerfile references), all layers will just be pulled from cache, producing much smaller output (like what’s shown below).

The output produced while building our images
!docker compose build
[+] Building 0.0s (0/2)                                                         
[+] Building 0.2s (3/4)                                                         
 => [postgis internal] load build definition from Dockerfile               0.0s
 => => transferring dockerfile: 64B                                        0.0s
 => [postgis internal] load .dockerignore                                  0.0s
 => => transferring context: 2B                                            0.0s
 => [postgis internal] load metadata for docker.io/postgis/postgis:15-3.3  0.1s
 => [postgis auth] postgis/postgis:pull token for registry-1.docker.io     0.0s
[+] Building 0.3s (3/4)                                                         
 => [postgis internal] load build definition from Dockerfile               0.0s
 => => transferring dockerfile: 64B                                        0.0s
 => [postgis internal] load .dockerignore                                  0.0s
 => => transferring context: 2B                                            0.0s
 => [postgis internal] load metadata for docker.io/postgis/postgis:15-3.3  0.3s
 => [postgis auth] postgis/postgis:pull token for registry-1.docker.io     0.0s
[+] Building 0.4s (6/6) FINISHED                                                
 => [postgis internal] load build definition from Dockerfile               0.0s
 => => transferring dockerfile: 64B                                        0.0s
 => [postgis internal] load .dockerignore                                  0.0s
 => => transferring context: 2B                                            0.0s
 => [postgis internal] load metadata for docker.io/postgis/postgis:15-3.3  0.3s
 => [postgis auth] postgis/postgis:pull token for registry-1.docker.io     0.0s
 => CACHED [postgis 1/1] FROM docker.io/postgis/postgis:15-3.3@sha256:55a  0.0s
 => [postgis] exporting to image                                           0.0s
 => => exporting layers                                                    0.0s
 => => writing image sha256:1460a9b1410a939c0fd0035d534c14cd5fb7d051c13a4  0.0s
 => => naming to docker.io/library/sandbox_postgis:15.3.3                  0.0s
Shows that our system’s one container is not running yet
!docker ps -f name=sandbox-postgis
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

docker ps shows running containers, and the -f name=... option allows us to filter to running containers with a name containing the entered string. Currently, “sandbox-postgis” isn’t part of the name of any running container, so let’s spin one up.

!docker compose up -d
[+] Running 2/0
 ✔ Network 015_docker_postgres_sandbox_default                Created      0.0s 
 ✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data"  Created      0.0s 
 ⠿ Container 015_docker_postgres_sandbox-postgis-1            Starting     0.1s 
[+] Running 2/3
 ✔ Network 015_docker_postgres_sandbox_default                Created      0.0s 
 ✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data"  Created      0.0s 
 ⠿ Container 015_docker_postgres_sandbox-postgis-1            Starting     0.2s 
[+] Running 2/3
 ✔ Network 015_docker_postgres_sandbox_default                Created      0.0s 
 ✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data"  Created      0.0s 
 ⠿ Container 015_docker_postgres_sandbox-postgis-1            Starting     0.3s 
[+] Running 2/3
 ✔ Network 015_docker_postgres_sandbox_default                Created      0.0s 
 ✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data"  Created      0.0s 
 ⠿ Container 015_docker_postgres_sandbox-postgis-1            Starting     0.4s 
[+] Running 3/3
 ✔ Network 015_docker_postgres_sandbox_default                Created      0.0s 
 ✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data"  Created      0.0s 
 ✔ Container 015_docker_postgres_sandbox-postgis-1            Started      0.4s 
Shows that our system’s one container is running now
!docker ps -f name=sandbox-postgis
CONTAINER ID   IMAGE                    COMMAND                  CREATED        STATUS        PORTS                                         NAMES
cf845e91c61a   sandbox_postgis:15.3.3   "docker-entrypoint.s…"   1 second ago   Up 1 second   0.0.0.0:54321->5432/tcp, :::54321->5432/tcp   015_docker_postgres_sandbox-postgis-1

Using our sandbox

Sending commands to our database

Now we have our database up and running on our local system, but now we need to connect to it from this jupyter notebook. I want to focus on experimenting with SQL, so I’ve implemented some functions than handle connecting to the database and executing SQL queries/commands.

Helper functions for our main python-to-postgres connector code
from contextlib import contextmanager
from typing import Tuple, Union, Optional

import autopep8
import pandas as pd
import psycopg2

def add_indentation(query, spaces=4):
    lines = query.splitlines()
    indented_lines = [(" " * spaces) + line for line in lines]
    indented_query = "\n".join(indented_lines)
    return indented_query

def execute_query_w_existing_conn(
    query: str, conn: psycopg2.extensions.connection
) -> Tuple[str, Union[pd.DataFrame, None]]:
    with conn.cursor() as cur:
        try:
            cur.execute(query)
            rows = cur.fetchall()
            col_names = [desc[0] for desc in cur.description]
            return (cur.statusmessage, pd.DataFrame(rows, columns=col_names))
        except psycopg2.errors.InsufficientPrivilege as err:
            return (add_indentation(f"\n{err.pgerror}  Error type: {type(err)}"), None)
        except psycopg2.ProgrammingError as err:
            if "no results to fetch" in str(err):
                return (cur.statusmessage, None)
            else:
                raise psycopg2.ProgrammingError(err)

def execute_query(
    query: str, conn: Optional[psycopg2.extensions.connection] = None
) -> Union[pd.DataFrame, None]:
    if conn is None:
        with get_db_connection() as new_conn:
            return execute_query_w_existing_conn(query=query, conn=new_conn)
    else:
        return execute_query_w_existing_conn(query=query, conn=conn)

def show_transaction_results(result: Tuple[str, str, Union[pd.DataFrame, None]]) -> None:
    query, status, result_df = result
    dedented_query = add_indentation(query=autopep8.fix_code(query))
    print(f"""\nQuery:\n{dedented_query}""")
    print(f"Database response message: '{status}'")
    if result_df is not None:
        print(f"records in result: {len(result_df)}")
        display(result_df.head())

The get_db_connection() function provides a connection to our database. Note that get_db_connection() uses the POSTGRES_DB, POSTGRES_USER, and POSTGRES_PASSWORD environment variables as well as port number 54321, all of which we set in our docker-compose.yml file.

The execute_transaction() function takes in a string containing one or more semicolon-separated SQL queries, executes each query using the same connection, displays the query and response from the database, and returns a list containing the query, database response, and result_set for each query from the input.

@contextmanager
def get_db_connection():
    conn = psycopg2.connect(
        dbname="db_name",
        user="db_username",
        password="db_password",
        host="localhost",
        port=54321
    )
    conn.autocommit = True
    try:
        yield conn
    finally:
        conn.close()

def execute_transaction(
    query: str, print_results: bool = True
) -> Tuple[str, str, Union[pd.DataFrame, None]]:
    queries = [el.strip() for el in query.split(";") if el.strip() != ""]
    with get_db_connection() as conn:
        results = []
        for q in queries:
            status_msg, result_df = execute_query(query=q, conn=conn)
            result = (q, status_msg, result_df)
            results.append(result)
            if print_results:
                show_transaction_results(result=result)
    return results

Learning exercise

Imagine you’re designing a data warehousing platform and you want to let other analysts or data scientists query data in the warehouse. Some users need to be able to update records or insert new records, but others only need to be able to view some datasets.

PostgreSQL has a concept called Roles that enables you to define permissions for users or groups of users. If you aren’t already familiar with how roles and permissions work, you don’t have a skilled database administrator looking over your work, and you only have access to a database that other systems depend on, it would be extremely intimidating to figure out the correct permissions. Our sandbox takes off that weight by eliminating the cost of mistakes.

Let’s specify the needs for a data analyst role and then figure out how to meet our specification in our sandbox.

  • We want data analysts to be able to view data in tables in the clean_data schema.
  • We don’t want data analysts to be able to modify data in any table, or view data in any schema other than the public schema.
  • We want to be able to easily grant the data-analyst permissions to new users.

Mocking up a table

To test whether our role meets the specification, we need to mock up some test fixtures.

We’ll need a clean_data schema, and another non-public schema to test that our role can only view data in the clean_data schema.

We’ll also need tables in these schemas to attempt to view.

Creating raw_data and clean_data schemas and inserting a data table in each
results = execute_transaction(
    query="""
        CREATE SCHEMA raw_data;
        CREATE TABLE raw_data.customers (
            customer_id int PRIMARY KEY,
            customer_name text,
            contact_name text,
            country text,
            email text
        );
        INSERT INTO raw_data.customers (customer_id, customer_name, contact_name, country, email)
        VALUES
            (1, 'Customer A', 'Contact A', 'Country A', 'contactA@email.com'),
            (2, 'Customer B', 'Contact B', 'Country B', 'contactB@email.com'),
            (3, 'Customer C', 'Contact C', 'Country C', 'contactC@email.com');

        CREATE SCHEMA clean_data;
        CREATE TABLE clean_data.customers (
            customer_id int PRIMARY KEY,
            customer_name text,
            contact_name text,
            country text,
            email text
        );
        INSERT INTO clean_data.customers (customer_id, customer_name, contact_name, country, email)
        SELECT
            customer_id,
            customer_name,
            contact_name,
            country,
            lower(email) AS email
        FROM raw_data.customers;
    """,
    print_results=True,
)

Query:
    CREATE SCHEMA raw_data
Database response message: 'CREATE SCHEMA'

Query:
    CREATE TABLE raw_data.customers(
        customer_id int PRIMARY KEY,
        customer_name text,
        contact_name text,
        country text,
        email text
    )
Database response message: 'CREATE TABLE'

Query:
    INSERT INTO raw_data.customers(customer_id, customer_name, contact_name, country, email)
    VALUES
    (1, 'Customer A', 'Contact A', 'Country A', 'contactA@email.com'),
    (2, 'Customer B', 'Contact B', 'Country B', 'contactB@email.com'),
    (3, 'Customer C', 'Contact C', 'Country C', 'contactC@email.com')
Database response message: 'INSERT 0 3'

Query:
    CREATE SCHEMA clean_data
Database response message: 'CREATE SCHEMA'

Query:
    CREATE TABLE clean_data.customers(
        customer_id int PRIMARY KEY,
        customer_name text,
        contact_name text,
        country text,
        email text
    )
Database response message: 'CREATE TABLE'

Query:
    INSERT INTO clean_data.customers(customer_id, customer_name, contact_name, country, email)
       SELECT
           customer_id,
            customer_name,
            contact_name,
            country,
            lower(email) AS email
        FROM raw_data.customers
Database response message: 'INSERT 0 3'
Checking that my role can see a table
results = execute_transaction(
    query="""
        SELECT current_user;
        SELECT * FROM clean_data.customers;
    """,
    print_results=True,
)

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM clean_data.customers
Database response message: 'SELECT 3'
records in result: 3
current_user
0 db_username
customer_id customer_name contact_name country email
0 1 Customer A Contact A Country A contacta@email.com
1 2 Customer B Contact B Country B contactb@email.com
2 3 Customer C Contact C Country C contactc@email.com

Creating a role to set permissions for a group of users

If we had to individually set the permissions for each database user, it wouldn’t take long before some user(s) were mistakenly granted privileges they shouldn’t have, and it would also be a chore for the data governance/IT team to have to manage this. Fortunately, Postgres allows us to create a role for a group of users (e.g. data analysts, data engineers, data scientists, etc), define the permissions that class should have, and then grant the role to relevant users.

Let’s create a role for data analysts.

Creating a new role and seeing what it can see before granting it permissions
results = execute_transaction(
    query="""
        CREATE ROLE data_analyst;

        SET ROLE data_analyst;
        SELECT current_user;

        SELECT * FROM raw_data.customers;
        SELECT * FROM clean_data.customers;
        SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
        FROM pg_roles
        WHERE rolname = 'data_analyst';
    """,
    print_results=True,
)

Query:
    CREATE ROLE data_analyst
Database response message: 'CREATE ROLE'

Query:
    SET ROLE data_analyst
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.customers
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.customers
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.customers
Database response message: '    
    ERROR:  permission denied for schema clean_data
    LINE 1: SELECT * FROM clean_data.customers
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
    FROM pg_roles
    WHERE rolname = 'data_analyst'
Database response message: 'SELECT 1'
records in result: 1
current_user
0 data_analyst
rolname rolsuper rolcreaterole rolcreatedb
0 data_analyst False False False

Good. We created that role, switched into that role, confirmed we were in the data_analyst role, and then were stopped from accessing tables in schema’s we didn’t have permission to use.

From the result above, we see that the data_analyst doesn’t have permission to view the only tables the raw_data or clean_data schemas, but it is able to view the pg_roles table (which is in the public schema).

The error message also helps us see what we don’t have permission to access: the clean_data schema.

A few privileges have to be granted before the data_analyst role can view that table.

Because I implemented my database connector as a context manager, it closes the connection after every transaction. This is generally a good practice, as ensures that process that was listening to the connection is released (along with all of the process’s resources). In this situation however, when we execute another transaction, get_db_connection() will create a new connection that will have the original role: db_username, and we’ll have to set the role to data_analyst in each transaction.

Necessary Permission 1: GRANT USAGE ON SCHEMA
Granting the role permission to use a schema and checking again
results = execute_transaction(
    query="""
        GRANT USAGE ON SCHEMA clean_data TO data_analyst;
        SELECT current_user;
        SET ROLE data_analyst;
        SELECT current_user;
        SELECT * FROM raw_data.customers;
        SELECT * FROM clean_data.customers;
    """,
    print_results=True,
)

Query:
    GRANT USAGE ON SCHEMA clean_data TO data_analyst
Database response message: 'GRANT'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SET ROLE data_analyst
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.customers
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.customers
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.customers
Database response message: '    
    ERROR:  permission denied for table customers
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user
0 db_username
current_user
0 data_analyst
Necessary Permission 2: GRANT privilege ON ALL TABLES IN SCHEMA

Our role still doesn’t have enough privileges to see the customers table in the clean_data schema. We haven’t specified whether the role should be allowed to delete tables, insert data into tables, select data from tables, etc, and postgres defaults to the more secure choice when there’s ambiguity. So we have to specify what we want to allow the role to do.

There are many privileges we could grant, but we only want to grant SELECT privileges.

Granting the data_analyst role privileges to SELECT from tables in the clean_data schema
results = execute_transaction(
    query="""
        GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst;

        SET ROLE data_analyst;
        SELECT current_user;
        SELECT * FROM raw_data.customers;
        SELECT * FROM clean_data.customers;
    """,
    print_results=True,
)

Query:
    GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst
Database response message: 'GRANT'

Query:
    SET ROLE data_analyst
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.customers
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.customers
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.customers
Database response message: 'SELECT 3'
records in result: 3
current_user
0 data_analyst
customer_id customer_name contact_name country email
0 1 Customer A Contact A Country A contacta@email.com
1 2 Customer B Contact B Country B contactb@email.com
2 3 Customer C Contact C Country C contactc@email.com

Now we see that the data_analyst role can select data from the clean_data.customers and can’t select from the raw_data.customers table. That seems to meet parts of our specification.

Let’s try creating some users and granting them the data_analyst role.

Creating a new user and granting it the data_analyst role
results = execute_transaction(
    query="""
        SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
        FROM pg_roles
        WHERE rolname LIKE '%analyst%';

        CREATE USER analyst1;
        GRANT data_analyst TO analyst1;

        SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
        FROM pg_roles
        WHERE rolname LIKE '%analyst%';
    """,
    print_results=True,
)

Query:
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
    FROM pg_roles
    WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 1'
records in result: 1

Query:
    CREATE USER analyst1
Database response message: 'CREATE ROLE'

Query:
    GRANT data_analyst TO analyst1
Database response message: 'GRANT ROLE'

Query:
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
    FROM pg_roles
    WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 2'
records in result: 2
rolname rolsuper rolcreaterole rolcreatedb
0 data_analyst False False False
rolname rolsuper rolcreaterole rolcreatedb
0 data_analyst False False False
1 analyst1 False False False

Let’s confirm that our new user can view the clean_data.customers table and can’t view the raw_data.customers table.

Checking if our new user can view the tables the data_analyst role can view.
results = execute_transaction(
    query="""
        SET ROLE analyst1;
        SELECT current_user;
        SELECT * FROM raw_data.customers;
        SELECT * FROM clean_data.customers;
    """,
    print_results=True,
)

Query:
    SET ROLE analyst1
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.customers
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.customers
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.customers
Database response message: 'SELECT 3'
records in result: 3
current_user
0 analyst1
customer_id customer_name contact_name country email
0 1 Customer A Contact A Country A contacta@email.com
1 2 Customer B Contact B Country B contactb@email.com
2 3 Customer C Contact C Country C contactc@email.com

Confirmed, our new user has the ability to select from a table in the data_clean schema. Let’s add another table and confirm the data_analyst and analyst1 roles have permissions to view it, as well.

Mocking up new tables in both the raw_data and clean_data schemas
results = execute_transaction(
    query="""
        CREATE TABLE raw_data.orders (
            order_id int PRIMARY KEY,
            customer_id int,
            order_date text,
            product text,
            quantity int,
            FOREIGN KEY (customer_id) REFERENCES raw_data.customers(customer_id)
        );
        INSERT INTO raw_data.orders (order_id, customer_id, order_date, product, quantity)
        VALUES
            (1, 1, '2023-01-01', 'Product A', 10),
            (2, 1, '2023-01-02', 'Product B', 15),
            (3, 2, '2023-02-01', 'Product A', 5),
            (4, 3, '2023-02-02', 'Product C', 20);
        CREATE TABLE clean_data.orders (
            order_id int PRIMARY KEY,
            customer_id int,
            order_date date,
            product text,
            quantity int,
            FOREIGN KEY (customer_id) REFERENCES clean_data.customers(customer_id)
        );
        INSERT INTO clean_data.orders (order_id, customer_id, order_date, product, quantity)
        SELECT
            order_id,
            customer_id,
            order_date::date AS order_date,
            product,
            quantity
        FROM raw_data.orders;
    """,
    print_results=True,
)

Query:
    CREATE TABLE raw_data.orders(
        order_id int PRIMARY KEY,
        customer_id int,
        order_date text,
        product text,
        quantity int,
        FOREIGN KEY(customer_id) REFERENCES raw_data.customers(customer_id)
    )
Database response message: 'CREATE TABLE'

Query:
    INSERT INTO raw_data.orders(order_id, customer_id, order_date, product, quantity)
    VALUES
    (1, 1, '2023-01-01', 'Product A', 10),
    (2, 1, '2023-01-02', 'Product B', 15),
    (3, 2, '2023-02-01', 'Product A', 5),
    (4, 3, '2023-02-02', 'Product C', 20)
Database response message: 'INSERT 0 4'

Query:
    CREATE TABLE clean_data.orders(
        order_id int PRIMARY KEY,
        customer_id int,
        order_date date,
        product text,
        quantity int,
        FOREIGN KEY(customer_id) REFERENCES clean_data.customers(customer_id)
    )
Database response message: 'CREATE TABLE'

Query:
    INSERT INTO clean_data.orders(order_id, customer_id, order_date, product, quantity)
       SELECT
           order_id,
            customer_id,
            order_date: : date AS order_date,
            product,
            quantity
        FROM raw_data.orders
Database response message: 'INSERT 0 4'
Checking if our new user can view the tables the data_analyst role can view.
results = execute_transaction(
    query="""
        SET ROLE data_analyst;
        SELECT current_user;
        SELECT * FROM raw_data.orders;
        SELECT * FROM clean_data.orders;

        SET ROLE analyst1;
        SELECT current_user;
        SELECT * FROM raw_data.orders;
        SELECT * FROM clean_data.orders;
    """,
    print_results=True,
)

Query:
    SET ROLE data_analyst
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.orders
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.orders
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.orders
Database response message: '    
    ERROR:  permission denied for table orders
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SET ROLE analyst1
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.orders
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.orders
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.orders
Database response message: '    
    ERROR:  permission denied for table orders
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user
0 data_analyst
current_user
0 analyst1

Wait, our neither the data_analyst nor the analyst1 roles have sufficient privileges to select from the new table in the clean_data table? Good thing we checked, as now we know we need to do more before our data_analyst role delivers the behavior we specified.

Necessary Permission 2 part 2: ALTER DEFAULT PRIVILEGES

GRANT SELECT ON ALL TABLES IN clean_data gave the data_analyst role (and any role inheriting from data_analyst at grant-execution-time) permission to select from any table in clean_data, but doesn’t set that as the default for users granted the data_analyst role later on. We have to ALTER DEFAULT PRIVILEGES (which will apply to future grantings), and then we have to run the GRANT SELECT ... command again (to apply to the already-existing analyst1).

Altering the default permissions for any user granted role data_analyst in the future
results = execute_transaction(
    query="""
        ALTER DEFAULT PRIVILEGES IN SCHEMA clean_data
            GRANT SELECT ON TABLES TO data_analyst;
        GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst;

        SET ROLE data_analyst;
        SELECT current_user;
        SELECT * FROM raw_data.orders;
        SELECT * FROM clean_data.orders;

        SET ROLE analyst1;
        SELECT current_user;
        SELECT * FROM raw_data.orders;
        SELECT * FROM clean_data.orders;
    """,
    print_results=True,
)

Query:
    ALTER DEFAULT PRIVILEGES IN SCHEMA clean_data
    GRANT SELECT ON TABLES TO data_analyst
Database response message: 'ALTER DEFAULT PRIVILEGES'

Query:
    GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst
Database response message: 'GRANT'

Query:
    SET ROLE data_analyst
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.orders
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.orders
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.orders
Database response message: 'SELECT 4'
records in result: 4

Query:
    SET ROLE analyst1
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.orders
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.orders
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.orders
Database response message: 'SELECT 4'
records in result: 4
current_user
0 data_analyst
order_id customer_id order_date product quantity
0 1 1 2023-01-01 Product A 10
1 2 1 2023-01-02 Product B 15
2 3 2 2023-02-01 Product A 5
3 4 3 2023-02-02 Product C 20
current_user
0 analyst1
order_id customer_id order_date product quantity
0 1 1 2023-01-01 Product A 10
1 2 1 2023-01-02 Product B 15
2 3 2 2023-02-01 Product A 5
3 4 3 2023-02-02 Product C 20

And as a final check, let’s confirm that we can create another user and that they’ll have the ability to select from clean_data schema tables.

Creating another user, granting them the data_analyst role, and checking permissions
results = execute_transaction(
    query="""
        SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
        FROM pg_roles
        WHERE rolname LIKE '%analyst%';

        CREATE USER analyst2;
        GRANT data_analyst TO analyst2;

        SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
        FROM pg_roles
        WHERE rolname LIKE '%analyst%';

        SET ROLE analyst2;
        SELECT current_user;
        SELECT * FROM raw_data.orders;
        SELECT * FROM clean_data.orders;
    """,
    print_results=True,
)

Query:
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
    FROM pg_roles
    WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 2'
records in result: 2

Query:
    CREATE USER analyst2
Database response message: 'CREATE ROLE'

Query:
    GRANT data_analyst TO analyst2
Database response message: 'GRANT ROLE'

Query:
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
    FROM pg_roles
    WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 3'
records in result: 3

Query:
    SET ROLE analyst2
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    SELECT * FROM raw_data.orders
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: SELECT * FROM raw_data.orders
                          ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    SELECT * FROM clean_data.orders
Database response message: 'SELECT 4'
records in result: 4
rolname rolsuper rolcreaterole rolcreatedb
0 data_analyst False False False
1 analyst1 False False False
rolname rolsuper rolcreaterole rolcreatedb
0 data_analyst False False False
1 analyst1 False False False
2 analyst2 False False False
current_user
0 analyst2
order_id customer_id order_date product quantity
0 1 1 2023-01-01 Product A 10
1 2 1 2023-01-02 Product B 15
2 3 2 2023-02-01 Product A 5
3 4 3 2023-02-02 Product C 20
Attempting to insert data into tables in both the raw_data and clean_data schemas
results = execute_transaction(
    query="""
        SET ROLE analyst2;
        SELECT current_user;

        INSERT INTO raw_data.orders (order_id, customer_id, order_date, product, quantity)
        VALUES
            (1, 1, '2023-01-01', 'Product A', 10);

        INSERT INTO clean_data.orders (order_id, customer_id, order_date, product, quantity)
        VALUES
            (1, 1, '2023-01-01', 'Product A', 10);
    """,
    print_results=True,
)

Query:
    SET ROLE analyst2
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    INSERT INTO raw_data.orders(order_id, customer_id, order_date, product, quantity)
    VALUES
    (1, 1, '2023-01-01', 'Product A', 10)
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: INSERT INTO raw_data.orders (order_id, customer_id, order_da...
                        ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    INSERT INTO clean_data.orders(order_id, customer_id, order_date, product, quantity)
    VALUES
    (1, 1, '2023-01-01', 'Product A', 10)
Database response message: '    
    ERROR:  permission denied for table orders
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user
0 analyst2
Attempting to insert data into tables in both the raw_data and clean_data schemas
results = execute_transaction(
    query="""
        SET ROLE data_analyst;
        SELECT current_user;

        INSERT INTO raw_data.orders (order_id, customer_id, order_date, product, quantity)
        VALUES
            (1, 1, '2023-01-01', 'Product A', 10);

        INSERT INTO clean_data.orders (order_id, customer_id, order_date, product, quantity)
        VALUES
            (1, 1, '2023-01-01', 'Product A', 10);
    """,
    print_results=True,
)

Query:
    SET ROLE data_analyst
Database response message: 'SET'

Query:
    SELECT current_user
Database response message: 'SELECT 1'
records in result: 1

Query:
    INSERT INTO raw_data.orders(order_id, customer_id, order_date, product, quantity)
    VALUES
    (1, 1, '2023-01-01', 'Product A', 10)
Database response message: '    
    ERROR:  permission denied for schema raw_data
    LINE 1: INSERT INTO raw_data.orders (order_id, customer_id, order_da...
                        ^
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'

Query:
    INSERT INTO clean_data.orders(order_id, customer_id, order_date, product, quantity)
    VALUES
    (1, 1, '2023-01-01', 'Product A', 10)
Database response message: '    
    ERROR:  permission denied for table orders
      Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user
0 data_analyst

Specification Met

It looks like we have a role that we can assign to new users to grant them permission to select data from tables in only the clean_data schema. And we’ve also confirmed our roles can’t insert data into tables (and likely can’t update or delete either). So we’ve developed and tested a solution that can meet our specification.

We create our data_analyst role, grant it privileges, and set its default privileges

GRANT USAGE ON SCHEMA clean_data TO data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst;
ALTER DEFAULT PRIVILEGES IN SCHEMA clean_data
    GRANT SELECT ON TABLES TO data_analyst;

and we can grant new analyst users those privileges

GRANT data_analyst TO new_analyst;

Success!

Cleaning up the sandbox

When we’re done with our experiments, we can tell docker to shut down our sandbox container and delete the volume storing our sandbox’s database via this command.

docker compose down -v

To shut down our container without deleting the volume, just leave off the -v flag.

Shutting down our sandbox and deleting our volume
!docker compose down -v
[+] Running 0/0
 ⠋ Container 015_docker_postgres_sandbox-postgis-1  Stopping               0.1s 
[+] Running 0/1
 ⠙ Container 015_docker_postgres_sandbox-postgis-1  Stopping               0.2s 
[+] Running 0/1
 ⠹ Container 015_docker_postgres_sandbox-postgis-1  Stopping               0.3s 
[+] Running 2/1
 ✔ Container 015_docker_postgres_sandbox-postgis-1          Removed        0.3s 
 ✔ Volume 015_docker_postgres_sandbox_sandbox_postgis_data  Removed        0.0s 
 ⠋ Network 015_docker_postgres_sandbox_default              Removing       0.1s 
[+] Running 2/3
 ✔ Container 015_docker_postgres_sandbox-postgis-1          Removed        0.3s 
 ✔ Volume 015_docker_postgres_sandbox_sandbox_postgis_data  Removed        0.0s 
 ⠙ Network 015_docker_postgres_sandbox_default              Removing       0.2s 
[+] Running 3/3
 ✔ Container 015_docker_postgres_sandbox-postgis-1          Removed        0.3s 
 ✔ Volume 015_docker_postgres_sandbox_sandbox_postgis_data  Removed        0.0s 
 ✔ Network 015_docker_postgres_sandbox_default              Removed        0.2s 
Shows that our system’s one container has shut down
!docker ps -a -f name=sandbox-postgis
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Footnotes

  1. Before you can use docker and compose, you have to install docker and compose. I installed the docker engine and compose, but it looks like the developers at Docker Inc. guide people towards installing their Docker Desktop client. In either setup, docker should still become available to you via the command line, so instructions in this post should work.↩︎

  2. Docker containers are designed to be a replicable instance of an image, so when you shut down your application, your containers are removed and new ones are created next time you start it up. This is great for reproducibility (you always get a new, clean instance based on your image), but you don’t want the data you ingest into your database to get wiped every time you shut down your system, so you can define a persistent volume that will live on in the host system.↩︎