Code
from pathlib import Path
import pandas as pd
= None pd.options.display.max_columns
Matt Triano
August 1, 2023
August 1, 2023
I’ve worked with SQL regularly over the past decade, but over the first half of that decade (and despite extensive formal knowledge of normalization, data modeling, and other database-related topics) I essentially only used SQL to extract data and use Python or R to do analysis. I only had access to production databases and I didn’t know enough about how database management system work to risk experiments, and as a result, I learned at a glacial pace. But in 2018, I stumbled into a project interesting enough to motivate me to install PostGIS on a personal machine and freed from the fear of accidentally taking down a production system (and the power of a superuser), I was able to lift up the hood and see how the machine worked, and this enabled me to learn and build with Postgres + PostGIS (a geospatial Postgres extension) far faster.
In this post, I show how to use Docker to set up a PostGIS database and experiment with it.
Docker and Docker Compose1 enable you to import and run complicated applications in just a few lines of code.
The internals of docker are really interesting, but for this post, all you need to know these things:
docker-compose.yml
file.The first one may sound complicated, as the contents of our Dockerfile show, this can only take one line of code (from us). That’s possible because Dockerfiles can build an image based on another image, and the developers of many open source projects (e.g. Ubuntu, PostgreSQL/PostGIS, MySQL, Go, nginx, etc) public official images on Docker Hub. Our Dockerfile pulls the postgis/postgis image (translated, it pulls the postgis
base image from the postgis
organization), and then adds nothing else. So in one line, we indicate that we want the postgis/postgis
image that has the tag “15-3.3” (which they’ve intuitively used to indicate the image provides a PostGIS database that has version 15-3.3).
And this is our docker-compose.yml
file. The docker-compose.yml
file defines the services, and any networks, volumes, configs, and/or secrets/environment variables your system needs to work. The file below defines one service (named postgis
) and one volume (named sandbox_postgis_data
).
version: '3.9'
services:
postgis:
image: sandbox_postgis:15.3.3
build:
context: ./db_context
dockerfile: Dockerfile
ports:
- 54321:5432
environment:
POSTGRES_DB: db_name
POSTGRES_USER: db_username
POSTGRES_PASSWORD: db_password
volumes:
- sandbox_postgis_data:/var/lib/postgresql/data
volumes:
sandbox_postgis_data:
There are 3 top level elements in this file: version
, services
, and volumes
.
version
indicates the version of the Docker Compose specification; I don’t think I’ve ever set it to anything other than 3.9.services
defines configurations for each container your system needs.volumes
defines persistent data stores that can be shared by different services.The postgis
service has five elements: image
, build
, ports
, environment
, and volumes
.
The image
and build
elements define the docker image to use; image
defines both the name (“sandbox_postgis”) and tag (“15.5.3”) for the docker image, and build
defines the Dockerfile to build into an image as well as the context to build into the docker image.
The port
element defines a connection from a port on the host machine (host post 54321) to a port into the container (container port 5432, the default for PostgreSQL). We’ll use that later to connect to the database.
The volumes
element (in the postgis
service) defines a persistent storage volume that will hold the data in our database. Without this, our database would reset every time we restart the system2.
And the environment
element enables you to set environment variables in the container. Here, we pass in environment variables POSTGRES_DB
, POSTGRES_USER
, and POSTGRES_PASSWORD
which are used to name the database (in this case, it’s named db_name
) and create a superuser (in this case having username db_username
and password db_password
) when the database is first created in a new volume.
The first time you build the image defined by your Dockerfile, docker will read your Dockerfile(s), download all layers of the base image(s) (defined in lines starting with FROM
), process each subsequent instruction into a layer, cache layers, and then bind the layers into an image. This is a template for creating containers.
The second time you run this command (assuming no changes have been made to the docker-compose.yml
file, Dockerfile(s), or any other files the Dockerfile references), all layers will just be pulled from cache, producing much smaller output (like what’s shown below).
[+] Building 0.0s (0/2)
[+] Building 0.2s (3/4)
=> [postgis internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 64B 0.0s
=> [postgis internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [postgis internal] load metadata for docker.io/postgis/postgis:15-3.3 0.1s
=> [postgis auth] postgis/postgis:pull token for registry-1.docker.io 0.0s
[+] Building 0.3s (3/4)
=> [postgis internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 64B 0.0s
=> [postgis internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [postgis internal] load metadata for docker.io/postgis/postgis:15-3.3 0.3s
=> [postgis auth] postgis/postgis:pull token for registry-1.docker.io 0.0s
[+] Building 0.4s (6/6) FINISHED
=> [postgis internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 64B 0.0s
=> [postgis internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [postgis internal] load metadata for docker.io/postgis/postgis:15-3.3 0.3s
=> [postgis auth] postgis/postgis:pull token for registry-1.docker.io 0.0s
=> CACHED [postgis 1/1] FROM docker.io/postgis/postgis:15-3.3@sha256:55a 0.0s
=> [postgis] exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:1460a9b1410a939c0fd0035d534c14cd5fb7d051c13a4 0.0s
=> => naming to docker.io/library/sandbox_postgis:15.3.3 0.0s
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
docker ps
shows running containers, and the -f name=...
option allows us to filter to running containers with a name containing the entered string. Currently, “sandbox-postgis” isn’t part of the name of any running container, so let’s spin one up.
[+] Running 2/0
✔ Network 015_docker_postgres_sandbox_default Created 0.0s
✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data" Created 0.0s
⠿ Container 015_docker_postgres_sandbox-postgis-1 Starting 0.1s
[+] Running 2/3
✔ Network 015_docker_postgres_sandbox_default Created 0.0s
✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data" Created 0.0s
⠿ Container 015_docker_postgres_sandbox-postgis-1 Starting 0.2s
[+] Running 2/3
✔ Network 015_docker_postgres_sandbox_default Created 0.0s
✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data" Created 0.0s
⠿ Container 015_docker_postgres_sandbox-postgis-1 Starting 0.3s
[+] Running 2/3
✔ Network 015_docker_postgres_sandbox_default Created 0.0s
✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data" Created 0.0s
⠿ Container 015_docker_postgres_sandbox-postgis-1 Starting 0.4s
[+] Running 3/3
✔ Network 015_docker_postgres_sandbox_default Created 0.0s
✔ Volume "015_docker_postgres_sandbox_sandbox_postgis_data" Created 0.0s
✔ Container 015_docker_postgres_sandbox-postgis-1 Started 0.4s
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cf845e91c61a sandbox_postgis:15.3.3 "docker-entrypoint.s…" 1 second ago Up 1 second 0.0.0.0:54321->5432/tcp, :::54321->5432/tcp 015_docker_postgres_sandbox-postgis-1
Now we have our database up and running on our local system, but now we need to connect to it from this jupyter notebook. I want to focus on experimenting with SQL, so I’ve implemented some functions than handle connecting to the database and executing SQL queries/commands.
from contextlib import contextmanager
from typing import Tuple, Union, Optional
import autopep8
import pandas as pd
import psycopg2
def add_indentation(query, spaces=4):
lines = query.splitlines()
indented_lines = [(" " * spaces) + line for line in lines]
indented_query = "\n".join(indented_lines)
return indented_query
def execute_query_w_existing_conn(
query: str, conn: psycopg2.extensions.connection
) -> Tuple[str, Union[pd.DataFrame, None]]:
with conn.cursor() as cur:
try:
cur.execute(query)
rows = cur.fetchall()
col_names = [desc[0] for desc in cur.description]
return (cur.statusmessage, pd.DataFrame(rows, columns=col_names))
except psycopg2.errors.InsufficientPrivilege as err:
return (add_indentation(f"\n{err.pgerror} Error type: {type(err)}"), None)
except psycopg2.ProgrammingError as err:
if "no results to fetch" in str(err):
return (cur.statusmessage, None)
else:
raise psycopg2.ProgrammingError(err)
def execute_query(
query: str, conn: Optional[psycopg2.extensions.connection] = None
) -> Union[pd.DataFrame, None]:
if conn is None:
with get_db_connection() as new_conn:
return execute_query_w_existing_conn(query=query, conn=new_conn)
else:
return execute_query_w_existing_conn(query=query, conn=conn)
def show_transaction_results(result: Tuple[str, str, Union[pd.DataFrame, None]]) -> None:
query, status, result_df = result
dedented_query = add_indentation(query=autopep8.fix_code(query))
print(f"""\nQuery:\n{dedented_query}""")
print(f"Database response message: '{status}'")
if result_df is not None:
print(f"records in result: {len(result_df)}")
display(result_df.head())
The get_db_connection()
function provides a connection to our database. Note that get_db_connection()
uses the POSTGRES_DB
, POSTGRES_USER
, and POSTGRES_PASSWORD
environment variables as well as port number 54321, all of which we set in our docker-compose.yml
file.
The execute_transaction()
function takes in a string containing one or more semicolon-separated SQL queries, executes each query using the same connection, displays the query and response from the database, and returns a list containing the query, database response, and result_set for each query from the input.
@contextmanager
def get_db_connection():
conn = psycopg2.connect(
dbname="db_name",
user="db_username",
password="db_password",
host="localhost",
port=54321
)
conn.autocommit = True
try:
yield conn
finally:
conn.close()
def execute_transaction(
query: str, print_results: bool = True
) -> Tuple[str, str, Union[pd.DataFrame, None]]:
queries = [el.strip() for el in query.split(";") if el.strip() != ""]
with get_db_connection() as conn:
results = []
for q in queries:
status_msg, result_df = execute_query(query=q, conn=conn)
result = (q, status_msg, result_df)
results.append(result)
if print_results:
show_transaction_results(result=result)
return results
Imagine you’re designing a data warehousing platform and you want to let other analysts or data scientists query data in the warehouse. Some users need to be able to update records or insert new records, but others only need to be able to view some datasets.
PostgreSQL has a concept called Roles that enables you to define permissions for users or groups of users. If you aren’t already familiar with how roles and permissions work, you don’t have a skilled database administrator looking over your work, and you only have access to a database that other systems depend on, it would be extremely intimidating to figure out the correct permissions. Our sandbox takes off that weight by eliminating the cost of mistakes.
Let’s specify the needs for a data analyst role and then figure out how to meet our specification in our sandbox.
clean_data
schema.public
schema.To test whether our role meets the specification, we need to mock up some test fixtures.
We’ll need a clean_data
schema, and another non-public
schema to test that our role can only view data in the clean_data
schema.
We’ll also need tables in these schemas to attempt to view.
results = execute_transaction(
query="""
CREATE SCHEMA raw_data;
CREATE TABLE raw_data.customers (
customer_id int PRIMARY KEY,
customer_name text,
contact_name text,
country text,
email text
);
INSERT INTO raw_data.customers (customer_id, customer_name, contact_name, country, email)
VALUES
(1, 'Customer A', 'Contact A', 'Country A', 'contactA@email.com'),
(2, 'Customer B', 'Contact B', 'Country B', 'contactB@email.com'),
(3, 'Customer C', 'Contact C', 'Country C', 'contactC@email.com');
CREATE SCHEMA clean_data;
CREATE TABLE clean_data.customers (
customer_id int PRIMARY KEY,
customer_name text,
contact_name text,
country text,
email text
);
INSERT INTO clean_data.customers (customer_id, customer_name, contact_name, country, email)
SELECT
customer_id,
customer_name,
contact_name,
country,
lower(email) AS email
FROM raw_data.customers;
""",
print_results=True,
)
Query:
CREATE SCHEMA raw_data
Database response message: 'CREATE SCHEMA'
Query:
CREATE TABLE raw_data.customers(
customer_id int PRIMARY KEY,
customer_name text,
contact_name text,
country text,
email text
)
Database response message: 'CREATE TABLE'
Query:
INSERT INTO raw_data.customers(customer_id, customer_name, contact_name, country, email)
VALUES
(1, 'Customer A', 'Contact A', 'Country A', 'contactA@email.com'),
(2, 'Customer B', 'Contact B', 'Country B', 'contactB@email.com'),
(3, 'Customer C', 'Contact C', 'Country C', 'contactC@email.com')
Database response message: 'INSERT 0 3'
Query:
CREATE SCHEMA clean_data
Database response message: 'CREATE SCHEMA'
Query:
CREATE TABLE clean_data.customers(
customer_id int PRIMARY KEY,
customer_name text,
contact_name text,
country text,
email text
)
Database response message: 'CREATE TABLE'
Query:
INSERT INTO clean_data.customers(customer_id, customer_name, contact_name, country, email)
SELECT
customer_id,
customer_name,
contact_name,
country,
lower(email) AS email
FROM raw_data.customers
Database response message: 'INSERT 0 3'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM clean_data.customers
Database response message: 'SELECT 3'
records in result: 3
current_user | |
---|---|
0 | db_username |
customer_id | customer_name | contact_name | country | ||
---|---|---|---|---|---|
0 | 1 | Customer A | Contact A | Country A | contacta@email.com |
1 | 2 | Customer B | Contact B | Country B | contactb@email.com |
2 | 3 | Customer C | Contact C | Country C | contactc@email.com |
If we had to individually set the permissions for each database user, it wouldn’t take long before some user(s) were mistakenly granted privileges they shouldn’t have, and it would also be a chore for the data governance/IT team to have to manage this. Fortunately, Postgres allows us to create a role for a group of users (e.g. data analysts, data engineers, data scientists, etc), define the permissions that class should have, and then grant the role to relevant users.
Let’s create a role for data analysts.
results = execute_transaction(
query="""
CREATE ROLE data_analyst;
SET ROLE data_analyst;
SELECT current_user;
SELECT * FROM raw_data.customers;
SELECT * FROM clean_data.customers;
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname = 'data_analyst';
""",
print_results=True,
)
Query:
CREATE ROLE data_analyst
Database response message: 'CREATE ROLE'
Query:
SET ROLE data_analyst
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.customers
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.customers
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.customers
Database response message: '
ERROR: permission denied for schema clean_data
LINE 1: SELECT * FROM clean_data.customers
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname = 'data_analyst'
Database response message: 'SELECT 1'
records in result: 1
current_user | |
---|---|
0 | data_analyst |
rolname | rolsuper | rolcreaterole | rolcreatedb | |
---|---|---|---|---|
0 | data_analyst | False | False | False |
Good. We created that role, switched into that role, confirmed we were in the data_analyst
role, and then were stopped from accessing tables in schema’s we didn’t have permission to use.
From the result above, we see that the data_analyst
doesn’t have permission to view the only tables the raw_data
or clean_data
schemas, but it is able to view the pg_roles
table (which is in the public schema).
The error message also helps us see what we don’t have permission to access: the clean_data
schema.
A few privileges have to be granted before the data_analyst
role can view that table.
Because I implemented my database connector as a context manager, it closes the connection after every transaction. This is generally a good practice, as ensures that process that was listening to the connection is released (along with all of the process’s resources). In this situation however, when we execute another transaction, get_db_connection()
will create a new connection that will have the original role: db_username
, and we’ll have to set the role to data_analyst
in each transaction.
Query:
GRANT USAGE ON SCHEMA clean_data TO data_analyst
Database response message: 'GRANT'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SET ROLE data_analyst
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.customers
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.customers
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.customers
Database response message: '
ERROR: permission denied for table customers
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user | |
---|---|
0 | db_username |
current_user | |
---|---|
0 | data_analyst |
Our role still doesn’t have enough privileges to see the customers
table in the clean_data
schema. We haven’t specified whether the role should be allowed to delete tables, insert data into tables, select data from tables, etc, and postgres
defaults to the more secure choice when there’s ambiguity. So we have to specify what we want to allow the role to do.
There are many privileges we could grant, but we only want to grant SELECT privileges.
Query:
GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst
Database response message: 'GRANT'
Query:
SET ROLE data_analyst
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.customers
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.customers
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.customers
Database response message: 'SELECT 3'
records in result: 3
current_user | |
---|---|
0 | data_analyst |
customer_id | customer_name | contact_name | country | ||
---|---|---|---|---|---|
0 | 1 | Customer A | Contact A | Country A | contacta@email.com |
1 | 2 | Customer B | Contact B | Country B | contactb@email.com |
2 | 3 | Customer C | Contact C | Country C | contactc@email.com |
Now we see that the data_analyst
role can select data from the clean_data.customers
and can’t select from the raw_data.customers
table. That seems to meet parts of our specification.
Let’s try creating some users and granting them the data_analyst
role.
results = execute_transaction(
query="""
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%';
CREATE USER analyst1;
GRANT data_analyst TO analyst1;
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%';
""",
print_results=True,
)
Query:
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 1'
records in result: 1
Query:
CREATE USER analyst1
Database response message: 'CREATE ROLE'
Query:
GRANT data_analyst TO analyst1
Database response message: 'GRANT ROLE'
Query:
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 2'
records in result: 2
rolname | rolsuper | rolcreaterole | rolcreatedb | |
---|---|---|---|---|
0 | data_analyst | False | False | False |
rolname | rolsuper | rolcreaterole | rolcreatedb | |
---|---|---|---|---|
0 | data_analyst | False | False | False |
1 | analyst1 | False | False | False |
Let’s confirm that our new user can view the clean_data.customers
table and can’t view the raw_data.customers
table.
Query:
SET ROLE analyst1
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.customers
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.customers
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.customers
Database response message: 'SELECT 3'
records in result: 3
current_user | |
---|---|
0 | analyst1 |
customer_id | customer_name | contact_name | country | ||
---|---|---|---|---|---|
0 | 1 | Customer A | Contact A | Country A | contacta@email.com |
1 | 2 | Customer B | Contact B | Country B | contactb@email.com |
2 | 3 | Customer C | Contact C | Country C | contactc@email.com |
Confirmed, our new user has the ability to select from a table in the data_clean
schema. Let’s add another table and confirm the data_analyst
and analyst1
roles have permissions to view it, as well.
results = execute_transaction(
query="""
CREATE TABLE raw_data.orders (
order_id int PRIMARY KEY,
customer_id int,
order_date text,
product text,
quantity int,
FOREIGN KEY (customer_id) REFERENCES raw_data.customers(customer_id)
);
INSERT INTO raw_data.orders (order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10),
(2, 1, '2023-01-02', 'Product B', 15),
(3, 2, '2023-02-01', 'Product A', 5),
(4, 3, '2023-02-02', 'Product C', 20);
CREATE TABLE clean_data.orders (
order_id int PRIMARY KEY,
customer_id int,
order_date date,
product text,
quantity int,
FOREIGN KEY (customer_id) REFERENCES clean_data.customers(customer_id)
);
INSERT INTO clean_data.orders (order_id, customer_id, order_date, product, quantity)
SELECT
order_id,
customer_id,
order_date::date AS order_date,
product,
quantity
FROM raw_data.orders;
""",
print_results=True,
)
Query:
CREATE TABLE raw_data.orders(
order_id int PRIMARY KEY,
customer_id int,
order_date text,
product text,
quantity int,
FOREIGN KEY(customer_id) REFERENCES raw_data.customers(customer_id)
)
Database response message: 'CREATE TABLE'
Query:
INSERT INTO raw_data.orders(order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10),
(2, 1, '2023-01-02', 'Product B', 15),
(3, 2, '2023-02-01', 'Product A', 5),
(4, 3, '2023-02-02', 'Product C', 20)
Database response message: 'INSERT 0 4'
Query:
CREATE TABLE clean_data.orders(
order_id int PRIMARY KEY,
customer_id int,
order_date date,
product text,
quantity int,
FOREIGN KEY(customer_id) REFERENCES clean_data.customers(customer_id)
)
Database response message: 'CREATE TABLE'
Query:
INSERT INTO clean_data.orders(order_id, customer_id, order_date, product, quantity)
SELECT
order_id,
customer_id,
order_date: : date AS order_date,
product,
quantity
FROM raw_data.orders
Database response message: 'INSERT 0 4'
Query:
SET ROLE data_analyst
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.orders
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.orders
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.orders
Database response message: '
ERROR: permission denied for table orders
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SET ROLE analyst1
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.orders
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.orders
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.orders
Database response message: '
ERROR: permission denied for table orders
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user | |
---|---|
0 | data_analyst |
current_user | |
---|---|
0 | analyst1 |
Wait, our neither the data_analyst
nor the analyst1
roles have sufficient privileges to select from the new table in the clean_data
table? Good thing we checked, as now we know we need to do more before our data_analyst
role delivers the behavior we specified.
GRANT SELECT ON ALL TABLES IN clean_data
gave the data_analyst
role (and any role inheriting from data_analyst
at grant-execution-time) permission to select from any table in clean_data
, but doesn’t set that as the default for users granted the data_analyst
role later on. We have to ALTER DEFAULT PRIVILEGES
(which will apply to future grantings), and then we have to run the GRANT SELECT ...
command again (to apply to the already-existing analyst1
).
results = execute_transaction(
query="""
ALTER DEFAULT PRIVILEGES IN SCHEMA clean_data
GRANT SELECT ON TABLES TO data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst;
SET ROLE data_analyst;
SELECT current_user;
SELECT * FROM raw_data.orders;
SELECT * FROM clean_data.orders;
SET ROLE analyst1;
SELECT current_user;
SELECT * FROM raw_data.orders;
SELECT * FROM clean_data.orders;
""",
print_results=True,
)
Query:
ALTER DEFAULT PRIVILEGES IN SCHEMA clean_data
GRANT SELECT ON TABLES TO data_analyst
Database response message: 'ALTER DEFAULT PRIVILEGES'
Query:
GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst
Database response message: 'GRANT'
Query:
SET ROLE data_analyst
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.orders
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.orders
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.orders
Database response message: 'SELECT 4'
records in result: 4
Query:
SET ROLE analyst1
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.orders
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.orders
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.orders
Database response message: 'SELECT 4'
records in result: 4
current_user | |
---|---|
0 | data_analyst |
order_id | customer_id | order_date | product | quantity | |
---|---|---|---|---|---|
0 | 1 | 1 | 2023-01-01 | Product A | 10 |
1 | 2 | 1 | 2023-01-02 | Product B | 15 |
2 | 3 | 2 | 2023-02-01 | Product A | 5 |
3 | 4 | 3 | 2023-02-02 | Product C | 20 |
current_user | |
---|---|
0 | analyst1 |
order_id | customer_id | order_date | product | quantity | |
---|---|---|---|---|---|
0 | 1 | 1 | 2023-01-01 | Product A | 10 |
1 | 2 | 1 | 2023-01-02 | Product B | 15 |
2 | 3 | 2 | 2023-02-01 | Product A | 5 |
3 | 4 | 3 | 2023-02-02 | Product C | 20 |
And as a final check, let’s confirm that we can create another user and that they’ll have the ability to select from clean_data
schema tables.
results = execute_transaction(
query="""
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%';
CREATE USER analyst2;
GRANT data_analyst TO analyst2;
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%';
SET ROLE analyst2;
SELECT current_user;
SELECT * FROM raw_data.orders;
SELECT * FROM clean_data.orders;
""",
print_results=True,
)
Query:
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 2'
records in result: 2
Query:
CREATE USER analyst2
Database response message: 'CREATE ROLE'
Query:
GRANT data_analyst TO analyst2
Database response message: 'GRANT ROLE'
Query:
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname LIKE '%analyst%'
Database response message: 'SELECT 3'
records in result: 3
Query:
SET ROLE analyst2
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
SELECT * FROM raw_data.orders
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: SELECT * FROM raw_data.orders
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
SELECT * FROM clean_data.orders
Database response message: 'SELECT 4'
records in result: 4
rolname | rolsuper | rolcreaterole | rolcreatedb | |
---|---|---|---|---|
0 | data_analyst | False | False | False |
1 | analyst1 | False | False | False |
rolname | rolsuper | rolcreaterole | rolcreatedb | |
---|---|---|---|---|
0 | data_analyst | False | False | False |
1 | analyst1 | False | False | False |
2 | analyst2 | False | False | False |
current_user | |
---|---|
0 | analyst2 |
order_id | customer_id | order_date | product | quantity | |
---|---|---|---|---|---|
0 | 1 | 1 | 2023-01-01 | Product A | 10 |
1 | 2 | 1 | 2023-01-02 | Product B | 15 |
2 | 3 | 2 | 2023-02-01 | Product A | 5 |
3 | 4 | 3 | 2023-02-02 | Product C | 20 |
results = execute_transaction(
query="""
SET ROLE analyst2;
SELECT current_user;
INSERT INTO raw_data.orders (order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10);
INSERT INTO clean_data.orders (order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10);
""",
print_results=True,
)
Query:
SET ROLE analyst2
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
INSERT INTO raw_data.orders(order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10)
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: INSERT INTO raw_data.orders (order_id, customer_id, order_da...
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
INSERT INTO clean_data.orders(order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10)
Database response message: '
ERROR: permission denied for table orders
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user | |
---|---|
0 | analyst2 |
results = execute_transaction(
query="""
SET ROLE data_analyst;
SELECT current_user;
INSERT INTO raw_data.orders (order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10);
INSERT INTO clean_data.orders (order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10);
""",
print_results=True,
)
Query:
SET ROLE data_analyst
Database response message: 'SET'
Query:
SELECT current_user
Database response message: 'SELECT 1'
records in result: 1
Query:
INSERT INTO raw_data.orders(order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10)
Database response message: '
ERROR: permission denied for schema raw_data
LINE 1: INSERT INTO raw_data.orders (order_id, customer_id, order_da...
^
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
Query:
INSERT INTO clean_data.orders(order_id, customer_id, order_date, product, quantity)
VALUES
(1, 1, '2023-01-01', 'Product A', 10)
Database response message: '
ERROR: permission denied for table orders
Error type: <class 'psycopg2.errors.InsufficientPrivilege'>'
current_user | |
---|---|
0 | data_analyst |
It looks like we have a role that we can assign to new users to grant them permission to select data from tables in only the clean_data
schema. And we’ve also confirmed our roles can’t insert data into tables (and likely can’t update or delete either). So we’ve developed and tested a solution that can meet our specification.
We create our data_analyst
role, grant it privileges, and set its default privileges
GRANT USAGE ON SCHEMA clean_data TO data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA clean_data TO data_analyst;
ALTER DEFAULT PRIVILEGES IN SCHEMA clean_data
GRANT SELECT ON TABLES TO data_analyst;
and we can grant new analyst users those privileges
Success!
When we’re done with our experiments, we can tell docker to shut down our sandbox container and delete the volume storing our sandbox’s database via this command.
To shut down our container without deleting the volume, just leave off the -v
flag.
[+] Running 0/0
⠋ Container 015_docker_postgres_sandbox-postgis-1 Stopping 0.1s
[+] Running 0/1
⠙ Container 015_docker_postgres_sandbox-postgis-1 Stopping 0.2s
[+] Running 0/1
⠹ Container 015_docker_postgres_sandbox-postgis-1 Stopping 0.3s
[+] Running 2/1
✔ Container 015_docker_postgres_sandbox-postgis-1 Removed 0.3s
✔ Volume 015_docker_postgres_sandbox_sandbox_postgis_data Removed 0.0s
⠋ Network 015_docker_postgres_sandbox_default Removing 0.1s
[+] Running 2/3
✔ Container 015_docker_postgres_sandbox-postgis-1 Removed 0.3s
✔ Volume 015_docker_postgres_sandbox_sandbox_postgis_data Removed 0.0s
⠙ Network 015_docker_postgres_sandbox_default Removing 0.2s
[+] Running 3/3
✔ Container 015_docker_postgres_sandbox-postgis-1 Removed 0.3s
✔ Volume 015_docker_postgres_sandbox_sandbox_postgis_data Removed 0.0s
✔ Network 015_docker_postgres_sandbox_default Removed 0.2s
Before you can use docker and compose, you have to install docker and compose. I installed the docker engine and compose, but it looks like the developers at Docker Inc. guide people towards installing their Docker Desktop client. In either setup, docker should still become available to you via the command line, so instructions in this post should work.↩︎
Docker containers are designed to be a replicable instance of an image, so when you shut down your application, your containers are removed and new ones are created next time you start it up. This is great for reproducibility (you always get a new, clean instance based on your image), but you don’t want the data you ingest into your database to get wiped every time you shut down your system, so you can define a persistent volume that will live on in the host system.↩︎