How psycopg row factories help you write safer applications?

Publié le 01/06/2022 par Denis Laxalde

Toulouse, 1 June 2022

psycopg is a PostgreSQL database adapter for the Python programming language. As such, one of its main goals is to automatically adapt PostgreSQL and Python types together when issuing queries and retrieving results. While this conversion applies fine for built-in types, psycopg also exposes a native way to handle domain-specific data types: row factories.

These domain models are typically defined by application developers in the form of data classes in Python, mapped to one or more relations (usually tables) in PostgreSQL. This is often where ORMs come into play by providing an abstract layer to write queries from model classes and map results to model instances. In psycopg 3, row factories provide a simple, yet powerful, way to handle domain-specific models without any extra dependency. In addition, we’ll see how this lightweight feature can be leveraged to write safer applications, through static typing in particular.

What’s a row factory?

Let’s assume we are working with weather data, collecting date reports for cities into a weather table:

CREATE TABLE weather (
    city            varchar(80),
    temp_lo         int,
    temp_hi         int,
    prcp            real,
    date            date
);

When querying this table with psycopg, we get rows as tuple values, each column having the correct Python type matching PostgreSQL one:

>>> with psycopg.connect() as conn:
...     print(conn.execute("SELECT * FROM weather").fetchone())
('San Francisco', 46, 50, 0.25, datetime.date(1994, 11, 27))

The simplest way to get something else is to use built-in row factories, such as dict_row which produces a dictionary mapping result set column names to values:

>>> with psycopg.connect() as conn, conn.cursor(row_factory=psycopg.rows.dict_row) as cur:
...     print(cur.execute("SELECT * FROM weather").fetchone())
{'city': 'San Francisco', 'temp_lo': 46, 'temp_hi': 50, 'prcp': 0.25, 'date': datetime.date(1994, 11, 27)}

The row factory is passed upon cursor initialization and the row obtained from fetchone() call is indeed a dictionary.

Now more interestingly, let’s assume we have the following domain-specific model in our Python application:

from dataclasses import dataclass
from datetime import date

@dataclass
class Weather:
    city: str
    temperature: tuple[int, int]
    precipitation: float
    date: date

The goal would be to get instances of this Weather class when executing queries such as SELECT * FROM weather. That’s where custom row factories come into play.

A row factory is typically a function, which processes raw Python data (adapted by psycopg from SQL values) into some final type, here an instance of the Weather class above. We can write such a function as:

def weather_from_row(city, temp_lo, temp_hi, prcp, date):
    return Weather(
        city=city, temperature=(temp_lo, temp_hi), precipitation=prcp, date=date
    )

However, this is not enough because psycopg is missing the information about columns in the SQL result set to use with this function. That’s why a row factory additionally makes use of the cursor being used for the current query. Accordingly, this function must be wrapped into a factory as:

def weather_row_factory(cursor):
    # Extract result set column names.
    columns = [column.name for column in cursor.description]

    def make_row(values):
        # Map column names to values
        row = dict(zip(columns, values))
        return weather_from_row(**row)

    return make_row

Next we’d arguably get a better design if these functions were class methods of our Weather class above; let’s do it:

@dataclass
class Weather:
    city: str
    ...

    @classmethod
    def from_row(cls, *, city, temp_lo, temp_hi, prcp, date):
        return cls(
            city=city, temperature=(temp_lo, temp_hi), precipitation=prcp, date=date
        )

    @classmethod
    def row_factory(cls, cursor):
        columns = [column.name for column in cursor.description]

        def make_row(values):
            row = dict(zip(columns, values))
            return cls.from_row(**row)

        return make_row

And now that all is in place, let’s see how it works:

>>> with psycopg.connect() as conn:
...     with conn.cursor(row_factory=Weather.row_factory) as cur:
...         cur.execute("SELECT * FROM weather")
...         row = cur.fetchone()
>>> print(row)
Weather(city='San Francisco', temperature=(46, 50), precipitation=0.25, date=datetime.date(1994, 11, 27))

As you can see, the row variable above is a Weather instance so we can manipulate it as such, e.g. accessing attributes:

>>> import statictics
>>> statictics.mean(row.temperature)
48

How is this safer?

In the context of databases, or I/O in general, the code involved working at the I/O boundary — where one transforms data from/to one type system into another — is typically quite “unsafe” and requires special care, such as validation when decoding / encoding.

By using row factories when issuing queries from psycopg as shown in previous section, we perform transformation from “raw” Python types to domain-specific models early. Thus the time spent in this I/O boundary is reduced. Typically, one would take great care of writing correct SQL queries and their corresponding row factories, carefully test them, and the rest of application code would work nicely without being concerned about data coming from the database. This is typically where an ORM also helps.

Another great benefit comes when combining row factories with Python static typing. Static typing, with type checkers such as mypy, provides additional safety guarantees without impacting runtime performances. However, leveraging this at the I/O boundary might be harder because this typically involves “weak” typing (i.e. str or dict types, as opposed to, e.g. Weather, considered as a “strong” type). In this context, the row factory feature of psycopg helps a lot because it brings strong typing into the I/O boundary. As such, one would typically use model types with strong validation capabilities (for instance, libraries such as pydantic in the Python ecosystem).

So let’s get back to our previous example to add type annotations:

from typing import Any, Sequence

@dataclass
class Weather:
    city: str
    ...

    @classmethod
    def from_row(cls, *, city: str, temp_lo: int, temp_hi: int, prcp: float, date: date) -> Weather:
        return cls(
            city=city, temperature=(temp_lo, temp_hi), precipitation=prcp, date=date
        )

    @classmethod
    def row_factory(cls, cursor: Cursor[Any]) -> Callable[[Sequence[Any]], Weather]:
        columns = [column.name for column in cursor.description]

        def make_row(values: Sequence[Any]) -> Weather:
            row = dict(zip(columns, values))
            return cls.from_row(**row)

        return make_row

the following example type-checks fine:

def get_weather_reports(conn: Connection[Any]) -> list[Weather]:
    with conn.cursor(row_factory=Weather.row_factory) as cur:
        cur.execute("SELECT * FROM weather")
        return cur.fetchall()

By passing a typed row factory to the cursor being used to fetch queries, the value obtained from this cursor’s fetchall() call is inferred to have the corresponding row type. Namely, here, the cursor has a row factory returning rows of type Weather so fetchall() returns a list[Weather] value, which matches what get_weather_reports() declares (and would be verified by a type checker such as mypy). This is safer because data comes from the database in the form of strong types (e.g. the Weather type), not tuple (or dict) of anonymous values (int, date, etc.). This is safer because, as we’re passing those strongly typed values, we avoid typical runtime errors (such as IndexError for tuple or KeyError for dict) that a type checker cannot verify; on the other hand typical AttributeError occurring with strong types can be checked. For instance, the following IndexError is impossible to catch until runtime:

>>> with conn.cursor() as cur:
...     rows = cur.execute("SELECT * FROM weather").fetchall()  # type is list[tuple[Any, ...]]
>>> rows[0][5]
Traceback (most recent call last):
  ...
IndexError: tuple index out of range

whereas the following AttributeError gets caught by a type checker (and can thus be avoided at runtime):

>>> with conn.cursor(row_factory=Weather.row_factory) as cur:
...     rows = cur.execute("SELECT * FROM weather").fetchall() # type is list[Weather]
>>> rows[0].prcp
Traceback (most recent call last):
  ...
AttributeError: 'Weather' object has no attribute 'prcp'

Finally, we get safer by using row factories, a built-in feature of psycopg, but we also gain expressiveness by avoiding clumsy dict or so when querying the database.

How does this work, under the hood?

Now the last part of previous section, demonstrating that types defined in the row factory propagate through Cursor methods, may seem a little magic. Let’s then take back the last example and add a few reveal_type() calls (with results shown as # note: comments):

def get_weather_reports(conn: Connection[Any]) -> list[Weather]:
    with conn.cursor(row_factory=Weather.row_factory) as cur:
        reveal_type(cur)
        # note: Revealed type is "psycopg.cursor.Cursor[Weather]"
        cur.execute("SELECT * FROM weather")
        rset = cur.fetchall()
        reveal_type(rset)
        # note: Revealed type is "builtins.list[Weather]"
        return rset

we can see that the cur: Cursor[Weather] value is parametrized on the “row” type Weather, as declared by in Weather.row_factory(cursor: Cursor[Any]) -> Callable[[Sequence[Any]], Weather] (the important bit is the Weather type declared as the return value of the callable returned by the row factory). Likewise, rset: list[Weather] is indeed inferred as a list of Weather objects.

To understand how it works, let’s first look at the definition of Connection in psycopg (simplified):

Row = TypeVar("Row", covariant=True)
RowFactory = Callable[[Sequence[Any]], Row]  # simplified

class Connection:

    @overload
    def cursor(self) -> Cursor[Tuple[Any, ...]]:
        ...

    @overload
    def cursor(self, *, row_factory: RowFactory[Row]) -> Cursor[Row]:
        ...

    def cursor(self, *, row_factory: Optional[RowFactory[Any]] = None) -> Cursor[Any]:
        # implementation here

Method Connection.cursor() is defined as an overload depending on the value of the row_factory parameter and returning a Cursor object with a specific type variable Row. In other words, that Row type variable is bound from the row factory parameter to the Cursor returned value.

The same Row type variable is then used to define a generic Cursor class, which in turns allows fetch*() methods to return Row values (still simplified):

class Cursor(Generic[Row]):

    def fetchone(self) -> Optional[Row]:
        ...

    def fetchall(self) -> List[Row]:
        ...

So if we consider back our last example:

>>> cur = conn.cursor(row_factory=Weather.row_factory)
>>> cur.execute("SELECT * FROM weather")
>>> rows = cur.fetchall()

cur is a Cursor[Weather] because Weather.row_factory has type RowFactory[Weather], and,
rows is a list of Weather because cur is a Cursor[Weather].

That’s all for this overview of row factories in psycopg. Next check out psycopg documentation, especially the page about static typing (which explains how Cursor is generic on Row), the main row factories page as well as the rows API module which provides nice row factory generator helpers.

DALIBO

DALIBO est le spécialiste français de PostgreSQL^®. Nous proposons du support, de la formation et du conseil depuis 2005.

Le blog Dalibo

Emploi

Événements

Formation

Open Source

Releases

SCOP

How psycopg row factories help you write safer applications?

What’s a row factory?

How is this safer?

How does this work, under the hood?

DALIBO

Emploi

Événements

Formation

Open Source

Releases

SCOP

How psycopg row factories help you write safer applications?

What’s a row factory?

How is this safer?

How does this work, under the hood?

Partagez cet article !

DALIBO