Database Migrations In A Pluggable Module System Using A Graph Algorithm

By Sudheer S

February 3, 2021

In this blog post, I will explain how I implemented a graph algorithm to solve the database migration problem in an application pluggable module system.

Prerequisites:

Working knowledge of Python
Working knowledge of Graph Theory. Familiar with the terms: Edge, vertex, path, sink, source, digraph, path graph, etc.

Gavika Web Framework has a pluggable module system. The modules can be developed independently. They can be installed, upgraded and removed from the main application. Gavika Web Framework is written using Python, Flask, SQLAlchemy and a bunch of other related technologies and libraries.

The main application uses Alembic to manage database migrations.

Each module has its own database schema that resides alongside of the main application’s database schema. Typically, modules have their own tables. As a module evolves it has needs to add or remove tables or columns. There is a migration system in place to handle such database schema changes on a module level. The framework uses more or less the same file system structure and conventions as Alembic.

migrations/
├── __init__.py
└── versions
    ├── 1_.py
    ├── 2_.py
    ├── 3_.py
    └── __init__.py

This is the directory structure inside a module. The files 1_.py, 2_.py and 3_.py contain the actual migrations.

Contents of 1_.py:

revision = "1"
down_revision = None

def upgrade():
  ...

Contents of 2_.py

revision = "2"
down_revision = "1"
def upgrade():
  ...

Every revision file contains revision and down_revision at minimum. The revision file 2_.py indicates that the revision is “2” and it comes after “1”. Notice that the revision is a string. The number is used in the string for convenience. The revision could be something like aasdeeeedd or some kind of UUID or hash that do not follow a sequence. A sequence of numbers is not suitable for software that uses different branches. For the purpose of brevity, we demonstrate a simple use case. Therefore, the revisions are 1, 2 and 3. In addition to the revision information, the revision file contains the upgrade method. This method contains the actual code to modify the schema. Typically, they are calls to op.create_table, op.alter_table, etc.

Implementation Of The Migration Algorithm

Networkx library is used to implement the graph algorithm.

Scan the versions directory for revision files. Create a digraph using nx.DiGraph() method.
Use importlib.machinery.SourceFileLoader to load the Python source code of the revision file.
From the loaded source file, read down_revision. Use down_revision and revision to form an edge in the graph. Add the edge to the edges list.
Add the edges to the graph.
Find the sink of the graph.
Find the shortest path from last revision mark to sink.
Iterate the path and call the upgrade method

    module_revisions_path = (
            f"{{module_name}/migrations/versions/"
        )
        if os.path.exists(module_revisions_path):
            G = nx.DiGraph()
            edges = []
            revision_modules = {}
            for revision_file in os.listdir(module_revisions_path):
                if revision_file in ["__init__.py", "__pycache__"]:
                    continue
                loader = importlib.machinery.SourceFileLoader(
                    "revision",
                    f"{module_name}/migrations/versions/{revision_file}",
                )
                python_module = types.ModuleType(loader.name)
                loader.exec_module(python_module)
                down_revision = python_module.down_revision
                if down_revision is None:
                    down_revision = "START"
                edges.append((down_revision, python_module.revision))
                revision_modules[python_module.revision] = python_module
            G.add_edges_from(edges)
            sink = list(topological_sort(G))[-1:][0]
            if not module.last_revision:
                last_revision = "START"
            else:
                last_revision = module.last_revision
            revision_path = shortest_path(G, last_revision, sink)
            revision_path = revision_path[
                1:
            ]  # The first node is current/source node, omit it
            for revision_path_item in revision_path:
                revision_python_module = revision_modules[revision_path_item]
                revision_python_module.upgrade()
                module.last_revision = revision_python_module.revision
        db.session.commit()

Further optimization

Mathematically, the revisions can be thought of as a directed path graph. The shortest path from source to sink becomes obvious in the directed path graph. There is only one way to traverse the path. There is no need to compute shortest path. All we need is a path. The closest solution I found in NetworkX was the shortest_path method. I do not know whether there is a better alternative to compute the path of the path graph in NetworkX or otherwise. That is an open question I would like to pursue at some point in the future.