Database Migrations In A Pluggable Module System Using A Graph Algorithm
By Sudheer S
In this blog post, I will explain how I implemented a graph algorithm to solve the database migration problem in an application pluggable module system.
Prerequisites:
- Working knowledge of Python
- Working knowledge of Graph Theory. Familiar with the terms: Edge, vertex, path, sink, source, digraph, path graph, etc.
Gavika Web Framework has a pluggable module system. The modules can be developed independently. They can be installed, upgraded and removed from the main application. Gavika Web Framework is written using Python, Flask, SQLAlchemy and a bunch of other related technologies and libraries.
The main application uses Alembic to manage database migrations.
Each module has its own database schema that resides alongside of the main application’s database schema. Typically, modules have their own tables. As a module evolves it has needs to add or remove tables or columns. There is a migration system in place to handle such database schema changes on a module level. The framework uses more or less the same file system structure and conventions as Alembic.
migrations/
├── __init__.py
└── versions
├── 1_.py
├── 2_.py
├── 3_.py
└── __init__.py
This is the directory structure inside a module. The files 1_.py
, 2_.py
and 3_.py
contain the actual migrations.
Contents of 1_.py
:
revision = "1"
down_revision = None
def upgrade():
...
Contents of 2_.py
revision = "2"
down_revision = "1"
def upgrade():
...
Every revision file contains revision
and down_revision
at minimum. The revision file 2_.py
indicates that the
revision is “2” and it comes after “1”. Notice that the revision is a string. The number is used in the string for
convenience. The revision could be something like aasdeeeedd
or some kind of UUID or hash that do not follow a
sequence. A sequence of numbers is not suitable for software that uses different branches. For the purpose of brevity,
we demonstrate a simple use case. Therefore, the revisions are 1
, 2
and 3
. In addition to the revision
information, the revision file contains the upgrade
method. This method contains the actual code to modify the schema.
Typically, they are calls to op.create_table
, op.alter_table
, etc.
Implementation Of The Migration Algorithm
Networkx
library is used to implement the graph algorithm.
- Scan the
versions
directory for revision files. Create a digraph using nx.DiGraph() method. - Use
importlib.machinery.SourceFileLoader
to load the Python source code of the revision file. - From the loaded source file, read
down_revision
. Usedown_revision
andrevision
to form an edge in the graph. Add the edge to theedges
list. - Add the edges to the graph.
- Find the sink of the graph.
- Find the shortest path from last revision mark to sink.
- Iterate the path and call the
upgrade
method
module_revisions_path = (
f"{{module_name}/migrations/versions/"
)
if os.path.exists(module_revisions_path):
G = nx.DiGraph()
edges = []
revision_modules = {}
for revision_file in os.listdir(module_revisions_path):
if revision_file in ["__init__.py", "__pycache__"]:
continue
loader = importlib.machinery.SourceFileLoader(
"revision",
f"{module_name}/migrations/versions/{revision_file}",
)
python_module = types.ModuleType(loader.name)
loader.exec_module(python_module)
down_revision = python_module.down_revision
if down_revision is None:
down_revision = "START"
edges.append((down_revision, python_module.revision))
revision_modules[python_module.revision] = python_module
G.add_edges_from(edges)
sink = list(topological_sort(G))[-1:][0]
if not module.last_revision:
last_revision = "START"
else:
last_revision = module.last_revision
revision_path = shortest_path(G, last_revision, sink)
revision_path = revision_path[
1:
] # The first node is current/source node, omit it
for revision_path_item in revision_path:
revision_python_module = revision_modules[revision_path_item]
revision_python_module.upgrade()
module.last_revision = revision_python_module.revision
db.session.commit()
Further optimization
Mathematically, the revisions can be thought of as a directed path graph. The shortest path from source to sink becomes
obvious in the directed path graph. There is only one way to traverse the path. There is no need to compute shortest path.
All we need is a path. The closest solution I found in NetworkX
was the shortest_path
method. I do not know whether
there is a better alternative to compute the path of the path graph in NetworkX or otherwise. That is an open question
I would like to pursue at some point in the future.