The Calculation of Electric Power Flow

Author: Ruizhi Yu

Power flow is fundamental in electric power system analysis. Given electric load and generation, we want to calculate the bus voltage and the power distribution[1].

In this example, we illustrate how to use Solverz to symbolically model the power flow and then perform the power flow analysis.

Model

The power flow models are equations describing the injection power of electric buses, with formulae

\[\begin{split}\left\{ \begin{aligned} &p_h=v_h\sum_{k}v_k\qty(g_{hk}\cos\theta_{hk}+b_{hk}\sin\theta_{hk}),\quad i\in\mathbb{B}_\text{pv, pq}\\ &q_h=v_h\sum_{k}v_k\qty(g_{hk}\sin\theta_{hk}-b_{hk}\cos\theta_{hk}),\quad i\in\mathbb{B}_\text{pv}\\ \end{aligned} \right.\end{split}\]

where \(p_h\) and \(q_h\) are respectively the active and reactive injection power of bus \(h\); \(v_h\) is the voltage magnitude of bus \(h\); \(\theta_{hk}=\theta_h-\theta_k\) is the voltage angle difference between bus \(h\) and \(k\); \(g_{hk}\) and \(b_{hk}\) are the \((h, k)\)-th entry of the conductance and susceptance matrices; \(\mathbb{B}\) is the set of bus indices with the subscripts being the bus types.

The buses in electric power systems are typically sparsely connected, and hence the Jacobian of power flow models are always sparse. In what follows, we will set the sparse flag to be True.

Implementation in Solverz

Power flow modelling

We use the case30 from the matpower library. The required data for verification can be found in case file directory of the source repo.

We first perform the symbolic modelling of the case30 power flow.

import matplotlib.pyplot as plt
from scipy.io import loadmat
import os
import numpy as np

from Solverz import Var, Eqn, Model, sin, cos, Param, module_printer

current_dir = os.getcwd()
file_path = os.path.join(current_dir, "test_pf_jac", "pf.mat")
sys = loadmat(file_path)
file_path = os.path.join(current_dir, "test_pf_jac", "pq.mat")
PQ = loadmat(file_path)
# %% model
V = sys["V"].reshape((-1,))
nb = V.shape[0]
Ybus = sys["Ybus"]
G = Ybus.real.toarray()
B = Ybus.imag.toarray()
ref = (sys["ref"] - 1).reshape((-1,)).tolist()
pv = (sys["pv"] - 1).reshape((-1,)).tolist()
pq = (sys["pq"] - 1).reshape((-1,)).tolist()
npv = len(pv)
npq = len(pq)
mbase = 100  # MVA

m = Model()
m.Va = Var("Va", np.angle(V)[pv + pq])
m.Vm = Var("Vm", np.abs(V)[pq])
m.Pg = Param("Pg", PQ["Pg"].reshape(-1, ) / mbase)
m.Qg = Param("Qg", PQ["Qg"].reshape(-1, ) / mbase)
m.Pd = Param("Pd", PQ["Pd"].reshape(-1, ) / mbase)
m.Qd = Param("Qd", PQ["Qd"].reshape(-1, ) / mbase)


def get_Vm(idx):
    if idx in ref + pv:
        return np.abs(V)[idx]
    elif idx in pq:
        return m.Vm[pq.index(idx)]


def get_Va(idx):
    if idx in ref:
        return np.angle(V)[idx]
    elif idx in pv + pq:
        return m.Va[(pv + pq).index(idx)]


for i in pv + pq:
    expr = 0
    Vmi = get_Vm(i)
    Vai = get_Va(i)
    for j in range(nb):
        Vmj = get_Vm(j)
        Vaj = get_Va(j)
        expr += Vmi * Vmj * (G[i, j] * cos(Vai - Vaj) + B[i, j] * sin(Vai - Vaj))
    m.__dict__[f"P_eqn_{i}"] = Eqn(f"P_eqn_{i}", expr + m.Pd[i] - m.Pg[i])

for i in pq:
    expr = 0
    Vmi = get_Vm(i)
    Vai = get_Va(i)
    for j in range(nb):
        Vmj = get_Vm(j)
        Vaj = get_Va(j)
        expr += Vmi * Vmj * (G[i, j] * sin(Vai - Vaj) - B[i, j] * cos(Vai - Vaj))
    m.__dict__[f"Q_eqn_{i}"] = Eqn(f"Q_eqn_{i}", expr + m.Qd[i] - m.Qg[i])
# %% create instance
spf, y0 = m.create_instance()
pyprinter = module_printer(spf, y0, "powerflow", make_hvp=True, jit=True)
pyprinter_njit = module_printer(spf, y0, "powerflow_njit", make_hvp=True)
pyprinter.render()
pyprinter_njit.render()

We use the module_printer to generate two independent python modules powerflow and powerflow_njit with the jit flag being True and False respectively.

After printing the modules, we can just import these modules and call the F and J functions to evaluate the power flow model and its Jacobian.

from powerflow import mdl as pf, y as y0
F0 = pf.F(y0, pf.p)
J0 = pf.J(y0, pf.p)

Jit acceleration

The above power models have the \(\sum\) symbols, which bring about burdensome for-loops. As a potent way to eliminate the for-loops and accelerate calculations, we ought to use the llvm-based Numba package to fully take advantage of the SMID in CPUs. It should be noted that Solverz can print numerical codes compatible for Numba integration by setting jit=True.

We show the computation overhead between two jit settings using the following figures.

omega

omega

On a laptop equipped with Ryzen 5800H CPU, it took hundreds of seconds to compile the module powerflow. However, the post-compiled F and J function evaluations were one magnitude faster than those without jit-compilation.

The compiled results are cached locally, so that only one compilation is required for each model. We recommend that one debug one’s models without jit and compile the models in efficiency-demanding cases.

Matrix-form power flow with Mat_Mul

Since version 0.8.0, Solverz supports a compact matrix-vector formulation of power flow using Mat_Mul. Instead of element-wise for-loops, we express the power injection equations in rectangular coordinates (\(e\), \(f\)) using matrix-vector products:

\[\begin{split}\left\{ \begin{aligned} &\mathbf{p}=\mathbf{e}\odot(\mathbf{G}\mathbf{e}-\mathbf{B}\mathbf{f})+\mathbf{f}\odot(\mathbf{B}\mathbf{e}+\mathbf{G}\mathbf{f})\\ &\mathbf{q}=\mathbf{f}\odot(\mathbf{G}\mathbf{e}-\mathbf{B}\mathbf{f})-\mathbf{e}\odot(\mathbf{B}\mathbf{e}+\mathbf{G}\mathbf{f}) \end{aligned} \right.\end{split}\]

where \(\mathbf{G}\) and \(\mathbf{B}\) are the conductance and susceptance matrices (sparse), and \(\odot\) denotes element-wise multiplication.

Solverz automatically computes the symbolic Jacobian via matrix calculus. For example, the Jacobian of the active power equation w.r.t. \(\mathbf{e}\) is:

\[\frac{\partial\mathbf{p}}{\partial\mathbf{e}}=\operatorname{diag}(\mathbf{G}\mathbf{e}-\mathbf{B}\mathbf{f})+\operatorname{diag}(\mathbf{e})\mathbf{G}+\operatorname{diag}(\mathbf{f})\mathbf{B}\]

The implementation for the case30 system:

"""
Power flow modelling using Mat_Mul (rectangular coordinates).

Instead of element-wise for-loops, this formulation uses Mat_Mul to express
the power injection equations in compact matrix-vector form:

    P = e * (G@e - B@f) + f * (B@e + G@f)
    Q = f * (G@e - B@f) - e * (B@e + G@f)

where G and B are the conductance and susceptance matrices (sparse),
and e, f are the real and imaginary parts of the bus voltage.

Solverz automatically computes the symbolic Jacobian via matrix calculus.
"""
from scipy.io import loadmat
from scipy.sparse import csc_array
import os
import numpy as np

from Solverz import Var, Eqn, Model, Param, Mat_Mul, made_numerical, nr_method

current_dir = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_dir, "test_pf_jac", "pf.mat")
sys_data = loadmat(file_path)
file_path = os.path.join(current_dir, "test_pf_jac", "pq.mat")
PQ = loadmat(file_path)

# %% parse system data
V = sys_data["V"].reshape((-1,))
nb = V.shape[0]
Ybus = sys_data["Ybus"].tocsc()
G_full = Ybus.real
B_full = Ybus.imag
ref = (sys_data["ref"] - 1).reshape((-1,)).tolist()
pv = (sys_data["pv"] - 1).reshape((-1,)).tolist()
pq = (sys_data["pq"] - 1).reshape((-1,)).tolist()
non_ref = pv + pq
mbase = 100

# Initial voltage in rectangular coordinates
e0 = V.real
f0 = V.imag
Pg = PQ["Pg"].reshape(-1) / mbase
Qg = PQ["Qg"].reshape(-1) / mbase
Pd = PQ["Pd"].reshape(-1) / mbase
Qd = PQ["Qd"].reshape(-1) / mbase
Pinj = Pg - Pd
Qinj = Qg - Qd

# %% Submatrices for non-reference buses
#
# P equations:  P_i = e_i * sum_j(G_ij*e_j - B_ij*f_j)
#                   + f_i * sum_j(B_ij*e_j + G_ij*f_j)
#
# In matrix form (only non-ref buses):
#   P = e * (G_nr@e - B_nr@f + p_ref) + f * (B_nr@e + G_nr@f + q_ref)
#
# where p_ref, q_ref absorb the reference bus contributions.

n_nr = len(non_ref)
n_pq = len(pq)
n_pv = len(pv)

G_nr = csc_array(G_full[np.ix_(non_ref, non_ref)])
B_nr = csc_array(B_full[np.ix_(non_ref, non_ref)])

# Reference bus contribution (constant)
e_ref = e0[ref[0]]
f_ref = f0[ref[0]]
G_ref_col = G_full[non_ref, ref[0]].toarray().ravel()
B_ref_col = B_full[non_ref, ref[0]].toarray().ravel()
p_ref = G_ref_col * e_ref - B_ref_col * f_ref
q_ref = B_ref_col * e_ref + G_ref_col * f_ref

# %% Q equations use a separate submatrix (PQ rows only)
pq_in_nr = [non_ref.index(i) for i in pq]

G_pq = csc_array(G_full[np.ix_(pq, non_ref)])
B_pq = csc_array(B_full[np.ix_(pq, non_ref)])

G_pq_ref_col = G_full[pq, ref[0]].toarray().ravel()
B_pq_ref_col = B_full[pq, ref[0]].toarray().ravel()
p_ref_pq = G_pq_ref_col * e_ref - B_pq_ref_col * f_ref
q_ref_pq = B_pq_ref_col * e_ref + G_pq_ref_col * f_ref

# %% Build model with Mat_Mul
m = Model()

# Variables: e and f at non-reference buses (flat start)
m.e = Var("e", np.ones(n_nr))
m.f = Var("f", np.zeros(n_nr))

# Parameters
m.G_nr = Param("G_nr", G_nr, dim=2, sparse=True)
m.B_nr = Param("B_nr", B_nr, dim=2, sparse=True)
m.G_pq = Param("G_pq", G_pq, dim=2, sparse=True)
m.B_pq = Param("B_pq", B_pq, dim=2, sparse=True)
m.p_ref = Param("p_ref", p_ref)
m.q_ref = Param("q_ref", q_ref)
m.p_ref_pq = Param("p_ref_pq", p_ref_pq)
m.q_ref_pq = Param("q_ref_pq", q_ref_pq)
m.Pinj = Param("Pinj", Pinj[non_ref])
m.Qinj = Param("Qinj", Qinj[pq])

# Active power balance (all non-ref buses): n_nr equations
m.P_eqn = Eqn("P_balance",
               m.e * (Mat_Mul(m.G_nr, m.e) - Mat_Mul(m.B_nr, m.f) + m.p_ref)
               + m.f * (Mat_Mul(m.B_nr, m.e) + Mat_Mul(m.G_nr, m.f) + m.q_ref)
               - m.Pinj)

# Reactive power balance (PQ buses only): n_pq equations
# Uses G_pq/B_pq submatrices (rows = pq buses, cols = non-ref buses)
e_pq = m.e[pq_in_nr[0]:pq_in_nr[-1] + 1]
f_pq = m.f[pq_in_nr[0]:pq_in_nr[-1] + 1]
m.Q_eqn = Eqn("Q_balance",
               f_pq * (Mat_Mul(m.G_pq, m.e) - Mat_Mul(m.B_pq, m.f) + m.p_ref_pq)
               - e_pq * (Mat_Mul(m.B_pq, m.e) + Mat_Mul(m.G_pq, m.f) + m.q_ref_pq)
               - m.Qinj)

# Voltage magnitude constraints (PV buses): n_pv equations
pv_in_nr = [non_ref.index(i) for i in pv]
Vm_pv_sq = np.abs(V[pv]) ** 2
m.Vm_sq = Param("Vm_sq", Vm_pv_sq)
e_pv = m.e[pv_in_nr[0]:pv_in_nr[-1] + 1]
f_pv = m.f[pv_in_nr[0]:pv_in_nr[-1] + 1]
m.V_eqn = Eqn("V_pv", e_pv ** 2 + f_pv ** 2 - m.Vm_sq)

# %% Solve
spf, y0 = m.create_instance()
mdl = made_numerical(spf, y0, sparse=True)
sol = nr_method(mdl, y0)

# Reconstruct full voltage
e_sol = np.zeros(nb)
f_sol = np.zeros(nb)
e_sol[ref] = e_ref
f_sol[ref] = f_ref
e_sol[non_ref] = sol.y["e"]
f_sol[non_ref] = sol.y["f"]
V_sol = e_sol + 1j * f_sol

print(f"Converged in {sol.stats.nstep} iterations")
print(f"|V| = {np.abs(V_sol[non_ref])}")
print(f"Va  = {np.angle(V_sol[non_ref])}")

Starting from a flat start (\(e=1\), \(f=0\)), the Newton-Raphson method converges in 4 iterations.

Note

The Mat_Mul formulation is much more compact than the for-loop approach — the entire power flow model is defined in just a few lines. The Jacobian is computed automatically by the matrix calculus engine, eliminating the need for manual derivation.

Compact polar power flow with LoopEqn

The for-loop polar form emits one scalar Eqn per bus, so a model with nb buses generates nb inner_F kernels and hundreds of per-non-zero Jacobian kernels. Since version 0.9.0, the same polar equations can be written as a small number of LoopEqn blocks over a flat, full-bus Vm / Va state, keeping the readable polar formulation while collapsing the code-generation explosion.

The idea is to iterate the active-power balance over the pv+pq buses and the reactive-power balance over the pq buses with two LoopEqns, and to pin the ref+pv voltage magnitudes and ref voltage angles to their setpoints with two further LoopEqns. The inner \(\sum_k\) over all buses reads the sparse conductance / susceptance rows directly, so each LoopEqn compiles to a single vector kernel:

"""Polar power flow modelled with LoopEqn instead of the per-bus
scalar expansion used in ``pf_mdl.py``.

The LoopEqn variant keeps a flat ``Vm_full`` / ``Va_full`` over every
bus and pins the ref / pv portions with two *additional* LoopEqn
blocks (``Vm_pin`` over ref+pv, ``Va_pin`` over ref). Every
sub-function in the generated module is therefore either a LoopEqn
kernel or the F_/J_ wrapper — there are no per-scalar
``inner_F<N>``'s emitted for the pin equations.

This module is used by ``bench_pf_loopeqn_vs_legacy.py`` to measure the
cold-cache module_printer compile-time delta.
"""
import os

import numpy as np
from scipy.io import loadmat
from scipy.sparse import csc_array

from Solverz import Var, Model, sin, cos, Param, module_printer
from Solverz import Idx, LoopEqn, Sum, Set


def build_loopeqn_pf_model(datadir=None):
    """Build the LoopEqn polar power flow model for case30.

    Parameters
    ----------
    datadir : str or None
        Directory containing ``pf.mat`` and ``pq.mat``. Defaults to the
        ``test_pf_jac`` folder next to this file.

    Returns
    -------
    m : Model
        Fully assembled Solverz ``Model`` with flat ``Vm_full`` /
        ``Va_full`` Vars, two ``LoopEqn``s (``P_eqn`` / ``Q_eqn``), and
        scalar pin Eqns for the ref / pv buses.
    """
    if datadir is None:
        datadir = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                               'test_pf_jac')
    param = loadmat(os.path.join(datadir, 'pf.mat'))
    PQ = loadmat(os.path.join(datadir, 'pq.mat'))

    V = param['V'].reshape(-1)
    nb = V.shape[0]
    Ybus = param['Ybus']
    ref = (param['ref'] - 1).reshape(-1).tolist()
    pv = (param['pv'] - 1).reshape(-1).tolist()
    pq = (param['pq'] - 1).reshape(-1).tolist()

    mbase = 100  # MVA
    Pg = PQ['Pg'].reshape(-1) / mbase
    Qg = PQ['Qg'].reshape(-1) / mbase
    Pd = PQ['Pd'].reshape(-1) / mbase
    Qd = PQ['Qd'].reshape(-1) / mbase

    Vm_init = np.abs(V)
    Va_init = np.angle(V)

    pv_pq_arr = np.array(pv + pq, dtype=int)  # free buses for P balance
    pq_arr = np.array(pq, dtype=int)          # free buses for Q balance
    ref_pv_arr = np.array(ref + pv, dtype=int)
    ref_arr = np.array(ref, dtype=int)

    npvpq = len(pv_pq_arr)
    npq = len(pq_arr)

    m = Model()
    m.Vm_full = Var('Vm_full', Vm_init)
    m.Va_full = Var('Va_full', Va_init)

    # Sparse admittance matrices. The indirect-outer CSR walker
    # ``Gbus[pv_pq_idx[i], j]`` reads row ``pv_pq_idx[i]`` from the
    # CSR skeleton at each outer iteration — identical effort to the
    # direct-outer walker but over the subset of buses the LoopEqn
    # iterates. Both F-side emission and J-side sparsity analysis
    # honour the row map.
    m.Gbus = Param('Gbus', csc_array(Ybus.real), dim=2, sparse=True)
    m.Bbus = Param('Bbus', csc_array(Ybus.imag), dim=2, sparse=True)

    m.Pg = Param('Pg', Pg)
    m.Qg = Param('Qg', Qg)
    m.Pd = Param('Pd', Pd)
    m.Qd = Param('Qd', Qd)

    # Four index sets: the full bus set (``j`` sums), the free-bus
    # subsets (``pv_pq`` for P balance, ``pq`` for Q balance), and
    # the pinned-bus subsets (``ref_pv`` for Vm pins, ``ref`` for Va
    # pins). ``Set`` replaces the ``Param + Idx + m.map[i]`` dance
    # with a single named object — ``Set.idx(...)`` produces a
    # bounded sympy.Idx whose uses in the body are transparently
    # gathered via the set's auxiliary Param.
    m.Bus = Set('Bus', nb)
    m.PVPQ = Set('PVPQ', pv_pq_arr)
    m.PQ = Set('PQ', pq_arr)
    m.RefPV = Set('RefPV', ref_pv_arr)
    m.Ref = Set('Ref', ref_arr)

    m.Vm_pinned = Param('Vm_pinned', Vm_init[ref_pv_arr])
    m.Va_pinned = Param('Va_pinned', Va_init[ref_arr])

    i_p = m.PVPQ.idx('i_p')
    i_q = m.PQ.idx('i_q')
    j = m.Bus.idx('j')

    # Split into separate Sums, one per sparse walker (Case A only —
    # ``_collect_sparse_walkers`` rejects multi-walker Sums).
    body_P = (
        m.Vm_full[i_p] * Sum(
            m.Vm_full[j] * m.Gbus[i_p, j]
            * cos(m.Va_full[i_p] - m.Va_full[j]),
            j,
        )
        + m.Vm_full[i_p] * Sum(
            m.Vm_full[j] * m.Bbus[i_p, j]
            * sin(m.Va_full[i_p] - m.Va_full[j]),
            j,
        )
        + m.Pd[i_p] - m.Pg[i_p]
    )
    m.P_eqn = LoopEqn('P_eqn', outer_index=i_p, body=body_P, model=m)

    body_Q = (
        m.Vm_full[i_q] * Sum(
            m.Vm_full[j] * m.Gbus[i_q, j]
            * sin(m.Va_full[i_q] - m.Va_full[j]),
            j,
        )
        - m.Vm_full[i_q] * Sum(
            m.Vm_full[j] * m.Bbus[i_q, j]
            * cos(m.Va_full[i_q] - m.Va_full[j]),
            j,
        )
        + m.Qd[i_q] - m.Qg[i_q]
    )
    m.Q_eqn = LoopEqn('Q_eqn', outer_index=i_q, body=body_Q, model=m)

    # Pin ref+pv buses' Vm and ref buses' Va to their known values
    # via two tiny LoopEqns over the ``RefPV`` / ``Ref`` sets. The
    # body uses the set-tagged ``Idx`` directly; the rewriter
    # inserts the gather to the underlying bus index automatically.
    i_vp = m.RefPV.idx('i_vp')
    i_vr = m.Ref.idx('i_vr')

    m.Vm_pin = LoopEqn(
        'Vm_pin', outer_index=i_vp,
        body=m.Vm_full[i_vp] - m.Vm_pinned[i_vp],
        model=m,
    )
    m.Va_pin = LoopEqn(
        'Va_pin', outer_index=i_vr,
        body=m.Va_full[i_vr] - m.Va_pinned[i_vr],
        model=m,
    )

    return m


if __name__ == '__main__':
    m = build_loopeqn_pf_model()
    spf, y0 = m.create_instance()
    pyprinter = module_printer(spf, y0, 'powerflow_loopeqn',
                               jit=True)
    pyprinter.render()

The four index sets are declared with the Set primitive: Bus for the inner sum, PVPQ / PQ for the free buses, and RefPV / Ref for the pinned buses. Set.idx(...) produces a bounded index whose uses in the body are gathered to the underlying bus index automatically, so m.Vm_full[i_p] and m.Gbus[i_p, j] resolve to the correct rows without a manual index map.

The LoopEqn form is algebraically identical to the per-bus scalar form, so the two converge to the same solution. The cross-check in src/verify_loopeqn_pf.py confirms the complex bus voltages agree to better than 1e-6 relative error. Its main advantage is compile time: like Mat_Mul, it emits a handful of vector kernels instead of O(nb) scalar ones, so module_printer cold-compile scales with the number of equation blocks rather than the number of buses. The same LoopEqn polar formulation is now the default in SolUtil.PowerFlow.

Performance comparison: Mat_Mul vs. for-loop

The two formulations above model the same physical system (case30) but pay very different costs at different phases of the workflow. The benchmark script is in src/bench_pf_matmul_vs_polar.py and can be re-run on any hardware:

cd docs/source/ae/pf/src
python bench_pf_matmul_vs_polar.py

It times seven phases end-to-end in a single run, with the module cold-compile phase executed in a fresh subprocess (and with __pycache__ / Numba .nbi/.nbc caches wiped beforehand) so the “first-time user” compile cost is measured honestly.

Note

Terminology used in this section

  • SpMVSparse Matrix-Vector multiplication, computing \(y = M\,x\) with \(M\) in a sparse format.

  • SpMMSparse Matrix-Matrix multiplication, computing \(C = M\,B\) with \(B\) multi-column.

  • CSC / CSRCompressed Sparse Column / Compressed Sparse Row, the two scipy.sparse storage layouts Solverz consumes.

  • @njit — Numba’s just-in-time compilation decorator. @njit(cache=True) saves the compiled code to disk and reuses it across processes.

  • scatter-add — an assembly pattern where many indexed reads are accumulated (+=) into a fixed output buffer; the mutable-matrix Jacobian kernels are pure scatter-add loops over precomputed index arrays.

  • fancy indexing — NumPy / scipy.sparse indexing with arrays of indices, e.g. M[[r0, r1, r2], [c0, c1, c2]].

  • fast path — Solverz’s code-gen path used for Mat_Mul(A, x) where A is a plain sparse dim=2 Param: the matvec runs inside @njit inner_F via the SolCF.csc_matvec Numba helper, with zero scipy.sparse calls per evaluation. Mutable-matrix Jacobian blocks have a parallel fast path of typed scatter-add kernels.

  • fallback path — the slower-but-correct path used when a Mat_Mul placeholder or a Jacobian block doesn’t fit the fast-path shape (negated matrices, nested Mat_Mul, dense dim=2, matrix expressions). The wrapper falls back to scipy.sparse @ and (for Jacobian blocks) fancy indexing into a freshly-built sparse matrix. Both paths co-exist; the runtime picks per placeholder / per block based on what the symbolic classifier recognised at code-gen time.

  • lambdify — SymPy’s symbolic-to-callable converter. Solverz’s “inline” mode uses it to turn each equation’s symbolic expression into a Python function on every call (no JIT).

  • hot F / hot J — steady-state per-call wall-clock time for F_(y, p) / J_(y, p) after warm-up. Distinguished from cold compile (the first call into a freshly-rendered module, which pays the Numba @njit compile cost for every decorated helper).

  • LICMLoop-Invariant Code Motion, an LLVM optimisation pass that hoists computations out of inner loops when their inputs do not change inside the loop.

The same terms (with longer definitions) appear in the Solverz Matrix-Vector Calculus glossary.

Benchmark environment

All numbers below were measured under:

  • Hardware: 2025 MacBook Air, Apple M4, AC power, no thermal throttling observed during the runs.

  • OS: macOS 26.4 (build 25E246).

  • Python: 3.11.13.

  • Library versions: numpy==2.3.3, scipy==1.16.0, numba==0.65.0, sympy==1.13.3, Solverz==0.8.1 (the post-csc_matvec fast path) or later.

  • Methodology: for each per-call number, 10 warm-up calls (to bake the Numba caches and prime the CPU branch predictor) followed by 5000–20000 timed iterations, median of three repeats. The cold-compile measurement is run in a fresh Python subprocess with __pycache__ and Numba .nbi/.nbc caches wiped beforehand.

  • Reproduce: cd docs/source/ae/pf/src && python bench_pf_matmul_vs_polar.py — same script the numbers were captured from.

Numbers below are averaged across two consecutive runs after the 0.8.1 SolCF.csc_matvec hot-F fast path (see Mat_Mul hot-F breakdown below):

Phase

for-loop (polar)

Mat_Mul (rect.)

Mat_Mul wins by

1. Model() create_instance()

≈ 2.0 s

≈ 0.07 s

~28×

2. FormJac(y0)

≈ 0.05 s

≈ 0.006 s

~9×

3. Inline compile (made_numerical)

≈ 0.27 s

≈ 0.01 s

~27×

4. Inline hot F (per call)

≈ 165 µs

≈ 17 µs

~10×

4. Inline hot J (per call)

≈ 820 µs

≈ 285 µs

~3×

5. Module render (module_printer.render)

≈ 0.53 s

≈ 0.02 s

~33×

6. Module cold compile (import + Numba)

≈ 47 s

≈ 2.7 s

~17×

7. Module hot F (per call)

≈ 1.1 µs

≈ 3.2 µs

0.34× (loses)

7. Module hot J (per call)

≈ 59 µs

≈ 57 µs

~1.0×

Shapes: the polar form has 53 unknowns / 53 scalar Eqns (Va at PV+PQ buses, Vm at PQ buses); the Mat_Mul form has 58 unknowns / 3 vector Eqns (e, f at non-ref buses, with P balance + Q balance + V² at PV). The comparison is not strictly equi-dimensional but close enough that the differences are driven by the formulation, not the unknown count.

Compile-time cost

Mat_Mul compiles ~17× faster on a cold import. This is the headline number and scales directly with the number of @njit functions the code generator emits:

  • for-loop (polar) — 1 dispatcher inner_F + 53 per-equation inner_F{i} + 1 dispatcher inner_J + 361 per-non-zero inner_J{k}. That’s ~416 Numba kernels to compile on the first run, each one a scalar trig expression. Each individual kernel is cheap to compile but the fixed overhead per kernel (LLVM instantiation, symbol table, cache write) adds up to ~47 seconds.

  • Mat_Mul (rectangular) — 1 dispatcher inner_F + 3 per-vector-equation inner_F{i} + 1 dispatcher inner_J + a handful of per-block inner_J{k} + 4 per-mutable-matrix-block _mut_block_N scatter-add kernels. Total: ~10 kernels. Cold compile is dominated by import + Numba startup (~2 s) rather than by per-kernel compilation.

Every earlier phase (model construction, FormJac, made_numerical, render) follows the same 20–40× scaling, because they all traverse the same explosion of scalar equations.

Runtime cost

Once the modules are compiled, the picture is more nuanced. Module hot F is the one place the for-loop path still wins — by ~2.9× after the 0.8.1 fast path, down from ~10× before. The regression and the fix are worth understanding:

Before 0.8.1 fast path — hot F was dispatch-bound at ~14 µs

Drilling into what a pre-fast-path Mat_Mul F_ call was spending its time on:

Step

Cost

% of total

8 scipy.sparse SpMVs in the wrapper (G_nr@e, B_nr@f, B_nr@e, G_nr@f, G_pq@e, B_pq@f, B_pq@e, G_pq@f)

~11.7 µs

83 %

@njit inner_F call (dispatch + 3 vector equations)

~1.6 µs

11 %

30 dict lookups on p_

~0.6 µs

4 %

Total F_(y, p) (0.8.0 and 0.8.1 baseline)

~14.1 µs

100 %

A single G_nr @ e scipy SpMV on case30 (29×29 with ~125 nnz) costs ~1.5 µs, of which virtually 100 % is Python→C dispatch overhead — the actual matvec arithmetic is well under 100 ns. Eight of those SpMVs added up to ~12 µs, and that was what dominated hot F.

0.8.1 fast path — matvec moves into inner_F via SolCF.csc_matvec

The 0.8.1 perf commit rewrites the precompute emission:

  • Fast path — when the Mat_Mul matrix operand is a plain sparse dim=2 Param (the common case), the matvec is emitted inside inner_F via the existing SolCF.csc_matvec Numba helper (Solverz/num_api/custom_function.py). No new helper function, no new setting entries — the CSC decomposition (<name>_data / _indices / _indptr / _shape0) that model/basic.py already emits for every sparse dim=2 Param is plugged straight into SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand) inside inner_F. The F_ wrapper no longer touches scipy at all for these placeholders.

  • Fallback path — when the matrix operand is not a plain sparse Para (negated matrix -G, nested Mat_Mul, dense dim=2 matrix, or a matrix expression), the wrapper still emits the scipy SpMV. These placeholders pay the full dispatch cost per call. Dense dim=2 params fire a one-shot UserWarning at FormJac time pointing users at the fallback cost.

Impact on the same case30 breakdown:

Step

Cost

% of total

8 SolCF.csc_matvec calls inside @njit inner_F

~2.7 µs

~85 %

Sub-function dispatch + 3 vector-equation bodies

~0.4 µs

~12 %

14 dict lookups on p_ (non-matrix params)

~0.1 µs

~3 %

Total F_(y, p) (0.8.1 fast path)

~3.2 µs

100 %

The 4.4× speedup (14.1 → 3.2 µs) comes entirely from eliminating the scipy dispatch barrier: SolCF.csc_matvec is a @njit(cache=True) Numba helper and inner_F calls it as an intra-Numba function call, which LLVM typically inlines into the caller’s body.

Remaining gap to polar — and the performance regression

Even after the fast path, case30 hot F is still ≈ 2.9× slower than the for-loop form (3.2 µs vs 1.1 µs). The gap is structural:

  • polar — 53 scalar kernels, fully inlined by Numba into a single inner_F function body. One Python→Numba boundary crossing, zero sub-function calls at runtime.

  • Mat_Mul — 8 SolCF.csc_matvec calls inside inner_F + dispatch to 3 sub-functions (inner_F0, inner_F1, inner_F2) + the inner_F dispatcher itself. Numba can inline some but not all of these layers, so you pay ~100 ns per sub-function call on top of the matvec work.

The 2 µs gap is usually invisible next to the J call (~55 µs) and the linear solve on anything non-trivial, but if your workload is millions of pure F evaluations on a small network and you don’t rebuild the module, the for-loop form still wins on hot F.

  • Module hot J is ~1.0× — a tie after the fast path. (The Mat_Mul J uses the vectorised scatter-add in the Matrix-Vector Calculus chapter and is independent of the hot F change.)

  • Inline hot F/J — without Numba, the for-loop form is slower across the board (~10× on F, ~3× on J) because lambdify has to walk 53 large scalar expressions + 361 Jacobian sub-expressions on every call. Mat_Mul’s 3 vector equations + scipy SpMVs finish in a fraction of the time.

Which formulation should I use?

Use Mat_Mul when any of the following hold:

  1. You iterate on the model. Compile-time savings are paid on every rebuild. A 47 s vs 2.7 s cold compile is the difference between a usable and an unusable interactive workflow.

  2. The network is larger than a toy example. For networks with more than ~60 unknowns the SpMV dispatch cost is amortised over more arithmetic and Mat_Mul’s hot F catches up to or passes the for-loop form. At case118+ scale the 2.9× regression on case30 is already gone.

  3. Your workload includes at least one Jacobian call per F call. J dominates per-J_(y, p) call cost (~55 µs on case30 for both paths), so a 2 µs hot-F regression is invisible relative to one full F+J pair. This matches every power-flow workload (Newton-Raphson nr_method, SICNM) and the implicit-step inner loop of DHS quasi-dynamic energy-flow integrators — anything that needs a Jacobian per step.

  4. You want compact, paper-faithful equations. Mat_Mul(G, e) replaces nb scalar Eqns and the matrix-calculus engine derives the Jacobian automatically.

Consider the for-loop form when all of these hold:

  1. The network is very small (≲ 30 unknowns).

  2. Your hot loop is dominated by pure F evaluations (no Jacobian in the loop). This is rare — Newton methods always call J; the only workloads that don’t are fixed-point iteration or residual-only solvers.

  3. The module is compiled once and then reused for millions of calls, so the 20–40× compile-time savings of Mat_Mul are not recovered.

This combination is narrow: in practice, ~every power-flow or DHS use case sees Mat_Mul as the better choice.

Known performance regression on case30-scale networks

On case30 (58 unknowns) the Mat_Mul hot F is ≈ 2.9× slower than the for-loop form after the 0.8.1 fast path (≈ 3.2 µs vs ≈ 1.1 µs). The remaining gap is the structural cost of 8 SolCF.csc_matvec calls + 3 sub-function dispatches + the inner_F dispatcher, versus the for-loop form’s single inlined @njit body of 53 scalar trig kernels.

The gap will shrink further if any of these optimisations land:

  • SpMV fusion — 8 SpMVs on the same 4 matrices could be combined into 4 SpMMs (G_nr @ [e|f], B_nr @ [e|f], etc.), halving the call overhead.

  • CSR format — marginally faster matvec on small matrices (~3 % in our micro-benchmark); larger impact on networks big enough that cache behaviour starts mattering.

  • Inlining the matvec loops directly into inner_F — avoids the SolCF.csc_matvec helper call entirely and lets Numba LICM the outer loop structure across the 8 matvecs.

None of these are in 0.8.1. Users who are hitting this regression should either (a) rebuild with the for-loop formulation for that specific workload, or (b) wait for the fusion optimisation.

Matrix shapes that fall out of the fast path

Every Mat_Mul(A, v) placeholder is classified at code-gen time. A placeholder goes on the fast path iff A is a plain sparse dim=2 Para. These shapes land on the slower fallback path (scipy SpMV in the wrapper, ~1.5 µs per call) and should be avoided in hot loops:

  • Mat_Mul(-A, v) — negated matrix. Workaround: write -Mat_Mul(A, v) instead; the sign flows through the outer expression and A stays on the fast path.

  • Mat_Mul(A, Mat_Mul(B, v)) — only the inner call is fast-path-compatible (its matrix is B, a plain Para); the outer call’s “matrix” is a placeholder _sz_mm_N, not a Para, so it falls back.

  • Mat_Mul(A + B, v) — matrix expression, not a single Para. Workaround: precompute the combined matrix as a single Param("AB", A + B, dim=2, sparse=True) outside the equation.

  • Mat_Mul(A, v) where A is Param(A, dim=2, sparse=False) — dense 2-D parameter. This also fires a one-shot UserWarning at FormJac time. Workaround: wrap the matrix in csc_array(...) and declare sparse=True.

Note

The cold-compile cost for the for-loop form (~47 s on M4) matches the “hundreds of seconds” figure quoted for the older Ryzen 5800H laptop earlier in this chapter. The absolute number is sensitive to CPU single-thread performance, but the ratio between the two formulations (≈17×) is driven almost entirely by the number of @njit kernels the code generator emits, which is a property of the formulation, not the hardware.

Ill-conditioned Power flow

The Newton method sometimes fails because it is not robust enough. We view this cases as having ill-conditioned initial settings. In this cases, we can use some more robust methods, such as the semi-implicit continuous Newton method (SICNM)[2] provided by Solverz. Shown below is an illustrative example of ill-conditioned power flow. The Newton failed while the SICNM easily converged.

import os

import matplotlib.pyplot as plt
import numpy as np
from scipy.io import loadmat

from powerflow import mdl as pf, y as y0
from Solverz import nr_method, sicnm, Opt

current_dir = os.getcwd()
file_path = os.path.join(current_dir, "test_pf_jac", "ill-Va.mat")
mat = loadmat(file_path)

pv = (mat['pv'] - 1).reshape(-1).tolist()
pq = (mat['pq'] - 1).reshape(-1).tolist()
Va0 = np.deg2rad(mat['Va0'])

for i in range(len(pv + pq)):
    y0['Va'][i:i + 1] = Va0[(pv + pq)[i]]

# Failed, cannot converge within 100 iterations
sol = nr_method(pf, y0, Opt(ite_tol=1e-5))
# Succeeded
sol1 = sicnm(pf, y0, Opt(scheme='rodas3d', rtol=1e-1, atol=1e-1, hinit=0.1, ite_tol=1e-5))


Vm_bench = np.array([0.98313825, 0.98009295, 0.98240617, 0.97318397, 0.9673554,
                     0.96062366, 0.98050609, 0.98440428, 0.98050609, 0.9854683,
                     0.97667682, 0.98022901, 0.97739562, 0.97686538, 0.96844031,
                     0.96528702, 0.96916633, 0.99338328, 0.9885663, 0.99021484,
                     0.97219415, 0.97471485, 0.9795967, 0.96788288])
# %% visualize error
plt.title(r'$\Delta |V_\text{m}|$')
plt.plot(np.abs(sol1.y['Vm']-Vm_bench))
plt.show()

omega

By the way, our implementation of SICNM for MATPOWER can be found here