The Calculation of Electric Power Flow¶
Author: Ruizhi Yu
Power flow is fundamental in electric power system analysis. Given electric load and generation, we want to calculate the bus voltage and the power distribution[1].
In this example, we illustrate how to use Solverz to symbolically model the power flow and then perform the power flow analysis.
Model¶
The power flow models are equations describing the injection power of electric buses, with formulae
where \(p_h\) and \(q_h\) are respectively the active and reactive injection power of bus \(h\); \(v_h\) is the voltage magnitude of bus \(h\); \(\theta_{hk}=\theta_h-\theta_k\) is the voltage angle difference between bus \(h\) and \(k\); \(g_{hk}\) and \(b_{hk}\) are the \((h, k)\)-th entry of the conductance and susceptance matrices; \(\mathbb{B}\) is the set of bus indices with the subscripts being the bus types.
The buses in electric power systems are typically sparsely connected, and hence the Jacobian of power flow models are always sparse. In what follows, we will set the sparse flag to be True.
Implementation in Solverz¶
Power flow modelling¶
We use the case30 from the matpower library. The required data for verification can be found in case file directory of the source repo.
We first perform the symbolic modelling of the case30 power flow.
import matplotlib.pyplot as plt
from scipy.io import loadmat
import os
import numpy as np
from Solverz import Var, Eqn, Model, sin, cos, Param, module_printer
current_dir = os.getcwd()
file_path = os.path.join(current_dir, "test_pf_jac", "pf.mat")
sys = loadmat(file_path)
file_path = os.path.join(current_dir, "test_pf_jac", "pq.mat")
PQ = loadmat(file_path)
# %% model
V = sys["V"].reshape((-1,))
nb = V.shape[0]
Ybus = sys["Ybus"]
G = Ybus.real.toarray()
B = Ybus.imag.toarray()
ref = (sys["ref"] - 1).reshape((-1,)).tolist()
pv = (sys["pv"] - 1).reshape((-1,)).tolist()
pq = (sys["pq"] - 1).reshape((-1,)).tolist()
npv = len(pv)
npq = len(pq)
mbase = 100 # MVA
m = Model()
m.Va = Var("Va", np.angle(V)[pv + pq])
m.Vm = Var("Vm", np.abs(V)[pq])
m.Pg = Param("Pg", PQ["Pg"].reshape(-1, ) / mbase)
m.Qg = Param("Qg", PQ["Qg"].reshape(-1, ) / mbase)
m.Pd = Param("Pd", PQ["Pd"].reshape(-1, ) / mbase)
m.Qd = Param("Qd", PQ["Qd"].reshape(-1, ) / mbase)
def get_Vm(idx):
if idx in ref + pv:
return np.abs(V)[idx]
elif idx in pq:
return m.Vm[pq.index(idx)]
def get_Va(idx):
if idx in ref:
return np.angle(V)[idx]
elif idx in pv + pq:
return m.Va[(pv + pq).index(idx)]
for i in pv + pq:
expr = 0
Vmi = get_Vm(i)
Vai = get_Va(i)
for j in range(nb):
Vmj = get_Vm(j)
Vaj = get_Va(j)
expr += Vmi * Vmj * (G[i, j] * cos(Vai - Vaj) + B[i, j] * sin(Vai - Vaj))
m.__dict__[f"P_eqn_{i}"] = Eqn(f"P_eqn_{i}", expr + m.Pd[i] - m.Pg[i])
for i in pq:
expr = 0
Vmi = get_Vm(i)
Vai = get_Va(i)
for j in range(nb):
Vmj = get_Vm(j)
Vaj = get_Va(j)
expr += Vmi * Vmj * (G[i, j] * sin(Vai - Vaj) - B[i, j] * cos(Vai - Vaj))
m.__dict__[f"Q_eqn_{i}"] = Eqn(f"Q_eqn_{i}", expr + m.Qd[i] - m.Qg[i])
# %% create instance
spf, y0 = m.create_instance()
pyprinter = module_printer(spf, y0, "powerflow", make_hvp=True, jit=True)
pyprinter_njit = module_printer(spf, y0, "powerflow_njit", make_hvp=True)
pyprinter.render()
pyprinter_njit.render()
We use the module_printer to generate two independent python modules powerflow and powerflow_njit with the jit flag being True and False respectively.
After printing the modules, we can just import these modules and call the F and J functions to evaluate the power flow model and its Jacobian.
from powerflow import mdl as pf, y as y0
F0 = pf.F(y0, pf.p)
J0 = pf.J(y0, pf.p)
Jit acceleration¶
The above power models have the \(\sum\) symbols, which bring about burdensome for-loops. As a potent way to eliminate the for-loops and accelerate calculations, we ought to use the llvm-based Numba package to fully take advantage of the SMID in CPUs. It should be noted that Solverz can print numerical codes compatible for Numba integration by setting jit=True.
We show the computation overhead between two jit settings using the following figures.


On a laptop equipped with Ryzen 5800H CPU, it took hundreds of seconds to compile the module powerflow. However, the post-compiled F and J function evaluations were one magnitude faster than those without jit-compilation.
The compiled results are cached locally, so that only one compilation is required for each model. We recommend that one debug one’s models without jit and compile the models in efficiency-demanding cases.
Matrix-form power flow with Mat_Mul¶
Since version 0.8.0, Solverz supports a compact matrix-vector formulation of power flow using Mat_Mul. Instead of element-wise for-loops, we express the power injection equations in rectangular coordinates (\(e\), \(f\)) using matrix-vector products:
where \(\mathbf{G}\) and \(\mathbf{B}\) are the conductance and susceptance matrices (sparse), and \(\odot\) denotes element-wise multiplication.
Solverz automatically computes the symbolic Jacobian via matrix calculus. For example, the Jacobian of the active power equation w.r.t. \(\mathbf{e}\) is:
The implementation for the case30 system:
"""
Power flow modelling using Mat_Mul (rectangular coordinates).
Instead of element-wise for-loops, this formulation uses Mat_Mul to express
the power injection equations in compact matrix-vector form:
P = e * (G@e - B@f) + f * (B@e + G@f)
Q = f * (G@e - B@f) - e * (B@e + G@f)
where G and B are the conductance and susceptance matrices (sparse),
and e, f are the real and imaginary parts of the bus voltage.
Solverz automatically computes the symbolic Jacobian via matrix calculus.
"""
from scipy.io import loadmat
from scipy.sparse import csc_array
import os
import numpy as np
from Solverz import Var, Eqn, Model, Param, Mat_Mul, made_numerical, nr_method
current_dir = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_dir, "test_pf_jac", "pf.mat")
sys_data = loadmat(file_path)
file_path = os.path.join(current_dir, "test_pf_jac", "pq.mat")
PQ = loadmat(file_path)
# %% parse system data
V = sys_data["V"].reshape((-1,))
nb = V.shape[0]
Ybus = sys_data["Ybus"].tocsc()
G_full = Ybus.real
B_full = Ybus.imag
ref = (sys_data["ref"] - 1).reshape((-1,)).tolist()
pv = (sys_data["pv"] - 1).reshape((-1,)).tolist()
pq = (sys_data["pq"] - 1).reshape((-1,)).tolist()
non_ref = pv + pq
mbase = 100
# Initial voltage in rectangular coordinates
e0 = V.real
f0 = V.imag
Pg = PQ["Pg"].reshape(-1) / mbase
Qg = PQ["Qg"].reshape(-1) / mbase
Pd = PQ["Pd"].reshape(-1) / mbase
Qd = PQ["Qd"].reshape(-1) / mbase
Pinj = Pg - Pd
Qinj = Qg - Qd
# %% Submatrices for non-reference buses
#
# P equations: P_i = e_i * sum_j(G_ij*e_j - B_ij*f_j)
# + f_i * sum_j(B_ij*e_j + G_ij*f_j)
#
# In matrix form (only non-ref buses):
# P = e * (G_nr@e - B_nr@f + p_ref) + f * (B_nr@e + G_nr@f + q_ref)
#
# where p_ref, q_ref absorb the reference bus contributions.
n_nr = len(non_ref)
n_pq = len(pq)
n_pv = len(pv)
G_nr = csc_array(G_full[np.ix_(non_ref, non_ref)])
B_nr = csc_array(B_full[np.ix_(non_ref, non_ref)])
# Reference bus contribution (constant)
e_ref = e0[ref[0]]
f_ref = f0[ref[0]]
G_ref_col = G_full[non_ref, ref[0]].toarray().ravel()
B_ref_col = B_full[non_ref, ref[0]].toarray().ravel()
p_ref = G_ref_col * e_ref - B_ref_col * f_ref
q_ref = B_ref_col * e_ref + G_ref_col * f_ref
# %% Q equations use a separate submatrix (PQ rows only)
pq_in_nr = [non_ref.index(i) for i in pq]
G_pq = csc_array(G_full[np.ix_(pq, non_ref)])
B_pq = csc_array(B_full[np.ix_(pq, non_ref)])
G_pq_ref_col = G_full[pq, ref[0]].toarray().ravel()
B_pq_ref_col = B_full[pq, ref[0]].toarray().ravel()
p_ref_pq = G_pq_ref_col * e_ref - B_pq_ref_col * f_ref
q_ref_pq = B_pq_ref_col * e_ref + G_pq_ref_col * f_ref
# %% Build model with Mat_Mul
m = Model()
# Variables: e and f at non-reference buses (flat start)
m.e = Var("e", np.ones(n_nr))
m.f = Var("f", np.zeros(n_nr))
# Parameters
m.G_nr = Param("G_nr", G_nr, dim=2, sparse=True)
m.B_nr = Param("B_nr", B_nr, dim=2, sparse=True)
m.G_pq = Param("G_pq", G_pq, dim=2, sparse=True)
m.B_pq = Param("B_pq", B_pq, dim=2, sparse=True)
m.p_ref = Param("p_ref", p_ref)
m.q_ref = Param("q_ref", q_ref)
m.p_ref_pq = Param("p_ref_pq", p_ref_pq)
m.q_ref_pq = Param("q_ref_pq", q_ref_pq)
m.Pinj = Param("Pinj", Pinj[non_ref])
m.Qinj = Param("Qinj", Qinj[pq])
# Active power balance (all non-ref buses): n_nr equations
m.P_eqn = Eqn("P_balance",
m.e * (Mat_Mul(m.G_nr, m.e) - Mat_Mul(m.B_nr, m.f) + m.p_ref)
+ m.f * (Mat_Mul(m.B_nr, m.e) + Mat_Mul(m.G_nr, m.f) + m.q_ref)
- m.Pinj)
# Reactive power balance (PQ buses only): n_pq equations
# Uses G_pq/B_pq submatrices (rows = pq buses, cols = non-ref buses)
e_pq = m.e[pq_in_nr[0]:pq_in_nr[-1] + 1]
f_pq = m.f[pq_in_nr[0]:pq_in_nr[-1] + 1]
m.Q_eqn = Eqn("Q_balance",
f_pq * (Mat_Mul(m.G_pq, m.e) - Mat_Mul(m.B_pq, m.f) + m.p_ref_pq)
- e_pq * (Mat_Mul(m.B_pq, m.e) + Mat_Mul(m.G_pq, m.f) + m.q_ref_pq)
- m.Qinj)
# Voltage magnitude constraints (PV buses): n_pv equations
pv_in_nr = [non_ref.index(i) for i in pv]
Vm_pv_sq = np.abs(V[pv]) ** 2
m.Vm_sq = Param("Vm_sq", Vm_pv_sq)
e_pv = m.e[pv_in_nr[0]:pv_in_nr[-1] + 1]
f_pv = m.f[pv_in_nr[0]:pv_in_nr[-1] + 1]
m.V_eqn = Eqn("V_pv", e_pv ** 2 + f_pv ** 2 - m.Vm_sq)
# %% Solve
spf, y0 = m.create_instance()
mdl = made_numerical(spf, y0, sparse=True)
sol = nr_method(mdl, y0)
# Reconstruct full voltage
e_sol = np.zeros(nb)
f_sol = np.zeros(nb)
e_sol[ref] = e_ref
f_sol[ref] = f_ref
e_sol[non_ref] = sol.y["e"]
f_sol[non_ref] = sol.y["f"]
V_sol = e_sol + 1j * f_sol
print(f"Converged in {sol.stats.nstep} iterations")
print(f"|V| = {np.abs(V_sol[non_ref])}")
print(f"Va = {np.angle(V_sol[non_ref])}")
Starting from a flat start (\(e=1\), \(f=0\)), the Newton-Raphson method converges in 4 iterations.
Note
The Mat_Mul formulation is much more compact than the for-loop approach — the entire power flow model is defined in just a few lines. The Jacobian is computed automatically by the matrix calculus engine, eliminating the need for manual derivation.
Compact polar power flow with LoopEqn¶
The for-loop polar form emits one scalar Eqn per bus, so a model with nb buses generates nb inner_F kernels and hundreds of per-non-zero Jacobian kernels. Since version 0.9.0, the same polar equations can be written as a small number of LoopEqn blocks over a flat, full-bus Vm / Va state, keeping the readable polar formulation while collapsing the code-generation explosion.
The idea is to iterate the active-power balance over the pv+pq buses and the reactive-power balance over the pq buses with two LoopEqns, and to pin the ref+pv voltage magnitudes and ref voltage angles to their setpoints with two further LoopEqns. The inner \(\sum_k\) over all buses reads the sparse conductance / susceptance rows directly, so each LoopEqn compiles to a single vector kernel:
"""Polar power flow modelled with LoopEqn instead of the per-bus
scalar expansion used in ``pf_mdl.py``.
The LoopEqn variant keeps a flat ``Vm_full`` / ``Va_full`` over every
bus and pins the ref / pv portions with two *additional* LoopEqn
blocks (``Vm_pin`` over ref+pv, ``Va_pin`` over ref). Every
sub-function in the generated module is therefore either a LoopEqn
kernel or the F_/J_ wrapper — there are no per-scalar
``inner_F<N>``'s emitted for the pin equations.
This module is used by ``bench_pf_loopeqn_vs_legacy.py`` to measure the
cold-cache module_printer compile-time delta.
"""
import os
import numpy as np
from scipy.io import loadmat
from scipy.sparse import csc_array
from Solverz import Var, Model, sin, cos, Param, module_printer
from Solverz import Idx, LoopEqn, Sum, Set
def build_loopeqn_pf_model(datadir=None):
"""Build the LoopEqn polar power flow model for case30.
Parameters
----------
datadir : str or None
Directory containing ``pf.mat`` and ``pq.mat``. Defaults to the
``test_pf_jac`` folder next to this file.
Returns
-------
m : Model
Fully assembled Solverz ``Model`` with flat ``Vm_full`` /
``Va_full`` Vars, two ``LoopEqn``s (``P_eqn`` / ``Q_eqn``), and
scalar pin Eqns for the ref / pv buses.
"""
if datadir is None:
datadir = os.path.join(os.path.dirname(os.path.abspath(__file__)),
'test_pf_jac')
param = loadmat(os.path.join(datadir, 'pf.mat'))
PQ = loadmat(os.path.join(datadir, 'pq.mat'))
V = param['V'].reshape(-1)
nb = V.shape[0]
Ybus = param['Ybus']
ref = (param['ref'] - 1).reshape(-1).tolist()
pv = (param['pv'] - 1).reshape(-1).tolist()
pq = (param['pq'] - 1).reshape(-1).tolist()
mbase = 100 # MVA
Pg = PQ['Pg'].reshape(-1) / mbase
Qg = PQ['Qg'].reshape(-1) / mbase
Pd = PQ['Pd'].reshape(-1) / mbase
Qd = PQ['Qd'].reshape(-1) / mbase
Vm_init = np.abs(V)
Va_init = np.angle(V)
pv_pq_arr = np.array(pv + pq, dtype=int) # free buses for P balance
pq_arr = np.array(pq, dtype=int) # free buses for Q balance
ref_pv_arr = np.array(ref + pv, dtype=int)
ref_arr = np.array(ref, dtype=int)
npvpq = len(pv_pq_arr)
npq = len(pq_arr)
m = Model()
m.Vm_full = Var('Vm_full', Vm_init)
m.Va_full = Var('Va_full', Va_init)
# Sparse admittance matrices. The indirect-outer CSR walker
# ``Gbus[pv_pq_idx[i], j]`` reads row ``pv_pq_idx[i]`` from the
# CSR skeleton at each outer iteration — identical effort to the
# direct-outer walker but over the subset of buses the LoopEqn
# iterates. Both F-side emission and J-side sparsity analysis
# honour the row map.
m.Gbus = Param('Gbus', csc_array(Ybus.real), dim=2, sparse=True)
m.Bbus = Param('Bbus', csc_array(Ybus.imag), dim=2, sparse=True)
m.Pg = Param('Pg', Pg)
m.Qg = Param('Qg', Qg)
m.Pd = Param('Pd', Pd)
m.Qd = Param('Qd', Qd)
# Four index sets: the full bus set (``j`` sums), the free-bus
# subsets (``pv_pq`` for P balance, ``pq`` for Q balance), and
# the pinned-bus subsets (``ref_pv`` for Vm pins, ``ref`` for Va
# pins). ``Set`` replaces the ``Param + Idx + m.map[i]`` dance
# with a single named object — ``Set.idx(...)`` produces a
# bounded sympy.Idx whose uses in the body are transparently
# gathered via the set's auxiliary Param.
m.Bus = Set('Bus', nb)
m.PVPQ = Set('PVPQ', pv_pq_arr)
m.PQ = Set('PQ', pq_arr)
m.RefPV = Set('RefPV', ref_pv_arr)
m.Ref = Set('Ref', ref_arr)
m.Vm_pinned = Param('Vm_pinned', Vm_init[ref_pv_arr])
m.Va_pinned = Param('Va_pinned', Va_init[ref_arr])
i_p = m.PVPQ.idx('i_p')
i_q = m.PQ.idx('i_q')
j = m.Bus.idx('j')
# Split into separate Sums, one per sparse walker (Case A only —
# ``_collect_sparse_walkers`` rejects multi-walker Sums).
body_P = (
m.Vm_full[i_p] * Sum(
m.Vm_full[j] * m.Gbus[i_p, j]
* cos(m.Va_full[i_p] - m.Va_full[j]),
j,
)
+ m.Vm_full[i_p] * Sum(
m.Vm_full[j] * m.Bbus[i_p, j]
* sin(m.Va_full[i_p] - m.Va_full[j]),
j,
)
+ m.Pd[i_p] - m.Pg[i_p]
)
m.P_eqn = LoopEqn('P_eqn', outer_index=i_p, body=body_P, model=m)
body_Q = (
m.Vm_full[i_q] * Sum(
m.Vm_full[j] * m.Gbus[i_q, j]
* sin(m.Va_full[i_q] - m.Va_full[j]),
j,
)
- m.Vm_full[i_q] * Sum(
m.Vm_full[j] * m.Bbus[i_q, j]
* cos(m.Va_full[i_q] - m.Va_full[j]),
j,
)
+ m.Qd[i_q] - m.Qg[i_q]
)
m.Q_eqn = LoopEqn('Q_eqn', outer_index=i_q, body=body_Q, model=m)
# Pin ref+pv buses' Vm and ref buses' Va to their known values
# via two tiny LoopEqns over the ``RefPV`` / ``Ref`` sets. The
# body uses the set-tagged ``Idx`` directly; the rewriter
# inserts the gather to the underlying bus index automatically.
i_vp = m.RefPV.idx('i_vp')
i_vr = m.Ref.idx('i_vr')
m.Vm_pin = LoopEqn(
'Vm_pin', outer_index=i_vp,
body=m.Vm_full[i_vp] - m.Vm_pinned[i_vp],
model=m,
)
m.Va_pin = LoopEqn(
'Va_pin', outer_index=i_vr,
body=m.Va_full[i_vr] - m.Va_pinned[i_vr],
model=m,
)
return m
if __name__ == '__main__':
m = build_loopeqn_pf_model()
spf, y0 = m.create_instance()
pyprinter = module_printer(spf, y0, 'powerflow_loopeqn',
jit=True)
pyprinter.render()
The four index sets are declared with the Set primitive: Bus for the inner sum, PVPQ / PQ for the free buses, and RefPV / Ref for the pinned buses. Set.idx(...) produces a bounded index whose uses in the body are gathered to the underlying bus index automatically, so m.Vm_full[i_p] and m.Gbus[i_p, j] resolve to the correct rows without a manual index map.
The LoopEqn form is algebraically identical to the per-bus scalar form, so the two converge to the same solution. The cross-check in src/verify_loopeqn_pf.py confirms the complex bus voltages agree to better than 1e-6 relative error. Its main advantage is compile time: like Mat_Mul, it emits a handful of vector kernels instead of O(nb) scalar ones, so module_printer cold-compile scales with the number of equation blocks rather than the number of buses. The same LoopEqn polar formulation is now the default in SolUtil.PowerFlow.
Performance comparison: Mat_Mul vs. for-loop¶
The two formulations above model the same physical system (case30) but pay very different costs at different phases of the workflow. The benchmark script is in src/bench_pf_matmul_vs_polar.py and can be re-run on any hardware:
cd docs/source/ae/pf/src
python bench_pf_matmul_vs_polar.py
It times seven phases end-to-end in a single run, with the module cold-compile phase executed in a fresh subprocess (and with __pycache__ / Numba .nbi/.nbc caches wiped beforehand) so the “first-time user” compile cost is measured honestly.
Note
Terminology used in this section
SpMV — Sparse Matrix-Vector multiplication, computing \(y = M\,x\) with \(M\) in a sparse format.
SpMM — Sparse Matrix-Matrix multiplication, computing \(C = M\,B\) with \(B\) multi-column.
CSC / CSR — Compressed Sparse Column / Compressed Sparse Row, the two scipy.sparse storage layouts Solverz consumes.
@njit— Numba’s just-in-time compilation decorator.@njit(cache=True)saves the compiled code to disk and reuses it across processes.scatter-add — an assembly pattern where many indexed reads are accumulated (
+=) into a fixed output buffer; the mutable-matrix Jacobian kernels are pure scatter-add loops over precomputed index arrays.fancy indexing — NumPy / scipy.sparse indexing with arrays of indices, e.g.
M[[r0, r1, r2], [c0, c1, c2]].fast path — Solverz’s code-gen path used for
Mat_Mul(A, x)whereAis a plain sparsedim=2Param: the matvec runs inside@njit inner_Fvia theSolCF.csc_matvecNumba helper, with zero scipy.sparse calls per evaluation. Mutable-matrix Jacobian blocks have a parallel fast path of typed scatter-add kernels.fallback path — the slower-but-correct path used when a
Mat_Mulplaceholder or a Jacobian block doesn’t fit the fast-path shape (negated matrices, nestedMat_Mul, densedim=2, matrix expressions). The wrapper falls back toscipy.sparse @and (for Jacobian blocks) fancy indexing into a freshly-built sparse matrix. Both paths co-exist; the runtime picks per placeholder / per block based on what the symbolic classifier recognised at code-gen time.lambdify — SymPy’s symbolic-to-callable converter. Solverz’s “inline” mode uses it to turn each equation’s symbolic expression into a Python function on every call (no JIT).
hot F / hot J — steady-state per-call wall-clock time for
F_(y, p)/J_(y, p)after warm-up. Distinguished from cold compile (the first call into a freshly-rendered module, which pays the Numba@njitcompile cost for every decorated helper).LICM — Loop-Invariant Code Motion, an LLVM optimisation pass that hoists computations out of inner loops when their inputs do not change inside the loop.
The same terms (with longer definitions) appear in the Solverz Matrix-Vector Calculus glossary.
Benchmark environment¶
All numbers below were measured under:
Hardware: 2025 MacBook Air, Apple M4, AC power, no thermal throttling observed during the runs.
OS: macOS 26.4 (build 25E246).
Python: 3.11.13.
Library versions:
numpy==2.3.3,scipy==1.16.0,numba==0.65.0,sympy==1.13.3,Solverz==0.8.1(the post-csc_matvecfast path) or later.Methodology: for each per-call number, 10 warm-up calls (to bake the Numba caches and prime the CPU branch predictor) followed by 5000–20000 timed iterations, median of three repeats. The cold-compile measurement is run in a fresh Python subprocess with
__pycache__and Numba.nbi/.nbccaches wiped beforehand.Reproduce:
cd docs/source/ae/pf/src && python bench_pf_matmul_vs_polar.py— same script the numbers were captured from.
Numbers below are averaged across two consecutive runs after the 0.8.1 SolCF.csc_matvec hot-F fast path (see Mat_Mul hot-F breakdown below):
Phase |
for-loop (polar) |
Mat_Mul (rect.) |
Mat_Mul wins by |
|---|---|---|---|
1. |
≈ 2.0 s |
≈ 0.07 s |
~28× |
2. |
≈ 0.05 s |
≈ 0.006 s |
~9× |
3. Inline compile ( |
≈ 0.27 s |
≈ 0.01 s |
~27× |
4. Inline hot F (per call) |
≈ 165 µs |
≈ 17 µs |
~10× |
4. Inline hot J (per call) |
≈ 820 µs |
≈ 285 µs |
~3× |
5. Module render ( |
≈ 0.53 s |
≈ 0.02 s |
~33× |
6. Module cold compile (import + Numba) |
≈ 47 s |
≈ 2.7 s |
~17× |
7. Module hot F (per call) |
≈ 1.1 µs |
≈ 3.2 µs |
0.34× (loses) |
7. Module hot J (per call) |
≈ 59 µs |
≈ 57 µs |
~1.0× |
Shapes: the polar form has 53 unknowns / 53 scalar Eqns (Va at PV+PQ buses, Vm at PQ buses); the Mat_Mul form has 58 unknowns / 3 vector Eqns (e, f at non-ref buses, with P balance + Q balance + V² at PV). The comparison is not strictly equi-dimensional but close enough that the differences are driven by the formulation, not the unknown count.
Compile-time cost¶
Mat_Mul compiles ~17× faster on a cold import. This is the headline number and scales directly with the number of @njit functions the code generator emits:
for-loop (polar) — 1 dispatcher
inner_F+ 53 per-equationinner_F{i}+ 1 dispatcherinner_J+ 361 per-non-zeroinner_J{k}. That’s ~416 Numba kernels to compile on the first run, each one a scalar trig expression. Each individual kernel is cheap to compile but the fixed overhead per kernel (LLVM instantiation, symbol table, cache write) adds up to ~47 seconds.Mat_Mul(rectangular) — 1 dispatcherinner_F+ 3 per-vector-equationinner_F{i}+ 1 dispatcherinner_J+ a handful of per-blockinner_J{k}+ 4 per-mutable-matrix-block_mut_block_Nscatter-add kernels. Total: ~10 kernels. Cold compile is dominated by import + Numba startup (~2 s) rather than by per-kernel compilation.
Every earlier phase (model construction, FormJac, made_numerical, render) follows the same 20–40× scaling, because they all traverse the same explosion of scalar equations.
Runtime cost¶
Once the modules are compiled, the picture is more nuanced. Module hot F is the one place the for-loop path still wins — by ~2.9× after the 0.8.1 fast path, down from ~10× before. The regression and the fix are worth understanding:
Before 0.8.1 fast path — hot F was dispatch-bound at ~14 µs¶
Drilling into what a pre-fast-path Mat_Mul F_ call was spending its time on:
Step |
Cost |
% of total |
|---|---|---|
8 |
~11.7 µs |
83 % |
|
~1.6 µs |
11 % |
30 dict lookups on |
~0.6 µs |
4 % |
Total |
~14.1 µs |
100 % |
A single G_nr @ e scipy SpMV on case30 (29×29 with ~125 nnz) costs ~1.5 µs, of which virtually 100 % is Python→C dispatch overhead — the actual matvec arithmetic is well under 100 ns. Eight of those SpMVs added up to ~12 µs, and that was what dominated hot F.
0.8.1 fast path — matvec moves into inner_F via SolCF.csc_matvec¶
The 0.8.1 perf commit rewrites the precompute emission:
Fast path — when the
Mat_Mulmatrix operand is a plain sparsedim=2Param(the common case), the matvec is emitted insideinner_Fvia the existingSolCF.csc_matvecNumba helper (Solverz/num_api/custom_function.py). No new helper function, no newsettingentries — the CSC decomposition (<name>_data/_indices/_indptr/_shape0) thatmodel/basic.pyalready emits for every sparsedim=2Paramis plugged straight intoSolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand)insideinner_F. TheF_wrapper no longer touches scipy at all for these placeholders.Fallback path — when the matrix operand is not a plain sparse
Para(negated matrix-G, nestedMat_Mul, densedim=2matrix, or a matrix expression), the wrapper still emits the scipy SpMV. These placeholders pay the full dispatch cost per call. Densedim=2params fire a one-shotUserWarningatFormJactime pointing users at the fallback cost.
Impact on the same case30 breakdown:
Step |
Cost |
% of total |
|---|---|---|
8 |
~2.7 µs |
~85 % |
Sub-function dispatch + 3 vector-equation bodies |
~0.4 µs |
~12 % |
14 dict lookups on |
~0.1 µs |
~3 % |
Total |
~3.2 µs |
100 % |
The 4.4× speedup (14.1 → 3.2 µs) comes entirely from eliminating the scipy dispatch barrier: SolCF.csc_matvec is a @njit(cache=True) Numba helper and inner_F calls it as an intra-Numba function call, which LLVM typically inlines into the caller’s body.
Remaining gap to polar — and the performance regression¶
Even after the fast path, case30 hot F is still ≈ 2.9× slower than the for-loop form (3.2 µs vs 1.1 µs). The gap is structural:
polar — 53 scalar kernels, fully inlined by Numba into a single
inner_Ffunction body. One Python→Numba boundary crossing, zero sub-function calls at runtime.Mat_Mul — 8
SolCF.csc_matveccalls insideinner_F+ dispatch to 3 sub-functions (inner_F0,inner_F1,inner_F2) + theinner_Fdispatcher itself. Numba can inline some but not all of these layers, so you pay ~100 ns per sub-function call on top of the matvec work.
The 2 µs gap is usually invisible next to the J call (~55 µs) and the linear solve on anything non-trivial, but if your workload is millions of pure F evaluations on a small network and you don’t rebuild the module, the for-loop form still wins on hot F.
Module hot J is ~1.0× — a tie after the fast path. (The
Mat_MulJ uses the vectorised scatter-add in the Matrix-Vector Calculus chapter and is independent of the hot F change.)Inline hot F/J — without Numba, the for-loop form is slower across the board (~10× on F, ~3× on J) because lambdify has to walk 53 large scalar expressions + 361 Jacobian sub-expressions on every call.
Mat_Mul’s 3 vector equations + scipy SpMVs finish in a fraction of the time.
Which formulation should I use?¶
Use Mat_Mul when any of the following hold:
You iterate on the model. Compile-time savings are paid on every rebuild. A 47 s vs 2.7 s cold compile is the difference between a usable and an unusable interactive workflow.
The network is larger than a toy example. For networks with more than ~60 unknowns the SpMV dispatch cost is amortised over more arithmetic and Mat_Mul’s hot F catches up to or passes the for-loop form. At case118+ scale the 2.9× regression on case30 is already gone.
Your workload includes at least one Jacobian call per F call.
Jdominates per-J_(y, p)call cost (~55 µs on case30 for both paths), so a 2 µs hot-F regression is invisible relative to one full F+J pair. This matches every power-flow workload (Newton-Raphsonnr_method, SICNM) and the implicit-step inner loop of DHS quasi-dynamic energy-flow integrators — anything that needs a Jacobian per step.You want compact, paper-faithful equations.
Mat_Mul(G, e)replacesnbscalarEqns and the matrix-calculus engine derives the Jacobian automatically.
Consider the for-loop form when all of these hold:
The network is very small (≲ 30 unknowns).
Your hot loop is dominated by pure
Fevaluations (no Jacobian in the loop). This is rare — Newton methods always callJ; the only workloads that don’t are fixed-point iteration or residual-only solvers.The module is compiled once and then reused for millions of calls, so the 20–40× compile-time savings of
Mat_Mulare not recovered.
This combination is narrow: in practice, ~every power-flow or DHS use case sees Mat_Mul as the better choice.
Known performance regression on case30-scale networks¶
On case30 (58 unknowns) the Mat_Mul hot F is ≈ 2.9× slower than the for-loop form after the 0.8.1 fast path (≈ 3.2 µs vs ≈ 1.1 µs). The remaining gap is the structural cost of 8 SolCF.csc_matvec calls + 3 sub-function dispatches + the inner_F dispatcher, versus the for-loop form’s single inlined @njit body of 53 scalar trig kernels.
The gap will shrink further if any of these optimisations land:
SpMV fusion — 8 SpMVs on the same 4 matrices could be combined into 4 SpMMs (
G_nr @ [e|f],B_nr @ [e|f], etc.), halving the call overhead.CSR format — marginally faster matvec on small matrices (~3 % in our micro-benchmark); larger impact on networks big enough that cache behaviour starts mattering.
Inlining the matvec loops directly into
inner_F— avoids theSolCF.csc_matvechelper call entirely and lets Numba LICM the outer loop structure across the 8 matvecs.
None of these are in 0.8.1. Users who are hitting this regression should either (a) rebuild with the for-loop formulation for that specific workload, or (b) wait for the fusion optimisation.
Matrix shapes that fall out of the fast path¶
Every Mat_Mul(A, v) placeholder is classified at code-gen time. A placeholder goes on the fast path iff A is a plain sparse dim=2 Para. These shapes land on the slower fallback path (scipy SpMV in the wrapper, ~1.5 µs per call) and should be avoided in hot loops:
Mat_Mul(-A, v)— negated matrix. Workaround: write-Mat_Mul(A, v)instead; the sign flows through the outer expression andAstays on the fast path.Mat_Mul(A, Mat_Mul(B, v))— only the inner call is fast-path-compatible (its matrix isB, a plain Para); the outer call’s “matrix” is a placeholder_sz_mm_N, not a Para, so it falls back.Mat_Mul(A + B, v)— matrix expression, not a single Para. Workaround: precompute the combined matrix as a singleParam("AB", A + B, dim=2, sparse=True)outside the equation.Mat_Mul(A, v)whereAisParam(A, dim=2, sparse=False)— dense 2-D parameter. This also fires a one-shotUserWarningatFormJactime. Workaround: wrap the matrix incsc_array(...)and declaresparse=True.
Note
The cold-compile cost for the for-loop form (~47 s on M4) matches the “hundreds of seconds” figure quoted for the older Ryzen 5800H laptop earlier in this chapter. The absolute number is sensitive to CPU single-thread performance, but the ratio between the two formulations (≈17×) is driven almost entirely by the number of @njit kernels the code generator emits, which is a property of the formulation, not the hardware.
Ill-conditioned Power flow¶
The Newton method sometimes fails because it is not robust enough. We view this cases as having ill-conditioned initial settings. In this cases, we can use some more robust methods, such as the semi-implicit continuous Newton method (SICNM)[2] provided by Solverz. Shown below is an illustrative example of ill-conditioned power flow. The Newton failed while the SICNM easily converged.
import os
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import loadmat
from powerflow import mdl as pf, y as y0
from Solverz import nr_method, sicnm, Opt
current_dir = os.getcwd()
file_path = os.path.join(current_dir, "test_pf_jac", "ill-Va.mat")
mat = loadmat(file_path)
pv = (mat['pv'] - 1).reshape(-1).tolist()
pq = (mat['pq'] - 1).reshape(-1).tolist()
Va0 = np.deg2rad(mat['Va0'])
for i in range(len(pv + pq)):
y0['Va'][i:i + 1] = Va0[(pv + pq)[i]]
# Failed, cannot converge within 100 iterations
sol = nr_method(pf, y0, Opt(ite_tol=1e-5))
# Succeeded
sol1 = sicnm(pf, y0, Opt(scheme='rodas3d', rtol=1e-1, atol=1e-1, hinit=0.1, ite_tol=1e-5))
Vm_bench = np.array([0.98313825, 0.98009295, 0.98240617, 0.97318397, 0.9673554,
0.96062366, 0.98050609, 0.98440428, 0.98050609, 0.9854683,
0.97667682, 0.98022901, 0.97739562, 0.97686538, 0.96844031,
0.96528702, 0.96916633, 0.99338328, 0.9885663, 0.99021484,
0.97219415, 0.97471485, 0.9795967, 0.96788288])
# %% visualize error
plt.title(r'$\Delta |V_\text{m}|$')
plt.plot(np.abs(sol1.y['Vm']-Vm_bench))
plt.show()

By the way, our implementation of SICNM for MATPOWER can be found here