"Transformers are Bayesian Networks" — every sigmoid transformer implements weighted loopy belief propagation
A March 2026 arxiv paper proves that every sigmoid transformer architecture, with any weights, implements weighted loopy belief propagation on its implicit factor graph. The paper provides a precise answer to the long-standing question of why transformers work — they are doing approximate Bayesian inference, by construction.
This is the theoretical bridge the field has been missing for eight years. Transformers have been treated as black-box function approximators with empirically excellent results; this paper recasts them as a known class of probabilistic graphical model with well-understood inference dynamics.
The practical implication is that transformer behavior should be analyzable using Bayesian-network tools — message passing diagnostics, factor graph visualizations, and identifiability theorems. Expect a wave of follow-up work applying this lens to interpretability (links to mech interp) and to architectural design (when belief propagation converges, vs. when it loops).
arXiv — Transformers are Bayesian Networks → · arXiv — Decoupling Knowledge and Reasoning →