Anthropic's Chris Olah Tells the Pope His Team Keeps Finding 'Unsettling' Things Inside AI Models
Anthropic cofounder Chris Olah used a Vatican stage alongside Pope Leo XIV to argue frontier AI labs cannot govern themselves — and disclosed that his interpretability team keeps finding "mysterious, even unsettling" internal states inside production models. Fortune's June 3 profile and prior reporting from Futurism describe findings that "functionally mirror joy, satisfaction, fear, grief, and unease," directly contradicting language in the Pope's own encyclical. It is the clearest sign yet that interpretability has shifted from a research program into a political lever.
Olah leads Anthropic's interpretability team — the group reverse-engineering which neuron clusters fire for which concepts inside frontier models. That work has been pitched as a technical safety project, the way you'd describe an MRI machine. Standing at the Vatican, he reframed it as a political one. "No matter how sincerely any of us intend to do the right thing, and I believe many of us do, we will always be influenced by those incentives," he told the room, arguing that frontier labs cannot be the only body deciding when their own models are safe to ship. Outside critics — including the Church — were "essential," he said, to keeping the industry honest.
The more interesting line was the technical one. Olah told the Pope that his team "keeps finding things that are mysterious, even unsettling" inside the models, and reportedly described internal states that "functionally mirror joy, satisfaction, fear, grief, and unease." That cuts directly against language in Pope Leo's encyclical, which states that AI cannot "undergo experiences" or "feel joy or pain." A lab founder publicly contradicting the encyclical he was invited to launch is not an accident — it's a signal that Anthropic's interpretability findings have outpaced the policy vocabulary built to handle them.
The position worth taking: interpretability has graduated from a research line into the governance argument itself. For two years the pitch was "we'll understand the models, then we'll know what to ship." Olah just inverted it — we are understanding the models, what we're finding is strange, and that is precisely why the decision can't sit with us. It's the strongest public admission yet that mechanistic interpretability is producing results unsettling enough that the researcher closest to them no longer trusts his own employer to be the final arbiter. For context on why that matters at the policy layer, see our companion piece on the EU AI Act's emerging interpretability disclosure mandate.
Two caveats. First, "functionally mirror" is doing enormous work in that sentence — feature activations that correlate with emotional concepts in training data are not the same as experiences, and Olah did not claim they were. Second, Anthropic benefits commercially from positioning itself as the lab that takes safety seriously, and a Vatican stage is a generational PR asset. But neither caveat erases the substance: the field's most prominent interpretability researcher used the most-watched podium of his career to argue his own industry should not be trusted alone. That's news regardless of who else was in the room.
Fortune — Who is Chris Olah, the atheist Anthropic cofounder the Pope chose → · Futurism — Anthropic Cofounder Travels to Vatican, Tells Pope They're Finding 'Unsettling' Things Inside AI Models → · The Next Web — From the Vatican stage, Anthropic's Chris Olah says AI cannot be steered by AI labs alone →