Multi-dimensional human feedback is supplanting thumbs-up/down across major labs
OpenAI, DeepMind, and Anthropic have all published versions of multi-dimensional RLHF in 2026 — where annotators score helpfulness, harmlessness, honesty, and task-specific quality separately rather than as a single preference signal.
The convergence across three labs in one quarter is the signal. Each describes the technique slightly differently — Anthropic calls it "structured-reward fine-tuning," OpenAI calls it "factor decomposition," DeepMind calls it "multi-criterion preference" — but the mechanism is the same.
The motivation: scalar preference signals compress real information about why a response was preferred, and that compression is exactly where reward hacking lives. Multi-dimensional signals are harder to game without legibly degrading on the dimension being gamed.
Claude5 Hub — Constitutional AI + RLHF → · Claude5 Hub — alignment breakthroughs 2026 →