Multi-dimensional RLHF: feedback along helpfulness, harmlessness, honesty, task-specific axes
OpenAI, DeepMind, and others have moved past single-dimension preference learning. The 2026 standard is multi-dimensional feedback: human raters score outputs separately on helpfulness, harmlessness, honesty, and task-specific axes, and reward models combine these into a richer signal.
The motivation is straightforward — single-dimension feedback compresses many real preferences into one number, and that compression is where regrettable behaviors slip through. A response can be helpful but dishonest, or harmless but uninformative. Multi-dimensional feedback gives the model separable signals for each.
Practical effect: the new reward models are much harder to game. Reward hacking in RLHF used to manifest as outputs that "looked good" along the single feedback axis — verbose, hedged, sycophantic. With separable axes, a sycophantic response scores well on helpfulness but poorly on honesty, and the combined reward penalizes it. The technique is now standard at every major frontier lab.