Hey everyone,

As many of you probably know Reinforcement Learning from Human Feedback (RLHF) was the core technique used to produce ChatGPT and similar AI assistants that followed. RLHF replaces human feedback in an RL schema with a preference model that is trained according to a dataset of human preferences.

Anthropic has devised an extension of this idea in which an AI model (rather than humans) is used to generate the data which ultimately trains the preference model. This method, called Reinforcement Learning from AI Feedback uses a "constitution" to guide the feedback model in terms of what outputs are preferable to others.

I go over the research in How Reinforcement Learning from AI Feedback Works. In short, the authors find that they are able to train a non-evasive harmless agent using a short constitution. The method is found to be superior to RLHF, and constitutes a Pareto improvement over RLHF models.


Let me know what you think, I'm happy to answer any questions!

