This AI Paper from Google AI Proposes Online AI Feedback (OAIF): A Simple and Effective Way to Make DAP Methods Online via AI Feedback

Aligning large language models (LLMs) with human expectations and values is crucial for maximizing societal advantages. Reinforcement learning from human feedback (RLHF) was the initial alignment approach presented. It involves training a reward model (RM) using paired preferences and optimizing a policy using reinforcement learning (RL). An alternative to RLHF that has lately gained popularity is direct alignment from preferences (DAP) approaches. Some examples of these methods are identity policy optimization, sequence likelihood calibration with human feedback (SLiC), and direct preference optimization (DPO).

While DAP approaches use preference datasets, they are typically compiled before training begins, and separate LLMs often produce the responses within them. This means that DAP approaches typically only provide offline feedback, as π cannot receive input on its training generations. The large distribution shift between the aligned policy and the policy that created the dataset makes this an issue.

Drawing inspiration from RL from AI feedback (RLAIF), a new study by Google DeepMind, the University of Edinburgh, and the University of Basel present Online AI Feedback (OAIF) for DAP techniques. With this approach, users get the best of both worlds: the online flexibility of RLHF and the efficiency of DAP methods. In particular, a three-step process is followed when an LLM policy π is aligned:

Two responses from the existing policy are chosen at random.
An LLM is instructed to imitate human preference annotation to gather online feedback over the two responses.
The model is updated using this online feedback using typical DAP losses.

In contrast to competing approaches, OAIF does not first train on RM data but instead retrieves the preference from an LLM. Extensive empirical comparisons between RLHF techniques, OAIF, and existing offline DAP approaches demonstrate the efficacy of the proposed concept. They have developed an experimental protocol incorporating artificial intelligence and human evaluation on three well-known LLM alignment tasks: TL;DR, Anthropic Helpfulness, and Harmlessness.

The researchers demonstrate that OAIF is useful and applicable to converting offline DAP algorithms (DPO, IPO, SLiC) into online ones. Online DAP approaches (DPO, IPO, SLiC) outperform their offline counterparts by an average of 66% in their human evaluation. In 4-way comparisons on the TL;DR task, results show that human raters prefer DPO with OAIF (hence, online DPO) to SFT baseline, RLHF, and RLAIF 58.00% of the time. This finding confirms the importance of having DAP methods available online. They also show that the LLM annotator can be controlled by inserting explicit commands into the prompts. Response length is used as a basis for their testing. The aligned policy’s average response length is cut in half, from 120 to 40 characters, without sacrificing quality compared to the SFT baseline, all because the LLM annotator is asked to prefer shorter responses.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.