Most of the LLMs today (for example, ChatGPT) are aligned using reinforcement learning from human feedback (RLHF), where human evaluators reward and penalize the model based on its performance to improve its efficiency. This process, however, is only effective when the evaluator can determine whether the model’s behavior is positive or negative.
Superhuman models have the potential to perform very complex behaviors that are far from human comprehension. For example, a superhuman model could generate millions of lines of complicated code for which a human cannot provide reliable supervision. In such cases, aligning these models becomes a fundamental challenge, and the researchers at OpenAI have tried to tackle this problem by proposing an analogy – can a smaller (less capable) model supervise a larger (more capable) model?
The researchers created the weak supervisors by finetuning small-sized pre-trained models on ground truth labels. They took the model’s prediction on a set of examples to generate weak labels and finetuned a strong model on the same. Lastly, for comparison, they finetuned a strong model on the ground truth labels. This setup can help the researchers study any pair of weak and strong models for any task of interest.
The researchers considered three settings for evaluation (NLP tasks, chess puzzles, and reward modeling) and assessed how well the strong generalized when finetuned on weak labels. When GPT-4 was supervised with a GPT-2 level model on NLP tasks, the performance was between that of GPT-3 and GPT-3.5, and the researchers were able to recover much of GPT-4’s capabilities. The results also show promising weak-to-strong generalization on chess puzzles. The researchers observed that the weak-to-strong generalization is poor for ChatGPT reward modeling.
The researchers also observed that the performance could be improved by allowing the strong models to make predictions with an auxiliary loss. For example, in the abovementioned example of NLP tasks, when using the auxiliary confidence loss, the researchers were able to recover 80% of the performance gap between the two models. Additionally, bootstrapping with intermediate model sizes (aligning a slightly superhuman model, using that to align an even smarter model, and so on) also improves weak-to-strong generalization on chess puzzles.
The research has a few limitations, such as their methods’ lack of consistent effectiveness across all settings and serving more as a proof of concept rather than a practical solution that could be deployed. Despite this, the researchers are encouraged by the results of their method and have shown that the ability of weak models to elicit information from strong models could be improved significantly using very simple methods. This research serves as a promising starting point to tackle the issue of super alignment, and the researchers have taken steps like making the code open-source and launching grant programs to kickstart more research in this area.
Check out the Paper and OpenAI Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
The post This OpenAI Paper Explores Weak-to-Strong Generalization: A Key to Unlocking Superhuman AI’s Full Capabilities appeared first on MarkTechPost.
#AIShorts #Applications #ArtificialIntelligence #LanguageModel #LargeLanguageModel #MachineLearning #TechNews #Technology [Source: AI Techpark]