On Thursday, OpenAI introduced CriticGPT, a new AI model designed to identify and critique errors in code generated by ChatGPT. This model aims to enhance AI alignment through Reinforcement Learning from Human Feedback (RLHF), improving the accuracy of large language model (LLM) outputs.
How CriticGPT Works
Detailed in the research paper “LLM Critics Help Catch LLM Bugs,” CriticGPT assists human trainers by reviewing programming code created by ChatGPT. Built on the GPT-4 family of LLMs, CriticGPT analyzes code and highlights potential errors, making it easier for human reviewers to detect mistakes. The model was trained on a dataset with intentionally inserted bugs, teaching it to identify and flag various coding errors.
Key Findings
- Annotator Preference: In 63% of cases involving naturally occurring LLM errors, human annotators preferred CriticGPT’s critiques over those from human reviewers.
- Enhanced Critiques: Human-machine teams using CriticGPT wrote more comprehensive critiques and reduced hallucination rates compared to AI-only critiques.
Development Process
CriticGPT was trained on numerous inputs with deliberately inserted mistakes. Human trainers modified ChatGPT-generated code, introduced errors, and provided example feedback. This helped the model learn to identify and critique coding errors effectively.
Performance and Techniques
CriticGPT excelled at catching both inserted bugs and naturally occurring errors in ChatGPT’s output. The model produced fewer unhelpful nitpicks and fewer false positives. The researchers also developed a technique called Force Sampling Beam Search (FSBS), allowing CriticGPT to write more detailed reviews by balancing thoroughness and accuracy.
Beyond Code Review
In experiments, CriticGPT identified errors in 24% of previously flawless ChatGPT training data, confirmed by human reviewers. This suggests the model’s potential to generalize to non-code tasks and catch subtle mistakes missed by human evaluators.
Limitations and Future Directions
CriticGPT was trained on relatively short ChatGPT answers, which might not fully prepare it for longer, more complex tasks. While the model reduces confabulations, it doesn’t eliminate them entirely, and human trainers can still make labeling mistakes based on these false outputs.
Integration and Future Plans
OpenAI plans to integrate CriticGPT-like models into its RLHF labeling pipeline to assist trainers. This represents a step towards better tools for evaluating LLM outputs, especially for tasks that are difficult for humans to rate without additional support. However, the researchers caution that extremely complex tasks may still challenge human evaluators, even with AI assistance.
For more insights on AI developments and OpenAI’s latest advancements, stay tuned to our updates.