OpenAI Intensifies Red-Teaming Efforts to Uncover AI Model Vulnerabilities and Unwanted Behaviors
November 21, 2024OpenAI is actively recruiting a diverse group of experts to challenge its models, with the goal of identifying new unwanted behaviors and finding ways to bypass existing safeguards.
The company has published two papers that detail its methods for stress-testing large language models, employing an approach known as red-teaming.
Red-teaming, which originated from cybersecurity practices, was first used by OpenAI in 2022 during the testing of DALL-E 2 to better understand potential user interactions and associated risks.
The first of the two papers outlines OpenAI's use of a network of human testers to evaluate model behavior before release, while the second paper discusses the automation of parts of this testing process using models like GPT-4.
This new testing method separates the brainstorming of unwanted behaviors from reinforcement learning, enabling a broader search for vulnerabilities.
OpenAI claims this is the first instance where automated red-teaming has successfully identified indirect prompt injections, which can lead to unintended model behaviors.
Testing has uncovered unexpected behaviors, such as GPT-4 mimicking user voices and challenges in interpreting ambiguous requests, like the term 'eggplant.'
OpenAI is also revealing aspects of its safety-testing processes, including findings from an investigation into harmful stereotypes produced by ChatGPT.
Despite these efforts, critics argue that no amount of testing can fully eliminate harmful behaviors due to the inherent unpredictability of large language models and their varied applications.
Andrew Tait emphasizes that the rapid development of language models is outpacing testing methods, highlighting the necessity for specific applications rather than broad general-purpose claims.
Ahmad from OpenAI stresses the importance of broader industry participation in red-teaming to ensure comprehensive safety testing of AI models.
While large language models are widely used, they are known to generate harmful content, reveal private information, and reinforce existing biases.
Summary based on 1 source
Get a daily email with more Tech stories
Source
MIT Technology Review • Nov 21, 2024
How OpenAI stress-tests its large language models