
Hello everyone, People have had schedule conflicts with our current meeting time, so I've decided to send out a survey to find a time that may work better. Please fill out the survey with your availability: http://whenisgood.net/swyfrcp Depending on responses, we may or may not be able to meet this week. I'll send out an update once the meeting time is confirmed. Our next paper will be "Red Teaming Language Models with Language Models <https://www.deepmind.com/research/publications/2022/Red-Teaming-Language-Models-with-Language-Models> ": Language Models (LMs) often cannot be deployed because of their potential
to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.
I hope as many people as possible are able to come to the new event. Join Zoom Meeting https://oregonstate.zoom.us/j/95843260079?pwd=TzZTN0xPaFZrazRGTElud0J1cnJLUT... Password: 961594 Phone Dial-In Information +1 971 247 1195 US (Portland) +1 253 215 8782 US (Tacoma) +1 301 715 8592 US (Washington DC) Meeting ID: 958 4326 0079 All the best, Quintin