Re: [Ai] AI Alignment Reading Group

18 Feb 2022

      Hello everyone,

Given the recent change in schedule, I'd like to send this quick reminder
that the alignment reading group is meeting in one hour.

Join Zoom Meeting
https://oregonstate.zoom.us/j/95843260079?pwd=TzZTN0xPaFZrazRGTElud0J1cnJLUT...

Password: 961594

Phone Dial-In Information
+1 971 247 1195 US (Portland)
+1 253 215 8782 US (Tacoma)
+1 301 715 8592 US (Washington DC)

Meeting ID: 958 4326 0079

All the best,
Quintin

On Thu, Feb 17, 2022 at 11:42 PM Pope, Quintin <popeq@oregonstate.edu>
wrote:
...
Hello everyone,
I've looked at the survey results. Unfortunately, there's no timeslot that
everyone can make. I've chosen Friday at 1 PM PST as the least conflicted
meeting time. I hope to see everyone soon!
Our discussion paper will be "Red Teaming Language Models with Language
Models
<https://www.deepmind.com/research/publications/2022/Red-Teaming-Language-Models-with-Language-Models>
":
Language Models (LMs) often cannot be deployed because of their potential
...
to harm users in ways that are hard to predict in advance. Prior work
identifies harmful behaviors before deployment by using human annotators to
hand-write test cases. However, human annotation is expensive, limiting the
number and diversity of test cases. In this work, we automatically find
cases where a target LM behaves in a harmful way, by generating test cases
(“red teaming”) using another LM. We evaluate the target LM’s replies to
generated test questions using a classifier trained to detect offensive
content, uncovering tens of thousands of offensive replies in a 280B
parameter LM chatbot. We explore several methods, from zero-shot generation
to reinforcement learning, for generating test cases with varying levels of
diversity and difficulty. Furthermore, we use prompt engineering to control
LM-generated test cases to uncover a variety of other harms, automatically
finding groups of people that the chatbot discusses in offensive ways,
personal and hospital phone numbers generated as the chatbot’s own contact
info, leakage of private training data in generated text, and harms that
occur over the course of a conversation. Overall, LM-based red teaming is
one promising tool (among many needed) for finding and fixing diverse,
undesirable LM behaviors before impacting users.
Join Zoom Meeting
https://oregonstate.zoom.us/j/95843260079?pwd=TzZTN0xPaFZrazRGTElud0J1cnJLUT...
Password: 961594
Phone Dial-In Information
+1 971 247 1195 US (Portland)
+1 253 215 8782 US (Tacoma)
+1 301 715 8592 US (Washington DC)
Meeting ID: 958 4326 0079
All the best,
Quintin