
Hello everyone, The survey results are in. Our first meeting will be at 2 PM (PST) on Friday, November 19th, and will repeat weekly. For our first discussion, I thought we'd start with something short and optimistic: Alignment by Default <https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default>. It argues that powerful models may have internal "natural abstractions" that straightforwardly represent human values. Some combination of finetuning, interpretability tools and luck may be enough to "wire up" the human values representation to the model's output and thereby get an aligned model. I'm interested in how plausible alignment by default seems to everyone, whether there are any architecture or training modifications we can make that raise the odds of alignment by default, and in how to best manage the "wire up" step where we get a globally aligned model out of a model with a human values subcomponent. Anyone who's at all interested is welcome to join! Join Zoom Meeting https://oregonstate.zoom.us/j/95843260079?pwd=TzZTN0xPaFZrazRGTElud0J1cnJLUT... Password: 961594 Phone Dial-In Information +1 971 247 1195 US (Portland) +1 253 215 8782 US (Tacoma) +1 301 715 8592 US (Washington DC) Meeting ID: 958 4326 0079