Hello everyone,
The survey results are in. Our first meeting will be at 2 PM (PST) on Friday, November 19th, and will repeat weekly.
For our first discussion, I thought we'd start with something short and optimistic:
Alignment by Default. It argues that powerful models may have internal "natural abstractions" that straightforwardly represent human values. Some combination of finetuning, interpretability tools and luck may be enough to "wire up" the human values representation to the model's output and thereby get an aligned model.
I'm interested in how plausible alignment by default seems to everyone, whether there are any architecture or training modifications we can make that raise the odds of alignment by default, and in how to best manage the "wire up" step where we get a globally aligned model out of a model with a human values subcomponent.
Anyone who's at all interested is welcome to join!