Hello everyone,
Models of human behavior for prediction and collaboration tend to fall into
two categories: ones that learn from large amounts of data via imitation
learning, and ones that assume human behavior to be noisily-optimal for some
reward function. The former are very useful, but only when it is possible to
gather a lot of human data in the target environment and distribution. The
advantage of the latter type, which includes Boltzmann rationality, is the
ability to make accurate predictions in new environments without extensive data
when humans are actually close to optimal. However, these models fail when
humans exhibit systematic suboptimality, i.e. when their deviations from
optimal behavior are not independent, but instead consistent over time. Our key
insight is that systematic suboptimality can be modeled by predicting policies,
which couple action choices over time, instead of trajectories. We introduce
the Boltzmann policy distribution (BPD), which serves as a prior over human
policies and adapts via Bayesian inference to capture systematic deviations by
observing human actions during a single episode. The BPD is difficult to
compute and represent because policies lie in a high-dimensional continuous
space, but we leverage tools from generative and sequence models to enable
efficient sampling and inference. We show that the BPD enables prediction of
human behavior and human-AI collaboration equally as well as imitation
learning-based human models while using far less data.
All the best,
Quintin Pope