Overparameterization is often believed to hurt test accuracy on rare subgroups. However, prior works establishing this focused on cases where subgroup information is known and group DRO is applied to improve the performance of the worst subgroup. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test error under ERM across all setups. In particular, larger pre-trained models are consistently better on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.
Supplementary notes can be added here, including code, math, and images.