The Effect of Model Size on Worst-Group Generalization

Image credit: Unsplash


Overparameterization is often believed to hurt test accuracy on rare subgroups. However, prior works establishing this focused on cases where subgroup information is known and group DRO is applied to improve the performance of the worst subgroup. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test error under ERM across all setups. In particular, larger pre-trained models are consistently better on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.

NuerIPS DistributionShift Workshop
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.

Supplementary notes can be added here, including code, math, and images.