Modeling Language and Vision at Human Scales