Language Models of Visual Cortex: Where do they work? And why do they work so well where they do?

Colin Conwell,Jacob S. Prince,George A. Alvarez,Talia Konkle

Journal of Vision（2023）

引用 0|浏览0

暂无评分

摘要

It’s often taken for granted that the best models of visual cortex are vision models. Recent research into models that learn from various combinations of vision and language, however, has reinvigorated longstanding debates over just how visual our models of visual cortex really need be. In this work, we characterize where and to what extent unimodal language models or multimodal vision-language models best predict evoked visual activity in the human ventral stream. We do this with a series of controlled modeling experiments on brain responses in 4 subjects responding to 1000 images from the Natural Scenes Dataset (NSD), with both classical and voxel-reweighted RSA (veRSA). Using a series of models which consist of pure SimCLR-style visual self-supervision, pure CLIP-style language-alignment, or a combination of the two, we first demonstrate that language-aligned models -- when controlling for dataset -- are in fact no better than unimodal vision models at predicting activity in the ventral stream. We next use captions associated with the NSD images to the test the brain predictivity of language embeddings from across the processing hierarchy of (N=24) unimodal language models (e.g. SentenceBERT, GPT2), demonstrating that while these kinds of embeddings systematically fail to predict activity in early visual cortex, they perform on par with unimodal vision models (N=19) in occipitemporal cortex (with classical and veRSA scores of up to 43% and 67%, respectively). Finally, in a series of text manipulation experiments (e.g. word scrambling, nouns only), we show that the predictive power of these models seems predicated almost entirely on simple nouns in no syntactic order (with veRSA scores of up to 61%). These results qualify recent excitement about language-alignment in the ventral stream, and suggest language models are only successful models of high-level vision to the extent they capture information about the objects present in an image.

查看译文

关键词

visual cortex,language models

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要