A Touch, Vision, and Language Dataset for Multimodal Alignment
CoRR(2024)
摘要
Touch is an important sensing modality for humans, but it has not yet been
incorporated into a multimodal generative language model. This is partially due
to the difficulty of obtaining natural language labels for tactile data and the
complexity of aligning tactile readings with both visual observations and
language descriptions. As a step towards bridging that gap, this work
introduces a new dataset of 44K in-the-wild vision-touch pairs, with English
language labels annotated by humans (10
(90
for open-vocabulary classification and a touch-vision-language (TVL) model for
text generation using the trained encoder. Results suggest that by
incorporating touch, the TVL model improves (+29
touch-vision-language alignment over existing models trained on any pair of
those modalities. Although only a small fraction of the dataset is
human-labeled, the TVL model demonstrates improved visual-tactile understanding
over GPT-4V (+12
touch-vision understanding benchmark. Code and data:
https://tactile-vlm.github.io.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要