TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
CoRR(2024)
摘要
The capability to jointly process multi-modal information is becoming an
essential task. However, the limited number of paired multi-modal data and the
large computational requirements in multi-modal learning hinder the
development. We propose a novel Tri-Modal Translation (TMT) model that
translates between arbitrary modalities spanning speech, image, and text. We
introduce a novel viewpoint, where we interpret different modalities as
different languages, and treat multi-modal translation as a well-established
machine translation problem. To this end, we tokenize speech and image data
into discrete tokens, which provide a unified interface across modalities and
significantly decrease the computational cost. In the proposed TMT, a
multi-modal encoder-decoder conducts the core translation, whereas
modality-specific processing is conducted only within the tokenization and
detokenization stages. We evaluate the proposed TMT on all six modality
translation tasks. TMT outperforms single model counterparts consistently,
demonstrating that unifying tasks is beneficial not only for practicality but
also for performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要