A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition
arxiv(2024)
摘要
Silent Speech Interfaces (SSIs) offer a noninvasive alternative to
brain-computer interfaces for soundless verbal communication. We introduce
Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal
alignment through novel loss functions–cross-contrast (crossCon) and
supervised temporal contrast (supTcon)–to train a multimodal model with a
shared latent representation. This architecture enables the use of audio-only
datasets like LibriSpeech to improve silent speech recognition. Additionally,
our introduction of Large Language Model (LLM) Integrated Scoring Adjustment
(LISA) significantly improves recognition accuracy. Together, MONA LISA reduces
the state-of-the-art word error rate (WER) from 28.8
(2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG
recordings, our method improves the state-of-the-art from 23.3
the Brain-to-Text 2024 competition, LISA performs best, improving the top WER
from 9.8
instance where noninvasive silent speech recognition on an open vocabulary has
cleared the threshold of 15
alternative to automatic speech recognition (ASR). Our work not only narrows
the performance gap between silent and vocalized speech but also opens new
possibilities in human-computer interaction, demonstrating the potential of
cross-modal approaches in noisy and data-limited regimes.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要