Digits micro-model for accurate and secure transactions
CoRR(2024)
摘要
Automatic Speech Recognition (ASR) systems are used in the financial domain
to enhance the caller experience by enabling natural language understanding and
facilitating efficient and intuitive interactions. Increasing use of ASR
systems requires that such systems exhibit very low error rates. The
predominant ASR models to collect numeric data are large, general-purpose
commercial models – Google Speech-to-text (STT), or Amazon Transcribe – or
open source (OpenAI's Whisper). Such ASR models are trained on hundreds of
thousands of hours of audio data and require considerable resources to run.
Despite recent progress large speech recognition models, we highlight the
potential of smaller, specialized "micro" models. Such light models can be
trained perform well on number recognition specific tasks, competing with
general models like Whisper or Google STT while using less than 80 minutes of
training time and occupying at least an order of less memory resources. Also,
unlike larger speech recognition models, micro-models are trained on carefully
selected and curated datasets, which makes them highly accurate, agile, and
easy to retrain, while using low compute resources. We present our work on
creating micro models for multi-digit number recognition that handle diverse
speaking styles reflecting real-world pronunciation patterns. Our work
contributes to domain-specific ASR models, improving digit recognition
accuracy, and privacy of data. An added advantage, their low resource
consumption allows them to be hosted on-premise, keeping private data local
instead uploading to an external cloud. Our results indicate that our
micro-model makes less errors than the best-of-breed commercial or open-source
ASRs in recognizing digits (1.8
error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our
model versus 11 GB VRAM for Whisper).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要