A Multi-modal Approach to Mining Intent from Code-Mixed Hindi-English Calls in the Hyperlocal-Delivery Domain.

SPECOM(2022)

引用 0|浏览2
暂无评分
摘要
In this work we outline an approach to mine insights from calls between delivery partners (DP) and customers involved in hyperlocal food delivery in India. Incorrect addresses/ locations or other impediments prompt the DPs to call customers leading to suboptimal experiences like breaches in the promised arrival-time, cancellation, fraud, etc. We demonstrate an end-to-end system that utilizes a multimodal approach where we combine data across speech, text and geospatial domains to extract the intent behind these calls. To transcribe calls to text, we develop an Automatic Speech Recognition (ASR) engine that works in the Indian context where the calls are typically highly code-mixed (in our case Hindi and English) along with variations in dialects and pronunciations. Additionally in the hyperlocal delivery space, the calls are also corrupted by high levels of background noise due to the nature of the business. Starting with Wav2Vec2.0 as the base we carried out a series of data and model based experiments to progressively reduce the WER from 85.30% to 31.17%. The transcripts from the ASR engine are encoded into embeddings by adapting an IndicBERT based model. Features extracted from the geospatial markers of calls are concatenated with the embeddings and passed through an XGBoost classification head to classify calls into one of three intents. Through ablation studies we show incremental improvements attributable to signals from different modalities. The winning multi-modal model has a macro average precision of 68.33% which is a 29.3pp lift over the baseline not utilizing all the modalities.
更多
查看译文
关键词
Automatic speech recognition,Wav2Vec2.0,Multi-modal models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要