Video And Audio Data Extraction For Retrieval, Ranking And Recapitulation (Vader(3))

Volkmar Frinken, Satish Ravindran,Shriphani Palakodety,Guha Jayachandran,Nilesh Powar

ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, ANNPR 2018（2018）

Cited 0|Views22

No score

Abstract

With advances in neural network architectures for computer vision and language processing, multiple modalities of a video can be used for complex content analysis. Here, we propose an architecture that combines visual, audio, and text data for video analytics. The model leverages six different modules: action recognition, voiceover detection, speech transcription, scene captioning, optical character recognition ( OCR) and object recognition. The proposed integration mechanism combines the output of all the modules into a text-based data structure. We demonstrate our model's performance in two applications: a clustering module which groups a corpus of videos into labelled clusters based on their semantic similarity, and a ranking module which returns a ranked list of videos based on a keyword. Our analysis of the precision-recall graphs show that using a multi-modal approach offers an overall performance boost over any single modality.

Translated text

Key words

Multi modal video analytics, LSTM, CNN

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined