Surveillance Event Detection ( SED ) Discriminative Features and Interactive Feedback Utilization

semanticscholar(2012)

引用 0|浏览2
暂无评分
摘要
We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, generally, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level features and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both of the official sources and our internal evaluations show good performance of our system. For our MER system, it takes some of the features and detection results from the MED system from which the recount is then generated. 1. MED System 1.1 Features In order to encompass all aspects of a video, we extracted a wide variety of low-level and high-level features. Table 1 summarizes the features used in our system. Among those features, most of them are widely used common feature in the community, for example, SIFT, STIP and MFCC. For those features we extracted them using standard code available from the authors of those feature with default parameters. Table 1: Features used for MED’12 system Visual Features Audio Features Low-level features SIFT (Sande, Gevers, & Snoek, 2010) Color SIFT (CSIFT) (Sande, Gevers, & Snoek, 2010) Motion SIFT (MoSIFT) (Chen & Hauptmann, 2009) Transformed Color Histogram (TCH) (Sande, Gevers, & Snoek, 2010) STIP (Willems, Tuytelaars, & Gool, 2008) Dense Trajectory (Wang, Klaser, MFCC Acoustic Unit Descriptors (AUDs) (Chaudhuri, Harvilla, & Raj, 2011) Schmid, & Liu, 2011) High-level features Semantic Indexing Concepts (SIN) (Over, et al., 2012) Object Bank (Li, Su, Xing, & Fei-Fei, 2010) Acoustic Scene Analysis Text Features Optical Character Recognition Automatic Speech Recognition Besides of those common features, we have two home-grown features which are Motion SIFT (MoSIFT) and Acoustic Unit Descriptors (AUDs) and we will introduce those two feature in the following subsections. 1.1 .1 Motion SIFT (MoSIFT) Feature The goal of developing MoSIFT feature is to combine the features from the spatial domain and the temporal domain. Local spatio-temporal features around interest points provide compact but descriptive representations for video analysis and motion recognition. Current approaches tend to extend spatial descriptions by adding a temporal component for the appearance descriptor, which only implicitly captures motion information. MoSIFT detects interest points and encodes not only their local appearance but also explicitly models local motion. The idea is to detect distinctive local features through local appearance and motion. Figure 1 demonstrates the MoSIFT algorithm. Figure 1: System flow chart of the MoSIFT algorithm. The algorithm takes a pair of video frames to find spatio-temporal interest points at multiple scales. Two major computations are applied: SIFT point detection and optical flow computation according to the scale of the SIFT points. For the descriptor, MoSIFT adapts the idea of grid aggregation in SIFT to describe motions. Optical flow detects the magnitude and direction of a movement. Thus, optical flow has the same properties as appearance gradients. The same aggregation can be applied to optical flow in the neighborhood of interest points to increase robustness to occlusion and deformation. The two aggregated histograms (appearance and optical flow) are combined into the MoSIFT descriptor, which now has 256 dimensions. 1.1 .2 Acoustic Unit Descriptors (AUDs) We have developed an unsupervised lexicon learning algorithm that automatically learns units of sound. Each unit is such that it spans a set of audio frames, thereby taking local acoustic context into account. Using a maximum-likelihood estimation process, we can learn a set of such acoustic units unsupervised from audio data. Each of these units can be thought of as low-level fundamental units of sound, and each audio frame is generated by these units. We refer to these units as Acoustic Unit Descriptors (AUDs) and we expect that the distribution of these units will carry information about the semantic content of the audio stream. Each AUD is represented by a 5-state Hidden Markov Model (HMM) with a 4-gaussian mixture output density function. Ideally, with a perfect learning process, we would like to learn semantically interpretable lowerlevel units, such as a clap, a thud sound, a bang, etc. Naturally, it is hard to enforce semantic interpretability on the audio learning process at that level of detail. Further, because the space of all possible sounds is so large, many different sounds will be mapped into single sounds at learning time, since we can only learn a finite set of units. 1.2 Feature Representat ions In the previous section, we briefly describe the features we used in the system and in this section we describe the representations we used for the raw features extraction in Section 1. Three representations were used in you system which were k-means based spatial bag-ofwords model with standard tiling (Lazebnik, Schmid, & Ponce, 2006), k-means based spatial bag-of-words with feature and event specific tiling (Viitaniemil & Laaksonen, 2009) and Gaussian Mixture Model Super Vector (Campbell & Sturim, 2006). Since the k-means based spatial bag-of-words model with standard tiling and Gaussian Mixture Model Super Vector are standard technology we will focus on the k-means based spatial bag-of-words model with feature and event specific tiling, for the simplicity, we call it tiling. Spatial bag-of-words model is a widely used representation of the low-level image/video features. The central idea of spatial bag-of-words model is to divide the image into some small tiles which is also called tiling. Figure 2 shows a couple of tiling examples. Figure 2: Examples of tiling In general, the spatial bag-of-words model uses the 1x1, 2x2 and 4x4 tiling. However the use of those tilings is ad-hoc and some preliminary works have shown that other tilings might produce better performance (Viitaniemil & Laaksonen, 2009). In our system, we systematically tested 80 different tilings to select the best one for each feature and each event. Table 2 shows the performance of feature specific tiling v.s. the standard tiling (for the details of datasets and evaluation metric please refer to the description in the Section 3). From the table, we can see clearly that for all of the five features, the feature tiling performs consistently at least 1% better than the standard tiling. Table 2: The performance of feature specific tiling and standard tiling Feat Featurure SIFT CSIFT TCH STIP MOSIFT Feature Specific 0.4209 0.4496 0.4914 0.5178 0.4330 Tiling Standard Tiling 0.4325 0.4618 0.5052 0.5234 0.4456 Figure 3 shows an example of the performance of event specific tiling v.s. standard tiling on a difficult event identified in our experiments which is E025. It can be seen clearly that the event specific tiling can improve the performance over standard tiling noticeably. Figure 3: The comparison of event specific tiling on event E025 1.3 Training and Fusion We used the standard MED’12 training dataset for our internal evaluation and the training of the models for our submission. For our internal evaluation, the MED’12 training dataset was further divided into the training set and testing set by randomly selecting half of the positive examples into the training set and the rest half into the testing set. The negative examples consisted of only NULL videos which do not have label information. Two classifiers were used in the system which were kernel SVM and kernelized rigid regression (for the simplicity, we refer to it as kernel regression). For the k-means based feature representations we used Chi2 kernel and for the GMM based representation RBF kernel was used. The parameters of the model were tuned by 5-fold cross validation and the PMiss @TER=12.5 was used as the evaluation metric. For combining features from multiple modalities and the outputs of different classifiers, we used fusion and ensemble methods. More specifically, for the same classifier with different features we used three fusion methods which were early fusion, late fusion and double fusion (Lan, Bao, Yu, Liu, & Hauptmann, 2012). In early Fusion the kernel matrices from different features were normalized first and then combined together while in late fusion the prediction scores from the models trained using different features were combined. In our system, we also used a fusion method called double fusion, which combines early fusion and late fusion together. Finally, the results from different classifiers were ensemble together. Figure 4 shows the diagram of our system. 0.55 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 CSIFT SIFT MOSIFT STIP TCH PM is s@ 12 .5 E025 Marriage_proposal baseline event_specific
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要