We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention N... more We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention Network (RMAN) model, which performs object-based scene classification by recurrently locating and memorizing objects in the image. Based on the proposed framework, we introduce a multi-task mechanism that contiguously attends on the different essential objects in a scene image and recurrently performs memory fusion of the features of object focused by an attention model to improve the scene recognition accuracy. The experimental results show that the RMAN model has achieved better classification performance on the constructed dataset and two public scene datasets, surpassing state-of-the-art image scene recognition approaches.
This report proposes a solution for Task 2 of IEEE DCASE data challenge 2020, which attempts to d... more This report proposes a solution for Task 2 of IEEE DCASE data challenge 2020, which attempts to detect anomaly machines according to acoustic data. The proposed solution uses a semi variational auto-encoder. The term “semi” indicates that the resulting variational auto-encoder may not successfully reconstruct the input as the key task of the task is to distinguish the outlier samples according to a specific feature rather than reconstruct the input precisely. As a result, there are a few minor changes introduced by the provided baseline system, which set up a different training stop criteria and a different anomaly scoring system. By the proposed method, the use of different stop training criteria for an variational auto-encoder may help different objectives.
Advances in Multimedia Information Processing - PCM 2010, 2010
Various techniques have been proposed to deal with the problems created by packets loss. The erro... more Various techniques have been proposed to deal with the problems created by packets loss. The error concealment is one of the possible techniques. Most of the researchers focused on error concealment techniques in general or specifically for human speech. These techniques are not efficient for music audio error concealment because the structure of the music signal is different from the audio speech. This paper proposes a novel error concealment scheme for the classical music. The scheme can handle the loss of two consecutive packets containing more than one musical note onsets. The sender report containing positions of note onsets and the cluster information of the signal sections is sent to the receiver. The cluster information is used for the error concealment when the lost packet has more than one onset or for the two consecutive packet losses. Listening tests are performed to evaluate the efficiency of the proposed scheme.
Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint
Many audio and multimedia applications would benefit if they could interpret the content of audio... more Many audio and multimedia applications would benefit if they could interpret the content of audio rather than relying on descriptions or keywords. These applications include multimedia databases and file systems, digital libraries, automatic segmentation or indexing of video (e.g., news or sports storage), and surveillance. This paper describes a novel content-based audio classification approach based on neural network and genetic algorithm. Experiments show this approach achieves a good performance of the classification.
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)
Music genre classification can be of great utility to musical database management. Most of curren... more Music genre classification can be of great utility to musical database management. Most of current classification methods are supervised and tend to be based on contrived taxonomies. However, due to the ambiguities and inconsistencies in the chosen taxonomies, these methods are not applicable for much larger database. In this paper, we proposed an unsupervised clustering method based on a given measure of similarity which can be provided by Hidden Markov Models. In addition, in order to better characterize music content, a novel segmentation scheme is proposed based on music intrinsic rhythmic structure analysis and features are extracted based on these segments. The performance of this feature segmentation scheme performs better than the traditional fixed-length method according to experimental results. Our preliminary results also suggest that proposed method is comparable to supervised classification method.
Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.
In this paper, we present a novel approach for music summarization based on music structure analy... more In this paper, we present a novel approach for music summarization based on music structure analysis. From the audio signal, we first extract the note onset representing the time tempo of the song and the music structure analysis can be performed based on this tempo information. After music content has been structured into different semantic regions such as Introduction (Intro), Verse, Chorus, Ending (Outro), etc., the final music summary can be created with chorus and music phrases which are included anterior or posterior to selected chorus to get the desired length of the final summary. In this way, we can guarantee that the summaries begin and end at meaningful music phrase boundaries, which is a difficult problem for existing music summarization methods. Experiments show our proposed method can capture the main theme of the music compared to the ideal summaries selected by music experts and user subjective evaluation indicates our proposed method has a good performance.
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)
Automatic music summarization is very useful for music indexing, content-based music retrieval an... more Automatic music summarization is very useful for music indexing, content-based music retrieval and on-line music distribution, but it is a challenge to automatically extract the most common and salient themes from unstructured music raw data. In this paper, we propose an effective approach to automatically summarize music content. Firstly, a number of features are extracted to characterize the music content. Based on extracted features, an adaptive clustering algorithm is then applied to structure the music content. Finally, the music summary is created in terms of the clustering results and domain-related music knowledge. User study is conducted to evaluate the quality of summarization. The experiments on different genres of music illustrate the results of summarization are significant and effective to actual expectation.
Proceedings of the 12th annual ACM international conference on Multimedia, 2004
In this paper, we present a novel approach for music structure analysis. A new segmentation metho... more In this paper, we present a novel approach for music structure analysis. A new segmentation method, beat space segmentation, is proposed and used for music chord detection and vocal/instrumental boundary detection. The wrongly detected chords in the chord pattern sequence and the misclassified vocal/instrumental frames are corrected using heuristics derived from the domain knowledge of music composition. Melody-based similarity regions are detected by matching sub-chord patterns using dynamic programming. The vocal content of the melodybased similarity regions is further analyzed to detect the contentbased similarity regions. Based on melody-based and contentbased similarity regions, the music structure is identified. Experimental results are encouraging and indicate that the performance of the proposed approach is superior to that of the existing methods. We believe that music structure analysis can greatly help music semantics understanding which can aid music transcription, summarization, retrieval and streaming. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing-abstract methods, indexing methods.
Proceedings / ICIP ... International Conference on Image Processing
In this paper, a new automatic summarization approach for music videos is presented. The proposed... more In this paper, a new automatic summarization approach for music videos is presented. The proposed method detects and recognizes lyric captions appearing commonly in Karaoke music video and uses the captions to analyze music video structure and identify the most salient music part. The summary of music video is created based on the salient part. The experiment result shows our proposed method is promising.
Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429)
In this paper, we propose a novel approach to automatically summarize musical videos. The propose... more In this paper, we propose a novel approach to automatically summarize musical videos. The proposed summarization scheme is different from the current methods used for video summarization. The musical video is separated into the musical and visual tracks. A music summary is created by analyzing the music content based on music features, adaptive clustering algorithm and musical domain knowledge. Then, shots are detected and clustered in the visual track. Finally, the music video summary is created by aligning the music summary and clustered video shots. Subjective studies by experienced users have been conducted to evaluate the quality of summarization. The experiments on different genres of musical video and comparisons with the summaries only based on music track and video track indicate that the results of summarization using proposed method are significant and effective to help realize user's expectation.
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
A novel compressed domain automatic music summarization approach is presented in this paper. The ... more A novel compressed domain automatic music summarization approach is presented in this paper. The proposed method works directly in the compressed domain. Only the encoded subband samples are extracted and processed for characterizing music content and discovering the music structure. The experimental results and the evaluation by a subjective study have shown that the summarization based on MPEG-1 Layer 3 (MP3) music is comparable to the summarization based on uncompressed PCM music samples.
In this paper, we try to solve the personalized travel recommendation problem by exploiting the m... more In this paper, we try to solve the personalized travel recommendation problem by exploiting the multi-modal data available from the real world social media, and a probabilistic graph model so called Sentiment-aware Multi-modal Topic Model (SMTM) is proposed to mine the latent semantics of the multimodal data on the online travel website. Distinguished from previous approaches, our proposed approach try to mine the topics from tourist and attraction domains separately for disclosing semantics for tourist topics and attraction themes. In addition, we analyze tourist's sentiments on attractions to further obtain the tourist's attitude over attractions and recommend the attraction with proper sentiment on the related attraction themes accordingly. Based on the proposed SMTM model, the documents in tourist domain and in attraction domain can be compared with each other after they were projected into the mutual topic space, and this latent space projection scheme can be further applied to two personalized traveling recommendations, that is, the single platform traveling recommendation and the inter-platform traveling recommendation. Evaluation results based on the real world online travel website have shown the improved performance of our method.
Proceedings of the 2012 2nd International Conference on Computer and Information Applications (ICCIA 2012), 2012
How to improve the frequency resolution, especially in the low frequency bands, is an important p... more How to improve the frequency resolution, especially in the low frequency bands, is an important problem for many applications. A novel approach based on sparse representation was proposed for this problem. In this work, we first discussed the design of the over-completed dictionary with the high frequency resolution, and then Matching Pursuit was advised to perform sparse decomposition using Matching Pursuit Tool Kit. Next, the frequency information was extracted from the code book of the results of Matching Pursuit. Finally, the experiments demonstrated the improvement of the frequency resolution for the proposed approach.
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05, 2005
In this paper, we propose a novel approach for automatic music video summarization based on audio... more In this paper, we propose a novel approach for automatic music video summarization based on audiovisual text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on music structure analysis. For the video track, we first segment the shots and classify the shots into close-up face shots and non-face shots, then we extract the lyrics and detect the most repeated lyrics from the shots. The music video summary is generated based on the alignment of boundaries of the detected chorus, shot class and the most repeated lyrics from the music video. The experiments on chorus detection, shot classification, and lyrics detection using 20 English music videos are described. Subjective user studies have been conducted to evaluate the quality and effectiveness of summary. The comparisons with the summaries based on our previous method and the manual method indicate that the results of summarization using the proposed method are better at meeting users' expectations. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing-abstract methods, indexing methods.
In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace mo... more In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace modeling using probabilistic linear discriminant analysis (PLDA) in handling the speaker and session (or channel) variability. An i-vector is a low-dimensional vector containing both speaker and channel information acquired from a speech segment. When PLDA is used on an i-vector, dimension reduction is performed twice: first in the i-vector extraction process and second in the PLDA model. Keeping the full dimensionality of the i-vector in the i-supervector space for PLDA modeling and scoring would avoid unnecessary loss of information. We refer to the uncompressed i-vector as the i-supervector. The drawback in using the i-supervector with PLDA is the inversion of large matrices in the estimation of the full posterior distribution, which we show can be solved rather efficiently by portioning large matrices into smaller blocks. We also introduce the Gaussianized rank-norm, as an alternative to whitening, for feature normalization prior to PLDA modeling. We found that the i-supervector performs better during normalization. A better performance is obtained by combining the i-supervector and i-vector at the score level. Furthermore, we also analyze the computational complexity of the i-supervector system, compared with that of the i-vector, at four different stages of loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring.
We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention N... more We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention Network (RMAN) model, which performs object-based scene classification by recurrently locating and memorizing objects in the image. Based on the proposed framework, we introduce a multi-task mechanism that contiguously attends on the different essential objects in a scene image and recurrently performs memory fusion of the features of object focused by an attention model to improve the scene recognition accuracy. The experimental results show that the RMAN model has achieved better classification performance on the constructed dataset and two public scene datasets, surpassing state-of-the-art image scene recognition approaches.
This report proposes a solution for Task 2 of IEEE DCASE data challenge 2020, which attempts to d... more This report proposes a solution for Task 2 of IEEE DCASE data challenge 2020, which attempts to detect anomaly machines according to acoustic data. The proposed solution uses a semi variational auto-encoder. The term “semi” indicates that the resulting variational auto-encoder may not successfully reconstruct the input as the key task of the task is to distinguish the outlier samples according to a specific feature rather than reconstruct the input precisely. As a result, there are a few minor changes introduced by the provided baseline system, which set up a different training stop criteria and a different anomaly scoring system. By the proposed method, the use of different stop training criteria for an variational auto-encoder may help different objectives.
Advances in Multimedia Information Processing - PCM 2010, 2010
Various techniques have been proposed to deal with the problems created by packets loss. The erro... more Various techniques have been proposed to deal with the problems created by packets loss. The error concealment is one of the possible techniques. Most of the researchers focused on error concealment techniques in general or specifically for human speech. These techniques are not efficient for music audio error concealment because the structure of the music signal is different from the audio speech. This paper proposes a novel error concealment scheme for the classical music. The scheme can handle the loss of two consecutive packets containing more than one musical note onsets. The sender report containing positions of note onsets and the cluster information of the signal sections is sent to the receiver. The cluster information is used for the error concealment when the lost packet has more than one onset or for the two consecutive packet losses. Listening tests are performed to evaluate the efficiency of the proposed scheme.
Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint
Many audio and multimedia applications would benefit if they could interpret the content of audio... more Many audio and multimedia applications would benefit if they could interpret the content of audio rather than relying on descriptions or keywords. These applications include multimedia databases and file systems, digital libraries, automatic segmentation or indexing of video (e.g., news or sports storage), and surveillance. This paper describes a novel content-based audio classification approach based on neural network and genetic algorithm. Experiments show this approach achieves a good performance of the classification.
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)
Music genre classification can be of great utility to musical database management. Most of curren... more Music genre classification can be of great utility to musical database management. Most of current classification methods are supervised and tend to be based on contrived taxonomies. However, due to the ambiguities and inconsistencies in the chosen taxonomies, these methods are not applicable for much larger database. In this paper, we proposed an unsupervised clustering method based on a given measure of similarity which can be provided by Hidden Markov Models. In addition, in order to better characterize music content, a novel segmentation scheme is proposed based on music intrinsic rhythmic structure analysis and features are extracted based on these segments. The performance of this feature segmentation scheme performs better than the traditional fixed-length method according to experimental results. Our preliminary results also suggest that proposed method is comparable to supervised classification method.
Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.
In this paper, we present a novel approach for music summarization based on music structure analy... more In this paper, we present a novel approach for music summarization based on music structure analysis. From the audio signal, we first extract the note onset representing the time tempo of the song and the music structure analysis can be performed based on this tempo information. After music content has been structured into different semantic regions such as Introduction (Intro), Verse, Chorus, Ending (Outro), etc., the final music summary can be created with chorus and music phrases which are included anterior or posterior to selected chorus to get the desired length of the final summary. In this way, we can guarantee that the summaries begin and end at meaningful music phrase boundaries, which is a difficult problem for existing music summarization methods. Experiments show our proposed method can capture the main theme of the music compared to the ideal summaries selected by music experts and user subjective evaluation indicates our proposed method has a good performance.
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)
Automatic music summarization is very useful for music indexing, content-based music retrieval an... more Automatic music summarization is very useful for music indexing, content-based music retrieval and on-line music distribution, but it is a challenge to automatically extract the most common and salient themes from unstructured music raw data. In this paper, we propose an effective approach to automatically summarize music content. Firstly, a number of features are extracted to characterize the music content. Based on extracted features, an adaptive clustering algorithm is then applied to structure the music content. Finally, the music summary is created in terms of the clustering results and domain-related music knowledge. User study is conducted to evaluate the quality of summarization. The experiments on different genres of music illustrate the results of summarization are significant and effective to actual expectation.
Proceedings of the 12th annual ACM international conference on Multimedia, 2004
In this paper, we present a novel approach for music structure analysis. A new segmentation metho... more In this paper, we present a novel approach for music structure analysis. A new segmentation method, beat space segmentation, is proposed and used for music chord detection and vocal/instrumental boundary detection. The wrongly detected chords in the chord pattern sequence and the misclassified vocal/instrumental frames are corrected using heuristics derived from the domain knowledge of music composition. Melody-based similarity regions are detected by matching sub-chord patterns using dynamic programming. The vocal content of the melodybased similarity regions is further analyzed to detect the contentbased similarity regions. Based on melody-based and contentbased similarity regions, the music structure is identified. Experimental results are encouraging and indicate that the performance of the proposed approach is superior to that of the existing methods. We believe that music structure analysis can greatly help music semantics understanding which can aid music transcription, summarization, retrieval and streaming. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing-abstract methods, indexing methods.
Proceedings / ICIP ... International Conference on Image Processing
In this paper, a new automatic summarization approach for music videos is presented. The proposed... more In this paper, a new automatic summarization approach for music videos is presented. The proposed method detects and recognizes lyric captions appearing commonly in Karaoke music video and uses the captions to analyze music video structure and identify the most salient music part. The summary of music video is created based on the salient part. The experiment result shows our proposed method is promising.
Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429)
In this paper, we propose a novel approach to automatically summarize musical videos. The propose... more In this paper, we propose a novel approach to automatically summarize musical videos. The proposed summarization scheme is different from the current methods used for video summarization. The musical video is separated into the musical and visual tracks. A music summary is created by analyzing the music content based on music features, adaptive clustering algorithm and musical domain knowledge. Then, shots are detected and clustered in the visual track. Finally, the music video summary is created by aligning the music summary and clustered video shots. Subjective studies by experienced users have been conducted to evaluate the quality of summarization. The experiments on different genres of musical video and comparisons with the summaries only based on music track and video track indicate that the results of summarization using proposed method are significant and effective to help realize user's expectation.
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
A novel compressed domain automatic music summarization approach is presented in this paper. The ... more A novel compressed domain automatic music summarization approach is presented in this paper. The proposed method works directly in the compressed domain. Only the encoded subband samples are extracted and processed for characterizing music content and discovering the music structure. The experimental results and the evaluation by a subjective study have shown that the summarization based on MPEG-1 Layer 3 (MP3) music is comparable to the summarization based on uncompressed PCM music samples.
In this paper, we try to solve the personalized travel recommendation problem by exploiting the m... more In this paper, we try to solve the personalized travel recommendation problem by exploiting the multi-modal data available from the real world social media, and a probabilistic graph model so called Sentiment-aware Multi-modal Topic Model (SMTM) is proposed to mine the latent semantics of the multimodal data on the online travel website. Distinguished from previous approaches, our proposed approach try to mine the topics from tourist and attraction domains separately for disclosing semantics for tourist topics and attraction themes. In addition, we analyze tourist's sentiments on attractions to further obtain the tourist's attitude over attractions and recommend the attraction with proper sentiment on the related attraction themes accordingly. Based on the proposed SMTM model, the documents in tourist domain and in attraction domain can be compared with each other after they were projected into the mutual topic space, and this latent space projection scheme can be further applied to two personalized traveling recommendations, that is, the single platform traveling recommendation and the inter-platform traveling recommendation. Evaluation results based on the real world online travel website have shown the improved performance of our method.
Proceedings of the 2012 2nd International Conference on Computer and Information Applications (ICCIA 2012), 2012
How to improve the frequency resolution, especially in the low frequency bands, is an important p... more How to improve the frequency resolution, especially in the low frequency bands, is an important problem for many applications. A novel approach based on sparse representation was proposed for this problem. In this work, we first discussed the design of the over-completed dictionary with the high frequency resolution, and then Matching Pursuit was advised to perform sparse decomposition using Matching Pursuit Tool Kit. Next, the frequency information was extracted from the code book of the results of Matching Pursuit. Finally, the experiments demonstrated the improvement of the frequency resolution for the proposed approach.
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05, 2005
In this paper, we propose a novel approach for automatic music video summarization based on audio... more In this paper, we propose a novel approach for automatic music video summarization based on audiovisual text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on music structure analysis. For the video track, we first segment the shots and classify the shots into close-up face shots and non-face shots, then we extract the lyrics and detect the most repeated lyrics from the shots. The music video summary is generated based on the alignment of boundaries of the detected chorus, shot class and the most repeated lyrics from the music video. The experiments on chorus detection, shot classification, and lyrics detection using 20 English music videos are described. Subjective user studies have been conducted to evaluate the quality and effectiveness of summary. The comparisons with the summaries based on our previous method and the manual method indicate that the results of summarization using the proposed method are better at meeting users' expectations. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing-abstract methods, indexing methods.
In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace mo... more In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace modeling using probabilistic linear discriminant analysis (PLDA) in handling the speaker and session (or channel) variability. An i-vector is a low-dimensional vector containing both speaker and channel information acquired from a speech segment. When PLDA is used on an i-vector, dimension reduction is performed twice: first in the i-vector extraction process and second in the PLDA model. Keeping the full dimensionality of the i-vector in the i-supervector space for PLDA modeling and scoring would avoid unnecessary loss of information. We refer to the uncompressed i-vector as the i-supervector. The drawback in using the i-supervector with PLDA is the inversion of large matrices in the estimation of the full posterior distribution, which we show can be solved rather efficiently by portioning large matrices into smaller blocks. We also introduce the Gaussianized rank-norm, as an alternative to whitening, for feature normalization prior to PLDA modeling. We found that the i-supervector performs better during normalization. A better performance is obtained by combining the i-supervector and i-vector at the score level. Furthermore, we also analyze the computational complexity of the i-supervector system, compared with that of the i-vector, at four different stages of loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring.
Uploads
Papers by Xi Shao