Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006
When dealing with qualitative analysis of large sets of data, the sheer volume of recorded information can make detailed analysis a time consuming and labour-intensive task. In the case of a hypothesis-driven experiment, much of the data may not be relevant, so examining this data looking for periods of interest can waste a lot of an analyst's time; considering the specific case of video data, an analyst may be required to watch thousands of hours of footage in search of evidence.
Image and Vision Computing, 2008
Multimedia databases have gained popularity due to rapidly growing quantities of multimedia data and the need to perform efficient indexing, retrieval and analysis of this data. One downside of multimedia databases is the necessity to process the data for feature extraction and labeling prior to storage and querying. Huge amount of data makes it impossible to complete this task manually. We propose a tool for the automatic detection and tracking of salient objects, and derivation of spatio-temporal relations between them in video. Our system aims to reduce the work for manual selection and labeling of objects significantly by detecting and tracking the salient objects, and hence, requiring to enter the label for each object only once within each shot instead of specifying the labels for each object in every frame they appear. This is also required as a first step in a fully-automatic video database management system in which the labeling should also be done automatically. The proposed framework covers a scalable architecture for video processing and stages of shot boundary detection, salient object detection and tracking, and knowledge-base construction for effective spatio-temporal object querying.
Imaging and Multimedia Analytics in a Web and Mobile World 2014, 2014
The aim of this work is to detect the events in video sequences that are salient with respect to the audio signal. In particular, we focus on the audio analysis of a video, with the goal of finding which are the significant features to detect audio-salient events. In our work we have extracted the audio tracks from videos of different sport events. For each video, we have manually labeled the salient audio-events using the binary markings. On each frame, features in both time and frequency domains have been considered. These features have been used to train different classifiers: Classification and Regression Trees, Support Vector Machine, and k-Nearest Neighbor. The classification performances are reported in terms of confusion matrices.
In this paper we are identifying the frequently appearing object in a video. This frequently appearing object is called thematic object. If an object appears many times in a video such objects are called thematic object. Identifying this frequently appearing object in a video is helpful for object search and tagging of that object. To identify thematic pattern in the video we must give an object as an input and then we try to find corner points of that object by Harris corner detection algorithm. Later we can find the similarity between the reference image and test frame by extracting the descriptors around the corner point. We are mining the video to identify the common patterns that appears in that video. The proposed approach will help to identify the object even when there is partial occlusion and variation in the viewpoint.
Signal Processing: Image Communication, 2009
Computer vision applications often need to process only a representative part of the visual input rather than the whole image/sequence. Considerable research has been carried out into salient region detection methods based either on models emulating human visual attention (VA) mechanisms or on computational approximations. Most of the proposed methods are bottom-up and their major goal is to filter out redundant visual information. In this paper, we propose and elaborate on a saliency detection model that treats a video sequence as a spatiotemporal volume and generates a local saliency measure for each visual unit (voxel). This computation involves an optimization process incorporating inter-and intra-feature competition at the voxel level. Perceptual decomposition of the input, spatiotemporal center-surround interactions and the integration of heterogeneous feature conspicuity values are described and an experimental framework for video classification is set up. This framework consists of a series of experiments that shows the effect of saliency in classification performance and let us draw conclusions on how well the detected salient regions represent the visual input.
International Journal on Artificial Intelligence Tools, 2014
This work studies collective intelligence behavior of Web users that share and watch video content. Accordingly, it is proposed that the aggregated users' video activity exhibits characteristic patterns. Such patterns may be used in order to infer important video scenes leading thus to collective intelligence concerning the video content. To this end, experimentation is based on users' interactions (e.g., pause, seek/scrub) that have been gathered in a controlled user experiment with information-rich videos. Collective information seeking behavior is then modeled by means of the corresponding probability distribution function. Thus, it is argued that the bell-shaped reference patterns are shown to significantly correlate with predefined scenes of interest for each video, as annotated by the users. In this way, the observed collective intelligence may be used to provide a video-segment detection tool that identifies the importance of video scenes. Accordingly, both a stochastic and a pattern matching approach are applied on the users' interactions information. The results received indicate increased accuracy in identifying the areas selected by users as having high importance information. In practice, the proposed techniques might improve both navigation within videos on the web as well as video search results with personalised video thumbnails.
IEEE Transactions on Image Processing, 2007
We propose a new statistical generative model for spatiotemporal video segmentation. The objective is to partition a video sequence into homogeneous segments that can be used as "building blocks" for semantic video segmentation. The baseline framework is a Gaussian mixture model (GMM)-based video modeling approach that involves a six-dimensional spatiotemporal feature space. Specifically, we introduce the concept of frame saliency to quantify the relevancy of a video frame to the GMM-based spatiotemporal video modeling. This helps us use a small set of salient frames to facilitate the model training by reducing data redundancy and irrelevance. A modified expectation maximization algorithm is developed for simultaneous GMM training and frame saliency estimation, and the frames with the highest saliency values are extracted to refine the GMM estimation for video segmentation. Moreover, it is interesting to find that frame saliency can imply some object behaviors. This makes the proposed method also applicable to other frame-related video analysis tasks, such as key-frame extraction, video skimming, etc. Experiments on real videos demonstrate the effectiveness and efficiency of the proposed method.
2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007
Recently more and more researches focus on the concept extraction from unstructured video data. To bridge the semantic gap between the low-level features and the high-level video concepts, a mid-level understanding of the video contents, i.e., salient object is detected based on the techniques of image segmentation and machine learning. Specifically, 21 salient object detectors are developed and tested on TRECVID 2005 development video corpus. In addition, a boosting method is proposed to select the most representative features to achieve a higher performance than only using single modality, and lower complexity than taking all features into account.
2005
This paper presents a new model of human attention that allows salient areas to be extracted from video frames. As automatic understanding of video semantic content is still far from being achieved, attention model tends to mimic the focus of the human visual system. Most existing approaches extract the saliency of images in order to be used in multiple applications but they are not compared to human perception.
IEICE Transactions on Information and Systems, 2005
This paper presents a framework for automatic video region-of-interest determination based on visual attention model. We view this work as a preliminary step towards the solution of high-level semantic video analysis. Facing such a challenging issue, in this work, a set of attempts on using video attention features and knowledge of computational media aesthetics are made. The three types of visual attention features we used are intensity, color, and motion. Referring to aesthetic principles, these features are combined according to camera motion types on the basis of a new proposed video analysis unit, frame-segment. We conduct subjective experiments on several kinds of video data and demonstrate the effectiveness of the proposed framework.
More and more popular, Smart TVs and set-top boxes open new ways for richer experiences in our living rooms. But to offer richer and novel functionalities, a better understanding of the multimedia content is crucial. If many works try to automatically annotate videos at object level, or classify them, we think that investigating the emotions through the use of digital analysis and processing techniques will allow great TV experience improvements. With our work, we propose a temporal saliency detection approach capable of defining the most exciting parts of a video that will be of the most interest to the users. To identify the most interesting events, without performing their classification (in order to be independent from the video domain), we compute a time series of arousal (excitement level of the content), based on audiovisual features. Our goal is to merge this preliminary work with user emotions analysis, in order to create a multi-modal system, allowing to bridge the gap bet...
Multimedia Tools and Applications, 2019
The ubiquitous utilization of video applications in recent years has made research on video quality of experience paramount. Lack of sufficient bandwidth deters the effective transmission of raw video contents to users. This bandwidth challenge has given rise to encoders for compressing digital video contents for transmission over an internet protocol infrastructure. However, transmitting compressed video color images still has an intrinsic limitation of high bandwidth consumption. Simple linear iterative clustering algorithm was applied for binary segmentation of video color images to circumvent the challenge of efficiently transmitting video contents. Compressed binary segmented images are generally fast to transmit and require lower bandwidth consumption as opposed to compressed video color images. However, since color images contain more useful information than binary image counterparts, evaluation of binary segmentation results was performed using the mean opinion score metric to determine user quality of experience of the transmitted video contents. The practical application of our method will lead to the development of a novel encoder that can deliver binary video contents faster, hence solving the bandwidth hiccup.
1998
The analysis of video data targeting the identification of relevant objects and the extraction of associated descriptive characteristics will be the enabling factor for a number of multimedia applications. This process has intrinsic difficulties, and since semantic criteria are difficult to express, usually only a part of the desired analysis results can be automatically achieved. For many applications, the automatic tools can be complemented with user guidance to improve performance. This paper proposes an integrated framework for video analysis, addressing the video segmentation and feature extraction problems. The framework includes a set of modules that can be combined following specific application needs. It includes both automatic (more objective) and user interaction (more semantic) analysis modules. The paper also proposes a specific segmentation solution to one of the most relevant application scenarios considered-off-line applications requiring precise segmentation
Internet, Multimedia Systems and Applications, 2002
This paper reviews and analyses the problems facing video clas- sification. It investigates how the semantic gap can be bridged. It presents a new taxonomy for video calssifaction based on a litu- ture survey. It conclueds that narrowing the domain is the current approach to bridgeing the semantic gap.
Movie shot classification is vital but challenging task due to various movie genres, different movie shooting techniques and much more shot types than other video domain. Variety of shot types are used in movies in order to attract audiences attention and enhance their watching experience. In this paper, we introduce context saliency to measure visual attention distributed in keyframes for movie shot classification. Different from traditional saliency maps, context saliency map is generated by removing redundancy from contrast saliency and incorporating geometry constrains. Context saliency is later combined with color and texture features to generate feature vectors. Support Vector Machine (SVM) is used to classify keyframes into pre-defined shot classes. Different from the existing works of either performing in a certain movie genre or classifying movie shot into limited directing semantic classes, the proposed method has three unique features: 1) context saliency significantly improves movie shot classification; 2) our method works for all movie genres; 3) our method deals with the most common types of video shots in movies. The experimental results indicate that the proposed method is effective and efficient for movie shot classification.
International Journal of Digital Multimedia Broadcasting, 2010
Journal of Educational Multimedia and Hypermedia, 1993
This paper proposes a new approach for researchers who analyze video data, recommending that data be layered in as many ways as possible as they are selected, coded, and annotated. Although video has become an important source of data over the past decade, the problem facing researchers is that interpreting video is fundamentally different from interpreting text. Given its multi-grained nature, how do we put together and then make sense of the chunked clusters of video? Furthermore, how do we share our views about our video data with our colleagues who may be using part or all of the same set of data? To address these questions, I will: 1) examine some of the theoretical issues underlying the inherent complexity of working with video data, 2) describe the video ethnography of a particular graduate student user who is working with video data, and then, 3) explain the use of a tool called a "Significance Measure" which allows users to layer or weigh the relative importance o...
2017
Video highlights are a selection of the most interesting parts of a video. The problem of highlight detection has been explored for video domains like egocentric, sports, movies, and surveillance videos. Existing methods are limited to finding visually important parts of the video but does not necessarily learn semantics. Moreover, the available benchmark datasets contain audio muted, single activity, short videos, which lack any context apart from a few keyframes that can be used to understand them. In this work, we explore highlight detection in the TV series domain, which features complex interactions with the surroundings. The existing methods would fare poorly in capturing the video semantics in such videos. To incorporate the importance of dialogues/audio, we propose using the descriptions of shots of the video as cues to learning visual importance. Note that while the audio information is used to determine visual importance during training, the highlight detection still works using only the visual information from videos. We use publicly available text ranking algorithms to rank the descriptions. The ranking scores are used to train a visual pairwise shot ranking model (VPSR) to find the highlights of the video. The results are reported on TV series videos of the VideoSet dataset and a season of Buffy the Vampire Slayer TV series.
Multimedia Tools and Applications, 2014
Identification of events from visual cues is in general an arduous task because of complex motion, cluttered backgrounds, occlusions, and geometric and photometric variations of the physical objects. This is even more challenging in case of detection of a logical chain of events, i.e., of a sequence of events called a workflow, and in case of the presence of multiple workflows of events in the environment, able to interact one with the other and affect one the outcome of the other. The recent research advances in computer vision and pattern recognition society have stimulated the development of a series of innovative algorithms, tools and methods for salient object detection and tracking in still images/video streams. These techniques are framed with appropriate descriptors (usually with invariance properties) such as the Scale-Invariant Feature Transform (SIFT) or the Speeded Up Robust Features (SURF), or the MPEG-7 visual descriptors. All these research methods can be considered as initial steps towards the ultimate goal for behavior/event understanding. However, automatic comprehension of someone's behavior within a scene or even automatic supervision of workflows (e.g., industrial processes) is a complex research field of great attention but with limited results so far. Most of the current approaches presented involve machine learning theories, such as supervised or semi-supervised methods, object tracking algorithms, adaptation mechanisms to handle complex, dynamic and abrupt visual conditions and application-specific analysis topics. On the other hand, during the past few years, more and more people have been coping with the so called "Information Overload" phenomenon. On the basis of the i) diversity and plenitude of media information currently available on the web and ii) the gradual but quick role shift of the users from being solely content consumers to acting both as content consumers
Signal Processing: Image Communication, 2017
Over the last few years, a number of interesting solutions covering different aspects of event recognition have been proposed for event-based multimedia analysis. Existing approaches mostly focus on an efficient representation of the image and advanced classification schemes. However, it would be desirable to focus on the event-specific information available in the image, namely the so-called event saliency. In this paper, we propose a novel approach based on multiple instance learning (MIL) to learn the visual features contained in event-salient regions, extracted through a crowd-sourcing study. In total, we collect the salient regions for 76 different events from 4 large-scale datasets. The experimental results demonstrate the efficacy of using only event-related regions by achieving a significant gain in performance over the state-of-the-art.
2014
Cutting the Visual World into Bigger Slices for Improved Video Concept Detection Amélioration de la détection des concepts dans les vidéos par de plus grandes tranches du Monde Visuel