Papers by Md. Mahmudul Hasan
Lecture Notes in Computer Science, 2014
Activity recognition strategies assume large amounts of labeled training data which require ted... more Activity recognition strategies assume large amounts of labeled training data which require tedious human labor to label. They also use hand engineered features, which are not best for all applications, hence required to be done separately for each application. Several recognition strategies have benefited from deep learning for unsupervised feature selection, which has two important propertyfine tuning and incremental update. Question! Can deep learning be leveraged upon for continuous learning of activity models from streaming videos? Contributions Goals Automatically learning the best set of features in unsupervised manner. Reducing the amount of manual labeling of the unlabeled instances. Retaining already learned information without storing all the previously seen data and continuously improve the existing activity models.

2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
Most of the state-of-the-art approaches to human activity recognition in video need an intensive ... more Most of the state-of-the-art approaches to human activity recognition in video need an intensive training stage and assume that all of the training examples are labeled and available beforehand. But these assumptions are unrealistic for many applications where we have to deal with streaming videos. In these videos, as new activities are seen, they can be leveraged upon to improve the current activity recognition models. In this work, we develop an incremental activity learning framework that is able to continuously update the activity models and learn new ones as more videos are seen. Our proposed approach leverages upon state-ofthe-art machine learning tools, most notably active learning systems. It does not require tedious manual labeling of every incoming example of each activity class. We perform rigorous experiments on challenging human activity datasets, which demonstrate that the incremental activity modeling framework can achieve performance very close to the cases when all examples are available a priori.

We detect seven activities defined by TRECVID SED task such as CellToEar, Embrace, ObjectPut, Peo... more We detect seven activities defined by TRECVID SED task such as CellToEar, Embrace, ObjectPut, PeopleMeet, Peo-pleSplitUp, PersonRuns, and Pointing. We employ two different strategies to detect these activities based on their characteristics. Activities like CellToEar, Embrace, ObjectPut, and Pointing are the results of articulated motion of human parts. Therefore, we employ local spatio-temporal interest point (STIP) feature based bag of words strategy for these activities. Visual vocabularies are constructed from the STIP features and each activity is described by the histograms of visual words. We also construct activity probability map for each camera-activity pair that reflects the spatial distribution of an activity in a camera. We train a discriminative SVM classifier using Gaussian kernel for each camera-activity pair. During evaluation we employ sliding window based technique. We slide spatio-temporal cuboids in both spatial and temporal direction to find a likely activity. The cuboid is also described by the histograms of visual words and final decision is made using the SVM classifier and the activity probability map. For the activities like PeopleMeet, Peo-pleSplitUp, and PersonRuns, the characteristics of trajectories of persons of interest in the activities are discriminative. For instance, trajectories of PeopleMeet converge along time while those of PeopleSplitUp diverge along time. Therefore, we use track-based string of feature graph (SFG) to recognize these activities. Results of our experimental runs on the evaluation videos are comparable with other participants. Our performances in all the activities are among the top five teams.

Computer Vision and Image Understanding, 2021
Most of the existing works on human activity analysis focus on recognition or early recognition o... more Most of the existing works on human activity analysis focus on recognition or early recognition of the activity labels from complete or partial observations. Similarly, almost all of the existing video captioning approaches focus on the observed events in videos. Predicting the labels and the captions of future activities where no frames of the predicted activities have been observed is a challenging problem, with important applications that require anticipatory response. In this work, we propose a system that can infer the labels and the captions of a sequence of future activities. Our proposed network for label prediction of a future activity sequence has three branches where the first branch takes visual features from the objects present in the scene, the second branch takes observed sequential activity features, and the third branch captures the last observed activity features. The predicted labels and the observed scene context are then mapped to meaningful captions using a sequence-to-sequence learning-based method. Experiments on four challenging activity analysis datasets and a video description dataset demonstrate that our label prediction approach achieves comparable performance with the state-of-the-arts and our captioning framework outperform the state-of-the-arts.
Uploads
Papers by Md. Mahmudul Hasan