Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2023, IRJET
…
7 pages
1 file
Video prediction aims to generate future frames from a given past frames. This is one of the fundamental tasks in the computer vision and machine learning. It has attracted many researchers and there are various methods have been proposed to address this task. However, most of them have focused on increasing the performance and ignored memory space and computation cost issue. In this paper, we proposed a lightweight yet efficient network for video prediction. In spire by depthwise and pointwise convolution in the image domainm, we introduce the 3D depthwise and pointwise con volution neural network for video prediction. The experiment results have shown that our proposed framework outperforms state-of-the-art methods in terms of PSNR, SSIM and LPIPS on standard datasets such as KTH, KITTI and BAIR datasets.
IEEE access, 2024
Video prediction is an essential vision task due to its wide applications in real-world scenarios. However, it is indeed challenging due to the inherent uncertainty and complex spatiotemporal dynamics of video content. Several state-of-the-art deep learning methods have achieved superior video prediction accuracy at the expense of huge computational cost. Hence, they are not suitable for devices with limitations in memory and computational resource. In the light of Green Artificial Intelligence (AI), more environment friendly deep learning solutions are desired to tackle the problem of large models and computational cost. In this work, we propose a novel video prediction network 3DTransLSTM, which adopts a hybrid transformer-long short-term memory (LSTM) structure to inherit the merits of both self-attention and recurrence. Three-dimensional (3D) depthwise separable convolutions are used in this hybrid structure to extract spatiotemporal features, meanwhile enhancing model efficiency. We conducted experimental studies on four popular video prediction datasets. Compared to existing methods, our proposed 3DTransLSTM achieved competitive frame prediction accuracy with significantly reduced model size, trainable parameters, and computational complexity. Moreover, we demonstrate the generalization ability of the proposed model by testing the model on dataset completely unseen in the training data.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.
2020
Predictive coding, currently a highly influential theory in neuroscience, has not been widely adopted in machine learning yet. In this work, we transform the seminal model of Rao and Ballard (1999) into a modern deep learning framework while remaining maximally faithful to the original schema. The resulting network we propose (PreCNet) is tested on a widely used next frame video prediction benchmark, which consists of images from an urban environment recorded from a car-mounted camera. On this benchmark (training: 41k images from KITTI dataset; testing: Caltech Pedestrian dataset), we achieve to our knowledge the best performance to date when measured with the Structural Similarity Index (SSIM). On two other common measures, MSE and PSNR, the model ranked third and fourth, respectively. Performance was further improved when a larger training set (2M images from BDD100k), pointing to the limitations of the KITTI training set. This work demonstrates that an architecture carefully base...
Iraqi journal for electrical and electronic engineering/Al-maǧallaẗ al-ʻirāqiyyaẗ al-handasaẗ al-kahrabāʼiyyaẗ wa-al-ilikttrūniyyaẗ, 2024
Video prediction theories have quickly progressed especially after a great revolution of deep learning methods. The prediction architectures based on pixel generation produced a blurry forecast, but it is preferred in many applications because this model is applied on frames only and does not need other support information like segmentation or flow mapping information making getting a suitable dataset very difficult. In this approach, we presented a novel end-to-end video forecasting framework to predict the dynamic relationship between pixels in time and space. The 3D CNN encoder is used for estimating the dynamic motion, while the decoder part is used to reconstruct the next frame based on adding 3DCNN CONVLSTM2D in skip connection. This novel representation of skip connection plays an important role in reducing the blur predicted and preserved the spatial and dynamic information. This leads to an increase in the accuracy of the whole model. The KITTI and Cityscapes are used in training and Caltech is applied in inference. The proposed framework has achieved a better quality in PSNR=33.14, MES=0.00101, SSIM=0.924, and a small number of parameters (2.3 M).
International Journal of Recent Contributions from Engineering, Science & IT (iJES)
Deep neural networks are becoming central in several areas of computer vision. While there have been a lot of studies regarding the classification of images and videos, future frame prediction is still a rarely investigated approach, and even some applications could make good use of the knowledge regarding the next frame of an image sequence in pixel-space. Examples include video compression and autonomous agents in robotics that have to act in natural environments. Learning how to forecast the future of an image sequence requires the system to understand and efficiently encode the content and dynamics for a certain period. It is viewed as a promising avenue from which even supervised tasks could benefit since labeled video data is limited and hard to obtain. Therefore, this work gives an overview of scientific advances covering future frame prediction and proposes a recurrent network model which utilizes recent techniques from deep learning research. The presented architecture is b...
2020
From Video Classification to Video Prediction: Deep Learning Approaches to Video Modelling by Hehe Fan Intelligent systems need the ability to understand not only the static scenes around them but also the dynamic changes in the environment. In order to understand the dynamic changes, intelligent systems are expected to model spatiotemporal sequences, i.e., videos. Learning video representations and predicting future movements constitute two fundamental missions of video modelling. This dissertation presents two works on video classification. In the first work, based on video shot representations, which are extracted by convolutional neural networks, a selective multi-instance learning method is proposed to automatically detect whether an event of interest happens in temporally untrimmed videos which usually consist of multiple video shots. This work can be seen as a binary video classification problem. The second work focuses on efficiency, which aims to reduce the computational co...
HAL (Le Centre pour la Communication Scientifique Directe), 2021
The use of recurrent neural networks in several applications has allowed to capture impressive results, especially in various applications such as video prediction and it has become a promising direction of scientific research. In this paper, we introduce a novel algorithm for video prediction called "Robust Spatiotemporal Convolutional Long Short-Term Memory (Robust-ST-ConvLSTM) Algorithm" that outperforms the stateof-the-art approaches. Robust-ST-ConvLSTM is a memory flow algorithm based on higher order ConvLSTM. This memory flow algorithm is holding the spatiotemporal information to optimize and control the prediction abilities of the ConvLSTM cell. Our approach is developed in the specific context of predicting future frames based on historical observations, and we experimentally validate the ability of the proposed algorithm on two spatiotemporal datasets, including a moving variant of MNIST dataset of handwritten digits, and KTH which is a human motion dataset.
Image Analysis and Processing - ICIAP 2017, 2017
There is an inherent need for autonomous cars, drones, and other robots to have a notion of how their environment behaves and to anticipate changes in the near future. In this work, we focus on anticipating future appearance given the current frame of a video. Existing work focuses on either predicting the future appearance as the next frame of a video, or predicting future motion as optical flow or motion trajectories starting from a single video frame. This work stretches the ability of CNNs (Convolutional Neural Networks) to predict an anticipation of appearance at an arbitrarily given future time, not necessarily the next video frame. We condition our predicted future appearance on a continuous time variable that allows us to anticipate future frames at a given temporal distance, directly from the input video frame. We show that CNNs can learn an intrinsic representation of typical appearance changes over time and successfully generate realistic predictions at a deliberate time difference in the near future.
Lecture notes in networks and systems, 2022
Recently, video prediction algorithms based on neural networks have become a promising research direction. Therefore, a new recurrent video prediction algorithm called "Robust Spatiotemporal Convolutional Long Short-Term Memory" (Robust-ST-ConvLSTM) is proposed in this paper. Robust-ST-ConvLSTM proposes a new internal mechanism that is able to regulate efficiently the flow of spatiotemporal information from video signals based on higher order Convolutional-LSTM. The spatiotemporal information is carried through the entire network to optimize and control the prediction potential of the ConvLSTM cell. In addition, in traditional ConvLSTM units, cell states, that carry relevant information throughout the processing of the input sequence, are updated using only one previous hidden state, which holds information on previous data unit already seen by the network. However, our Robust-ST-ConvLSTM unit will rely on N previous hidden states, that provide temporal context for the motion in video scenes, in the cell state updating process. Experimental results further suggest that the proposed architecture can improve the state-of-the-art video prediction methods significantly on two challenging datasets, including the standard Moving MNIST dataset, and the commonly used video prediction KTH dataset, as human motion dataset.
2020 IEEE International Conference on Image Processing (ICIP)
Based on a previously published neural network-based video intra prediction approach, this paper proposes and evaluates several extensions of both the training process as well as the network architectures. In particular, the influence of coding artifacts in the training samples as well as the effect of using different approximations of the residual coding costs as loss functions are investigated. In addition, the architecture is optimized and extended by final deconvolutional layers. Combined with the use of network pruning, it was not only possible to increase the achieved compression gain in comparison to the previous work, but also to decrease the needed number of floating point operations per pixel by more than 72% at the same time.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the AAAI Conference on Artificial Intelligence, 2019
ArXiv, 2018
arXiv (Cornell University), 2024
arXiv (Cornell University), 2023
Karbala International Journal of Modern Science, 2023
Artificial Intelligence and Technologies, 2021
Cornell University - arXiv, 2022
Proceedings of the AAAI Conference on Artificial Intelligence, 2020
2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
arXiv (Cornell University), 2017
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2017 IEEE International Conference on Computer Vision (ICCV), 2017