Deepfake Image Detection
using Vision Transformer Models
Bogdan Ghita Ievgeniia Kuzminykh Abubakar Usama
University of Plymouth King’s College London University of Plymouth
Plymouth, UK London, UK Plymouth, UK
[Link]@[Link] [Link]@[Link] usamacheema0007@[Link]
Taimur Bakhshi Jims Marchang
Leeds Beckett University Sheffield Hallam University
Leeds, UK Sheffield, UK
[Link]@[Link] [Link]@[Link]
Abstract—Deepfake images are causing an increasing negative Its misuse was also followed by significant effort from the
impact on the day to day life and pose significant challenges research community both to establish more realistic images as
for the society. There are various categories of deepfake images well as developing techniques to detect them. Deepfake images
as the technology evolves and becomes more accessible. In
parallel, deepfake detection methods are also improving, from can be catalogued based on the focus and degree of change into
basic features analysis to pairwise analysis and deep learning; entire face synthesis, identity swap, attribute manipulation, and
nevertheless, to date, there is no consistent method able to fully expression swap. From a technology perspective, deepfakes
detect such images. This study aims to provide an overview of use unsupervised ANNs named autoencoders, which are, in
existing methods of deepfake detection in the literature and inves- fact, used for both image manipulation as well as facial
tigate the accuracy of models based on Vision Transformer (VIT)
when analysing and detecting deepfake images. We implement a recognition, by being able to both synthesise the facial features
VIT model-based deepfake detection technique, which is trained into a set of characteristics as well as modify images based on
and tasted on a mixed real and deepfake images dataset from their defining features. The encoding process can be further
Kaggle, containing 40000 images. improved through the use of a generative adversarial network
The results show that The VIT model scores relatively high, (GAN).
89.9125%, which demonstrates its potential but also highlights
there is significant room for improvement. Preliminary tests also From a detection perspective, the early efforts were based on
highlight the importance of a large dataset for training and image features analysis, such as pixel similarity and noise, to
the fast convergence of the model. When compared with other identify artifacts of the modification process. Such approaches
deepfake machine learning and deep learning detection methods, were effective towards early instances but were surpassed
the performance of the ViT model is in line with prior research by Autoencoders and GANs to replicate the modification
and warrants further investigation in order to evaluate its full
potential. process. In parallel, a number of classification models may also
Index Terms—deepfake images, deepfake detection, Vision be employed for deepfake detection purposes. One obvious
Transformer model example of such a model are Visual Transformers (ViT), a
classification approach derived from natural language pro-
I. I NTRODUCTION cessing which allows interpretation of images as an analysis
The recent advancements in artificial neural network (ANN) array. Although initially used for image classification, ViT
technologies have had a significant impact on multimedia can also be applied for deepfake detection, particularly given
content manipulation. AI-based software tools, allowing users its algorithm-agnostic approach and ability to handle very
to modify facial appearance, hairstyle, gender, age, and other large inputs. This study expands the work so far in ViT by
personal attributes, have facilitated the creation of realistic applying the model for classifying an image dataset into real
fake images, videos, and audios. The widespread availability and deepfake images. We evaluate the model against a small
of these tools and the manipulated content they produced were image set and discuss its classification performance, as well as
coined in 2017 with the term ”Deepfake”, derived from “Deep identify a number of limitations and potential for future work.
Learning (DL)” and “fake,” and it encompasses applying
II. D EEPFAKE GENERATION AND DETECTION
deep learning methods to generate very realistic appearing
(fake) content. The advent of deepfakes was facilitated by the A. Deepfake generation techniques
increased complexity and capabilities of computer vision and Since its inception, the generation of deepfake content has
deep learning techniques. While it can be employed towards had a negative effect across all levels of society [2], from
legitimate, creative use, it has been typically misused by users its prevalence into everyday social media [3] to its ability to
to create fake news or fake images. [1] disrupt international politics [4] [1].
Although there is a wide variety of approaches for gener- approach. A typical example is AutoGAN, the approach from
ating deepfakes, they revolve around the concept of image [12], which is essentially a GAN that replicates the deepfake
(typically face) features extraction and reconstruction. The process. AutoGAN takes the image used as input and a GAN-
process, implemented through an autoencoder-decoder. uses based generator to produce an image following the same
multiple images for training to extract and recompose these principles as deepfake; the image is then compared to the
features. As part of the process, the deepfake model is trained original to detect spectral artifacts. The method relies on the
to parse a set of images, extract their salient features, then GAN requiring a variant of upsampling, either transposed
reconstruct each image as accurately as possible using the convolution or nearest neighbour interpolation, that do produce
extracted features. Once the model is trained for extraction spectral artifacts that can be l̈earned.̈ The method delivers a
and reconstruction, it can be applied to cross-convert images. tangible improvement to cycleGAN, reaching accuracy of over
For example, in order to replace the face of a subject, it can 95% for both transposed and nearest neighbour interpolation.
extract the image features of subject X and then reconstruct
the image using features from a different subject Y. C. Conclusion
While autoencoder-based models will produce very good As highlighted by the results of the studies in the previous
results, it is designed with efficiency in mind rather than subsection, detection techniques mimic the deepfake genera-
performance and will therefore allow for either pixelation tion models in order to expose artifacts, defects, or other types
defects or blurriness, both detectable by users or automated of data errors in the input images and successfully segregate
methods. Generative Adversarial Networks (GANs) [5] have real and fake images. The results are very good, with methods
better intrinsic capabilities to identify and remove artifacts achieving accuracy of over 85-90% across the datasets tests.
and were therefore proven to significantly reduce noise and One common point to the detection methods is their need
improve the quality of the resulting deepfake images. [6]. and awareness of the model used for generating the fake
More recently, newer deepfake approaches applied existing images, as they are somewhat geared towards specific deepfake
techniques for more efficient and realistic results. Variational models behaviour. This issue is made clearer by some of the
Autoencoders (VAEs), proposed in [7] and [8], allow the papers by highlighting that deepfake images generated through
inclusion of more complex models built from larger datasets. other or unknown methods may perform worse when analysed
VAEs include a reconstruction loss monitoring component, and, implicitly, the proposed methods may become obsolete
which aims to minimise the loss value resulting from the with future variants of deepfake models. It is therefore worth
process, and a regulariser component, that ensures diversity in exploring the wider field of image recognition techniques and
the outcome. Similarly, an Adversarial Autoencoder (AAE) [9] investigate their potential for usage as a deepfake method-
draws in the benefits of GANs but relies on the autoencoder agnostic detection alternative.
training to extract the distribution of the data rather than
impose it on the output layer. As shown in the study that III. VT- BASED DEEPFAKE DETECTION
introduced the concept, AAE is net superior to VAE and
A. Introduction
incrementally better than GAN in terms of the error rate
applied to standard evaluation images datasets. Vision Transformer (ViT) is an image analysis approach
proposed in [13], based on the concept of Transformer in-
B. Deepfake detection techniques troduced by [14]. Transformers consist of alternative self-
As pointed out in the previous subsection, deepfake gener- attention and multilayer perceptron blocks; they were initially
ation has been through an evolutionary process; similarly, de- aimed at natural language processing and relatively small mod-
tection techniques followed that trajectory. The early methods els, but subsequent research demonstrated that it can be scaled
were based on machine learning and aimed to detect artifacts to very large models of 1011 parameters [15]. An early attempt
and defects in the generation process. A typical such example by [16] saw self-Transformers applied to resized images on a
is [10], where the authors aimed to identify the convolutional pixel-by-pixel basis. In their paper, Dosvitskiy et al. apply
lines produced by a deepfake autoencoder and achieved a high a variant of the Image Transformer but, instead of smaller
accuracy of 93%. Similarly, [11] looked at artifacts introduced images and pixel-by-pixel analysis, they split the image into
in the process such as global consistency, illumination, and fixed-size patches which are fed to a Transformer. In an NLP
geometry, focusing on the particular characteristics of face, scenario, the information is provided as a 1D sequence; for
such as iris and teeth characteristics. The method performs ViT the image is converted into a linear projection vector.
very well with an AUC-ROC of 0.83 when combining all The Transformer itself follows the [14] design, with a slight
observed features. The main challenge in both studies is the adjustment to allow for position embedding.
nature and availability of the images used, as the database Both papers acknowledged the ability of the technology to
is sufficient to train the detection models and capture all the go beyond image recognition to identifying generated images
artifact variances. as a possible application, with [16] also providing some
As content became more realistic and more effective at preliminary results.
eliminating image artifacts, the detection methods aimed to This aim of this research is to determine the efficiency of
replicate the generation ones and became deep learning-based ViT-based algorithms when classifying real vs fake images. In
order to observe their performance, we used the image classi- TABLE I
fication work described in [13] and evaluated its performance V I T MODEL PARAMETERS
on the [17] image dataset from Kaggle. Parameter Value
learning rate 0.001
B. Text and image classification weight decay 0.0001
batch size 256
The ViT model is inspired by the the standard Transformer number of epochs 800
from [14]. The Transformer model has self-attention at its image size 72
core, a multilayered stack of feed forward and multi-head patch size 6
projection dimension 64
attention blocks. Both the encoder and the decoder use a stack transformer layers 5
of 6 identical layers, each composed of a multi-head self-
attention mechanism and connected to a feed-forward network.
Amongst the two types of attention, the model uses Scaled
Dot-Product Attention due to its more efficient computation.
The authors exploited its ability to be scaled in a Multi-Head
attention architecture, whereby the queries and results are
parallelised and producing vectored values. As tested by the
authors, Transformer models work very well with text, being
able to outperform previous models in English-to-German
translation.
While effective at processing text-based input, Transformers
were not intrinsically able to handle 2D content; [16] extended Fig. 1. Processed and resized image: full (left) and split into patches (right).
the model with position embedding information and named
the architecture Image Transformer. The authors proposed a
pixel-by-pixel approach, whereby the representation q’ of a slower but more accurate convergence, and the weight decay,
channel for a pixel q is derived based on self-attention to the which is the penalty for the loss function; the two parameters
memory of the previously generated pixels. Unlike text-based also influence the number of epochs for training the model.
input, scaling for Image Transformers was unattainable for a The speed of the model is also influenced by the batchsize
larger image, hence the authors biased the model with a level which dictates the level of parallelism for the processing of
of locality, whereby pixel values were derived mostly from the model.
their vicinity rather than the entire picture, property termed IV. M ODEL PERFORMANCE
Local Self-Attention.
In their paper, Dosovitskiy et al. revisited the concept of The model was trained and tested on subsets of images,
image transformer by dividing the image into patches. [13]. including a mix of fake and real samples, from a Kaggle image
The overall image x of resolution (H, W ) and C channels dataset containing 190,345 images. [17]
can be reshaped from x ∈ RH×W ×C to an array xp of N = All input samples are the same size, 256x256 pixels. The
2
HW/P 2 patches, with xp ∈ RN ×(P ·C) . preprocessing takes each sample, converts it to 72x72 pixels,
It is interesting to note that ViT does not heavily rely on then splits it using the patch size, as per the specified ViT
locality. Instead, the two components are adjusted to exploit model parameter, into 144 patcehs, 6x6 pixels each. An
their characteristics as follows: the multilayer peceptron layers example of the converted full image and set of patches is
are local and the self-attention layers are global. As a result, show in 1. Each patch is fed to the model with its positional
locality is shared between the two types of layers. information.
The model was trained and tested on a Google Colab L4
C. ViT implementation and dataset cloud instance, equipped with 64GB of RAM, 24GB of GPU
We implemented the ViT model in python, following the RAM, and a powerful GPU, designed specifically for deep
ViT specification from [13]. The implementation takes a learning applications [18]. Each processed dataset was split
dataset as input, including a mix of real and fake images, then 80/20 for training and testing.
it applies patching to each image and encodes the positioning
of each patch. The model includes the multi-head attention, A. Preliminary test
normalisation, and MLP layers that take in the inputs. The training process is a rather computationally demand-
The ViT model requires a set of parameters for training, ing one. The preliminary evaluation aimed to determine the
testing, and speed optimisation, as listed in table I above. The learning speed of the model, to avoid excessive training and
size and number of inputs are dictated by image size and optimise the use of computational resources. For this, we used
patch size, which set the size of the input images and patch a subset that included 5850 images (2531 deepfakes and 3319
images, both measured in pixels. real). The reason for the subset is pragmatic - we aimed to
At the core of the model performance are the evaluate the model on a manageable subset from an accuracy
learning rate, for which smaller values typically lead to and training perspective.
Fig. 2. Evolution of accuracy (left) and residual loss (right) during training Fig. 3. Evolution of accuracy (left) and residual loss (right) during training
of the preliminary dataset of the complete dataset
TABLE II
The model was trained and tested subsets of 5850 images, V I T FULL TESTING RESULTS
including a mix of fake and real samples, from a Kaggle image
dataset containing 190,345 images. The subset included 2531 precision recall f1-score support
deepfake images and 3319 real images. All input samples are
the same size, 256x256 pixels. The reason for the subset is class
pragmatic - we aimed to evaluate the model on a manageable real 0.87 0.93 0.90 4015
subset from an accuracy and training perspective. fake 0.93 0.87 0.90 3985
The preprocessing takes each sample, converts it to 72x72 accuracy
pixels, then splits it using the patch size, as per the ViT model, macro avg 0.90 0.90 0.90 8000
into 6x6 pixels patches. An example of the converted full weighted avg 0.90 0.90 0.90 8000
image and set of patches is show in 1. Each patch is fed to Accuracy: 89.9125
the model with its positional information.
The training was set to 800 epochs and took just under 1
hour, with just over 4s per epoch. process larger datasets. Given the results from the preliminary
The evolution of both accuracy and residual loss through tests, we set the training to 200 epochs, which required 95
the testing process is shown in Fig. 2. As it can be seen, the minutes of processing.
model performs very well, with accuracy close to 1, for the Accuracy and residual loss had a similar evolution to the
training dataset but, for the validation dataset, the accuracy preliminary test, as shown in Fig. 3. The overall full results
remained rather low, 0.75. Looking at the loss values, the are shown in Table II.
training and validation sets behave similarly for the first 100 The model performs very well with an overall accuracy
epochs, then their evolution changes. The fit of the model to of 89.9125%. There are some minimal variations between
the training set is monotonously improving, reaching residual the real and fake classes detection, but nothing significant.
loss close to 0, while for the validation set the residual loss is Overall, the model is slightly biased towards deepfakes, with
increasing; as a result, while the loss stabilises for training, it a higher percentage of both accurate detection of deepfakes
becomes increasingly oscillating for validation. Coming back as well as identifying real images as deepfakes. Looking at
to the accuracy diagram, there is no tangible improvement in the other studies in the area, we can compare our model
the accuracy for the validation dataset beyond 100 epochs, with the summary provided by [20] which looked at the
despite the better results for the training dataset. results from 62 studies aiming to detect deepfake images.
The results confirm the behaviour of ViT models if we take According to the overall figures, deep learning models deliver
into consideration the size of the dataset. ViT models require an average accuracy of 89.73% and machine learning provide
large datasets for training, followed by fine-tuning on smaller an average accuracy of 86.86%. Our accuracy therefore does
datasets. Compared to ILSVRC-2012 ImageNet, the smallest match closely the deep learning category, but it is worth noting
dataset [19] used by [13] which included 1.3 million samples the dataset size limitations we encountered during training and
and over 1000 classes, despite the decrease in the number therefore the likely possibility to reach better results with a
of classes from 1000 to just two (real and deepfake), the ViT larger dataset.
model still requires a larger training dataset for better accuracy.
V. C ONCLUSION AND FUTURE WORK
B. Model evaluation Transformer models are a combination of self-attention and
A larger subset of 40000 images, 20000 deepfakes and multilayer perceptron blocks, notable for their ability to handle
20000 real, was used to fully train the implemented ViT model Natural Language Processing tasks. Vision Transformer mod-
on an L4 Google Colab instance. We were not able to use els are an expansion of Transformer models, whereby the input
the full Kaggle dataset due to two reasons - length of time is a serialised patches carrying locality information. This paper
for training and, more importantly, the L4 instance failing to provides a practical evaluation of a Vision Transformer model
applied to the task of detecting deepfake images. We used an [16] N. J. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and
implementation that followed strictly the design of the ViT and D. Tran, “Image transformer,” in International Conference on Machine
Learning (ICML), 2018.
was tested on a small dataset consisting of a combination of [17] M. Karki, Deepfake and real images. Kaggle, 2022.
deepfake amd real images. The model delivered a 89.9125% [18] “Introducing G2 VMs with NVIDIA L4 GPUs — Google Cloud Blog
accuracy, with slightly better results for the deepfake images. — [Link].” [Link]
introducing-g2-vms-with-nvidia-l4-gpus. [Accessed 10-04-2024].
The results showed that the accuracy of the model is [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
significantly affected by the size of the dataset used and, due Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
to implementation limitations, we were able to test only a L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3,
subset of potential images. For future work, we aim to use pp. 211–252, 2015.
larger datasets for evaluating the model, which are likely to [20] M. Rana et al., “Deepfake detection: A systematic literature review,”
deliver significantly better results, as well as further investigate IEEE Access, vol. 10, pp. 1–1, 2022.
the performance of the model when using different training
parameters.
R EFERENCES
[1] R. Chesney and D. K. Citron, “Deep fakes: A looming challenge for
privacy, democracy, and national security (july 14,” Research Paper 692,
2018). 107 California Law Review 1753 (2019), U of Texas Law, Public
Law, 2018-21.
[2] J. T. Hancock and J. N. Bailenson, “The social impact of deepfakes,”
Cyberpsychol. Behav. Soc. Netw., vol. 24, pp. 149–152, Mar. 2021.
[3] S. Karnouskos, “Artificial intelligence in digital media: The era of
deepfakes,” IEEE Transactions on Technology and Society, vol. 1, no. 3,
pp. 138–147, 2020.
[4] R. Chesney and D. Citron, “Deepfakes and the new disinformation war:
The coming age of post-truth geopolitics,” 2019.
[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
Advances in neural information processing systems, vol. 27, 2014.
[6] A. A. Maksutov, V. O. Morozov, A. A. Lavrenov, and A. S. Smirnov,
“Methods of deepfake detection based on machine learning,” in 2020
IEEE Conference of Russian Young Researchers in Electrical and
Electronic Engineering (EIConRus), pp. 408–411, 2020.
[7] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[8] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop-
agation and approximate inference in deep generative models,” in
Proceedings of the 31st International Conference on Machine Learning
(E. P. Xing and T. Jebara, eds.), vol. 32 of Proceedings of Machine
Learning Research, (Bejing, China), pp. 1278–1286, PMLR, 22–24 Jun
2014.
[9] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adver-
sarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
[10] L. Guarnera, O. Giudice, and S. Battiato, “Fighting deepfake by exposing
the convolutional traces on images,” IEEE Access, vol. 8, pp. 165085–
165098, 2020.
[11] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts
to expose deepfakes and face manipulations,” in 2019 IEEE Winter
Applications of Computer Vision Workshops (WACVW), pp. 83–92, 2019.
[12] X. Zhang, S. Karaman, and S.-F. Chang, “Detecting and simulating
artifacts in gan fake images,” in 2019 IEEE International Workshop
on Information Forensics and Security (WIFS), pp. 1–6, 2019.
[13] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers
for image recognition at scale.” arXiv preprint, 2020.
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems (I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
eds.), vol. 30, Curran Associates, Inc., 2017.
[15] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and
D. Amodei, “Language models are few-shot learners,” in Advances in
Neural Information Processing Systems (H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 1877–1901, Curran
Associates, Inc., 2020.