A Qualitative Survey on Deep Learning Based Deep fake Video Creation and Detection Method

The rapid growth of Deep Learning (DL) based applications is taking place in this modern world. Deep Learning is used to solve so many critical problems such as big data analysis, computer vision, and human brain interfacing. The advancement of deep learning can also causes some national and some international threats to privacy, democracy, and national security. Deepfake videos are growing so fast having an impact on political, social and personal life. Deepfake videos use artificial intelligence and can appear very convincing, even to a trained eye. Often obscene videos are made using deepfakes which tarnishes people's reputation. Deepfakes are a general public concern, thus it's important to develop methods to detect them. This survey paper includes a survey of deepfake creation algorithms and, more crucially we added some approaches of deepfake detection that proposed by researchers to date. Here we go over the problems, trends in the field, and future directions for deepfake technology in detail. This paper gives a complete overview of deepfake approaches and supports the implementation of novel and more reliable methods to cope with the highly complicated deepfakes by studying the background of deepfakes and state-of-the-art deepfake detection methods.

These models are used to analyze a person's facial emotions and movement and synthesis facial images of someone with similar expressions and movements (Lyu, 2018). To train a model to generate photo-realistic pictures and videos deepfake technologies often requires a huge volume of image and video data-sets. Politicians and celebrities are the first targets of deepfakes since they have a massive amount of videos and photographs available on the internet. Deepfakes were utilized to create pornographic photographs and movies to replace the heads of politicians and celebrities' bodies. In 2017, the first deepfake video was released, in which a celebrity's face was replaced with a porn actor. Nowadays deepfake videos are becoming global security threat because now it's used to make fake speech videos of international leaders (Hwang, 2020).
Deepfakes can thus be used to incite political or religious tensions between countries, deceive the public and influence election results, or create havoc in financial markets by spreading false information (Zhou, 2020;Guo, 2020). It can also be used to create fictional satellite photos of the Globe that contain objects which do not exist in the real world to deceive military ana-lysts, such as making a fictional bridge across a river which is not actually present. This can mislead a troop when crossing a bridge during a combat (Fish, 2019).
Because the democratization of creating effective virtual humans has beneficial consequences, so deepfakes can also being used in positive ways, such as in visual effects, digital avatars, snapchat filters, creating voices for those who have lost their voice, and updating episodes of movies without reshooting them . The number of illegal implementations of deepfakes, on the other hand, far outnumbers the beneficial ones. Because of the advancement of deep neural networks and the accessibility of enormous amounts of data, faked photos and movies are nearly unrecognizable to humans and even powerful computer algorithms. The procedure of making those modified photographs and films is also more easier nowadays, as it only requires a target person's identifying photo or a short videos. Nowadays, producing astonishingly realistic tempered video requires less and less efforts. Recent advancement of technology can generate a deepfake video with the help of a single picture (Zakharov, 2019).
As a result, deepfakes may pose a danger not only to public figures but also every individual. For example, a voice deepfake was used to scam a CEO out of $243, 000 (Damiani, 2019). Recently an application Deep Nude was released, that can turn a person into a nonconsensual pornograp video, which even more troubling (Samuel, 2019). Similarly, the Chinese application Zao has recently created a buzz, it can allow even the most non -technical users to switch their faces onto the bodies of a famous movie stars and in-ject themselves into well-known films and TV clips (Guardian, 2019). These types of falsification pose a serious danger to privacy and identification, and they have an impact on many parts of people's lives.
As a result, discovering the reality in the digital world has become particularly crucial. It's considerably more difficult when handling with deepfake videos, because they're frequently utilized for harmful reasons, and virtually anyone can now construct deepfakes using current deepfake video creation tools. Several approaches for detecting deepfake videos have been proposed so far (Lyu, 2020) (Jafar, 2020). Because the majority of the deepfake creation and detection method are deep learning based, a conflict has erupted between malevolent and beneficial uses of deep learning methods. To combat the problem of deepfakes or faceswapping technologies, the US Defense Advanced Research Projects Agency (DARPA) launched a multimedia forensics research program (called Media Forensics or MediFor) to speed the invention of fake digital visual media detection methods (Turek, 2020). Facebook Inc., in collaboration with Microsoft Corp. and the Partnership on AI coalition, has established the Deepfake Detection Challenge to encourage greater research and innovation towards identifying and stopping the use of deepfakes to confuse viewers (Schroepfer, 2019).
The volume of deepfake papers has increased rapidly in recent years, according to data acquired by dimensions.ai at the end of 2020 (Dimensions, 2021). Fig. 1 shows the growth of deep-fake papers that increasing recently after 2017. Although the amount of deepfake papers received is likely to be lower than the original amount, the research trend on this area is clearly expanding. This survey paper presented all the method of creating and detecting deepfake videos. There are so many survey papers present now in this field (Verdoliva, 2020), but we done our survey from a different point of view and taxonomy. The fundamentals of deepfake algorithms and deep learning based deepfake creation method are presented in Section II. In Section III we discuss the various technique for identifying deepfake videos as well as their benefits and drawbacks. In last section, we described all the challenges, and future directions for deep fake detection and media forensics concerns.

Deepfake Creation
Deepfakes have grown in popularity as a result of the high quality of manipulated videos and the ease with which their implementations may be used by a wide variety of users with varying computing skills, from professional to newbie. Deep learning methods are used to create the majority of these applications. The ability of deep learning to represent complicated and high-dimensional data is well-known. Deep auto encoders, a type of deep network with that capability, have been widely used for dimensionality reduction and image compression (Punnappurath, 2019) (Cheng, 2019). FakeApp, created by a Reddit user using an auto encoder decoder pairing structure, was the first approach at deepfake creation (Reddit, 2015). The auto encoders obtain latent features from facial images, and the decoder reconstructs the images in that fashion. Two encoder-decoder pairs are required to exchange faces between source and target images, with each pair training on an image set and the encoder's parameters shared between two network pairs (Guera, 2018).

Fig. 2:
Two encoder-decoder pairs are used in this deepfake production strategy.
For the training process, two networks utilize the same encoder but distinct decoders (top). Deepfakes are made by encoding the image of face A with the common encoder and decoding it with decoder B. (bottom) (Guera, 2018). In other words, the encoder networks of two pairs are identical. This technique allows the common encoder to find and learn the similarity between two sets of face images, which is very easy because faces have comparable features like eyes, noses, and mouth positions.  (Ker, 2014) to the encoder-decoder architecture. It is included in the VGGFace perceptual loss in order to produce a higherquality output video, which is made possible by smoothing out artifacts in segmentation masks (Goodfellow, 2014). It is possible to generate outputs with pixel resolutions of 64x64, 128x128, and 256x256. Additionally, FaceNet (net, 2015) introduces a multi-task convolutional neural network (CNN) (Albawi, 2017) to enhance facial recognition and alignment accuracy. In order to implement generative networks, Cycle GAN (Cycle, 2017) is used (Zhao, 2016). An overview of the most popular deepfake tools is shown in Table  1.

Deepfake Video Detection
The growing numbers of deepfakes are threatening privacy, social security, and democracy (Chesney, 2018). As soon as the threat of deepfakes was identified, methods for identifying them were proposed. Early approaches relied on manufactured features derived from glitches and flaws in the deepfake video synthesis process. Deep learning was used in re-cent approaches to automatically extract significant and discriminative features in order to detect deepfakes (de Lima, 2020; Amerini, 2020). Deepfake detection is typically thought of as a binary classification problem, in which classifiers are employed to distinguish bet-ween real and manipulated videos. To train this type of classification model, we need a big dataset of actual and false videos. Although the quantity of deepfake videos is growing, there are still limitations in terms of establishing a benchmark for verifying multiple detection methods. Korshunov and Marcel (Korshu-nov, 2019) used the open-source code Faceswap-GAN (Face, 2015) to create a significant deepfake dataset consisting of 620 videos based on the GAN model to address this issue. Low and high-quality deepfake videos were created using videos from the publicly available VidTIMIT dataset (Sanderson, 2002), which can convincingly simulate facial movements, lip motions, and eye blinking. These dataset videos were then put to the test to see how well numerous deepfake detection methods worked. The popular facial recognition algorithms based on VGG (Parkhi, 2015) and Facenet (Schroff, 2015) are unable to recognize deepfakes successfully, according to test results. When used to detect deepfake videos from this freshly created dataset, other methods such as lip-syn-cing approaches (Chung, 2017) (Korshunov, 2018) and image resolution measures with support vector machine (SVM) (Boulkenafet, 2015) show very high mistake rates. This increases worries about the crucial need for more powerful approaches to distinguish true deepfakes in the future. Generally there are two categories of deepfake detection, one is fake video detection and the other on is fake image detection. Fake video detection is divided further into two more categories namely Visual Artifacts within Frame and Temporal Features across Frames. Fig. 3 shows all the categories of deepfake detection. In this paper we only survey about deepfake video detection methods and give researcher a future direction to in reach this research field.

Fake Video Detection
Due to the significant loss of frame data following video compression, most image detection techniques can't be applied to videos (Afchar, 2018). Additionally, videos have temporal features that vary between frames, making it difficult for systems built to detect merely still fraudulent images to detect them. Fake video detection is divided into two categories namely Temporal Features across Frames and Visual Artifacts within Frames. Those two categories are explained in this subsection.

1) Temporal Features Across Frames
Sabir et al. used spatio-temporal properties of video streams to detect deepfakes, based on the finding that temporal coherence is not maintained well in the synthesis process of deepfakes (Sabir, 2019). Low-level abnormalities caused by face modifications are considered to express themselves as temporally artifacts with irregularities between frames because video modification is done frame by frame. To leverage temporal disparities across frames, a recurrent convolutional model (RCN) was established based on a combination of the convolutional network DenseNet (Huang, 2017) and the gated recurrent unit cells (Cho, 2014) (Fig. 4).

Fig. 4:
A two-step process for detecting face manipulation in which the first step aims to detect, crop, and align faces on a sequence of frames, and the second step uses a combination of convolutional neural networks (CNN) and recurrent neural networks to distinguish between manipulated and authentic face images (RNN) (Sabir, 2019).
According to Guera and Delp (Guera, 2018), deepfake videos feature intra-frame abnormalities as well as tem poral anomalies between frames. They then suggested a temporal aware pipeline method for detecting deepfake videos that uses CNN and long short-term memory (LSTM) (Guera, 2018). Frame-level features are extracted using CNN, which are then input into the LSTM to build temporal series descriptors. Finally, based on the sequence descriptor, a fully-connected network is utilized to classify doctored video from genuine ones, as seen in Fig. 5.

Fig. 5:
A deepfake recognition method that uses a convolutional neural network (CNN) and long short term memory (LSTM) to extract temporal information from a video sequence and express them using a sequence descriptor.
The sequence descriptor is used to calculate probability of the frame sequence belonging to either authentic or deepfake class using a detection network with fully connected layers (Guera, 2018). The use of a physiological signal, such as eye blinking, to detect deepfakes, on the other hand, was proposed based on the finding that a person in deepfakes blinks far less frequently than a person in untampered videos (Li, 2018). A normal person blinks the eye between 2 and 10 times per minute, with each blink lasting between 0.1 and 0.4 seconds. Deepfake algorithms, on the other hand, frequently use Internet face pictures for training, which typically show people with open eyes (very few pictures on the online show persons with eyes closed). As a result, deepfake algorithms are unable to build fake faces that blink normally without access to photos of individuals blinking (Li, 2018). First breakdown the videos into frames, then extract face regions and subsequently eye areas based on six eye cues to distinguish between actual and false videos. These cropped eye region sequences are distributed into long-term recurrent convolutional networks (LR-CN) (Donahue, 2015) for dynamic state prediction after a few stages of pre-processing like aligning faces, extracting and scaling the bounding boxes of eye land-mark points to build fresh sequences of frames. The LRCN consists of a CNN-based feature extractor, LSTM-based sequence learning, and a fully connected layer-based state prediction to forecast the probability of eye open and closure states. The use of LSTM helps to capture these temporal patterns efficiently because eye blinking reveals substantial temporal dependencies. A blink is defined as a peak over the level of .05 with a length of fewer than 7 frames, and the blinking rate is measured on the prediction outcomes. This method is tested on a web-based dataset consisting of 49 interviews and lecture videos, as well as the deepfake classifiers' fake versions of those videos. The experimental results show that the suggested method has potential detection accuracy for fake videos, which can also be taking into account the dynamic pattern of blinking, such as excessively rapid blinking, which could be a symptom of video manipulation.

Visual Artifacts With in Frames
As explained in the earlier section, the techniques for detecting deepfake videos that use temporal patterns between video sequences are generally based on deep recurrent network architectures. In this section, we explored some more methods for obtaining feature maps by disintegrating videos into frames and looking at visual artifacts within a single frame. To distinguish among fake and real videos, these characteristics are transmitted into a deep or shallow classification model. As a result, we divided the approaches in this section are into two categories: deep and shallow classifiers.

1) Deep classifiers
Deepfake video are typically made with low resolutions, necessitating an affine face warping strategy (i.e., resizing, rotating, and shearing) to match the originals' configuration. This method produces artifacts that CNN models like VGG16 (Simonyan, 2014), ResNet50, ResNet101, and ResNet152 (He, 2016) can identify due to the resolution mismatch between the warping face area and surrounding context. In a deep learning approach for detecting deepfakes was presented based on artifacts noticed during in the face warping phase of the deepfake generating algorithm (Li, 2018 The suggested method has not needing to create deepfake video to train the detection methods (Zhou, 2017). This is the main advantages because deepfake videos have bad aspects also.

2) Shallow classifiers
The majority of deepfake detection algorithms focus on artifacts or inconsistencies in inherent properties among real and fake photos or videos.

DISCUSSION:
People's faith in media information has been eroded by deepfakes, as seeing them no longer equates to trust in them. They have the potential to generate anguish and negative consequences for people targeted, increase misinformation and offensive language, and even exacerbate political tensions, incite public outrage, violence, or war. This is particularly important today because deepfake technology is becoming more accessible, and social media sites can swiftly propagate false news (Zubiaga, 2018). The serious problem of deepfake, the research community has concentrated on building deepfake learning algorithms, with multiple results published. The state-of-the-art methodologies were addressed in this work, and Table 2 presents an overview of common approaches. It's clear that a struggle is brewing between people who utilize powerful machine learning to build deepfakes and others who take the opportunity to recognize them. The quality of deepfakes has been improving, and monitoring systems' performance has to advance through too. The concept is that what AI has broken can also be mended by AI (Floridi, 2018). Detection techniques are still in their infancy, and a variety of approaches have been presented and tested, but on scattered datasets. Creating a growing updated benchmark dataset of deepfakes to verify the ongoing development of detection methods is such a way to improve detection method performance. This will find things simpler to train recognizers, especially deep learning models, which require massive training sets (Dolhansky, 2020).Current detection methods, on the other hand, are most often focused on the flaws in deepfake generating pathways. In adversarial contexts, where attackers' frequently tries not to expose deepfake creation methods, this kind of information and knowledge is not always available. The deepfake detection task has become more complex as a result of recent work on adversarial perturbation assaults to deceive DNN-based monitors (Hussain, 2021) (Yang, 2021). These are actual obstacles in the creation of detection systems, and future studies should focus on developing more reliable, adaptable, and generally applicable methods. Some other line of inquiry is to include monitoring systems into production platforms like social media to ensure their efficiency in grappling with deepfakes widespread influence. On these platforms, a screening or filtering mechanism with effective detection methods can be created to make deepfakes detection easier (Citron, 2018). Law limitations could be imposed on internet corporations that own these sites, requiring them to immediately delete deepfakes in order to mitigate their effects. Photo editing tools can also be embedded into devices that humans use to create digital content to create unchanging metadata for maintaining originality details like time and place of audiovisual items, as well as their untampered attestation (Citron, 2018).
This connection is tough to implement, so using disruptive blockchain technology as a solution could be a viable option. The block chain has been actively employed in a variety of fields, but there has been little research tackling deepfake detection issues using this technology. It's an excellent tool for digital provenance because it cans construct a chain of unique, immutable metadata chunks. Although the application of block chain systems to this challenge has generated some promising findings (Hasan, 2019), this study area is still in its inception. It's necessary to use detection tools to recognize deepfakes, but it's even more critical to grasp the true motivations of all who publish them. Users must appraise deepfake regarding the social context in which it is detected, like who circulated it and what they said here. This really is important because deepfakes are becoming increasingly lifelike, and detecting software is expected to fall behind. It is thus worthwhile to conduct research on the social aspect of deepfakes in order to support users in making such decisions.  CNN models are used to find artifacts depending on resolution inconsistencies between the warping face region and side area.  CNN is used to collect frame-level features, which are then used to train the LSTM model to classify deepfake videos.
 A combination of 600 videos taken from a variety of online sources.
Spatiotemporal features with RCN (Sabir,  2019) RCN (Cunningham,  2000) RCN, which combines the convolutional network DenseNet (Huang, 2017) and the gated recurrent unit cells (Cho, 2014), is used to investigate temporal differences across frames. In police investigations and criminal trials, photographs and videos have been routinely used as evidence. Digital media forensics professionals with a degree in computer or law enforcement and skill collecting, reviewing, and analyzing digital material may present them as evidence in a court of law. Because even experts are unable to discern manipulated contents, the development of machine learning and AI technologies may have been used to modify these digi-tal contents, and thus the experts' personal views may not be enough to verify this evidence. Because of the development of a wide range of digital manipulation tools, this aspect must be recognized in today's courtrooms when photographs and videos are used as evidence to convict perpetrator (Maras, 2019). Before digital content forensics results may be utilized in court, they must be proved to be real and reliable. This necessitates meticulous documentation for every step of the forensics process as well as the methodology used to acquire the results. Although most of these algorithms are inexplicable. AI and Machine learning algorithms can be used to support the determination of the authenticity of digital media and have provided accurate and reliable results. This is a significant obstacle for the use of AI in forensics challenges for not only do forensics experts lack experience in computer algorithms, but computer professionals also lack the ability to properly explain the results because most of these algorithms are black-box models (Malolan, 2020).
This is incredibly significant because the most recent models to generate the most accurate results are based on deep learning methods that involve a large number of neural network parameters. As an outcome, explainable AI in computer vision is a research direction that is required to promote and employ AI and machine learning advances and effects in digital media forensics.

CONCLUSION:
Technologies based on deep learning, such as deepfake, have been advancing at an unprecedented rate in recent years. The global pervasiveness of the Internet makes it possible for malicious face-manipulated videos to be distributed rapidly, posing a threat to social order and personal safety. In order to mitigate the negative effects of deepfake videos on people, research groups and commercial companies around the world are conducting relevant studies. Firstly, we present deepfake video generation technology, followed by the existing detection technology, and finally the future research direction. An emphasis in this review is particularly placed on current detection algorithm problems and promising research. The review places special emphasis on generalization and robustness. This article will hopefully prove useful for researchers who are interested in deepfake detection and in limiting the negative impact of deepfake videos.

ACKNOWLEDGEMENT:
We would like to thanks our parents and our entire teacher to support us mentally and monetary.

CONFLICTS OF INTEREST:
Researcher can use this work only research purpose. There are conflicts of interest for research community.