Face Frontalization is a generative computer vision task in which the model takes a photo of a person's head taken at an angle between -90 and 90 degrees, and produces an image of what that person's frontal (i.e. 0 degree) view of the face might look like. The present model was first released in this repository by Scaleway, a European cloud provider originating from France. It has been previously discussed in a Scaleway blog post and presented at the DataXDay conference in Paris. The model's GAN architecture was inspired by the work of R. Huang et al.
Model description
The Face Frontalization model is the Generator part of a GAN that was trained in a supervised fashion on profile-frontal image pairs. The Discriminator was based on a fairly standard DCGAN architecture, where the input is a 128x128x3 image that is processed through multiple convolutional layers, to be classified as either Real or Fake. The Generator had to be modified in order to fit the supervised learning scenario. It consists of convolutional layers (the Encoder of the input image), followed by a 512-dimensional hidden representation that is then fed into the Decoder made up of deconvolutional layers, which produces the output image. For more details on the model's architecture, see this blog post.
Intended uses & limitations
The present Face Frontalization model was not intended to represent the state of the art for this machine learning task. Instead, the goals were:
(a) to demonstrate the benefits of using a GAN for supervised machine learning tasks (whereas the original GAN is an unsupervised generative algorithm; see this conference talk for more details);
(b) to show how a complex generative computer vision project can be accomplished on a Scaleway cloud RENDER-S instance within ~ a day.
How to use
The Face Frontalization model is a saved Pytorch model that can be loaded provided the included network package is present in the directory. It takes in 3-channel color images resized to 128x128 pixels in the form of [N, 3, 128, 128] tensors (where N is the size of the batch). Ideally, the input images should be closely-cropped photos of faces, taken in good lighting conditions. Here is how the model can be used for inference with a gradio image widget, e.g. in a Jupyter notebook:
import gradio as gr
import numpy as np
import torch
from torchvision import transforms
from torch.autograd import Variable
from PIL import Image
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Load the saved Frontalization generator model
saved_model = torch.load("./generator_v0.pt", map_location=torch.device('cpu'))
def frontalize(image):
# Convert the test image to a [1, 3, 128, 128]-shaped torch tensor
# (as required by the frontalization model)
preprocess = transforms.Compose((transforms.ToPILImage(),
transforms.Resize(size = (128, 128)),
transforms.ToTensor()))
input_tensor = torch.unsqueeze(preprocess(image), 0)
# Use the saved model to generate an output (whose values go between -1 and 1,
# and this will need to get fixed before the output is displayed)
generated_image = saved_model(Variable(input_tensor.type('torch.FloatTensor')))
generated_image = generated_image.detach().squeeze().permute(1, 2, 0).numpy()
generated_image = (generated_image + 1.0) / 2.0
return generated_image
iface = gr.Interface(frontalize, gr.inputs.Image(type="numpy"), "image")
iface.launch()
Limitations and bias
As mentioned in the Intended uses section, the present model's performance is not intended to compete with the state of the art. Additionally, as the training data had a disproportionately high number of images of caucasian and asian males in their 20s, the model does not perform as well when supplied with images of people not belonging to this limited demographic.
Training data
The present model was trained on the CMU Multi-PIE Face Database that is available commercially. The input images were closely cropped to include the face of a person photographed at an angle between -90 and 90 degrees. The target frontal images were cropped and aligned so that the center of the person's left eye was at the same relative position in all of them. Having a precise alignment for the target images turned out to play a key role in the training of the model.
Training procedure
The training of the model was performed in a similar manner to that of a regular unsupervised GAN, except that in addition to the binary cross entropy loss for the Discriminator, a pixelwise loss function was introduced for the Generator (see the blog post for details). The exact weights given to the L1 and L2 pixelwise losses, as well as the BCE (GAN) loss were as follows:
L1_factor = 1
L2_factor = 1
GAN_factor = 0.001
The model was trained for 18 epochs, with the training batch size equal to 30. The following optimizers were used for the Discriminator and the Generator:
optimizerD = optim.Adam(netD.parameters(), lr = 0.0002, betas = (0.5, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr = 0.0002, betas = (0.5, 0.999), eps = 1e-8)
Evaluation results
GANs are notoriously difficult to train, with the losses for the Discriminator and the Generator often failing to converge even when producing what looks to be a highly realistic result to a human eye. The pixelwise loss for the test images also serves as a poor indicator of the model's performance because any variation in the lighting between the real target photo and the generated image could result in a deceptively high discrepancy between the two. The best evaluation method that remains is the manual inspection of the generated results. We have found that the present model performs reasonably well on the test data from the CMU Multi-PIE Face Database (naturally, all of the photos of the individuals included in the test set were removed from training):
(Top row: inputs; middle row: model outputs; bottom row: ground truth images)