From Theory to Practice: Real-Time Facial Recognition with PyTorch Facenet

Cutting-edge technology that enables rapid and accurate detection of human faces in real-world settings

6 min readApr 5, 2023

Real-time facial recognition is an advanced technology that has revolutionized the field of computer vision. It enables rapid and accurate detection of human faces in real-world settings, making it an exciting area of research. One of the most powerful tools in this area is Facenet, which uses deep learning techniques to recognize faces in real-time.

Introduction to Facenet

Facenet is a deep neural network that is trained to map faces into a high-dimensional space, where faces of the same person are located close together and faces of different people are far apart. It is based on the Siamese neural network architecture, which consists of three identical sub-networks that share the same weights and are trained to extract features from three input images.

The network uses a triplet loss function, which takes three images as input: an anchor image, a positive image (another image of the same person as the anchor), and a negative image (an image of a different person). The triplet loss function aims to minimize the distance between the anchor and positive images in the high-dimensional space while maximizing the distance between the anchor and negative images.

Triplet Loss minimize the distance between an anchor and positive, while maximize the distance with negative (from Facenet)

Facenet uses the Inception ResNet architecture, which combines the Inception module with residual connections. This architecture allows the network to learn and generate the representations of faces in the form of numeric embeddings. These embeddings are high-dimensional vectors that represent the unique features of a person’s face. By using cosine similarity, face recognition systems can determine whether two faces belong to the same person or not.

Implmentation

Now that we have discussed the technical details of Facenet and the importance of cosine similarity in face recognition, it’s time to implement this knowledge into code. In the upcoming sections, we will cover topics such as preprocessing images, generating facial embeddings using Facenet, and comparing embeddings using cosine similarity. Before we begin with the code implementation, it is essential to make sure that we have all the necessary libraries installed on our system.

pip install torch facenet_pytorch tqdm

To enable the machine to recognize specific individuals, it is possible to pre-save images of their faces. In this particular case, I have created a folder named “./saved” and populated it with one image of myself and two other random images.

With the necessary libraries installed and the images for recognition in place, we are now ready to build our system. The initial step in this process involves creating a new Python file, named main.py, in which all the required steps will be implemented.

To begin building our facial recognition system, we first import the necessary packages and load the required models. Specifically, we import two pre-trained models

MTCNN (Multi-Task Cascaded Convolutional Neural Networks), which is capable of detecting faces and facial landmarks in images, and is used to detect and crop the faces appearing on screen before passing the output to the next model
InceptionResnet, a neural network trained on a large-scale face recognition dataset (VGGFace2) that can generate 512-token embeddings based on facial features.

import os
import cv2
import torch
from facenet_pytorch import InceptionResnetV1, MTCNN
from tqdm import tqdm
from types import MethodType

### helper function
def encode(img):
    res = resnet(torch.Tensor(img))
    return res

def detect_box(self, img, save_path=None):
    # Detect faces
    batch_boxes, batch_probs, batch_points = self.detect(img, landmarks=True)
    # Select faces
    if not self.keep_all:
        batch_boxes, batch_probs, batch_points = self.select_boxes(
            batch_boxes, batch_probs, batch_points, img, method=self.selection_method
        )
    # Extract faces
    faces = self.extract(img, batch_boxes, save_path)
    return batch_boxes, faces


### load model
resnet = InceptionResnetV1(pretrained='vggface2').eval()
mtcnn = MTCNN(
  image_size=224, keep_all=True, thresholds=[0.4, 0.5, 0.5], min_face_size=60
)
mtcnn.detect_box = MethodType(detect_box, mtcnn)

After loading the required models, we can proceed with generating vector representations of our stored images. This code below is meant to read the images from a specified directory, detecting faces in those images using MTCNN, and encoding the faces using InceptionResnet model. The resulting face embeddings are stored in a dictionary all_people_faces, where the keys are the names of the people in the images and the values are the corresponding face embeddings.

### get encoded features for all saved images
saved_pictures = "./saved/"
all_people_faces = {}
for file in person_face, extension = file.split(".")
    img = cv2.imread(f'{saved_pictures}/{person_face}.jpg')
    cropped = mtcnn(img)
    if cropped is not None:
        all_people_faces[person_face] = encode(cropped)[0, :]

Once all the saved images have been processed and converted into vector embeddings, we can now do a live detection to compare the frame against the saved vector. From the code below:

The detect() function is defined, which takes two optional parameters: cam (the camera index to use, defaulting to 0 for the default camera) and thres (the recognition threshold, the similarity value lower than this value will be labeled as “matched”).
Within the detect() function, a video capture object vdo is created using cv2.VideoCapture(cam). while vdo.grab() and vdo.retrieve()continuously grab and process video frames from the camera
The mtcnn.detect_box() function is used to detect faces in the current frame, and returns two outputs: batch_boxes (a list of bounding boxes around detected faces) and cropped_images (a list of cropped face images).
The encode() function is called on the cropped face image to obtain an embedding vector for the face.
A dictionary detect_dict is created to store the Euclidean distances between the face embeddings of the detected faces and the embeddings of known faces stored in all_people_faces.
The person with the smallest distance is identified by finding the key with the minimum value in detect_dict. If the smallest distance is greater than thres, the person is marked as "Undetected".
The bounding box around the detected face is drawn using cv2.rectangle(), and the person's name (or "Undetected") is added as text using cv2.putText().

def detect(cam=0, thres=0.7):
    vdo = cv2.VideoCapture(cam)
    while vdo.grab():
        _, img0 = vdo.retrieve()
        batch_boxes, cropped_images = mtcnn.detect_box(img0)

        if cropped_images is not None:
            for box, cropped in zip(batch_boxes, cropped_images):
                x, y, x2, y2 = [int(x) for x in box]
                img_embedding = encode(cropped.unsqueeze(0))
                detect_dict = {}
                for k, v in all_people_faces.items():
                    detect_dict[k] = (v - img_embedding).norm().item()
                min_key = min(detect_dict, key=detect_dict.get)

                if detect_dict[min_key] >= thres:
                    min_key = 'Undetected'
                
                cv2.rectangle(img0, (x, y), (x2, y2), (0, 0, 255), 2)
                cv2.putText(
                  img0, min_key, (x + 5, y + 10), 
                   cv2.FONT_HERSHEY_DUPLEX, 0.5, (255, 255, 255), 1)
                
        ### display
        cv2.imshow("output", img0)
        if cv2.waitKey(1) == ord('q'):
            cv2.destroyAllWindows()
            break

if __name__ == "__main__":
    detect(0)

Finally, we can run the script using

python main.py

The webcam will capture each frame, detect any faces present, and attempt to match them with faces saved in its database. Once a match is found, the corresponding name (or “Undetected”) will be displayed. The image below displays the output of the script, which successfully detects my own face and associates it with my name.

Conclusion

In this blog post, we introduced the FaceNet model, a deep learning architecture that can be used to generate highly accurate facial embeddings. We also provided a code example that utilized the MTCNN face detection algorithm and the FaceNet model to match faces in real-time video streams with pre-existing images. While facial recognition is a powerful tool, it is important to use it responsibly and ensure that privacy concerns are addressed. Nonetheless, the potential benefits of this technology are vast, and as FaceNet and other deep learning models continue to improve, it is exciting to consider the many ways that facial recognition can be applied to improve our lives.

From Theory to Practice: Real-Time Facial Recognition with PyTorch Facenet

Cutting-edge technology that enables rapid and accurate detection of human faces in real-world settings

Introduction to Facenet

Implmentation

Conclusion

Written by CheeKean