a

Machine vision will play a significant role in the next generation of IR 4.0 systems. Recognition and analysis of faces are essential in many vision-based applications. Deep Learning provides the thrust for the advancement in visual recognition. An important tool for visual recognition tasks is Convolution Neural networks (CNN). However, the 2D methods for machine vision suffer from Pose, Illumination, and Expression (PIE) challenges and occlusions. The 3D Race Recognition (3DFR) is very promising for dealing with PIE and a certain degree of occlusions and is suitable for unconstrained environments. However, the 3D data is highly irregular, affecting the performance of deep networks. Most of the 3D Face recognition models are implemented from a research aspect and rarely find a complete 3DFR application. This work attempts to implement a complete end-to-end robust 3DFR pipeline. For this purpose, we implemented a CuteFace3D. This face recognition model is trained on the most challenging dataset, where the state-of-the-art model had below 95% accuracy. An accuracy of 98.89% is achieved on the intellifusion test dataset. Further, for open world and unseen domain adaptation, embeddings learning is achieved using KNN. Then a complete FR pipeline for RGBD face recognition is implemented using a RealSense D435 depth camera. With the KNN classifier and k-fold validation, we achieved 99.997% for the open set RGBD pipeline on registered users. The proposed method with early fusion four-channel input is found to be more robust and has achieved higher accuracy in the benchmark dataset.


INTRODUCTION
The current wave of Industrial Revolution 4.0 (IR 4.0) will mainly rely upon machine vision to drive the need for industrial automation. Human and robot coworkers, known as collaborative robots or cobots, have collaborated to complete tasks in various environments. Cobots play a vital role in IR 4.0 revolution. However, many of the robots developed are blind. Hence highprecision machine vision will be a critical part of IR 4.0, making intelligent machines capable of interacting in collaborative environments and making decisions [1]. Many such applications require Face Analysis and Recognition [2] [3].
Face Recognition (FR) is a method of identifying or validating a person's identity. The human face has highly non-rigid characteristics that have very discriminative features. Humans can identify each other with ease. Identifying faces from computers started as early as the 1960s and became popular with Eigenfaces [4] in the 1990s. It became famous as a non-invasive biometric with the advancement in technologies. Significant advancements in face recognition techniques can be grouped into four phases. The millstones can be called i) Holistic learning, ii) Locally handcrafted techniques, iii) Shallow learning, and iv) Deep learning. Phase-I uses holistic approaches. It dominated in the 1990s and spanned till early 2000 [5] [6]. The locally handcrafted feature extraction became popular in early 2000 [7]. In the next decade, shallow learning with the local feature reached an accuracy of 95% on the LFW dataset [7]. The breakthrough in deep learning technology driven by improved computer hardware and algorithms and the availability of large datasets made a new revolutionary phase with the advent of AlexNet [8] in 2012. DeepFace [9] and DeepID [10] achieved state-of-the-art performance in 2014, and research has shifted to deep-learning-based approaches. It took three decades to increase shallow recognition from 60% to 90%. In comparison, deep learning took its performance to 99.8% using a deeper pipeline on the LFW dataset in just three years.
2D machine vision has many limitations [7], such as parallax. A parallax is an apparent displacement of an object due to a change in perspective and depth of focus. Other issues faced by FR are changing ambient light and variations in contrast. The 2D methods are also prone to spoofing or other attacks. The 3D data is rich in information. 3D cameras are becoming affordable and prevalent with the advancement in camera technology. This work proposes a complete, real-time, implementable 3D face recognition pipeline for practical use in this work.
The organization of the paper is Section 1 introduces face recognition, Section 2 provides a literature review and discusses advancements in 3DFR. Section 3 describes the methodology and components of the proposed system in detail. Section 4 the results of the proposed multimodal 3D deep face recognition model as a feature extractor for RGBD face recognition applications in an open world. Section 5 is the conclusion.

Literature Review
In 2012, researchers began utilizing Deep Learning for visual tasks on ImageNet [8]. Deep CNN has a significant advantage over traditional processing methods of images and videos. In contrast, Recurrent Neural Network (RNN) processes continuous data such as voice and text [11] [12]. Zhou et al. [13] proposed a real-time 3DFR system that employs a trained two-level cascade classifier and preprocesses RGB and depth data. Goswami et al. [14] suggest the unification of 2D and 3D information to accomplish a hybrid face recognition, applying techniques of entropy and saliency to construct a descriptor and utilizing geometrical analysis of 3D fiducial points.
Large-scale face datasets used for the train deep learning model improvise recognition accuracy. Deep learning models can learn facial features and depict rich internal data information with the assistance of large datasets. 2D face datasets on a massive scale can be done by data scraping from the internet. Due to the lack of large-scale 3D face datasets available, it is challenging to train discriminative in-depth features for 3D facial models compared to the 2D face dataset. To solve this problem, Kim et al. [15] proposed a frontal 3D scan, producing a 2.5D depth map and extracting the depth map features using the VGG16 network to represent the 3D face. VGG Face gave an excellent result on Bosphorus (99.240%). Except for the Bosphorus dataset, their results do not outperform the state-of-the-art conventional methods. Zhang et al. [16] proposed an expression and pose, invariant 3D face recognition. It directly takes 3D point clouds as input. However, its performance lacks considerably over FR3DNet with or without finetuning. It also needs an effective mechanism to handle the distribution gap between synthesized data and real 3D faces. A specialized Deep CNN model trained over a large dataset for 3D face recognition is proposed by Gilani et al. [17]. The FR3DNet uses the three-channel images generated from 3D point cloud data. However, Zhang et. al. [16] and Gilani et. al. [17] both use a synthetic 3D face dataset for training. FR3DNet uses two more maps than Kim et.al. [15]. Using more channels helps minimize the loss of 3D information but incurs additional memory and computation costs.
It should be remarked that 3D facial recognition is still an open field for improvement, either because it demands high computational power or lacks a large dataset to train algorithms or validate results. A complete survey of 3D facial recognition is presented in [18] [19]. The proposed method should be a 3DFR application capable of running on embedded platforms like [20]. Modern 2D face recognition applications use large training datasets of millions of images and challenging testing datasets benchmarks. However, face recognition applications are deployed with different scenarios in the unconstrained real world and deal with unseen data. Generalized face recognition is more challenging and less studied. The generalized face recognition system should deal with unseen domains without updating or finetuning deep learning models. In 3D, such a face recognition application is rarely attempted in literature. Without making any assumptions about the target domain, this work aims to investigate and improve upon how 3D modalities can play a part in the construction of a generalised face recognition system.

Methodology
The proposed work methodology is accomplished in the following stages. First, the CNN architecture for 3D face is designed, trained, and finetuned on a challenging dataset. Later, the robust 3D face recognition model developed in the previous stage is used for developing a classifier for an open-set application gallery. Then the final 3D deep FR pipeline for the open world is implemented for effective recognition. Each stage is discussed in the following subsections.

CNN Architecture
The backbone of CuteFace3D is like ResNet-50 architecture. Since we use the residual units in our CNN architecture illustrated in Figure 1, it takes four-channel input of 224*224 instead of 3 channel input images, as shown in Figure 1. The last layer uses the most widely used SoftMax layer to classify 1200 identities from the intellifusion dataset. The SoftMax loss is presented as follows: Where xi ∈ ℝ d is a deep extracted feature of the i-th sample of the class yi, the embedding from the previous layer (avgpool) can be extracted for open-set recognition challenges with a vector size of 2048. These embeddings need to be more discriminative. On the other hand, the softmax loss does not optimize the embedding quality to achieve a higher similarity to interclass variations or diversities. From the literature, the gap is evident when SoftMax is used in deep CNN for face recognition. The intraclass variations [21,22,23,24] are not handled effectively using the SoftMax function. This situation suits studying the extra channel used in the 3D multichannel face recognition. It will be evident if embedding quality is highly discriminative despite being trained on SoftMax loss due to the fusion of depth channel with RGB. Without using any specialized loss functions, such ArcFace [25]. Alternatively, without the special learning metrics used in SphereFace [26] for learning large margin features to estimate the discriminant power of the depth image fusion to RGB. This is one of the primary motives of this work to identify the significance of additional channel input in multichannel 3D face recognition.  Figure 1 shows that the parameters are not substantially increased compared to ResNet-50 after modifications. The overall size of the model has been increased only by nearly 2MB. The ReLU activation function is used to assess the impact of fusion alone. Further can experiment with CReLU or PReLU as used in previous studies. Most 3D face recognition applications use training datasets of a few thousand scans and small testing datasets benchmarks. In this attempt of 3DFR, a training dataset of nearly four hundred thousand RGB-D scans and over forty thousand test images from the intellifusion dataset are used. The Intellifusion dataset is described in next.

Training and Finetuning
The CNN-based FR model with multimodal images is developed using the Intellifusion dataset. The dataset used is a high-quality dataset that uses different domains: age, ethnicity, expression, and occlusion. The training phase of CuteFace3D is illustrated in Figure 2. Training faces preprocessed using MTCNN are fed to CNN with a SoftMax layer. The Adam optimizer is utilized with a learning rate of 0.001. After every seven epochs, the learning rate decreases by 0.1. The model is trained for 50 epochs, and test accuracy is calculated using SoftMax for final prediction. Named this 3DFR model called CuteFace3D as a reference to the Center for Unmanned Technologies (CUTe).
Further, as shown in Figure 3, the trained CuteFace3D model is used by dropping fully connected layers (FC) to extract RGBD face embeddings. The extracted embeddings of RGBD face scan with deep features are a vector of 2048 size. The embedding for each scan in the gallery is fed as input to the classifier. Then the similarity or dissimilarity metric or classifier can be employed, as shown in Figure 3. This work will compare the results using classifiers such as KNN for different values of k. If this is robust and discriminative, it is worthy of use in unseen domains and open world scenarios.

3D Deep Face Recognition Pipeline
The proposed methodology of a robust 3D face recognition system comprises four modules, namely a) Image acquisition module, b) Feature extraction module, c) Classifier module, and d) Inference module, as depicted in the system diagram illustrated in Figure 4.
The acquisition module captures a video stream from the RealSense camera using pyrealsense2 python wrapper. The depth and RGB streams are aligned and co-registered for further consideration. When aligned pairs of frames are available, they are converted to NumPy arrays. A depth map is converted to an 8-bit color map. Then dlib deep face detector is applied to capture the face in the frame. If the face is found, it is cropped, preprocessed, and stored as registered with other user details. Fifty frames are extracted in every user registration. The inference module can also use the acquisition module before applying face recognition.
The feature extraction module comprises a novel 3DCuteFace deep learning model using multimodal learning as described. The registered user scans are used to extract a feature vector size of 2048. The feature vectors from the gallery are used for training a classifier, and a besttrained classifier is deployed for the face recognition task. The inference module acquires the image for inference as mentioned in the acquisition module and invokes the inference engine. The captured depth map and face scan will be fed to 3DCuteFace, and embedding is extracted. The embeddings are fed to the classifier model for final prediction. Thus, 3D face recognition in the real world can be achieved in an open-world environment.

Training Dataset
An Intellifusion RGB-D dataset contains 403,067 pairs of face images of 1,205 people. Each pair of face images is registered and includes RGB and depth images. It was issued during the international 3D face recognition algorithm challenge 2019. It incorporates huge variations. A few challenges of PIE are shown in Figure 5. Depth images are not shown here as they will be more distinct only for expression and extreme pose with no impact of illumination or background clutter.  The exact train test split mechanism by X. Xiong et al. [26] is adopted. That is 90-10 split performed for training and testing, respectively. The IDs with less than ten samples were excluded, and after cleaning the dataset, a total of 361,799 face scans from 1200 identities were used for training. For the test dataset, 40,809 registered pairs were separated using the closed set approach. It means identities in the test set will always be present in the training set. The split list for the test set is 10%, validation and training set is 90% to keep results comparable.

Application Data
The data from the RealSense camera is collected for making a 3D face recognition system. The RGB and depth stream are recorded in an uncontrolled environment. For each subject, 50 frames are captured in the gallery. After the registration of users is completed. An FR pipeline is completed by training the SVM and KNN classifier for registered users. Figure 6(a) and Figure 6(b) show the raw input of depth and RGB images. The preprocessed depth map and RGB face of a subject are illustrated in Figure 6(c) and Figure 6 (d).

RESULTS
CuteFace3D converges well with training and finetuning of the model with hyperparameters described in Section 3.2. The loss and accuracy of training can be seen in Figure 7. The orange line is for training, and the blue line indicates validation loss and accuracy, respectively. Figure  7(a) indicates the loss function of CuteFace3D on the training and validation set, and Figure  7(b) shows the accuracy of CuteFace3D on the training and validation set for {Citation}up to 14 epochs. The training is carried out for 50 epochs. It took over seven days on a single GPU TitanXp system. After 50 epochs, the training accuracy was 99.67%, with a loss of 0.0379, evaluation accuracy of 99.77%, and a loss of 0.221. In the future, the CNN can be trained for different tasks and challenges and will be ensembled for robust face analysis.  The results obtained for an Intellifusion RGB-D dataset surpassed the performance of the most advanced method found in the research literature, as seen in Table 1. The CuteFace3D model outperforms model [28] by approximately 16% and Model (A) [27] by over 10%. Model (B) uses pre-trained weights and has an accuracy of 94.64 percent. The CuteFace3D has a 4.25 percent lower error rate than Model (B).
The application gallery was collected for about 85 subjects, as described in section 3.5, and RGBD embeddings were extracted for 3DFR application development using the CuteFace3D model, as described in Figure 3. Then k-fold validation KNN is applied for classifier training. The KNN accuracy trained on the gallery population with error rate is shown for 10-fold validation, as illustrated in Figure 8. 3D face embedding of registered user gallery using KNN with 10-fold validation achieved an accuracy of 99.997%.  The application development setup with the output of face recognition using a proposed system is shown in Figure 9(a) and its recognition with depth map in Figure 9(b). The person in action is very similar to the user represented in Figure 6 in terms of facial hair and outlook. In addition, the spectacles, extreme expression, pose, and skull cap is also introduced. In the gallery, there was no image of a person with such extreme expression or open mouth. Despite such extreme variations, a subject has been recognized with very fair accuracy. Similar results were observed in all the registered users of the 3DFR application.

CONCLUSION
A complete, robust 3DFR pipeline is successfully implemented and tested. The proposed work has achieved an accuracy of 98.89%, with an improvement of over 4% from the state-ofthe-art. The improvements are achieved by tweaking the CNN architecture for early RGB and depth fusion. This method is found to be more discriminative. The model is improvised with finetuning the hyperparameters such as Adam optimizer and using PReLU. The proposed 3DFR can effectively work with an extreme pose, expression, self-occlusion, facial hairs, and spectacles. The effectiveness is achieved through detailed experiments on embedding vector size and found the size 2048 optimal in the case of 3DFR applications in an open set. A domain adaptation pipeline uses an embedding of 2048 size. 3D face embedding of registered user gallery using KNN with 10-fold validation surpassed the accuracy of 99.997%. So, it has well demonstrated that the proposed 3DFR model can be used more effectively in a practical 3DFR real-time application pipeline. The RGBD camera with low resolution and low-quality depth map using RealSense D435 can work effectively without incurring the additional computational cost for reconstruction or quality enhancement.