3.6. Training of the Facial Emotion Recognition Model Based on a Neural Network Algorithm

Dimin Zhu; Yuxi Fu; Xinjie Zhao; Xin Wang; Hanxi Yi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

3.6. Training of the Facial Emotion Recognition Model Based on a Neural Network Algorithm

DZ Dimin Zhu

YF Yuxi Fu

XZ Xinjie Zhao

XW Xin Wang

HY Hanxi Yi

This method is extracted from research article: Comput Intell Neurosci, Sep 2022

Facial Emotion Recognition Using a Novel Fusion of Convolutional Neural Network and Local Binary Pattern in Crime Investigation

DOI: 10.1155/2022/2249417

Ask a question

Favorite

Static data sets and dynamic data sets are often used to recognize facial expressions, among which JAFFE is a common static image database [32], and CK+ (Cohn-Kanade+) and Fer2013 are image data sets composed of dynamic expression sequences [33, 34]. Considering that facial expression recognition generally processes dynamic expression sequences, the CK+ and JAFFE are chosen as the research data sets. Table 2 presents the image composition of the two data sets.

Image composition of Fer2013 and CK + data sets.

In the detection and recognition of facial expressions, the JAFFE data set and the CK + data set are common. In addition, the information contained in these two databases is more comprehensive, which can effectively improve the model performance and the accuracy for facial emotion recognition. Due to facial expressions' dynamic characteristics, images of six typical expressions of the CK + data set are chosen for facial emotion recognition. Some face images of the JAFFE data set are shown in Figure 3. Some face images of the CK + data set are presented in Figure 4.

Human face images of the JAFFE data set (the data source: https://blog.csdn.net/akadiao/article/details/79956952).

Human face images of the CK + data set (the data source: https://blog.csdn.net/yinghua2016/article/details/77323537).

In addition to selecting data sets, positioning detection is a critical step in recognizing human facial expressions. The accuracy of positioning detection directly impacts the recognition accuracy. Therefore, it is vital to select applicable positioning detection algorithms. In other words, a positioning detection algorithm meeting the requirements needs to consider both accuracy and efficiency. Consequently, the Haar-like algorithm is adopted to describe the facial features of human faces, and the AdaBoost algorithm is adopted for classification. In the process of feature extraction using Haar-like, feature values can be expressed as

where S_A represents the black composition in the region, and S_B denotes the white composition in the region. Besides, a refers to the proportion of the black area in the area, b stands for the proportion of the white area in the area, and i (x, y) represents the corresponding pixel value in the image feature interval.

Furthermore, the feature number can be determined according to

In (12), W represents the width of the image, H signifies the height of the image, w denotes the width of the rectangle, and h refers to the height of the rectangle. Meanwhile, X represents the magnification factor of the rectangular feature in the horizontal direction, and Y stands for the rectangular feature magnification factor in the vertical direction.

For the AdaBoost training strong classifier, N training samples are expressed as

where Y_i = 0 indicates negative samples of nonface data, and Y_i = 1 refers to positive samples of image data. Then, the weight initialization process is performed. In the case of Y_i = 0, the weight can be written as (4).

In the case of Y_i = 1, the weight can be expressed as

where m represents the number of negative samples, and l denotes the number of positive samples. The normalization of weights can be presented as

Then, it is necessary to train the model using the features by a weak classifier. The final step of classification using the AdaBoost algorithm is to update the weights. The strong classifier can be expressed as

Before training the model using images, since the background information on the original image may obstruct recognizing and detecting the image, the image preprocessing mainly contains grayscale processing, cropping processing, and normalization. Generally, the CNN has an excellent performance in image processing, and it is unnecessary to preprocess images or extract features, since the fine-grained feature extraction of CNN can process the images. However, face images usually involve complicated information that is affected by multiple factors, such as the visual angle and background information. Therefore, the image information cannot be exactly exacted by a separate operation. The initial image information is processed via grayscale, cropping, and normalization. The grayscale processing is to convert the color image into a grayscale image with a single channel feature. After this operation, both the influence of light intensity and the calculation complexity in the training process decrease, improving the model's training speed. Specifically, the gray value conversion can be calculated according to

where R represents the red channel in the image, G denotes the green channel in the image, B refers to the blue channel in the image, and Y stands for the gray value.

Furthermore, the image cropping is indispensable since there exists a considerable amount of disturbance information on the initial face image, which may reduce classification accuracy. Therefore, the face image is cut before expression recognition. The specific cutting method is that, in the horizontal direction, a crop factor of 0.7 is selected to complete the image cropping process. In the vertical direction, a crop factor of 0.3 is selected to complete the cropping process. After the cropping operation, the information on the image irrelevant to facial expressions is removed, and the size of the image is significantly reduced, which can significantly reduce the workload of the subsequent training. The final part after the cropping processing is the normalization of the image. Specifically, during the normalization processing, the initial image of the data sets is rotated by -30°, -15°, 15°, and 30°, respectively, considering that there may be nonfrontal facial images. Subsequently, facial emotion recognition is performed after the normalization. The image preprocessing mainly aims to reduce the impact of uneven lighting on the facial emotion recognition. In the normalization of the image, the histogram is used to equalize the image, which can be expressed as

where n denotes the total number of pixels in the human face image, k represents the type of gray value, and n_l refers to the total amount corresponding to the l-th type of gray value.

It is essential to use a feasible optimization algorithm to find the model's optimal global solution to train the CNN model. Adam algorithm is an optimization algorithm developed based on a stochastic gradient algorithm. The algorithm has the characteristics of an adaptive gradient and root mean square propagation. The adaptive gradient provides the algorithm with excellent performance in computer vision, and root mean square propagation affords the algorithm excellent performance in solving intermittent problems. The mean value of the initial time gradient in the Adam algorithm can be obtained according to

The noncentral variance corresponding to the gradient at the second moment can be expressed as

where $\hat{M_{t}}$ indicates the mean value of the corresponding gradient at the initial time, b₁ = 0.9, $\hat{V_{t}}$ represents the noncentral variance of the corresponding gradient at the second time, and b₂ = 0.999.

This algorithm updates the parameters according to

The Adam algorithm has excellent convergence performance and requires a small amount of memory space. Therefore, the algorithm can solve optimization problems, including numerous data information and parameters [35]. Therefore, the algorithm is selected as a tool to optimize the neural network.

The cross-entropy loss function can measure and evaluate the difference between the probability distributions. Here, this function presented in (23) is selected as the loss function.

In equation (24), p refers to the correct probability distribution value, and q represents the predicted value. Furthermore, discrete variables can be calculated according to

Continuous variables can be decided according to

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol