3.6. Training of the Facial Emotion Recognition Model Based on a Neural Network Algorithm

DZ Dimin Zhu
YF Yuxi Fu
XZ Xinjie Zhao
XW Xin Wang
HY Hanxi Yi
ask Ask a question
Favorite

Static data sets and dynamic data sets are often used to recognize facial expressions, among which JAFFE is a common static image database [32], and CK+ (Cohn-Kanade+) and Fer2013 are image data sets composed of dynamic expression sequences [33, 34]. Considering that facial expression recognition generally processes dynamic expression sequences, the CK+ and JAFFE are chosen as the research data sets. Table 2 presents the image composition of the two data sets.

Image composition of Fer2013 and CK + data sets.

In the detection and recognition of facial expressions, the JAFFE data set and the CK + data set are common. In addition, the information contained in these two databases is more comprehensive, which can effectively improve the model performance and the accuracy for facial emotion recognition. Due to facial expressions' dynamic characteristics, images of six typical expressions of the CK + data set are chosen for facial emotion recognition. Some face images of the JAFFE data set are shown in Figure 3. Some face images of the CK + data set are presented in Figure 4.

Human face images of the JAFFE data set (the data source: https://blog.csdn.net/akadiao/article/details/79956952).

Human face images of the CK + data set (the data source: https://blog.csdn.net/yinghua2016/article/details/77323537).

In addition to selecting data sets, positioning detection is a critical step in recognizing human facial expressions. The accuracy of positioning detection directly impacts the recognition accuracy. Therefore, it is vital to select applicable positioning detection algorithms. In other words, a positioning detection algorithm meeting the requirements needs to consider both accuracy and efficiency. Consequently, the Haar-like algorithm is adopted to describe the facial features of human faces, and the AdaBoost algorithm is adopted for classification. In the process of feature extraction using Haar-like, feature values can be expressed as

where SA represents the black composition in the region, and SB denotes the white composition in the region. Besides, a refers to the proportion of the black area in the area, b stands for the proportion of the white area in the area, and i (x, y) represents the corresponding pixel value in the image feature interval.

Furthermore, the feature number can be determined according to

In (12), W represents the width of the image, H signifies the height of the image, w denotes the width of the rectangle, and h refers to the height of the rectangle. Meanwhile, X represents the magnification factor of the rectangular feature in the horizontal direction, and Y stands for the rectangular feature magnification factor in the vertical direction.

For the AdaBoost training strong classifier, N training samples are expressed as

where Yi = 0 indicates negative samples of nonface data, and Yi = 1 refers to positive samples of image data. Then, the weight initialization process is performed. In the case of Yi = 0, the weight can be written as (4).

In the case of Yi = 1, the weight can be expressed as

where m represents the number of negative samples, and l denotes the number of positive samples. The normalization of weights can be presented as

Then, it is necessary to train the model using the features by a weak classifier. The final step of classification using the AdaBoost algorithm is to update the weights. The strong classifier can be expressed as

Before training the model using images, since the background information on the original image may obstruct recognizing and detecting the image, the image preprocessing mainly contains grayscale processing, cropping processing, and normalization. Generally, the CNN has an excellent performance in image processing, and it is unnecessary to preprocess images or extract features, since the fine-grained feature extraction of CNN can process the images. However, face images usually involve complicated information that is affected by multiple factors, such as the visual angle and background information. Therefore, the image information cannot be exactly exacted by a separate operation. The initial image information is processed via grayscale, cropping, and normalization. The grayscale processing is to convert the color image into a grayscale image with a single channel feature. After this operation, both the influence of light intensity and the calculation complexity in the training process decrease, improving the model's training speed. Specifically, the gray value conversion can be calculated according to

where R represents the red channel in the image, G denotes the green channel in the image, B refers to the blue channel in the image, and Y stands for the gray value.

Furthermore, the image cropping is indispensable since there exists a considerable amount of disturbance information on the initial face image, which may reduce classification accuracy. Therefore, the face image is cut before expression recognition. The specific cutting method is that, in the horizontal direction, a crop factor of 0.7 is selected to complete the image cropping process. In the vertical direction, a crop factor of 0.3 is selected to complete the cropping process. After the cropping operation, the information on the image irrelevant to facial expressions is removed, and the size of the image is significantly reduced, which can significantly reduce the workload of the subsequent training. The final part after the cropping processing is the normalization of the image. Specifically, during the normalization processing, the initial image of the data sets is rotated by -30°, -15°, 15°, and 30°, respectively, considering that there may be nonfrontal facial images. Subsequently, facial emotion recognition is performed after the normalization. The image preprocessing mainly aims to reduce the impact of uneven lighting on the facial emotion recognition. In the normalization of the image, the histogram is used to equalize the image, which can be expressed as

where n denotes the total number of pixels in the human face image, k represents the type of gray value, and nl refers to the total amount corresponding to the l-th type of gray value.

It is essential to use a feasible optimization algorithm to find the model's optimal global solution to train the CNN model. Adam algorithm is an optimization algorithm developed based on a stochastic gradient algorithm. The algorithm has the characteristics of an adaptive gradient and root mean square propagation. The adaptive gradient provides the algorithm with excellent performance in computer vision, and root mean square propagation affords the algorithm excellent performance in solving intermittent problems. The mean value of the initial time gradient in the Adam algorithm can be obtained according to

The noncentral variance corresponding to the gradient at the second moment can be expressed as

where Mt^ indicates the mean value of the corresponding gradient at the initial time, b1 = 0.9, Vt^ represents the noncentral variance of the corresponding gradient at the second time, and b2 = 0.999.

This algorithm updates the parameters according to

The Adam algorithm has excellent convergence performance and requires a small amount of memory space. Therefore, the algorithm can solve optimization problems, including numerous data information and parameters [35]. Therefore, the algorithm is selected as a tool to optimize the neural network.

The cross-entropy loss function can measure and evaluate the difference between the probability distributions. Here, this function presented in (23) is selected as the loss function.

In equation (24), p refers to the correct probability distribution value, and q represents the predicted value. Furthermore, discrete variables can be calculated according to

Continuous variables can be decided according to

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A