2.3. Model Construction

Siyi Peng; Jiaming Zhu; Zuohua Liu; Bin Hu; Miao Wang; Shihua Pu

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.3. Model Construction

SP Siyi Peng

JZ Jiaming Zhu

ZL Zuohua Liu

BH Bin Hu

MW Miao Wang

SP Shihua Pu

This method is extracted from research article: Animals (Basel), Dec 2022

Prediction of Ammonia Concentration in a Pig House Based on Machine Learning Models and Environmental Parameters

DOI: 10.3390/ani13010165

Request a Protocol

Ask a question

Favorite

The construction process of the six prediction models was consistent (Figure 1), and they were all carried out in the following three steps: selecting the environmental parameters to determine the feature input scheme (2.3.1), selecting and importing potential algorithms from scikit-learn or Keras libraries using Python (2.3.2), and training the input data based on different algorithms and adjusting parameters in combination with model evaluation metrics to achieve relatively good results (2.3.3).

Modeling workflow.

A variety of environmental parameters concerning the piggery were collected to build the model, including temperature, humidity, CO₂, H₂O, ventilation, air pressure inside the pig house, and temperature and rainfall outside the pig house. These eight parameters were considered potentially correlated variables. On this basis, the random forest algorithm was used to rank the importance of eight environmental parameters on NH₃ concentration in the pig house. Random forest can yield the importance score of each variable to evaluate the role of each in classification, as it relies on a self-help resampling technology and node random splitting. The ability to analyze complex interacting classification features makes random forest a feature selection tool for high-dimensional data. In this study, we considered the parameters with importance scores greater than 0.1 after random forest analysis as the priority input environmental parameters, and selected the inputs in order of importance from the largest to the smallest. The environmental parameters with importance scores less than 0.1 were used to calculate their correlations with NH₃ concentration using Pearson correlation analysis (PsCA), and the inputs were selected in order from the largest to the smallest according to the absolute value of correlation. The input scheme for the model characteristic parameters was obtained on the basis of the analysis of environmental importance and the correlations among the data (Table 1).

Input scheme of model feature parameters.

The NH₃ concentration of the pig house was used as the label datum, and the environmental parameters related to the NH₃ concentration were used as the characteristic data. The purpose was to learn the correspondence from the characteristic data such as temperature and humidity to predict the label data. Therefore, it was necessary to model the supervised learning algorithm in machine learning. At the same time, the input variables and output variables were time series, so the prediction of NH₃ in the pig house was formally a regression problem, and the corresponding model is a non-probabilistic model. Therefore, different machine learning algorithms were used to establish discriminant models in supervised learning, including classical algorithms such as neural networks, DT, SVM, and related ensemble algorithms (XGBoost, LSTM, RNN, BPNN). Using Python software, machine learning algorithm running, statistical analysis, and data mining work were managed with pandas, matplotlib, and numpy. Traditional machine learning algorithms (DT, SVM, and XGBoost) were imported directly from the scikit-learn library and combined with the input data for subsequent training and hyperparameter optimization, while deep learning algorithms (BPNN, LSTM and RNN) required additional use of the Keras library and artificial debugging to determine the number of hidden layers (there were two hidden layers in this study).

The NH₃ concentration was used as the prediction target. The length of the input time series (input_len) of each model was set to 5, and the length of the prediction time series (out_len) was set to 1. The first 80% of the preprocessed data was used to train the model, and the last 20% was used to test the model. In the training process, the training of each integrated model involved the selection of hyperparameters, a factor that is directly related to the final prediction results. Here, the hyperparameters were firstly artificially selected and set so that the prediction effect was relatively high, and then three deep learning models and three conventional machine learning models were established. Then, the models with good prediction performance were screened, and hyperparameter optimization was performed using the corresponding algorithms on this basis. For neural network algorithms (LSTM, RNN, and BPNN), the particle swarm optimization (PSO) algorithm was used to optimize the number of hidden layer neurons in the first and second layers and the learning rate. For DT, SVM, and XGBoost algorithms, grid search was used for parameter tuning.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol