Research design

MJ Mikael Jamil
AP Ashwin Phatak
SM Saumya Mehta
MB Marco Beato
DM Daniel Memmert
MC Mark Connor
request Request a Protocol
ask Ask a question
Favorite

Three different machine learning classification algorithms, Logistic Regression (LR), Random Forest Classifier (RFC), and Gradient Boosting Classifier (GBC) were used to classify goalkeepers who had played in the UEFA Champions League (UCL) (classified as: 1) as opposed to not having played in the UCL (classified as: 0). The UEFA Champions League, was purposely selected as the identifier of elite and sub-elite performance due to the competition being of the highest prestige32 and due to the fact this competition comprises of the very best teams and players33. For data balancing purposes, data for 53 non-UCL goalkeepers were excluded (random under sampling referred to above), resulting in a final sample of 300 GK’s. Data on UCL appearances were obtained from the increasingly popular Transfermarkt website34,35. Figure 1 outlines the machine learning pipeline used to conduct this study. Min–max scaling was performed and preliminary hyperparameter optimization was conducted for all three algorithms using the 73 filtered features to achieve a > 70 AUC (area under ROC curve) for each of the three models. Post optimization, recursive feature elimination was performed for all three classification algorithms using a ‘balanced accuracy’ scoring metric with the minimum allowable features set at 2036 to reduce the dimension of the problem space and only use the features providing the highest information gain. Post extraction of the features for each model was optimized for ‘balanced accuracy’ (average of the recall obtained on each class) using grid search cross validation36. The common features present in all three algorithms were reported with coefficients and variable importance. The pseudocode is presented in ESM Appendix A.

Machine learning pipeline for obtaining KPI’s.

The coefficients from the LR provided both magnitude and direction of the effect, while the GBC and RFC provided feature importance scores. Ethical approval for this study was obtained by the ethics committee of the local institution. This study did not comprise of any testing on human subjects as all data utilised were secondary data obtained directly from Opta and full permissions to utilise this data for research purposes were obtained by all institutions involved in this study.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A