2.2.2. Machine Learning for Feature Selection

CF Catarina Dinis Fernandes
AS Annekoos Schaap
JK Joan Kant
PH Petra van Houdt
HW Hessel Wijkstra
EB Elise Bekers
SL Simon Linder
AB Andries M. Bergman
UH Uulke van der Heide
MM Massimo Mischi
WZ Wilbert Zwart
FE Federica Eduati
ST Simona Turco
request Request a Protocol
ask Ask a question
Favorite

A regularized logistic regression model was optimized on the transcriptomic training dataset by a nested stratified cross-validation approach with 25 repeats, 10 outer loops, and 5 inner loops. In the nested approach, the inner loop is used to optimize the hyperparameters, while the outer loop is used for performance evaluation and model selection. In both loops, folds were stratified to ensure that the folds were balanced between the two classes (insignificant vs. significant PCa), and features were standardized in each outer loop. The models were optimized by minimizing the log loss and penalized with elastic net regularization. Features were ranked by importance by averaging the weight coefficients obtained for each feature at each repetition of the cross-validation procedure and converting them to odds ratios (ORs) using OR=ew, with w being the coefficient of a feature. The OR quantifies the relative odds of having clinically significant PCa given the value of a feature, where an OR > 1 means that a higher value for this feature is associated with higher odds of clinically significant PCa and the opposite with an OR < 1 [56]. Features for which the OR was very close to 1, with a margin of 0.1 (|OR1|<0.1), were excluded from further investigation. This procedure was done separately to perform feature ranking and selection for the transcription factors and pathway activity. For PORTOS/Decipher signature scores, all six features were further investigated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A