Data handling

AN Abraham Nunes
HS Hugo G. Schnack
CC Christopher R. K. Ching
IA Ingrid Agartz
TA Theophilus N. Akudjedu
MA Martin Alda
DA Dag Alnæs
SA Silvia Alonso-Lana
JB Jochen Bauer
BB Bernhard T. Baune
EB Erlend Bøen
CB Caterina del Mar Bonnin
GB Geraldo F. Busatto
EC Erick J. Canales-Rodríguez
DC Dara M. Cannon
XC Xavier Caseras
TC Tiffany M. Chaim-Avancini
UD Udo Dannlowski
AD Ana M. Díaz-Zuluaga
BD Bruno Dietsche
ND Nhat Trung Doan
ED Edouard Duchesnay
TE Torbjørn Elvsåshagen
DE Daniel Emden
LE Lisa T. Eyler
MF Mar Fatjó-Vilas
PF Pauline Favre
SF Sonya F. Foley
JF Janice M. Fullerton
DG David C. Glahn
JG Jose M. Goikolea
DG Dominik Grotegerd
TH Tim Hahn
CH Chantal Henry
DH Derrek P. Hibar
JH Josselin Houenou
FH Fleur M. Howells
NJ Neda Jahanshad
TK Tobias Kaufmann
JK Joanne Kenney
TK Tilo T. J. Kircher
AK Axel Krug
TL Trine V. Lagerberg
RL Rhoshel K. Lenroot
CL Carlos López-Jaramillo
RM Rodrigo Machado-Vieira
UM Ulrik F. Malt
CM Colm McDonald
PM Philip B. Mitchell
BM Benson Mwangi
ask Ask a question
Favorite

The first application of the above-described classifier was to the classification of cases versus controls in individual sites, referred to as site-level analyses. For each site, we fit an SVM and measured its performance using a stratified K-fold cross-validation procedure. This method is stratified insofar as the proportion of cases and controls (in respective folds) is similar in both training and validation sets. The number of folds was selected independently for each site, such that the validation set on each fold would have approximately 3 ( ±1) cases.

To further study how overall classification performance relates to different methods of data handling, we implemented three approaches. The first was a meta-analysis of diagnostic accuracy from site-level analyses, referred to as meta-analysis. This models the typical method of analyzing data in a multi-site collaboration [11, 14]. The meta-analyses were done using the hierarchical summary receiver operating characteristic, implemented in HSROC package v. 2.1.8 [59], in the R programming language, see Supplementary material.

Second, we evaluated the same linear SVM parameterization used in all other analyses on a leave-one-site-out (LOSO) cross-validation procedure, referred to as LOSO analyses. In each fold of cross-validation, one site’s data were completely excluded from the training partition. The SVM was then trained on the training partition and predictive performance was evaluated on the data from the held-out site.

Third, we fit an SVM classifier to the data pooled across all sites, using the same linear SVM parameterization as in the site-level analyses, and the same cross-validation procedure. This yielded a total of 284-folds and is further referred to as aggregate subject-level analysis.

We corrected for the effects of imbalanced data in all analyses and thereby trained the SVM classifiers on an effectively balanced dataset. To do this, we implemented the Synthetic Minority Oversampling Technique with Tomek link [60, 61] using the imblearn package v. 0.3.0.dev0 [62], in the Python language v. 3.6. The computer code for the above-described analyses will be provided upon reasonable request.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A