Data handling

Abraham Nunes; Hugo G. Schnack; Christopher R. K. Ching; Ingrid Agartz; Theophilus N. Akudjedu; Martin Alda; Dag Alnæs; Silvia Alonso-Lana; Jochen Bauer; Bernhard T. Baune; Erlend Bøen; Caterina del Mar Bonnin; Geraldo F. Busatto; Erick J. Canales-Rodríguez; Dara M. Cannon; Xavier Caseras; Tiffany M. Chaim-Avancini; Udo Dannlowski; Ana M. Díaz-Zuluaga; Bruno Dietsche; Nhat Trung Doan; Edouard Duchesnay; Torbjørn Elvsåshagen; Daniel Emden; Lisa T. Eyler; Mar Fatjó-Vilas; Pauline Favre; Sonya F. Foley; Janice M. Fullerton; David C. Glahn; Jose M. Goikolea; Dominik Grotegerd; Tim Hahn; Chantal Henry; Derrek P. Hibar; Josselin Houenou; Fleur M. Howells; Neda Jahanshad; Tobias Kaufmann; Joanne Kenney; Tilo T. J. Kircher; Axel Krug; Trine V. Lagerberg; Rhoshel K. Lenroot; Carlos López-Jaramillo; Rodrigo Machado-Vieira; Ulrik F. Malt; Colm McDonald; Philip B. Mitchell; Benson Mwangi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Data handling

AN Abraham Nunes

HS Hugo G. Schnack

CC Christopher R. K. Ching

IA Ingrid Agartz

TA Theophilus N. Akudjedu

MA Martin Alda

DA Dag Alnæs

SA Silvia Alonso-Lana

JB Jochen Bauer

BB Bernhard T. Baune

EB Erlend Bøen

CB Caterina del Mar Bonnin

GB Geraldo F. Busatto

EC Erick J. Canales-Rodríguez

DC Dara M. Cannon

XC Xavier Caseras

TC Tiffany M. Chaim-Avancini

UD Udo Dannlowski

AD Ana M. Díaz-Zuluaga

BD Bruno Dietsche

ND Nhat Trung Doan

ED Edouard Duchesnay

TE Torbjørn Elvsåshagen

DE Daniel Emden

LE Lisa T. Eyler

MF Mar Fatjó-Vilas

PF Pauline Favre

SF Sonya F. Foley

JF Janice M. Fullerton

DG David C. Glahn

JG Jose M. Goikolea

DG Dominik Grotegerd

TH Tim Hahn

CH Chantal Henry

DH Derrek P. Hibar

JH Josselin Houenou

FH Fleur M. Howells

NJ Neda Jahanshad

TK Tobias Kaufmann

JK Joanne Kenney

TK Tilo T. J. Kircher

AK Axel Krug

TL Trine V. Lagerberg

RL Rhoshel K. Lenroot

CL Carlos López-Jaramillo

RM Rodrigo Machado-Vieira

UM Ulrik F. Malt

CM Colm McDonald

PM Philip B. Mitchell

BM Benson Mwangi

This method is extracted from research article: Mol Psychiatry, Aug 2018

Using structural MRI to identify bipolar disorders – 13 site machine learning study in 3020 individuals from the ENIGMA Bipolar Disorders Working Group

DOI: 10.1038/s41380-018-0228-9

Ask a question

Favorite

The first application of the above-described classifier was to the classification of cases versus controls in individual sites, referred to as site-level analyses. For each site, we fit an SVM and measured its performance using a stratified K-fold cross-validation procedure. This method is stratified insofar as the proportion of cases and controls (in respective folds) is similar in both training and validation sets. The number of folds was selected independently for each site, such that the validation set on each fold would have approximately 3 ( ±1) cases.

To further study how overall classification performance relates to different methods of data handling, we implemented three approaches. The first was a meta-analysis of diagnostic accuracy from site-level analyses, referred to as meta-analysis. This models the typical method of analyzing data in a multi-site collaboration [11, 14]. The meta-analyses were done using the hierarchical summary receiver operating characteristic, implemented in HSROC package v. 2.1.8 [59], in the R programming language, see Supplementary material.

Second, we evaluated the same linear SVM parameterization used in all other analyses on a leave-one-site-out (LOSO) cross-validation procedure, referred to as LOSO analyses. In each fold of cross-validation, one site’s data were completely excluded from the training partition. The SVM was then trained on the training partition and predictive performance was evaluated on the data from the held-out site.

Third, we fit an SVM classifier to the data pooled across all sites, using the same linear SVM parameterization as in the site-level analyses, and the same cross-validation procedure. This yielded a total of 284-folds and is further referred to as aggregate subject-level analysis.

We corrected for the effects of imbalanced data in all analyses and thereby trained the SVM classifiers on an effectively balanced dataset. To do this, we implemented the Synthetic Minority Oversampling Technique with Tomek link [60, 61] using the imblearn package v. 0.3.0.dev0 [62], in the Python language v. 3.6. The computer code for the above-described analyses will be provided upon reasonable request.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol