Submodule 4—advanced scaling: Nf-core/methylseq pipeline using Google batch

Yujia Qin; Angela Maggio; Dale Hawkins; Laura Beaudry; Allen Kim; Daniel Pan; Ting Gong; Yuanyuan Fu; Hua Yang; Youping Deng

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Submodule 4—advanced scaling: Nf-core/methylseq pipeline using Google batch

YQ Yujia Qin

AM Angela Maggio

DH Dale Hawkins

LB Laura Beaudry

AK Allen Kim

DP Daniel Pan

TG Ting Gong

YF Yuanyuan Fu

HY Hua Yang

YD Youping Deng

This method is extracted from research article: Brief Bioinform, Jul 2024

Whole-genome bisulfite sequencing data analysis learning module on Google Cloud Platform

DOI: 10.1093/bib/bbae236

Request a Protocol

Ask a question

Favorite

To address the resource-intensive nature of WGBS data analysis, the large-scale submodule exemplifies the potential of cloud computing for real-world WGBS data analysis. Analyzing WGBS data involves processing every cytosine in the genome to determine its methylation status, resulting in substantial data volumes and computational complexity. This submodule (Figure 5) provides an example using nf-core/methylseq via Google Batch to preprocess a WGBS dataset directly downloaded from the Sequence Read Archive (SRA). The utilization of cloud computing resources, such as high-performance machines optimized for specific purposes, enhances the efficiency of the analysis process, making it particularly advantageous for large-scale analyses.

The design and key steps of the large-scale module, utilizing cloud resources for scalability. This submodule employs the nf-core/methylseq pipeline in conjunction with the Google Batch to process large-scale WGBS data

Google Batch in GCP is a managed computing service that simplifies the execution of containerized workloads. It seamlessly integrates with Nextflow, allowing for easy deployment of pipelines like nf-core/methylseq, with process executions offloaded to Google Cloud’s powerful infrastructure. This is especially useful when comes to large-scale operations. To utilize Google Batch through Nextflow, users first need to create a Nextflow service account and grant permission to the Vertex AI notebook. The subsequent steps are similar to those in the methylseq submodule, with the addition of a new configuration file that specifies the executor, input/output directories, machine types, and the storage bucket to be used. These configurations can be set up individually for each step of the workflow, which allows for the use of different cloud computing resources based on specific tasks, enhancing cost-effectiveness.

As shown in Figure 5, the selected datasets are downloaded from SRA using SRA-tools and stored in the local notebook environment. The nf-core/methylseq pipeline is then executed on this dataset with the new configuration file, sending the job to the Google Batch for parallel execution using specified computation resources. Users can also define a local notebook directory using the parameter ‘—tracedir’ to save pipeline execution logs for tracing and runtime estimation. Once the process is complete, the output files are saved in the storage bucket defined in the configuration file.

As mentioned previously, to demonstrate the potential of cloud computing, we selected a moderately sized SRA dataset for this submodule. While this dataset allows the tutorial notebook to run without incurring excessive costs due to long runtime, it may not fully showcase the true advantages of GCP’s computational resources. To further test cloud-based analysis capabilities, we conducted thorough tests on a substantially larger dataset (not included in the submodule), providing valuable insights and experiences in optimizing configuration files and execution. By sharing our encountered problems and experiences in the submodule, we aim to assist users in making cost-effective decisions when using WGBS pipelines for their own large-scale data processing.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol