Submodule 4—advanced scaling: Nf-core/methylseq pipeline using Google batch

YQ Yujia Qin
AM Angela Maggio
DH Dale Hawkins
LB Laura Beaudry
AK Allen Kim
DP Daniel Pan
TG Ting Gong
YF Yuanyuan Fu
HY Hua Yang
YD Youping Deng
request Request a Protocol
ask Ask a question
Favorite

To address the resource-intensive nature of WGBS data analysis, the large-scale submodule exemplifies the potential of cloud computing for real-world WGBS data analysis. Analyzing WGBS data involves processing every cytosine in the genome to determine its methylation status, resulting in substantial data volumes and computational complexity. This submodule (Figure 5) provides an example using nf-core/methylseq via Google Batch to preprocess a WGBS dataset directly downloaded from the Sequence Read Archive (SRA). The utilization of cloud computing resources, such as high-performance machines optimized for specific purposes, enhances the efficiency of the analysis process, making it particularly advantageous for large-scale analyses.

The design and key steps of the large-scale module, utilizing cloud resources for scalability. This submodule employs the nf-core/methylseq pipeline in conjunction with the Google Batch to process large-scale WGBS data

Google Batch in GCP is a managed computing service that simplifies the execution of containerized workloads. It seamlessly integrates with Nextflow, allowing for easy deployment of pipelines like nf-core/methylseq, with process executions offloaded to Google Cloud’s powerful infrastructure. This is especially useful when comes to large-scale operations. To utilize Google Batch through Nextflow, users first need to create a Nextflow service account and grant permission to the Vertex AI notebook. The subsequent steps are similar to those in the methylseq submodule, with the addition of a new configuration file that specifies the executor, input/output directories, machine types, and the storage bucket to be used. These configurations can be set up individually for each step of the workflow, which allows for the use of different cloud computing resources based on specific tasks, enhancing cost-effectiveness.

As shown in Figure 5, the selected datasets are downloaded from SRA using SRA-tools and stored in the local notebook environment. The nf-core/methylseq pipeline is then executed on this dataset with the new configuration file, sending the job to the Google Batch for parallel execution using specified computation resources. Users can also define a local notebook directory using the parameter ‘—tracedir’ to save pipeline execution logs for tracing and runtime estimation. Once the process is complete, the output files are saved in the storage bucket defined in the configuration file.

As mentioned previously, to demonstrate the potential of cloud computing, we selected a moderately sized SRA dataset for this submodule. While this dataset allows the tutorial notebook to run without incurring excessive costs due to long runtime, it may not fully showcase the true advantages of GCP’s computational resources. To further test cloud-based analysis capabilities, we conducted thorough tests on a substantially larger dataset (not included in the submodule), providing valuable insights and experiences in optimizing configuration files and execution. By sharing our encountered problems and experiences in the submodule, we aim to assist users in making cost-effective decisions when using WGBS pipelines for their own large-scale data processing.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A