A Flexible and Easy-to-use Pipeline for Next Generation Sequencing Analysis of Cancer Samples
Abstract
Next Generation Sequencing (NGS) technology offers unprecedented opportunities to improve our understanding of tumor formation and development. Large-scale whole genome and whole exome sequencing projects revealed the mutation profiles for various cancers (Wheeler et al. 2013). However, NGS brought many challenges along with it and many algorithms have been developed to solve these problems. Most of these algorithms run through terminal manually or requires specific programming language knowledge; therefore, it is not easy to execute these algorithms for non-technical users. Robust and efficient combination of these algorithms with a simple interface is highly desirable. We have built a flexible pipeline for this purpose. Our pipeline provides multiple algorithms for every step of sequencing analyses and allows user to compare these alternative algorithms by intermediary outputs. It starts from a raw FASTQ file (which is the text file produced by sequencing machines) and ends with a detailed report of discovered mutations. It includes different types of mapping, pre-processing (such as GATK Best Practices), variant calling and variant annotation algorithms. A very brief workflow of our basic pipeline can be seen in Figure 1. It is a complete pipeline, but for users who want to benefit from only certain stages, system would allow that as well. Depending on preferences users might also access intermediary files instead of the final report. This system helps medical doctors and biologists to use open-source algorithms without having any trouble with a simple interface and as an open-source platform, it allows advanced users to optimize performance through basic settings. Our pipeline uses Burrow-Wheeler Algorithm (BWA) and Bowtie2 algorithms for mapping; Picard Tools, SAMtools and Genome Analysis Toolkit (GATK) for pre-processing steps; and VarScan and Mutect2 for variant calling. Offering different options in each step allows users to compare different algorithms and their results, which will be very beneficial for benchmarking analysis. Apart from the benchmarking, it is also known that various algorithms may perform quite differently for different settings (Krøigård et al. 2016), therefore our flexible system can be optimized for different scenarios. Our pipeline focuses on cancer studies where user has tumor and normal samples and tries to identify somatic and germline single nucleotide polymorphisms (SNP) and insertion-deletions (indels). It uses reference genomes (hg19/GRCh38) and can create a panel of normal reference from the non-tumor samples provided by user. We use public databases such as COSMIC and dbSNP for annotations after the variant calling phase. With respect to performance, all steps (mapping, pre-processing, variant calling and variant annotation) for Whole Exome Sequencing (WXS) data can be processed in 10 hours with 16 Core AMD Opteron Processor 6274 and 48GB RAM hardware through our pipeline. Those numbers are also in decreasing trend with our improvements in parallel processing. Our pipeline is implemented in Python, it can be used through our web page by uploading data to our servers or as a stand-alone software for users who prefer to use local resource. We are working on supporting cloud-based platforms as a different alternative. Up until now, we have tested our algorithms with The Cancer Genomics Atlas (TCGA) data and our private data. In order to verify our results, we have used the TCGA Glioblastoma (GBM) data as reference, where TCGA provides both raw files (.fastq) and annotated files (.vcf). We fed the raw files from TCGA into our pipeline, then compare the VCF files that we produced with ones in the TCGA datasets. The results were almost identical, meaning that all the steps are performed as they should be. In summary, our pipeline takes the very raw input from the NGS machines and produces a detailed report with respect to user’s preferences and requires no programming background, therefore it will be very useful for non-technical people who want to use sequencing technology. Additional to this, our pipeline also allows researchers with advanced computing skills to compare algorithms on different datasets and be more confident about their results. All the codes and more detailed documentations are available at here.