Abstract
Background: Next-generation sequencing provides useful information about gene mutations, gene expression, epigenetic modification, microRNA expression, and copy number variations. More and more computing tools have been developed to analyze this large quantity of information. However, to test and find suitable analytical tools and integrate their results is tedious and challenging for users with little bioinformatics training. In the present study, we assembled the computing tools into a convenient toolkit to simplify the analysis and integration of data between bioinformatics tools. Materials and Methods: The toolkit, GeneGazer, comprises of two parts: the first, named Gaze_Profiler, was designed for personalized molecular profiling from next-generation sequencing data of paired samples; the other, named Gaze_BioSigner, was designed for the discovery of disease-associated biosignatures from expressional and mutational profiles of a cohort study. Results: To demonstrate the capabilities of Gaze_Profiler, we analyzed a pair (colon cancer and adjacent normal tissues) of RNA-sequencing data from one patient downloaded from the Sequencing Read Archive database and used them to profile somatic mutations and digital gene expression. In this case, alterations in the RAS/RAF/MEK/ERK signaling pathway (activated by KRAS G13D mutation) and canonical WNT signaling pathway (activated by truncated APC) were identified; no EGFR mutation or overexpression was found. These data suggested a limited efficacy of cetuximab in the patient. To demonstrate the ability of Gazer_BioSigner, we analyzed gene-expression data from 192 cancer tissues downloaded from The Cancer Genome Atlas and found that the activation of cAMP/PKA signaling, OCT-3/4 and SRF were associated with colon cancer progression and could be potential therapeutic targets. Conclusion: GeneGazer is a reliable and robust toolkit for the analysis of data from high-throughput platforms and has potential for clinical application and biomedical research.
- Next-generation sequencing
- personalized molecular profiling
- biosignature
- somatic mutations
- digital gene expression
Abbreviations: cAMP, Cyclic adenosine monophosphate; APC, adenomatous polyposis coli; BIRC5, baculoviral IAP repeat containing 5; BRAF, B-RAF proto-oncogene, serine/threonine kinase; EGFR, epidermal growth factor receptor; ERK, extracellular signal-regulated kinase; FZD7, frizzled class receptor 7; IGF1R, insulin-like growth factor 1 receptor; KRAS, kirsten rat sarcoma viral oncogene homolog; MEK, mitogen-activated protein kinase kinase; MET, Met proto-oncogene, receptor tyrosine kinase; NF-Y, nucleic factor Y; OCT-3/4, octamer-binding transcription factor 3/4; PIK3CA, phosphotidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha; PKA, cAMP-dependent protein kinase; PTEN, phosphatase and tensin homolog; RAF, v-Raf murine sarcoma viral oncogene; SFRP5, secreted frizzled-related protein 5; SPRY2, sprouty homolog 2; SRF, serum response factor; RAS, rat sarcoma viral oncogene; TBP, TATA-Box binding protein; WNT, wingless-type MMTV integration site family.
High-throughput technologies, such as next-generation sequencing (NGS) and high-density microarrays, provide systematic information on molecular biological processes and lead to a comprehensive understanding of cancer genomes (1-3). Knowledge acquired with these sophisticated tools indicates that genotypes or expression profiles in cancer tissues are highly diverse among patients (4, 5). This molecular diversity leads to the activation of different pathways and cancer phenotypes, which subsequently affect therapeutic efficacy and prognosis. For instance, tumors with high EGFR expression are susceptible to cetuximab, the therapeutic antibody commonly used against metastatic colorectal cancer. Once a cancer cell acquires either mutation of KRAS, BRAF or PIK3CA mutation, PTEN loss, or overexpression of MET or IGF1R, it develops resistance to cetuximab by activating a signaling pathway other than EGFR (6). In the case of EGFR-independent tumors, cetuximab has limited therapeutic efficacy, and medications targeting other signaling pathways should be considered (7, 8). Therefore, the molecular profiling of genetic variations and gene expression for each patient would be useful for selecting for the most appropriate therapeutic approach (9-11).
NGS is a powerful tool for the molecular profiling of gene mutation, expression, methylation and microRNA profiles. More and more useful computing tools and pipelines have been developed for the analysis of NGS data (12). For instance, the BWA-GATK pipeline is often used for variant identification (13-15); Tophat-Cufflinks pipeline is widely applied to calculate the digital gene-expression profile (16); and VarScan (17, 18) and Somaticsnipper (19) were developed for identifying somatic variations. Most of these tools have been developed for text-based user interface and for the UNIX system only. Many commercial software kits, such as CLC Genome Workbench, Avadis, and Partek, provide a user-friendly graphic user interface and are available for various types of operating systems. These tools are powerful but often inflexible. In addition, using these programs and selecting for appropriate parameters is still a great challenge, especially for clinical technicians without bioinformatics training.
In addition to molecular profiling, high-throughput techniques also generate large amounts of data. Many archive databases, such as Sequence Read Archive (SRA), Gene Expression Omnibus (GEO), and The Cancer Genome Atlas (TCGA), have been established for storing omics data generated from these high-throughput techniques. These data provide important clues regarding disease-progression mechanisms, drug development, biosignature identification and biomarker discovery. For instance, general biosignatures in glioblastoma multiforme (20), ovarian carcinoma (21), colorectal adenocarcinoma (22), and squamous cell lung cancer (23) have been uncovered using the TCGA database. Many strategies, such as PARADIGM (24) and MuSiC (25), have been developed for identifying general disease-causing signatures. These observations indicate that mining information from such a massive data deposit would be useful in order to gain a comprehensive understanding of malignant diseases.
In the present study, we attempted to establish an easy and integrated toolkit that assembles bioinformatics resources for analyzing sequencing data and generating molecular profiles of the samples. The toolkit, named GeneGazer, comprises of two computing pipelines: the first one is for personalized molecular profiling through analysis of NGS data, and the second one is for biosignature identification (Figure 1). To demonstrate the capability of the toolkit, an RNA-sequencing dataset from SRA and a large dataset from TCGA were analyzed.
Materials and Methods
Resource requirements. The source codes for GeneGazer are available on request. The toolkit was developed and tested on Linux Ubuntu 12.04 operating system with the following software: Burrows-Wheeler Aligner (BWA, 0.6.2-r126), SAMtools (0.1.8), TopHat (2.0.3), Bowtie (0.12.7), Cufflinks (1.3.0), MySQL (5.5.35), and JAVA compiler. For Gaze_Profiler, at least two threads and four gigabytes of RAM are recommended. For Gaze_BioSigner, at least one thread and four gigabytes of RAM are required.
RNA-sequencing dataset. The RNA sequence dataset CC2 (SRP021221) was downloaded from the Sequence Reads Archive (SRA) database of the National Center of Biotechnology and Information (NCBI) database. This dataset consists of 59.2 million 100-bp reads from each tumor section and 35.4 million 75-bp reads from the adjacent normal section. All reads were paired-end sequences. The raw data were stored in the raw_data folder in fastq format for subsequent analysis.
Somatic mutation identification. This analysis was performed using Gaze_Profiler. The workflow is illustrated in Figure 2A. Firstly, five nucleotides from the 5’-end and 10 nucleotides from the 3’-end were trimmed. The trimmed data were stored in the seq_data folder. The data were aligned against human genome DNA version 19 (hg19, UCSC) using BWA (14, 15). The alignment result was stored in the sam format. The sam format data were sorted using Picard-Utility (http://broadinstitute.github.io/picard). The duplicated reads were removed with Picard-Utility. Indel realignment and quality recalculation were performed with Genome analytic toolKit (GATK) (13). Before analysis using VarScan (17, 18), the pileup files are generated with mpileup module in SAMtools (26). For scanning somatic variations, (i) the depth of sequencing of normal tissue should be greater than six; (ii) the depth of sequencing of tumor tissue should be greater than three; and (iii) the total depth of sequencing of normal and tumor tissue should be at least six. The significance of each somatic variant calling was evaluated by Fisher's exact test. The variants with p-value smaller than 0.05 were identified as somatic variants or loss of heterozygocity. The selected variants were subsequently annotated with Annovar (27, 28). The genes with missense, nonsense, or frameshift mutations were collected into an SQL database.
The organization of GeneGazer.
Geneexpression profile. “Exp” subroutine in Gaze_Profiler, based on the TopHat-Cufflinks pipeline (16), was applied to evaluate the gene-expression profile from the RNA-sequencing dataset. The trimmed sequencing data were aligned to GRCh37 (Ensemble) with TopHat (16). The alignment data were subsequently analyzed with Cufflinks (16). The read counts for each gene locus were calculated and normalized in fragments per kilobase of transcript per million mapped reads (FPKM) format. The genes whose expression was two-fold greater or lower in the tumor than in the paired normal tissue were selected as being up-/down-regulated and were collected into the SQL database.
Processing of data obtained from The Cancer Genome Atlas (TCGA) database. The expression profiles of patients with colon adenocarcinoma, including COAD.IlluminaHiSeq_RNASeqV2 (level 3.1.9.0) and COAD_IlluminaGA_RSEQV2 (level 3.1.1.0) were obtained from the TCGA database (https://tcga-data.nci.nih.gov/tcga/). The expression of each gene was centralized by Z-score transformation. The average and standard deviations of each gene in normal colon tissues were calculated. The Z-score of each target gene in tumor tissue was computed according to the equation: Z-scoretar=(Expressiontar−Averagetar)/SDtar, whereas Expressiontar represented the level of expression of the target gene, Averagetar represented the average target gene expression level in normal colon tissue, and SDtar represented the standard deviation of the target gene expression level in normal tissue. All calculations were performed using the subroutine “zscore”. The COAD_IlluminaGA_RSEQV2 (level 3.1.1.0) dataset was used for the following Gene-Set Enrichment Analysis (GSEA) (29).
Gene-set enrichment analysis. For GSEA, the required files were generated from the FPKM file. The phenotype data were recorded in the cls file; gene-annotation data were described in the chip file; and the expression data were in the gct file. All files were analyzed by GSEA program. The gene-set database was based on the curated Kyoto Encyclopedia of Genes and Genomes (KEGG) database (c2.cp.kegg.v5.0.symbols.gmt). The permutation type was set in “Gene Set”. The default settings of the other parameters were used.
Transcription factor analysis. Genes with differential expression in distinct groups of patients were identified by the subroutine “diff_zscore”. The gene list was imported into MetaCore™ (THOMSON REUTERS, New York, NY, USA) for analysis of the transcription factors. The selected pathways were then integrated.
Results
Implementation. GeneGazer utilizes two automatic pipelines, both of which are encoded in two shell scripts and controlled by text-based user-interface in the Linux operating environment.
Gaze_Profiler is designed for identifying somatic genetic alterations from paired NGS data from a tumor and its normal tissue counterpart. The pipeline includes BWA, Picard-tools, GATK, SAMtools, VarScan, Annovar, Tophat, Cufflinks, and GSEA, as shown in Figure 2A. FASTQ-format RNA-sequencing or exome-sequencing data from paired normal and tumor samples from one patient are analyzed in this pipeline. Initially, low-quality base-calling at both the 5’-end and 3’-end of reads was trimmed by the subroutine “prep”. To survey somatic mutations, the subroutine “mut” is applied to map the trimmed reads to the human genome (hg19, UCSC) using BWA. The alignment results are sorted and transformed into bam format with Picard-tools, and the duplicated reads are discarded. The realignment of insertion/deletion and the recalculation of mapping quality are processed with GATK. Thereafter, somatic variations are identified with Varscan, based on Fisher's exact test. These somatic variations are annotated with Annovar. Finally, the missense, nonsense, or frameshift mutations are extracted into a personalized mutational profile, which is subsequently imported into MySQL database. For such a paired sequencing dataset with 10 gigabytes of data output for each of the paired samples, the entire process can be completed within 24 hours using two threads and four gigabytes of RAM. The subroutine “exp” is designed to calculate differential gene expression, which is based on the Tophat-Cufflinks pipeline. The results are presented in FPKM format (16). The gene sets associated with tumor sections are selected using GSEA, and the differentially expressed genes, either overexpressed or down-regulated in the tumor section, are identified and stored in MySQL database for subsequent analysis. For such a sequencing dataset with 6 gigabyte data output for each of the paired samples, the generation of a complete differential gene expression profile still takes under 24 h with two threads and four gigabytes of RAM.
Gaze_BioSigner is a pipeline containing subroutines for data management and biosignature identification. The input data are genetic variations or gene expression in a cohort, obtained either from Gaze_Profiler or public domains. Its workflow is shown in Figure 2B. Molecular profiles from Gaze_Profiler are imported into MySQL database through the subroutine “import”. The expression data from TCGA are imported through the subroutine “zscore”. The clinical data of each patient are stored in the database by the subroutine “idf”. The associations between the genetic alterations and selected clinical features are calculated with the subroutine “diff”, analyzed by Pearson's Chi-square test and odds ratio. Expression profiles can be further entered into subroutine “diff_zscore”, which invokes GSEA and identifies the gene sets associated with selected clinical features.
The workflow of GeneGazer. The workflow of the two pipelines in GeneGazer, Gaze_Profiler (A) and Gaze_BioSigner (B), is shown. Blue squares represent subroutines each consisting of one or more invoked programs (identified in green).
Personalized molecular profiling. To demonstrate the capability of Gaze_Profiler, we analyzed a paired (normal and tumor tissues) RNA-sequencing dataset from a patient with colon cancer, downloaded from the SRA database in NCBI. Mutational and expressional profiles of this dataset were generated with the Gaze_Profiler pipeline.
In this dataset, 70 somatic mutations were identified, including 18 missense substitutions, four nonsense substitutions and 48 frameshifts caused by insertions or deletions. Moreover, there were 108 genes up-regulated and 172 down-regulated in the tumor section. GSEA revealed that eight pathways were up-regulated and 22 pathways down-regulated in the tumor section (Figure 3A-C). Genes associated with these pathways were involved in the cell cycle, DNA repair, cytokine production, and cell adhesion. These results suggested that the tumor tissue in this patient had a phenotype of cell-cycle acceleration, DNA-repair activation, cytokine reduction and cell adhesion dysfunction.
Gaze_Profiler can extend its capability through utilization of other analytical tools not part of our pipeline. For example, to further investigate the pathways in which the genetic alterations occurred, the somatic genetic alterations for each patient were entered into MetaCore™, a comprehensive pathway analysis software. The results show that in the tumor tissue, a heterozygous KRAS G13D mutation and SPRY2 overexpression were identified, suggesting that the tumor tissue harbored the KRAS-driven activation of the RAS/RAF/MEK/ERK pathway. Moreover, the tumor also showed activation of the canonical WNT pathway, most likely due to the dysfunction of APC and SFRP5, activation of FZD7, and overexpression of BIRC5 (Figure 3D).
The personal molecular profile of patient CC2 was analyzed by Gene-Set Enrichment Analysis (GSEA) and pathway analysis. This figure shows the top 20 differentially expressed genes (A), up-regulated gene sets (B), and down-regulated gene sets (C) identified by GSEA in the tumor section. After pathway analysis, activation of the canonical WNT signaling pathway and KRAS/RAF/MEK/ERK signaling pathway was identified in this patient (D). The genes up-regulated in the tumor are labeled in red; those down-regulated in the tumor in blue; non-synonymous somatic mutations in purple.
Biosignature identification. To demonstrate the capability of Gaze_BioSigner, we downloaded the expression profiles of colon adenocarcinoma from the TCGA database (COAD_IlluminaGA_RSEQV2 level 3.1.1.0) and analyzed the molecular profiles associated with certain clinical features. This dataset includes 459 patients; among them, 192 patients with record of the tumor stage (111 at an early stage or stage I and II, and 81 at a late stage or stage III and IV). This dataset was used to identify possible genetic alterations that were associated with tumor stage. The expression profile of each patient was converted into a Z-score profile with the subroutine “zscore”. The Z-score profile was transferred into another subroutine “diff_zscore”, which invokes GSEA to identify stage-associated biosignatures.
The GSEA results are shown in the supplementary information (https://drive.google.com/open?id=0B3Nd1wE1_lrRaTl3NnFJTnRXWTA). Briefly, 98 pathways were up-regulated in patients with late-stage tumor. Only the gene set KEGG_Vasopression_regulated_water_absorption was significantly enriched. The genes in its core enrichment were associated with cAMP/PKA signal transduction. This result suggested that cAMP/PKA activation could be important for tumor progression. However, 119 gene sets were up-regulated in patients with early-stage disease. Three of them were significantly enriched, including KEGG_p53_ signaling_pathway, KEGG_Apoptosis, and KEGG_NOD_ like_receptor_signaling_pathway. The genes in the core enrichment of these gene sets were associated with p53-dependent cell-cycle arrest, apoptosis, and immune response. These results indicated that maintaining the activation of these genes could be crucial to restricting tumor development.
Transcription factors activated in early- and late-stage colon adenocarcinoma. This figure describes the strategy for identifying the biosignatures associated with late-stage colon adenocarcinoma. The enriched score of each gene was calculated using Gene-Set Enrichment Analysis (GSEA) (A). The stage-specific gene expression was identified and could be traced back to transcription factors (B). The transcription factors activated in patients with late-stage or early-stage disease are shown (C). The network of late-stage-specific transcription factors and their downstream genes were plotted by Metacore™ (D). The results demonstrate that SRF, OCT3/4, p73, NF-Y and TBP are activated in late-stage colon adenocarcinomas.
In addition to pathway analysis with GSEA, we further surveyed stage-specific transcription factor activation according to the following strategies not involved in GeneGazer. Firstly, we extracted the enrich score (ES) of each gene, that was calculated in GSEA analysis, and calculated the average (MeanES) and standard deviation (SDES) of the ES. The genes with an ES larger than MeanES + 2 × SDES were defined as being related to late-stage disease. In contrast, the genes with an ES less than MeanES − 2 × SDES were defined as being related to early-stage disease (Figure 4A). We hypothesized that these differentially expressed genes were the result of altered transcription factors crucial for tumor progression (Figure 4B). These transcription factors can be identified using MetaCore™. As shown in Figure 4C, 12 transcription factors were activated in late-stage, 12 were activated in early-stage, and 17 were activated throughout tumor progression. Combining the 12 late-stage-related transcription factors, the transcription factors involved in tumor progression included SRF, p73, NF-Y,TBP, and OCT3/4.
Discussion
In this study, we developed a computing toolkit, GeneGazer, which integrates two pipelines, Gaze_Profiler and Gaze_BioSigner, designed to discover personalized molecular profiles and the biosignatures associated with clinical features, respectively. All analyses were automatically processed after assigning the required parameters via a simple text-based user interface, and the results were output as format-ready for further analysis with other software. We used an RNA-sequencing dataset from a tumor tissue and its paired normal tissue to demonstrate the capability of Gaze_Profiler. The pipeline identified activation of the RAS/RAF/MEK/ERK pathway and canonical WNT pathway in the tumor tissue. In addition, we used mutation and expression datasets of colon cancer in TCGA to demonstrate the capability of Gaze_BioSigner. The pipeline identified that the activation of SRF, OCT-3/4, NF-Y, p73, and TBP was highly associated with the progression of colon adenocarcinoma. These results suggested that GeneGazer is a useful and reliable toolkit for biomedical studies.
The features of GeneGazer. As a user-friendly and reliable toolkit, GeneGazer offers the following features: (i) the integration of computing tools in a simple user interface; (ii) easy interface with other out-sourcing toolkits; and (iii) reliable strategies for identification of somatic variation. GeneGazer integrates many well-established computing tools into a simple text-based user interface. Once the necessary parameters are assigned, the personalized molecular profiling or biosignature discovery will be automatically processed. The results from GeneGazer are ready to be analyzed with other computing tools that are not incorporated in the pipeline. For example, to annotate each variant and estimate its effect at the protein level, the somatic mutations identified by Gazer_Profiler were stored in a vcf4-format text file, which is directly accessible to VarioWatch (http://genepipe.ncgm.sinica.edu.tw/variowatch/main.do) (30). In addition, the personalized molecular profile is directly available to DAVID Bioinformatics Resources (http://david.abcc.ncifcrf.gov) (31, 32) and MetaCore™. With these two bioinformatics sources, the pathways associated with genetic alterations can be identified. This information provides an insight into the mechanisms of these genetic alterations.
To identify somatic variations, many computing tools are available, such as MutTect in GATK (13), mpileup in SAMtools (26), Somaticsniper (19), and VarScan (17, 18). The first three are based on genotype likelihood and the latter, VarScan, is based on Fisher's exact test. We chose to integrate VarScan in our pipeline as it is more conservative and reliable for paired samples with different read counts or insufficient sequencing depth in some alleles, which is a common occurrence in RNA-sequencing data.
Application of GeneGazer in personalized medicine. Personalized molecular profiling generated by GeneGazer can provide useful information for personalized medicine, therapeutic approach, and prognosis. In the example of the CC2 patient, the canonical WNT and RAS/RAF/MEK/ERK signaling pathway was activated; no EGFR overexpression, BRAF or PIK3CA mutation, nor PTEN loss were identified. This profile suggested that the CC2 patient is very likely to be unresponsive to therapy with cetuximab, a commonly used therapy targeting EGFR in colon cancer (33). Instead, tankyRASe inhibitor (for blocking the canonical WNT signaling pathway) (34) or MEK inhibitor (for blocking the RAS/RAF/MEK/ERK pathway) (35) may be an alternative therapeutic choice. More and more information on drug–protein interactions is being gathered (36), and computational modeling has been applied to evaluate the effects of genetic alterations on phenotypic changes (37). By combining this information, personalized molecular profiling can suggest reasonable therapeutic strategies for individual patients.
Application of GeneGazer in cancer research. By using Gaze_BioSigner, we identified late-stage-specific biosignatures from colon adenocarcinoma data in the TCGA database. Activated cAMP/PKA pathway, p53-mediated apoptosis and immune response were associated with late-stage disease. The role of p53 and immune response in colon cancer has been well-documented, but the activation of the cAMP/PKA pathway is less well-understood. The activation of cAMP/PKA pathway has been found in many types of cancer (38-40), including of the stomach (38), prostate (41, 42), thyroid (41), and breast (43-45). Our finding suggests that the cAMP/PKA pathway is also important in late-stage colon cancer.
In addition, five transcription factors, namely SRF, p73, NF-Y, TBP and OCT-3/4, have been linked to late-stage colon cancer based upon the results of the Gazer_BioSigner pipeline and MetaCore™. SRF is an important regulator of tumor progression. Its activation has been identified in prostate (46-48), lung (49), and gastric (50) cancer, hepatocellular carcinoma (51, 52), and papillary thyroid cancer (53). OCT-3/4 has been shown to accelerate prostate cancer progression and aggressiveness (54). NF-Y has been found to support tumor proliferation and progression in many types of cancers, including colon cancer (55). The p53 family protein p73 has been shown to act as a tumor suppressor in many types of cancers (56); however, its activation is highly associated with colon cancer progression (57). TBP is associated with ovarian cancer progression (58). These results suggest that these transcription factors are involved in colon cancer progression and may be potential therapeutic targets.
Conclusion
In the present study, we demonstrated that GeneGazer is a reliable and robust toolkit not only for personalized profiling but also for biosignature discovery. This information could be useful for personalized medication and mechanism study. Thus, GeneGazer could have potential for clinical application and biomedical research.
Acknowledgements
The Authors would like to acknowledge Professor Petrus Tang and Professor Kwan-Wu Chen from Chang Gung University and Professor Yuen-Shin Lin from National Taiwan Chiao-Tung University for kindly giving advice and making suggestions. This work was supported by the Chang Gung Molecular Medicine Research Center, Ministry of Education (grant EMRPD 170291), and Taipei Medical University.
Footnotes
↵* These Authors contributed equally to this work.
- Received October 13, 2015.
- Revision received December 15, 2015.
- Accepted December 16, 2015.
- Copyright© 2016, International Institute of Anticancer Research (Dr. John G. Delinasios), All rights reserved