Abstract
Background: Oncoprotein genes are over-represented in statically defined, low mutation-frequency fractions of cancer genome atlas (TCGA) datasets, consistent with a higher driver mutation density. Materials and Methods: We developed a “continuously variable fraction” (CVF) approach to defining high and low mutation-frequency groups. Results and Conclusion: Using the CVF approach, an oncoprotein set was shown to be associated with a TCGA, low mutation-frequency group in nine distinct cancer types, versus six, for statically defined sets; and a tumor-suppressor set was over-represented in the low mutation-frequency group in seven cancer types, notably including BRCA. The CVF approach identified single-mutation driver candidates, such as BRAF V600E in the thyroid cancer dataset. The CVF approach allowed investigation of cytoskeletal protein-related coding regions (CPCRs), leading to the conclusion that mutation of CPCRs occurs at a statistically significant, higher density in low mutation-frequency groups. Supporting online material for this article can be found at www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php
- TCGA
- mutation density
- oncoproteins
- tumor suppressor proteins
- cytoskeletal proteins
- mutation frequency
The cancer genome atlas (1) (http://cancergenome.nih.gov/) and related cancer DNA sequence databases (2) have provided an opportunity for exploiting statistical power to discover commonalities in the genetics, and possibly the advent of cancer. We recently made a number of conclusions (3) by segregating five cancer genome atlas datasets into high and low mutation-frequency groups: (i) coding region mutagenesis is largely random, with almost no difference in the occurrence of silent versus amino acid changes in the high and low mutation-frequency groups; (ii) the vast majority of coding regions mutations occur in very large coding regions, consistent with the significant stochastic aspect to coding region alterations; (iii) a disproportionate representation of tumor suppressor proteins in the low mutation-frequency group could not be established; (iv) a disproportionate representation of oncoproteins in the low-mutation-frequency groups could be established for only two of the five TCGA datasets. In those two cases, COAD and LUAD, the disproportionate association of oncoproteins with the low mutation-frequency groups relied on the likelihood that oncoproteins often represent degenerate signaling pathways (4) and could, thus, be grouped for increased statistical power, i.e. an oncoprotein set, rather than individual oncoproteins, was observed to associate with the COAD and LUAD low mutation-frequency groups.
In the present report, the basic algorithm of dividing the TCGA datasets into high and low mutation-frequency groups has been encoded (scripted), thus allowing the development of a continuously variable definition of “high” or “low” frequency groups via the increased computing power. With a variable definition, an over-representation of mutations in any low frequency group (in comparison to the analogous high frequency group) can be detected, rather than relying on such detection in one arbitrarily defined low frequency group. This computational approach has provided for a substantial increase in detection of candidate driver mutations.
Materials and Methods
Mutation Annotation Format (MAF) files from the TCGA database containing only somatic mutations detected in tumor samples were downloaded as tab-delimited files from the TCGA data portal. All 3,158,693 mutations detected within 6,482 tumor samples contained in 26 MAF files spanning 24 TCGA datasets were inserted into various tables of a single PostgreSQL 9.3.5 relational database. Queries or database views to select the disease, HUGO symbol of mutated gene, gene type (oncoprotein, tumor suppressor protein, cytoskeletal protein, or none), codon change, start position, and end position of each mutation allow the count of total mutations as well as the count of mutations of a certain gene type, gene, and/or gene with a specific codon change for each dataset. The queries have also been designed to select only distinct mutations, keeping in mind that each dataset (cancer type) within the TCGA collection can be compared to one or more matched normal samples, potentially (and erroneously) giving rise to multiple records for identical mutations within a given tumor sample.
After the total mutation counts have been determined, the individual datasets are sorted by their number of total mutations and separated into high and low mutation-frequency groups containing N samples (TCGA barcodes), with N ranging from 2 (minimum required for statistical testing) to half the total number of samples (maximum without duplicating samples). After the samples are sorted, the ratio of occurrence of particular gene types (oncoprotein, tumor suppressor protein, cytoskeletal protein) as well as each gene with a specific codon change, relative to the total number of mutations is calculated. However, no mutations were further considered unless the mutation occurred at a minimum of 25-times in the pan-cancer database. The p-value from a two-tailed two-sample t-test, assuming unequal variances, is used to determine the level of significance at which the mean ratio in the high mutation-frequency group is not equal to the mean ratio in the low mutation-frequency group for any mutation that occurs more than 24 times in the database. Matlab® R2014a was used to query the PostgreSQL database, sort tumor samples, and calculate the reported p-values. All code is available in the supporting online material (SOM) (www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php).
Ethics statement. The corresponding author submitted and received approval for TCGA-use proposal, although all data in this report are publicly available.
Results
Initial work was based on single, arbitrary definitions for high and low mutation-frequency groups (3), reproduced herein, via scripting the initial algorithm; and with additional TCGA datasets (Table I). Results of the scripted version of the algorithm were identical for the previously studied datasets. Several TCGA datasets not previously studied indicated a disproportionate level of oncoprotein coding region mutations (LAML, SKCM, THCA, UCEC) in the low mutation-frequency groups. And for the first time, this approach indicated a disproportionate level of tumor suppressor mutations in previously unstudied datasets (KIRC, UCEC), in the low mutation-frequency groups.
To search for statistical significance of a disproportionate association of oncoprotein and tumor suppressor sets with the low mutation-frequency groups, with the same basic paradigm, but with the CVF strategy, we plotted p-values against continuous fractions for definitions of the high and low mutation-frequency groups (Methods; SOM; (www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php), with results consistent with BRAF as an oncoprotein (Figure 1). We repeated this approach with the previously defined oncoprotein and tumor suppressor sets (3), again keeping in mind extensive signaling pathway degeneracy in cancer (4) (Table II). Results indicated an association of one or the other or both cancer-gene sets with TCGA datasets where the approach using fixed mutation group fractions did not lead to such detection, illustrating the increased opportunities of the modification represented by the CVF approach. In particular, we detected a significantly increased association of the oncoprotein set, with the low mutation-frequency group, in three additional TCGA datasets: HNSC, LIHC and UCS (Compare Tables I and II). We detected increased association of the tumor suppressor set with the low mutation-frequency group in five additional datasets: BRCA, COAD, HNSC, LUAD and STAD.
To determine whether the CVF approach had the resolving power of detecting a disproportionate association of individual mutations and genes with the low mutation-frequency group, we applied the algorithm to every mutation position in the TCGA database. Results with a p<0.01 are indicated in Table III; results with a p<0.05 are indicated in Table S3 in the SOM (www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php); and in Excel files in the SOM. As expected, a number of well-studied mutations and genes, such as BRAF, with the V600E amino acid alternation, were readily detected as disproportionately associated with the low mutation-frequency groups, in particular for the SKCM (Figure 1B and Table III) and THCA datasets (Table III). In addition, numerous mutations were specifically associated with the high mutation-frequency mutation groups, which could represent numerous possibilities including lack of a driver status, i.e., an artifact of high level mutagenesis; or a requirement for cooperation with other mutations for driver status (with such cooperation only occurring when there is high enough level of mutagenesis for “two hits”).
Other coding regions, less well connected with the study of cancer, have been significantly associated with the low mutation-frequency mutation groups, representing several different cancer datasets, at p<0.05 but p>0.01: SKCM, MYO5B,GTC>GCC (nucleotide position, 47363917); HNSC AQP7,AGT>AGA(33385614); ACC KRT8,TCC>GCC (53298675); LIHC; KRT8,TCC>GCC(53298675) (Table S3, SOM).
Large coding regions are particularly susceptible to the stochastic process that plays a large role in the mutational process, as reflected by the TCGA datasets (3). The role of gene size in mutational susceptibility is consistent with large gene size being a significant factor in gene-partner inclusion in cancer fusion genes (5, 6). Cytoskeletal protein related coding regions (CPCRs), many of which are among the largest coding regions in the human genome (3), have long been thought to play a role in cancer development and metastasis, but with contradictory conclusions (7-13). Interestingly, a recent report indicated a CPCR mutation associated with breast cancer metastasis to the lymph node (14).
To determine whether CPCRs as a class (3) (SOM: Excel file: list of CPCRs) (www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php) could represent mutated, candidate driver genes, we determined whether there was disproportionate representation of CPCR genes in the low mutation-frequency groups using the approach of Parry et al., with a static definition of high and low mutation-frequency groups (Table IV), and with a CVF approach (Figure 1 and Table V). Results indicated a highly significant, disproportionate presence of CPCR mutations in the low mutation-frequency groups for numerous TCGA datasets.
Discussion
The above results indicate that varying the fraction of samples included in defining high and low mutation-frequency groups, dramatically extends the usefulness of a mutation density-based approach to identifying candidate driver mutations, i.e. leads to identification of a greater number of candidate driver mutations. All data-mining-based algorithms used for identification of candidate drivers identify just that: candidates. Furthermore, there is the presumption that empirical work is required for verification of the function of the candidate driver mutation in vivo. However, the above CVF approach readily identifies several known oncoproteins, such as IDH1 and BRAF V600E. Thus, other mutations and genes revealed by the CVF approach, hitherto not considered extensively in oncogenesis, are likely to be revealed as functionally relevant with empirical approaches, for example, MYO5B, discussed above (Table S3).
It is likely, at a low level of statistical significance, that background mutagenesis would be identified by the CVF approach. For example, cells having few mutations could have a higher density of mutations due to DNA replication error rates that are more common (repeated) at one position in the genome, in comparison to cells with a very high mutation rate due to greater exposures to mutagens, for example. The high concentration of mutagens could conceivably overwhelm a propensity for a high replication error rate (with a bias for a particular genome position), and thus such background mutations would be detected as statistically “significantly” associated with the low mutation-frequency group. Indeed, Table S3 (SOM) (www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php) lists several silent mutations, all of which have a p-value<0.05 but greater than 0.01.
It is also likely that a more comprehensive application of statistical tools would enhance the yield from what is fundamentally a mutation density based approach to identifying cancer driver genes. However, the clear and dramatic identification of positive controls, such as BRAF V600E in the SKCM (melanoma) dataset, indicate that the current statistical analysis is productive.
CPCRs are extensively mutated in many cancers (3), but understanding the role of these mutations is complicated by the large sizes of the CPCR coding regions. Large coding region sizes make routine experimental approaches, for example, DNA transfections, difficult, and thus there is less groundwork available to justify more extensive approaches, such as generating mice with relatively sophisticated genetic engineering features. In addition, past research regarding the cytoskeleton in tumorigenesis has been contradictory (7, 12, 13, 15-17). Thus, the CVF approach offers a third avenue of investigation, for the study of CPCRs that circumvents the impracticality of conventional experimental approaches. In particular, several of the mutations indicated as significantly associated with the low mutation-frequency group at the p<0.05 level, e.g., MYO5B and KRT8, are related to the cytoskeleton and cell shape functions.
The above CPCR results raise the question of whether cell-shape changes that accompany a disorganized cytoskeleton are due to the relatively common mutation of very large, genetically vulnerable CPCRs? The above results also raise the question of whether the common spheroid shape of cancer-drug resistant cells (18-25) is traceable to the same genetic vulnerability, leading to cells with reduced surface area to volume ratios and thus cells with lower intracellular drug concentrations?
Interestingly, as noted by inspection of Table IV, three of the TCGA data sets (BLCA, COAD, READ) demonstrate an association of CPCR mutations with the low mutation-frequency groups; and two of the TCGA datasets (KIRC, PRAD) indicate that the CPCRs are associated with the high mutation frequency groups. This latter result raises the question of whether the tumorigenic effects of CPCR mutations in certain cancer types occur only with a high mutation burden, where there is the greater possibility that a CPCR mutation will cooperate with mutations of other groups of proteins, such as conventional oncoproteins or tumor suppressor proteins?
Acknowledgements
The Authors acknowledge the assistance of USF research computing; and the financial support of the taxpayers of the State of Florida.
Footnotes
Supporting Online Material
www.universityseminarassociates.com/Supporting_online_material_for_scholarly_pubs.php
Supporting online material can also be obtained by emailing Authors.
Conflicts of Interest
None.
- Received July 14, 2015.
- Revision received August 31, 2015.
- Accepted September 7, 2015.
- Copyright© 2015, International Institute of Anticancer Research (Dr. John G. Delinasios), All rights reserved