Abstract
Background: The uncharacterized proteins of the human proteome offer an untapped potential for cancer biomarker discovery. Numerous predicted open reading frames (ORFs) are present in diverse chromosomes. The mRNA and protein expression data, as well as the mutational and variant information for these ORF proteins are available in the cancer-related bioinformatics databases. Materials and Methods: ORF proteins were mined using bioinformatics and proteomic tools to predict motifs and domains, and cancer relevance was established using cancer genome, transcriptome and proteome analysis tools. Results: A novel testis-restricted ORF protein present in chromosome X called CXorf66 was detected in the serum, plasma and neutrophils. This gene is termed secreted glycoprotein in chromosome X (SGPX). The SGPX gene is up-regulated in cancer of the brain, lung and in leukemia, and down-regulated in liver and prostate cancer. Brain cancer in female patients exhibited elevated copy numbers of the SGPX gene. Conclusion: The SGPX gene is a putative novel cancer biomarker. Our results demonstrate the feasibility of mining the ‘dark matter’ of the cancer proteome for rapid cancer biomarker discovery.
- Signal peptide
- ORF
- ‘dark matter’ of the genome
- serum protein
- uncharacterized proteins
- biomarkers
- X-chromosome
- cell trafficking
- vesicular transport
- secreted protein
The human genome project is an attractive starting point for cancer gene discovery (1-3). Numerous drug targets and biomarkers have emerged from the Genome Project (3-7). Expression specificity provides a strong rationale for finding targets that can lead to highly selective and less toxic therapeutics (6-8). An attractive area for mining the human genome resides in the uncharacterized proteome (9). Currently, over half of the predicted proteins in the human genome are of unknown nature (10). These proteins and the non-coding RNAs are called the ‘dark matter’ of the genome (11-13). Whereas in the past most gene discovery has revolved around known genes due to the ease of follow-up studies (14-17), the dark matter of the genome offers an untapped potential (18). Realizing the importance of this area, the US National Cancer Institute has recently announced a major initiative called illuminating the dark matter for druggable targets (http://commonfund.nih.gov/idg/index).
Establishing cancer relevance for novel or uncharacterized proteins is a crucial first step in lead discovery. The cancer genomes of patients from around the world can be readily mined using databases such as the cBioPortal (19), canSar (20), the Catalogue of Somatic Mutations in Cancer, COSMIC (21), The Cancer Gene Atlas, TCGA (http://cancergenome.nih.gov) and the International Cancer Genome Consortium, ICGC (https://www.icgc.org).
The availability of microarray databases such as Oncomine (22) and ArrayExpress (23), and protein expression analysis tools including the Human Protein Atlas (24), the Human Protein Reference Database, HPRD (25), and the Model Organism Protein Expression database, MOPED (26) can facilitate the cancer target verification.
Numerous proteomic analysis tools including Expasy (http://www.expasy.org), PredictProtein (27) and MESSA Meta analysis server (28) are available for predicting the putative motifs and domains of the novel proteins. These tools can be used to mine the dark matter of the proteome to identify motifs with biomarker and druggable potential (signal peptides, receptors, transporters and enzyme) signatures (29-31).
In the present report, we demonstrate the feasibility of discovering novel biomarkers from a database of cancer-related uncharacterized proteins we have recently developed (18). An X-chromosome specific ORF, CXorf66, was rapidly validated for secreted nature using the protein expression databases. Detailed bioinformatics and proteomics characterization of the CXorf66 gene confirmed the cancer biomarker potential of this gene. Our results support the potential of the uncharacterized proteome to be tapped for cancer biomarker discovery.
Materials and Methods
The bioinformatics tools used in the study are shown in Table I. All the bioinformatics mining was verified by two independent experiments. Only statistically significant results per each tool’s requirement are reported. Prior to using a bioinformatics tool, a series of control query sequences were tested to evaluate the predicted outcome of the results.
Results
We recently established a database of expression verified uncharacterized ORFs by mining the cancer proteome (18). In that study, using a streamlined approach involving gene expression, protein motifs and domains analysis and Genome-Wide Analysis Studies (GWAS) analysis, we identified a novel cancer biomarker called carcinoma related EF-hand protein (CREF). The CREF gene (C1orf87) is a calcium-binding protein specific to breast, lung and liver cancer. These results supported our premise that it is possible to harness the dark matter of the cancer proteome systematically for cancer drug target and biomarker discovery. Reasoning that prediction of signal peptide motifs in these uncharacterized ORFs may lead to novel cancer diagnostic marker discovery, we have undertaken bioinformatics mining of the hits from the database. Preliminary experiments using the signal P tool (http://www.cbs.dtu.dk/services/SignalP/) identified an ORF, CXorf66, which may harbor a putative signal peptide sequence. Encouraged by this finding, we undertook a comprehensive bioinformatics and proteomics analysis of the CXorf66 gene.
CXorf66 expression in normal tissues. Initially, protein expression analysis tools were used to verify the secreted nature of the CXorf66 gene. Protein expression for the CXorf66 gene was detected in the serum and testis using the HPRD (http://www.hprd.org) (Figure 1A). The GeneCards (http://www.genecards.org) summary of protein expression databases showed the presence of Secreted glycoprotein in chromosome X (SGPX) protein in hematopoietic tissues plasma and platelets (Figure 1B).
Additional evidence for expression of the CXorf66 protein in normal tissues was obtained from the Human Protein Atlas (http://www.proteinatlas.org). In tissue microarray sections, the CXorf66 protein was detected at a medium expression level in 7 out of 77 analyzed normal tissue cell types. Major normal tissues included salivary gland, spleen, lymph node, kidney and tonsil. A subset of leukocytes scattered throughout most tissues showed strong cytoplasmic positivity. Most remaining normal tissues were negative. Immunohistochemical (IHC) staining of normal testis with the antibody HPA048517 showed membranous/cytoplasmic staining (Figure 1C). Predominant staining was observed in the cells in seminiferous ducts (65%). CXorf66 protein expression in normal tissues was further correlated with mRNA expression. The Unigene EST expression tool (http://www.ncbi.nlm.nih.gov/unigene) indicated a restricted expression in testis. Developmentally, CXorf66 expression was detected in the fetus but not in adult tissues.
CXorf66 expression in tumors. We next investigated the CXorf66 expression in diverse tumors. The NextBio meta analysis tool (www.nextbio.com) indicated that the CXorf66 gene is up-regulated in brain and lung cancer, and in lymphoid leukemia (Figure 2A). In contrast, down-regulation of CXorf66 gene expression was seen in liver and prostate carcinomas. Uterine cancer association (most highly correlated) with the CXorf66 gene was seen only at the somatic mutation level. Expression of the CXorf66 protein was seen in about 4% of tumors analyzed by tissue microarrays in the Human Protein Atlas tool (see Figure 2B). Elevated expression of the CXorf66 protein was seen in carcinoids (colon), lung, ovarian and urothelial cancers. An expanded view of the carcinoid IHC (Figure 2C) demonstrates a strong cytoplasmic and membranous staining. The current protein expression data for CXorf66 is available for a very limited set of patient samples. Additional verification is needed. The CGAP short SAGE tag (sTTTCAAGCAA) analysis of the CGAP tissue libraries (http://cgap.nci.nih.gov) showed down-regulation of the CXorf66 mRNA in liver and prostate carcinomas, as well as up-regulation in brain and lung cancer. These results were consistent with the NextBio Meta analysis (Figure 2). Analysis of the NCI60 cancer cell lines for CXorf66 mRNA expression using the NCI Developmental Therapeutics Molecular target database DTP (http://dtp.nci.nih.gov) indicated CXorf66 mRNA expression in non-small cell lung carcinoma (NCI-H322M, NCI-H460, NCI-H52), breast (MDA-MBA-231, HS578T, BT-549 and T47D), CNS cancer (SF-268, SF-295, SF-539, SNB-19 and SNB-75) and ovarian carcinoma-derived (OVCAR 4) cell lines. We next performed an Oncomine microarray analysis (https://www.oncomine.org) for the mRNA expression of the CXorf66 gene. The CXorf66 copy number was significantly elevated in anaplastic astrocytomas, anaplastic oligodendrogliomas and in primary and secondary glioblastomas (Figure 3A). Significant differences in the CXorf66 gene copy number were seen between males and females in these tumor types (Figure 3B).
Characterization of the SGPX gene. The molecular characterization of the CXorf66 gene is shown in Table II. The CXorf66 gene is present on chromosome X 27.1 and codes for an ORF of 361 amino acids (39944 Da). NCBI-AceView (http://www.ncbi.nlm.nih.gov/ieb/research/acembly/) predicted one primary transcript with three exons spread over 9,762 bp. This gene is present in the common ancestor of human and mouse. According to NCBI-homologue Gene (http://www.ncbi.nlm.nih.gov/homologene), orthologs include Pan troglodytes, Mus musculus, Rattus norvegicus, Bos taurus and Canis familiaris.
Putative transcription factor binding sites for the SGPX gene included Transcriptional repressor CTCF, CCAAT/enhancer-binding protein beta (CEBPB), Myc proto-oncogene protein (MYC), DNA-directed RNA polymerase II subunit RPB1 (POLR2A), Transcription factor E2F1, Proto-oncogene c-Fos and Transcription factor Sp1 (UCSC browser, http://genome.ucsc.edu). In fetal mouse stem cells, overexpression of cMYC caused down-regulation of the SGPX mRNA, suggesting a role of cMYC in the transcriptional regulation of the SGPX gene (NextBio, data not shown). The NexBio Meta analysis revealed three miRs that are implicated in the regulation of the SGPX gene. The most highly correlated miRs by Meta-analysis included hsa-miR-130b/a (sarcoidosis, lung, glioblastoma), hsa-130a (lung cancer) and hsa-130b (glioblastoma). An additional miR, hsa-miR-1290, was predicted by GeneCards.
Characterization of the CXorf66 protein. Using a streamlined approach that we recently developed to characterize the novel ORFs (18), a detailed motifs and domain analysis of the CXorf66 protein was undertaken (Table III). The UniProtKB database (http://www.uniprot.org) analysis showed that CXorf66 (Uniprot id Q5JRM2) is a single-pass type-I transmembrane protein with a signal peptide (amino acids 1-19). Topologically distinct domains, extracellular (amino acids 20-47), transmembrane helical (amino acids 48-68), cytoplasmic (amino acids 69-361) and serine rich (amino acids 91-177) sites were detected in the CXorf66 protein. The presence of signal peptide was further verified using the Signal P prediction tool for eukaryotic network (http://www.cbs.dtu.dk/services/SignalP/). This tool predicted a signal peptide sequence MNLVICVLLLSIWKNNCMT with most likely cleavage site between pos. 19 and 20: CMT-TN (Y-score 0.476 at amino acids 18). Secretome P (http://www.cbs.dtu.dk/services/SecretomeP/) analysis of the CXorf66 showed that it is not secreted by the non-classical secretary pathway (data not shown). These results, together with the protein expression in the plasma and serum (Figure 1), supported the premise that CXorf66 is a secreted transmembrane protein. Hence, CXorf66 was named secreted glycoprotein in chromosome X (SGPX).
To further characterize the SGPX protein’s nature, we used diverse motif and domain analysis tools (Table III). The NCBI CDD analysis identified a superfamily (DUF 936) present in several hypothetical proteins from Arabidopsis (e value: 5.97e-03). The MESSA analysis tool predicted an additional conserved domain in the SGPX protein (KOGO566, inositol-1,4,5-triphosphatase-synaptojanin, INP51/INP52,INP53 family) with an e value of 7e-04. This family of proteins is involved in intra-cellular trafficking, secretion and vesicular transport (32).
The PFAM tool (http://pfam.sanger.ac.uk) identified the FAM 163 family signature, present in neuroblastoma-derived secretory protein (NDSP). This signature was further verified by Motif (http://www.genome.jp). The NDSP is highly expressed in neuroblastoma compared to other tissues, suggesting that it may be useful as a marker for metastasis in bone marrow (33). The HMMER tool (http://hmmer.janelia.org) further verified the transmembrane and signal peptide domains. In addition, a ribosomal protein S2 PFAM domain was predicted at amino acids 18-113 in the SGPX protein.
The PRODOM domain analysis (http://prodom.prabi.fr/) identified two distinct domains, PDA8v7u5 (amino acids1-149, e value: 4e-48) and PDA3C6G2 (amino acids 186-356, e value: 1e-86). These two domains further verified that the SGPX is a transmembrane glycoprotein and suggested that the full-length protein may be a precursor to the secreted product.
The secreted nature of the SGPX protein was further verified using the Predict Protein meta-analysis tool. The sub-cellular localization for the eukarya domain was predicted as secreted (GO term ID: GO: 0005576, prediction confidence 77%). Three protein binding sites were identified (amino acids 184, 270 and 278) in the SGPX protein using the profisis (ISIS), a machine learning-based method (34). The secondary structure of the SGPX protein was classified as mixed.
The nature of the post-translational modification site was next investigated using diverse proteomic tools from the Swiss Expasy server http://www.expasy.org). The Prosite (http://prosite.expasy.org) identified a serine-rich motif. The SGPX protein is modified by phosphorylation at serine (114, 165) and theronine (169) as predicted by GeneCards. Three phosphorylation sites (Ser:41, Thr:2 and Tyr:5) were predicted by the NetPhos 2.0 server (http://www.cbs.dtu.dk/services/NetPhos/). Furthermore, a protein kinase C-specific protein phosphorylation site was predicted by NetPhosK 1.0 server (http://www.cbs.dtu.dk/services/NetPhosK/). The Myristylator tool (http://web.expasy.org/myristoylator/) indicated that the SGPX is not myristoylated. SGPX is, on the other hand, glycosylated (O-linked, at amino acids 92 and 94) and N-linked, (amino acids 24, NGSS) according to the NetoGlyc tool (http://www.cbs.dtu.dk/services/NetNGlyc/).
We next performed 3-D modeling for the SGPX protein. The UCSC genome browser analysis of the CXorf66 gene identified a protein model template in the Modbase comparative 3-D structural database. A 17% identity with tyrosine-protein phosphatase, auxilin (PDB code, 3n0a, e value, 0, reliable model) was detected (Figure 4A). Auxilin, a J-domain containing protein, is involved in the recruitment of the Hsc70 uncoating ATPase to newly-budded clathrin-coated vesicles (35). The top template used by I-TASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/) to generate the 3-D model was properdin, a glycoprotein (Figure B).
The Meta Server for Sequence Analysis (MESSA) and the Predict Protein tools were used to further characterize the structure of the SGPX protein. A disordered region lacking a stable tertiary structure was predicted using both of these Meta structural tools. Secondary structures included helical (amino acids 11-16 and 35-67) and strand (amino acids 3-10).
GWAS studies on the SGPX gene. The dbSNP database showed 230 variants with one missense variant (amino acids 233, P to L, Haploid frequency: p=67.9 and L=32.1). One deletion variant esv2672915 is present (36). To further establish a strong correlation of the SGPX gene with the cancer genome from different patients, we next performed GWAS analysis using multiple cancer genome analysis tools (Figure 5). Mining the COSMIC somatic mutations database (http://cancer.sanger.ac.uk) indicated that the SGPX mutations are largely missense and non-sense (Figure 5A). The cBioPortal (http://www.cbioportal.org/) identified in tumors SGPX gene amplifications (prostate, lung and gliomas), deletions (sarcomas), and mutations (lung, gliomas and pancreas) (Figure 5B). One frameshift insertion mutation (p.H273fs*8) was seen in a mucinous colon adenocarcinoma patient (TCGA-AA-A01R-01). Current COSMIC mutations largely include hematopoietic, lung, CNS, ovary, uterus, gastric, melanoma and kidney tumors.
We next performed a comprehensive mutational analysis for the SGPX gene using the ICGC world-wide cancer genome database. The data were compared with CanSar and COSMIC and common mutations were manually curated. Table IV shows the compilation of the current mutations from these databases. Predominant mutations centered around the lung and uterus.
CXorf66/SGPX-interacting proteins. Understanding the nature of the proteins that interact with an uncharacterized protein is critical to deciphering its putative pathways and mechanisms. Hence, the String interactome tool (http://string-db.org) was used to identify putative protein partners for the SGPX gene (Figure 6). Numerous Sperm protein associated with the nucleus on the X chromosome A (SPANX) family members are predicted to interact with the SGPX protein. The SPANX gene family members are located on the X chromosome. This gene family encodes proteins that play a role in spermiogenesis (37). These proteins represent a specific subgroup of cancer/testis-associated antigens (37). The SPANX family members are associated with prostate cancer (38). The involvement of the SPANX family of proteins in the SGPX pathway was also verified by co-expression analysis using the Oncomine Microarray database in breast tumors (data not shown).
Another predicted protein partner for SGPX interaction was FUN14 domain-containing protein-2 (FUNDC2), also known as cervical carcinoma oncogene 3. The FUNDC2 protein is involved in hepatitis C and cervical cancer (39). This gene also is present on chromosome X. Additional X chromosome-specific genes predicted to interact with the SGPX protein include paraneoplastic antigen-like (PNMA6B) and testis expressed 28 (TEX28P1) both of which are pseudogenes. In addition, ncRNAs were also predicted to interact with the SGPX protein.
Regulation of the SGPX gene. Regulation of gene expression occurs at different levels including promoter methylation, transcription factors, cell cycle and the ncRNAs (40-44). Hence we have attempted to develop an understanding of the SGPX gene regulation using the NextBio Meta-analysis tool. Two miRs (hsa-130a and hsa-130b) were found to be most highly correlated with the SGPX mRNA expression. The hsa-130a was up-regulated (16-fold) in glioblastoma (Bioset: Glioblastoma multiform WHO grade 4 vs. normal brain tissue, p-value 2.5e-21). The Hsa-130 b miR was up-regulated (2.36-fold) in blood from lung cancer patients (Bioset: Blood from patients from lung cancer vs. healthy control, p-value=0.0001).
The DNA methylation status of the SGPX gene was next investigated. In T-cell lymphoblastic leukemia, hypermethylation of the SGPX gene was seen as monitored by CpG island methylator phenotype (45). Lapatinib, a dual Epidermal growth factor (EGF) and Receptor tyrosine-protein kinase erbB-2 (HER2) kinase inhibitor, causes G0/G1 cell cycle arrest (46). In a breast cancer cell line, SKBR3, lapatinib treatment caused up-regulation of the SGPX mRNA, suggesting a G1 regulation of SGPX gene. The G1 regulation of the SGPX gene was further corroborated by NextBio Meta analysis. Mutations and overexpression of the G1/S cyclin D1 (CCND1) (47) resulted in down-regulation of SGPX mRNA (data not shown).
Discussion
Discovery of novel secreted proteins offers a biomarker potential for non-invasive diagnosis of cancer. New diagnostic and therapeutic targets are likely to emerge among the numerous uncharacterized proteins with putative ORFs. The results presented in the present study demonstrate the feasibility of mining the human proteome for cancer target discovery.
The CXorf66/SGPX gene is inferred to be a secreted protein product from the detection of the protein in the body fluids including plasma and serum. Motifs and domain analysis of the SGPX protein showed the presence of the classical signal peptide at the N-terminus. Transmembrane extracellular and cytoplasmic tail features were identified in the SGPX protein sequence. The SGPX protein harbors a serine rich motif (amino acids 91-177). This motif is present in numerous oncoproteins and is involved in binding to β-catenin (48).
The SGPX gene is a testis-restricted gene located on chromosome Xq27.1. Various cancer-associated genes are located in this region including members of the SPANX sperm protein associated with the nucleus mapped to the X chromosome (39). NCBI map positions MCF.2 cell line derived transforming sequence and testicular germ cell tumor susceptibility 1 genes to this locus. A list of all genes (Atlas of Genetics and Cytogenetics in Oncology and Hematology) present in chromosome X indicates the presence of numerous cancer-related genes (49).
The SGPX gene showed a complex pattern of expression in diverse tumors. It is found to be up-regulated in brain and lung tumors and in leukemia. On the other hand, it is down-regulated in liver and prostate carcinomas. Somatic mutations were found in glioma, lung, uterine and pancreatic cancer. Deletions were found in sarcoma and amplifications in prostate, lung and glioma. To date, however, no association has been established in testicular cancer, which is not yet adequately represented in the numerous cancer genome databases.
Microarray datamining showed interesting sex differences in patients with brain cancer. In different subtypes of brain cancer, female patients showed significantly elevated DNA copy numbers of the SGPX gene. In mammals, silencing of one of the two X chromosomes is required for dosage compensation (50). Some X-linked genes, however, escape such silencing. Hypermutation of the inactive X chromosome in females contributes to various cancers (51). It is tempting to speculate that the SGPX gene may play an important role in the development and progress of reproductive and urological cancers. Additional experiments are warranted to clarify the relevance of the SGPX gene in female cancer.
The function of the SGPX gene is unclear. However, our results with the 3-D modeling and the interactome analysis offer a clue. The structural homologues include the Phosphatidylinositol 3,4,5-trisphosphate 3-phosphatase and dual-specificity protein phosphatase (PTEN)-like region of auxilin, which is involved in vesicular budding (35). The SGPX protein shows regions of similarity to inositol-1,4,5-triphophate 5-phosphatase/synaptojanin. The family members of this protein are involved in intracellular trafficking, secretion and vesicular transport (52). We speculate that the SGPX gene function involves cellular trafficking, transport and budding.
HMMER identified a ribosomal S2 Protein Family (PFAM) domain in the SGPX protein (amino acids 18-113). This domain encompasses DS RNA and protein binding signatures and is involved in mitogenic fibroblast growth factor binding (GO). The RPS2 mRNA was expressed in all cancer cells and non-malignant cell lines tested, but was not expressed in normal tissues except for the testis, muscle, and peripheral mononuclear leukocyte cells (53). The protein expression of RPS2 correlates with the tumor types for the SGPX gene (Human Protein Atlas, data not shown). The S2 domain similarity raises a possibility that the SGPX protein may be involved in mitogenic signaling pathways.
Additional insight into the SGPX protein comes from the NDSP family signature predicted by the PFAM tool. NDSP is a secretory protein highly specific to neuroblastoma (33). The NDSP protein expression pattern correlated with the SGPX in the tumors (Human Protein Atlas, data not shown). The precise function of NDSP, however, is not known. In view of the shared signature and expression profile with the SGPX protein, we postulate that the SGPX protein may share a functional role similar to the NDSP in brain tumors. Alternatively, NDSP and the SGPX could be interacting protein partners.
The interactome analysis predicted SGPX interactions with SPANX family members, which are involved in spermiogenesis (32). We postulate that the SGPX gene is involved in the trafficking and transport of sperm cells.
In conclusion, the discovery of the SGPX gene from the dark matter of the human proteome and the establishment of its relevance to major cancers underscore the power of bioinformatics in mining of the cancer genome. The SGPX gene offers a valuable biomarker potential for cancer.
Conflicts of Interest
None.
Contributions
RN was responsible for the overall execution of the project. Data generation and validation were performed by APD, PB and SH.
Acknowledgments
We thank the cancer genome databases, CanSar, cBioportal, the ICGC, TCGA and COSMIC for the mutations dataset. This work was supported in part by the Genomics of Cancer Fund, Florida Atlantic University Foundation. We thank Jeanine Narayanan for editorial assistance.
- Received March 4, 2014.
- Revision received March 12, 2014.
- Accepted March 13, 2014.
- Copyright© 2014, International Institute of Anticancer Research (Dr. John G. Delinasios), All rights reserved