Abstract
We report on next-generation transcriptome sequencing results of three human hepatocellular carcinoma tumor/tumor-adjacent pairs. This analysis robustly examined ∼12,000 genes for both expression differences and molecular alterations. We observed 4,513 and 1,182 genes demonstrating 2-fold or greater increase or decrease in expression relative to their normal, respectively. Network analysis of expression data identified the Aurora B signaling, FOXM1 transcription factor network and Wnt signaling pathways pairs being altered in HCC. We validated as differential gene expression findings in a large data set containing of 434 liver normal/tumor sample pairs. In addition to known driver mutations in TP53 and CTNNB1, our mutation analysis identified non-synonymous mutations in genes implicated in metabolic diseases, i.e. diabetes and obesity: IRS1, HMGCS1, ATP8B1, PRMT6 and CLU, suggesting a common molecular etiology for HCC of alternative pathogenic origin.
Worldwide, liver cancer is the fifth most common cancer and the third most common cause of cancer-related mortality (1). Over 75% of liver cancers are hepatocellular carcinomas (HCCs), which are adenocarcinomas that occur in the context of cirrhosis in 60% to 85% of cases (2). Although over 80% of HCC cases occur in developing countries in Asia and Africa, incidence is on the rise in developed countries as well (3-4). This increase has been equally attributed to a rise in HCV infection and to the worldwide increase in the number of people with diabetes and obesity (4).
Although the association of HCC with chronic liver disease is well-established, typically due to liver cirrhosis resulting from HBV/HCV infection and/or other liver disease, the exact developmental etiology of HCC remains undefined. In developed countries chronic liver disease is most commonly due to non-alcoholic fatty liver disease (NAFLD), which encompasses a spectrum of liver disorders, ranging histopathologically from the milder hepatic steatosis/isolated fatty liver to the more aggressive non-alcoholic steatohepatitis (NASH). It is NASH that may progress to liver cirrhosis, ultimately leading to further complications such as hepatic failure and HCC (3, 5-7).
Next-generation sequencing (NGS) of the complete RNA transcriptome (RNA-seq) offers a novel approach for systematically characterizing the underlying molecular etiology of HCC. RNA-seq not only permits for the accurate measurement of transcript expression levels, but also reliably identifies and assesses rare RNA species, alternative splicing, base substitutions, insertions and deletions (indels), allele-specific expression and RNA editing (8-10). In the present study, we report on a survey of the complete transcriptome landscape and the mutation profile of hepatocellular carcinoma using RNA-seq in three pairs of HCC tumor and tumor-adjacent tissue samples. Key findings observed in the present survey are examined in larger collections of tumor and normal samples in order to establish robust evidence of prevalence and validity.
Summary of mapping data generated by RNA-seq.
Materials and Methods
Tissue samples. All tissue samples in the present study were collected anonymously at the Asan Medical Center in South Korea during 1998-2001. The study design adhered to all NIH standards for human subject research and was approved by the NIH office of Human Subject Research. RNA was isolated from three HCC tumor (PHC9824, PHC9826 and PHC9828) and adjacent normal samples (PHC9823, PHC9825 and PHC9827). Two of the normal/tumor pairs (PHC9823/24 and PHC9827/28) were HCV-infected; the third pair (PHC9825/26) was HCV/HBV-negative.
RNA Sequencing Method. Total RNA was isolated from three HCC tumor and normal adjacent tissue pairs using the RNeasy kit (Qiagen) according to the manufacturer's instructions. Double-stranded cDNA was generated using random hexamer primers and reverse transcriptase. Sequencing adaptors were ligated using the Illumina Genomic DNA sample prep kit. Fragments (approximately 340-bp long) were isolated by gel electrophoresis and amplified by limited cycles of PCR. Large-scale deep sequencing was carried out on the Illumina Solexa Genome Analyzer, as described in the Illumina mRNA expression analysis protocol (http://www.illumina.com).
Mapping of RNA-seq reads to the genome. Mapping of RNA deep sequencing reads from three pairs of liver tumor and matched normal tissues was carried out using Burrows-Wheeler Aligner (BWA: http://bio-bwa.sourceforge.net/). Reads of 75mer were mapped to four reference databases: hg18 (NCBI build 36), refFlat, alternative-splicing exons and ESTs. Our alternative-splicing database consists of all sequential combinations of non-adjacent exons that are not found in the RefFlat database. EST exon sequences were extracted from the hg18 sequence using exon coordinates in the UCSC intronEst table.
Differentially expressed genes in RNA-seq. We have already reported on the normalized expression values as RPKM (reads per kilobase of exon model per million mapped reads) (8). For a given gene, reads were normalized against exon size and the number of reads sequenced to generate the RPKM value. Log2 ratios of RPKM values were calculated for each gene in paired tumor and adjacent normal samples.
To validate the differentially expressed genes identified from RNA-seq analysis, we downloaded a large HCC gene expression dataset from GEO repository with the accession number of GSE14520 GPL3921 (http://www.ncbi.nlm.nih.gov/geo) which contains 212 liver normal tissues and 222 liver tumor tissues (11). Profiling was performed on an Affymetrix HT HG-U133A platform and the data were normalized using Robust Multi-array Average (RMA) method and global median centering (11). We extracted the data for the genes overlapping with the ones shown on the differential expressed gene list. In the case of genes with multiple probe sets, the maximum gene expression was calculated. The t-statistics were used to compare the gene expression between liver normal and tumor tissues.
Single-nucleotide variant (mutation) calls. Somatic non-silent mutations (missense, non-sense, frameshift) were identified by our in-house software, Bambino (12) and by VarScan (version 2.2, Wash U) (13) from BAM files for each paired tumor and adjacent normal samples. Two criteria were applied: 1) Mutations from Bambino with a minimum of 4× coverage, greater than 10% alternative allele frequency in tumor samples and less than 1% alternative allele frequency in normal samples; and 2) mutations from Bambino which did not meet the previous criteria but showed a p-value <0.05 in VarScan. All somatic non-silent mutations were manually reviewed by examining the alignment in Bambino alignment view.
Mutation validation. Sanger sequencing was applied to PCR products generated using both cDNA and genomic DNA templates. We validated the candidate variants in cDNA and genomic DNA with primers designed by the primer3 program (14). Sanger sequencing was performed on the PCR products. Each chromatogram was base-called with Phred using the option designed to generate a polymorphism file, which details secondary and alternate base call information (15).
Prediction of AA substitution using logE and SIFT. LogE (16) and SIFT (17) are used to predict the AA substitution. A logE score whose absolute value is greater than or equal to 1 indicates that the amino acid alteration is likely to affect protein (16). SIFT value ≤0.05 is predicted to be deleterious (17).
Pathway analysis of differentially expressed and validated mutated genes. We analyzed the differentially expressed genes and mutated genes by identifying over-represented pathways and by constructing gene-gene networks. Differentially expressed genes were mapped to pathways in the Pathway Interaction Database (PID) (18) and the significance of an individual pathway's being affected was computed using R's hyper-geometric distribution function, adjusted for multiple hypothesis testing using the Benjamini-Hochberg false discovery rate (19). Differentially expressed genes were also assembled into novel networks with direct interactions obtained from PID.
Analysis of copy number variation using Affymetrix SNP6.0 assay. The Affymetrix SNP6.0 assay was performed according to the manufacturer's instructions (Affymetrix, Santa Clara, CA, USA). Assay runs were performed in 96 well plates containing three tumor and normal adjacent samples, four Asian HapMap samples (NA18954, NA18971, NA18603 and NA18995), the Affymetrix103 control DNA and a negative control (H2O). Data generated by SNP6.0 assays was analyzed with the Affymetrix Genotyping Console version 3.0 birdseed algorithm. Samples were analyzed for copy number variation using the Affymetrix Genotyping Console program with default parameters and the HapMap270 reference model. The resulting copy number log2ratio data served as input for the R DNAcopy package, which implements the circular binary segmentation (CBS) algorithm (20). We converted fractional CBS copy number values to discrete copy number states using the following thresholds: CBS copy number <=1.8 represents copy number loss; CBS copy number >=2.2 indicated copy number gain.
Results
The complete workflow process for mapping the RNA-seq data from our HCC tumor samples (PHC98024, PHC98026 and PHC98028) is described in Figure 1.
Identification of differentially expressed genes by RNA-Seq. For expression analysis, the number of reads mapped to exon regions of each gene was calculated and normalized using the RPKM method (8). Our data revealed ∼17,500 genes with at least one read and ∼12,000 genes with RPKM ≥1 (Table I). These findings are consistent with previous reports that 10-12,000 genes are expressed in normal/cancerous liver tissue (21). Our initial analysis used RPKM >1 for tumor or normal and fold change between tumor and normal of >2 in at least 2 sample pairs. Applying these thresholds, we identified 4,513 up-regulated and 1,182 down-regulated genes in tumors compared to matched normal samples (Table II).
Mapping Workflow. The above workflow diagram illustrates the sequence of steps involved in the alignment and mapping of reads to the human genome (hg18). Reads are trimmed and mapped to the RefSeq, EST, and hg18 databases. The best mapping for each read is selected to build the final BAM files in hg18 coordinates. UCSC: University of California Santa Cruz Genome Browser (http://genome.ucsc.edu); BWA: Burrows-Wheeler Alignment tool (http://bio-bwa.sourceforge.net/); SAMtools - utilities for manipulating alignments in the SAM/BAM format (http://samtools.sourceforge.net/); Bambino - an in-house single-nucleotide variation (snv) call program (https://cgwb.nci.nih.gov/); VarScan - a variant call program from Washington University (http://varscan.sourceforge.net/).
When the most stringent filter (RPKM ≥10 for tumor with RPKM ≤1 for normal and fold change ≥10) was applied to gene expression levels, a total of 58 genes were identified as up-regulated in at least two tumor-normal pairs (Table II and Figure 2). Out of these 58 genes, 26 have been previously identified as differentially expressed in HCC, while the remaining 32 have not been reported for HCC to date. Additionally, a total of 98 genes were down-regulated when similar strict criteria (RPKM ≤1 for tumor with RPKM ≥10 for normal and fold change ≥10) were applied (Table II and Figure 2). This sub-set of 58 up-regulated and 98 down-regulated genes were then subjected to additional analyses. To more robustly assess the prevalence of the differential expression of these 156 genes, we obtained a large HCC gene expression dataset from the GEO repository (http://www.ncbi.nlm.nih.gov/geo) (11), containing 434 liver tissues with 212 normal and 222 tumor tissues. Only 118 of 156 differentially expressed genes were assayed in this dataset. Over 90% of the 118 genes had the same expression pattern as in the original RNA-seq data and showed statistically significant differential expression between normal liver and tumor tissues (FDR<0.01).
Diagrammatic representation of expression levels, non-synonymous mutations, translocations, and copy number variations in the mRNA sequences of three HCC tissue samples. Chromosome ideograms are displayed along the outer ring (track 1); these are oriented pter-qter in a clockwise direction with centromeres indicated in blue. From outside to inside, the track just internal to the chromosome ideogram represents differential expression of genes at each locus (track 2). Orange bars indicate 58 up-regulated genes and blue bars indicate 98 down-regulated genes. Thirty non-synonymous mutations appear as red lines in the next track (track 3), copy number variation is indicated in the next track in dark green lines (track 4). The innermost track consists of light green lines depicting inter-chromosomal re-arrangements and blue lines depicting intra-chromosomal re-arrangements (track 5).
Identifying somatic mutations associated with HCC. The RNA-seq data revealed a total of 712,451 single-nucleotide variants (SNVs) present in tumor and/or normal tissues (Table III). We focused our subsequent analyses on the 149 non-synonymous somatic mutations. Among these, 124 have not been previously reported in HCC except for CTNNB1 and TP53. Sixty-three of the mutations exhibited ≥30% frequency reads for the alternative allele in one tumor but not in the paired normal tissue. Among these 63 mutations, 30 mutations from 29 genes could be validated by Sanger sequencing as shown in Table III and Figure 2, track 4, red lines. Only TP53 was mutated in two samples. To more robustly assess prevalence of mutations identified from this transcriptome-wide survey, we performed follow-up analysis of an additional 25 sets of paired HCC tumor samples and matched control tissues. This analysis revealed that approximately one-third of HCC samples were mutated in the TP53 gene. Two of these TP53-mutated HCC samples were also mutated in the CTNNB1 gene; one of these samples had 2 TP53 mutations. No other genes were found to have recurrent mutations in the extended sample set.
Network diagram of the interaction of a subset of the 58 up-regulated and 98 down-regulated genes. Up-regulated genes are highlighted in red and the down-regulated genes are highlighted in blue. Twenty-one up-regulated and six down-regulated genes form a network. Nineteen of these 27 genes form a major gene-gene interaction network. Out of these 19 genes, eight genes (CCNB2, CDCA8, DNM1, KIF2C, NKD1, RACGAP1, VCAN, and MST1) have been previously reported in HCC; to date, the other 13 genes have not been documented in HCC. Each gene is connected by a solid line to each of the other genes with which it interacts. Theses interacting genes are displayed as small solid squares.
Gene interaction network of a subset of the 29 mutated genes. Only the four mutated genes that exhibit multiple interactions (IRS1, TP53, CNNB1 and SMG5) are shown in the present figure. Mutated genes are highlighted in red. The genes with which they interact are shown as small solid squares, each of which is connected by a solid line to its interacting mutated gene. The IRS1 interacts with many other genes including BCAR1, IGF1, IGF1R, IL2RG, IL4, IL4R, INS, INSR, IRS2, JAK1, JAK3, PIK3CA, PIK3R1, SH2B2, and SOS1. SMG5 interacts with DKC1, HDAC8, HSP90AA1, PRKACA, TGES3 and TER.
To evaluate whether the 30 validated mutations are likely to alter protein function, we performed two different statistical analyses, SIFT and LogE (16-17). Both algorithms identified three genes in which the observed mutation was predicted to confer deleterious effects on protein function (Table IV). SIFT identified 13 potentially deleterious mutations that were not detected by LogE; LogE identified one potential functionally significant mutation not detected by SIFT (Table IV). The two TP53 mutations were predicted to be deleterious by both algorithms.
Pathway analysis of differentially expressed genes. Next, we performed pathway analysis using the up-regulated and down-regulated genes. For a pathway to be considered statistically significant, two or more of differentially expressed genes must“hit” that pathway, with an FDR adjusted p-value of <0.01. When the set of up- and down-regulated genes was analyzed, the Aurora B signaling, FOXM1 transcription factor network and Wnt signaling pathways were observed to be significantly associated with HCC (Table V).
Next we performed gene-gene network analysis to examine whether the 58 up-regulated genes and 98 down-regulated genes interact with each other and form a network. To perform this, we leveraged the gene-gene interaction data present in the NCI/Nature PID. As shown in Figure 3, 27 of these genes are highly connected hubs in a novel, complex network. Importantly, 19 of the 27 hub genes have not yet been reported in hepatocellular carcinoma.
Pathway analysis of mutated genes. We carried out further pathway analysis using the 29 genes with 30 validated mutations by applying the same analytical approach as the one used for the expression pathway analysis. Only six (CTNNB1, DOCK10, HMGCS1, IRS1, SMG5, TP53) of the 29 mutated genes are present in canonical pathways. These genes are distributed among 38 pathways in a manner such that no more than one gene “hit” any individual pathway.
In order to gain insight into whether these mutated genes interact to form a larger network, we used an approach that was similar to the pathway analysis for gene expression differences. We discovered that four of the mutated genes (TP53, IRS1, CTNNB1 and SMG5) form a complex network (Figure 4).
As the remaining mutated genes are not contained in any of the pathways in the PID, we performed analyses of these genes using Gene Ontology and found that many were associated with liver metabolism (Table IV).
DNA copy number variations and their association with mutation and expression. When DNA copy number variations (CNVs) were determined for the three paired tumor-matched normal sample sets using the Affymetrix SNP6.0 platform, we identified 18,000 non-overlapping autosomal segments for which at least one HCC tissue sample showed CNV relative to its paired normal tissue sample. Sixteen segments demonstrated copy number gain in all 3 tumors. These copy number gain segments were located in six broad chromosomal regions: 1q21-44, 2q31, 3p24, 8q24, 17q11-12 and 20p12. Sixty-nine segments showed copy number loss in all three tumors relative to their matched normal tissues. The largest of these recurrent regions of loss are located in 17p12-13, 19p13 and 10q21-26 (Figure 5).
Genome alterations observed in three HCC samples. Copy number alterations are indicated by shading: red and pink represent copy number gain (red greater than pink) relative to the normal adjacent sample; dark blue to light blue represent copy number loss (dark blue greater than light blue) relative to the matched normal adjacent sample; grey represents no copy number alteration. Yellow diamonds mark the locations of the subset of frequently overexpressed genes that are up-regulated in a particular sample. Black vertical hashes mark the locations of validated non-synonymous somatic mutations identified in a sample. Chromosome numbers are listed on the left side and sample identification numbers are listed at the top of each figure.
Summary statistics of somatic non-synonymous mutation discovery.
We then mapped the 30 validated mutations onto CNV segments. Sixteen mutations lie in regions of copy number loss; 14 mutations lie in regions that do not show copy number alterations relative to adjacent normal tissue. Two genes, NDS1 and ATP8B1 genes are altered in all 3 tumor samples with mutation in sample PHC98028 and copy number reduction in samples PHC98024 and PHC98026.
Mutations and predicted effects of amino acid changes.
Pathway analysis of 58 up-regulated and 98 down-regulated genes.
In similar fashion, we investigated possible correlations of gene expression with copy number alterations. We observed that of the 58 up-regulated genes, only PEG10, RNF43, AXIN2, MAP2K6, BIRC5 and APCDD1 are located at positions exhibiting recurrent patterns of copy number gain. Mapping of the 98 down-regulated genes onto CNV regions revealed that 42 genes (43%) map to regions with frequent copy number loss. In contrast, six down-regulated genes are located in genomic regions showing recurrent copy number gain.
Discussion
Considerable heterogeneity exists in the somatic genetic etiology of HCC. Although the origin of this heterogeneity is still unknown, molecular studies of HCC suggest a connection not only with HBV/HCV infection, but also with abnormalities of metabolic balance, such as diabetes and obesity (22-23). To address this issue, we surveyed three paired HCC and adjacent normal tissue samples using RNA-seq. We identified differentially expressed genes as well as somatic mutations. We observed 58 up-regulated and 98 down-regulated genes in tumor samples when compared to adjacent normal liver tissue (Table II). Aurora B signaling, FOXM1 transcription factor network and Wnt signaling pathways are significantly enriched in the differentially expressed genes of HCC. Only Wnt the signaling pathway has been previously implicated in the development of HCC. Thus, all pathways, which play a key role in cell proliferation, cytokinesis and DNA repair, may be involved in the development and progression of HCC (11, 24-25).
Gene-gene network interaction analysis of the 156 differentially expressed genes revealed that 27 genes form a large complex network (Figure 3). This analysis allowed us to investigate new networks that have not previously been associated with HCC. The following up-regulated genes were identified by this analysis: KIF2C and CDCA8, key components of the Aurora B signaling pathway (24, 26); NKD1, a major component of the Wnt signaling pathway (27); CCNB2, a key member of the FOXM1 pathway (28); RACGAP1, a key member of the cytokinesis pathway (29); and DNM1, Dynamin, a large GTPase involved in clathrin-mediated endocytosis and other vesicular trafficking processes (30). As discussed above, only genes belonging to the Wnt signaling pathway have previously been documented in relation to HCC carcinogenesis. With regard to the down-regulated genes, HMGCS2 (3-hydroxy-3-methylglutaryl-coenzyme A synthase) and PLG (plasminogen) are newly-identified in HCC and have functions that are of interest from the perspective of carcinogenesis. HMGCS2 expression is associated with differentiation and its down-regulation has been demonstrated in moderately- and poorly-differentiated colorectal adenocarcinomas (31). The protein encoded by the PLG gene, plasminogen, is a secreted blood zymogen and is converted to plasmin and angiostatin when activated by proteolysis. Angiostatin, in turn, is a potent inhibitor of angiogenesis, tumor growth and metastasis; consistent with the notion that angiogenesis contributes to carcinogenesis in tumors characterized by down-regulated PLG.
HCC typically develops in the setting of liver cirrhosis, which is associated with chronic liver disease (1). The current dominant view is that non-viral associated HCCs evolve in tissue that has been subjected to physiological insults. For example, disruption of normal metabolic balance results in an increase in the release of free fatty acids (FFA) from adipocytes, release of multiple pro-inflammatory cytokines including tumor necrosis factor-alpha (TNF), interleukin-6 (IL6), leptin, and resistin (7). Together, the elevated levels of such factors contribute to abnormal hepatic conditions, including NAFLD and NASH, which may progress to cirrhosis and ultimately to HCC (7). Among the 30 mutations identified in our tumor samples six are located in genes that are implicated in metabolic processes, five of which are involved in diseases such as diabetes and obesity: IRS1, HMGCS1, ATP8B1, PRMT6 and CLU (Table IV).
IRS1 is known to be involved in diabetes, obesity and activation of cytokine signaling pathways (32-33). In addition, overexpression of IRS1 has been documented in over 90% of HCC cases (34). Yet, mutations in IRS1 have not previously been reported in HCC. The identification of a novel non-synonymous mutation in codon 793 (threonine to alanine) potentially alters the activity of the IRS1 protein. The position of this mutation supports such a functional effect since the mutation is adjacent to Ser794, which has previously been shown to play a critical role in insulin signal transduction (35-36). Changes in the phosphorylation state of Ser794 are associated with a variety of insulin-mediated activities (35, 37). Specifically, phosphorylation of this amino acid alters the efficiency of insulin signal transduction, eventually causing insulin resistance in diabetic animals (37). The multiple roles played by phosphorylated Ser794 are potentially influenced by the mutation that we observed in the neighboring Thre793. Thus, this discovery of an IRS1 mutation in HCC suggests that dysregulated IRS1 may play a role in the etiology of liver cancer.
Interaction analysis of canonical pathways contained in the PID (Figure 4) reinforces the potential importance of the role played by IRS1 in HCC tumorigenesis. Thus, network analysis of all possible interactions among the 29 mutated genes in our study revealed that the network comprises four hub genes: IRS1, TP53, CTNNB1 and SMG5. All four of these network-interacting mutated genes individually affect processes known to be dysregulated in cancer. Apart from TP53, the most commonly mutated gene in cancer, CTNNB1, encodeing beta-catenin, is frequently mutated in specific cancers, including colon cancer and HCC (38-39). The fourth gene in the network hub is SMG5, which is involved in protection of telomere integrity, and thus maintenance of chromosomal stability (40). Furthermore, degradation of SMG5 results in a major reduction of total telomerase activity (41).
TP53 mutation has a well-established association with many cancers, including HCC. HCV-infected cells exhibit a tendency toward increased mutation rate, leading to a 5-10-fold increase in mutation frequency of genes such as the immunoglobulin heavy chain gene, BCL-6, TP53, and the CTNNB1 gene (39). Our own initial observations concur, given that two of our three original liver cancer samples infected with HCV are also the same two that exhibit mutations in the TP53 gene. In addition, 50% of the additional 25 HCC tissues which we analyzed exhibited TP53 mutations; these TP53-mutated samples were also positive for HBV infection, unlike the other 50% HCC samples, which were TP53-mutation negative. Furthermore, out of the three HCC samples (Figure 5), the two samples with HCV infection and TP53 mutation (PHC98024 and PHC98028) exhibited the highest levels of copy number variations. This observation is consistent with the known role of the TP53 protein in maintaining genomic stability via the DNA repair mechanisms (42). In contrast to the common epithelial cancers such as breast, prostate, and colon, HCC is not characterized by a high rate of mutations in conventional oncogenes and tumor suppressor genes, other than TP53 (43). The results of our study are consistent with a model in which HCC development is associated with metabolic changes due to HCV/HBV infection, diabetes, obesity or metabolic syndrome. Insulin resistance and the subsequent inflammatory cascade that characterize NAFLD, NASH, and cirrhosis appear to play an important role in carcinogenesis in the liver. Thus, the entire constellation of these metabolic changes culminates in a substantially increased risk of HCC development.
Acknowledgments
The Authors would like to acknowledge and thank Dr. Katherine McGlynn for insightful discussions and Dr. Gang Wu for technical assistance.
- Received February 6, 2014.
- Revision received February 21, 2014.
- Accepted February 24, 2014.
- Copyright© 2014 International Institute of Anticancer Research (Dr. John G. Delinassios), All rights reserved