Abstract
Background/Aim: Lung cancer remains the main culprit in cancer-related mortality worldwide. Transcript fusions play a critical role in the initiation and progression of multiple cancers. Treatment approaches based on specific targeting of discovered driver events, such as mutations in EGFR, and fusions in NTRK, ROS1, and ALK genes led to profound improvements in clinical outcomes. The formation of chimeric proteins due to genomic rearrangements or at the post-transcriptional level is widespread and plays a critical role in tumor initiation and progression. Yet, the fusion landscape of lung cancer remains underexplored. Materials and Methods: We used the JAFFA pipeline to discover transcript fusions in early-stage non-small cell lung cancer (NSCLC). The set of detected fusions was further analyzed to identify recurrent events, genes with multiple partners and fusions with high predicted oncogenic potential. Finally, we used a generalized linear model (GLM) to establish statistical associations between fusion occurrences and clinicopathological variables. RNA sequencing was used to discover and characterize transcript fusions in 270 NSCLC samples selected from the Glans-Look specimen repository. The samples were obtained during the early stages of disease prior to the initiation of chemo- or radiotherapy. Results: We identified a set of 792 fusions where 751 were novel, and 33 were recurrent. Four of the 33 recurrent fusions were significantly associated with clinicopathological variables. Several of the fusion partners were represented by well-established oncogenes ERBB4, BRAF, FGFR2, and MET. Conclusion: The data presented in this study allow researchers to identify, select, and validate promising candidates for targeted clinical interventions.
In the United States (USA) the risk of developing lung cancer (LC) over a lifetime stands at 6.7% for men and 6.0% for women. Despite the plateauing incidence trend, lung and bronchus cancer remain the leading cause of cancer-associated mortality with 135,720 estimated deaths in 2020 in the USA alone, exceeding the mortality of prostate, breast, brain, and colorectal cancers combined (1). Non-small cell lung cancer (NSCLC) comprises 85% of all LC cases further sub-typed into 3 major classes: adenocarcinoma (AC), squamous cell carcinoma (SCC), and large cell carcinoma (LCC). Currently AC, is the most prevalent type comprising over 64% of NSCLC cases, originates from more distal airways and peripheral lung, whereas SCC making-up 25-30% arises from cells in the more central airway epithelium. The remaining cases belong to poorly differentiated and aggressive tumors assigned to the LCC type (2).
Clinical outcomes of NSCLC remain poor, for example, 5-year relative survival of patients with advanced disease stands at 6.1%. The survival is significantly improved reaching 61% in cases of localized NSCLC indicating the importance of early detection, relapse prognosis, and recurrence prevention (3). The aggressive nature of NSCLC and poor outcomes of non-targeted radio- and chemotherapies highlight the need for further research to uncover personalized targeted approaches that rely on the knowledge of molecular alterations behind the initiation and progression of NSCLC. Several driver mutations such as EGFR and translocations such as ALK, ROS1, NTRK have been identified and successfully targeted using small-molecule inhibitors (4, 5). Although not curative, this approach has led to profound improvements in outcomes to the minority of patients whose tumors harbor these changes. Identifying new translocations that can be successfully targeted clinically remains a pivotal strategy in improving lung cancer patients experience (6).
Genome-wide studies of NSCLC have revealed a complex mutational landscape involving point-mutations, insertion, deletions, gene amplification, and larger structural rearrangements (7-9). The formation of chimeric proteins is a common way of oncogene activation across epithelial cancers (10).
Fusion transcript is a hybrid transcript formed by exons that belong to different genes. Mechanistically fusions may form both as a result of genomic rearrangements and at the post-transcriptional level via trans-splicing (11).
Fusion-mediated oncogenic activation is an established mechanism behind the initiation and progression of many NSCLC cases with a handful of tyrosine kinases (TK) identified as culprits. Initially, transforming EML4:ALK fusion was detected 6.7% of studied cases (12), later RET and ROS1 fusions were added to the list of clinically relevant alterations (13, 14). More recent efforts in the analysis of lung cancer genomes and transcriptomes identified targetable fusions in NTRK1/2/3 (15), NRG1 (16), and FGFR1/2/3 (17). Wider application of high-throughput methods revealed that fusion transcripts in epithelial cancer are common and are not limited to tyrosine kinases but involve a wide array of genes including transcription factors, metabolic genes, signal transducers, and chromatin modifiers (10).
For the most part, the detection of fusion transcripts in lung tumors relies on single-gene tests and targeted multiplexed panels (18). Both approaches require pre-existing knowledge of markers that need to be tested. The discovery of new targets requires unbiased genome-wide studies based on whole-genome or transcriptome sequencing. The majority of reports on the large-scale fusion detection that involves lung cancer is based on re-analysis of TCGA samples (19-21). Only a few studies were done using dedicated NSCLC fusion collections (22-24). The majority of fusions detected in these studies were not reported previously, showing the potential for novel findings.
Herein, we used RNA sequencing to conduct genome-wide discovery of transcript fusion events in a panel of 270 archived NSCLC tumor specimens. Our study focused on the characterization of transcript fusions in early-stage NSCLC prior to introduction of radio- or chemo-therapy. We also show that RNA sequencing can be used to detect transcript fusion in formalin-fixed, paraffin-embedded samples (FFPE) samples, despite the challenges with RNA quality. Furthermore, we analyzed recurrent fusions, promiscuous fusion partners, and predicted fusion oncogenic potential using computational methods. Finally, we characterized the associations between recurrent fusions, recurrent partner genes, and key clinicopathological covariates. We believe that this study may serve as a valuable resource for the selection of candidates with oncogenic and precision medicine potential.
Materials and Methods
Patient inclusion criteria and tumor samples. This was a retrospective study aimed at the characterization of transcript fusions in early stages non-small lung cancer (NSCLC). The study was approved by the Health Research Ethics Board of Alberta. All patients were diagnosed with NSCLC at the Tom Baker Cancer Centre in Calgary, Canada, between 2003 and 2010 (25). Clinical data were obtained retrospectively by chart review and abstracted into the Glans-Look Lung Cancer Database (26). The Glans-Look Lung Cancer Research (GLR) database is an Institutional database which captures patient-level demographics, clinical, treatment, response, and outcome data via chart reviews of electronic medical records for all patients with a diagnosis of lung cancer who present for diagnosis and treatment within the Canadian province of Alberta. Study data within the GLR Database are collected and managed using the Research Electronic Data Capture (REDCap) data capture tools hosted at the University of Calgary (27, 28).
The tumor samples were surgically collected taking utmost care to avoid non-cancerous tissue. The samples were gathered over the period of 10 years, they were fixed in formalin and preserved in paraffin blocks as a part of Glans-Look tumor repository. The series represents NSCLC at the early (I, II and III) stages of oncogenesis prior to the start of radio- and chemotherapy.
RNA isolation and sequencing library construction. Formalin-fixed paraffin-embedded (FFPE) NSCLC samples were excised from paraffin blocks and subjected to an RNA extraction process. RNA was isolated using the FFPE isolation protocol following the manufacturer’s instructions. Briefly, the RNA samples were treated with DNAse I to remove the traces of genomic DNA contamination and quality tested using Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). The sequencing libraries were constructed from 100 ng of RNA using NEB Next Ultra II directional RNA library reagents kit with ribosomal depletion step (New England Biolabs, Ipswitch, MA, USA). The library preparation was performed following the manufacturer’s instructions omitting the fragmentation step as recommended in the manual for the samples with a low RNA integrity index (RIN). The samples were multiplexed with NEBNext Multiplex oligonucleotides (New England Biolabs). The fragment libraries were sequenced using NextSeq500 (Illumina, San Diego, CA, USA) with a single-end, 75 cycle protocol. To minimize potential batch effects, all library preparation and sequencing steps were conducted by the same personnel and using the same set of equipment.
Quality control and fusion detection. Base-calling and demultiplexing steps were done with bcl2fastq v2.17.1.14 script from Illumina CASAVA 1.9 pipeline (Illumina). Various quality parameters of the sample sequencing libraries were assessed using FastQCv.0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Base qualities were excellent across all samples, and no adapter contamination was noted. The reads were mapped to the human genome (Ensembl GRCh37) using HISAT v.2.2.1 (29) and raw read counts were obtained with featureCounts v.2.0.1 software (30). The reference genome was downloaded from Illumina iGenomes website (31). This step was conducted as a part of quality control to estimate alignment rates and gene expressions.
Fusion detection was conducted using JAFFA v. 1.09 pipeline in ‘hybrid’ mode that uses transcript assembly combined while incorporating the reads that do not map to either reference transcriptome or the assembly. JAFFA developers recommended the ‘hybrid’ mode the single-end reads in the range of 60-99 bp (32). JAFFA was used with default GENCODE hg38 (UCSC version) transcript reference. The workings of the JAFFA pipeline are described in detail in the original publication and the software documentation https://github.com/Oshlack/JAFFA. Briefly, JAFFA pre-filters the reads to remove those mapping to intergenic, intronic, and mitochondrial regions. The remaining reads are assembled into contigs with Oases (33). The resulting contigs and reads are mapped to reference transcripts; those spanning multiple genes are selected as candidate fusions. Next, the pipeline counts the reads spanning breakpoints and aligns the contigs to the genome to determine break-point coordinates. The fusion candidates are then characterized and ranked based on confidence.
Fusions with genomic gap below 200 bp and no evidence of genomic re-arrangement are flagged as “potential regular transcripts” and excluded from default reporting. Fusions with breakpoints spanning exact exon-exon boundaries with overlapping reads and read-pairs are classified as “high confidence”. The fusion receives “medium confidence” if only reads and no read pairs are present at the break-point. Since the present sequencing was single-end, the highest achievable rank was “medium confidence”. Fusions with no breakpoints aligning to exon-exon boundaries were classified as “low confidence” and excluded from the analysis.
Filtering of fusion candidates. To reduce the number of potential false-positives we applied a set of filtering criteria to the dataset. We retained all of the fusions ranked as “medium confidence” with at least 2 reads spanning the break-point. Fusions involving adjacent (less than 50 kb gap) and overlapping genes were excluded as potential read-through events. We also removed fusions involving pseudogenes, immunoglobulins, and fusions formed between paralogues due to potential misalignments. Pseudogene and paralog data was obtained using biomaRt v2.48.1 (34). Finally, we excluded fusions that were observed in non-cancerous cells. The list of non-cancer fusion was obtained from FusionHub (20) and ChimerDB 4.0 databases (19).
Prediction of oncodriver potential. The oncogenic potential of discovered fusions was predicted using Oncofuse v.1.1.1 with epithelial training dataset and hg38 genome as a reference (21). The rest of the software options were kept at default. Oncofuse is a naive Bayesian classifier that predicts the oncogenic potential of the experimental fusions based on genomic hallmarks of known fusions. Oncodriver score generated by the software is easily interpreted as p-value, further referred to as “driver probability”. Bioinformatics workflow involved in the detection, filtering and prediction of transcript fusion is shown in Figure 1.
Statistical analysis. All of the statistical analyses were conducted in R v4.1.0. The associations between clinicopathologic covariates and recurrent fusions or partner genes were done using a multivariate logistic regression model implemented in glm() function with the “family” option set as “binomial”. The model formula of constructed as follows: Fusion/Gene ~ Sex + Smoking + Histology + Stage + Age + Recurrence. Associations with the p-values below 0.05 were considered significant. The associations between samples fusion burden and clinicopathologic variables were conducted in the same way except the model family was “gaussian”.
The associations with survival were analyzed with Cox proportional hazards model implemented as coxph() function from survival v3.2-11 R package. The model incorporated the same clinical and demographic variables as in GLM analysis. In addition, Kaplan-Meyers curves were drawn for fusions and gene partners with Cox model p-values below 0.05, and the survival of fusion-positive and negative cases was compared with the log-rank test. This analysis was conducted using survminer v0.4.9 package.
Results
Clinical and demographic characteristics. The lung tumor samples were selected from the Glans-Look archive of clinical specimens. The non-small lung cancer (NSCLC) samples used in the study were surgically resected at early stages (I-III) of the disease and prior to the start of chemo- or radiotherapy. The sample selection process was designed to generate a data set with a balanced representation of clinical and demographic covariates. The demographic characteristics considered in the analysis were age at diagnosis, sex, smoking history, and survival from the date of diagnosis until the event of death. The “age” variable was divided into 2 intervals - under and over 65 years. Clinical variables were the stage of tumor development (I, II, III) and the histological type – either squamous or non-squamous. Stage and the histological type were determined using histopathological methods by qualified pathologists. The summary statistics for demographic and clinical variables are available in Table I.
Summary of sequencing. We used high-throughput RNA-sequencing to gain insight into the transcript fusion landscape of early-stage NSCLC. In total, we sequenced 270 FFPE tumor specimens that represented individual patients. Initial quality control showed that the base qualities were good across all samples, minor adapter contamination was successfully removed. The sizes of the sequencing libraries ranged from 7 to 123 million reads with a median of 33. The reads were mapped to GRCh37 (Ensembl) human genome assembly resulting in the alignment rate in the range of 24-95% with a median of 88%. The sequencing detected 25,533 expressed features with at least 10 reads in 5 samples. The majority of 64.66% of detected features were protein-coding genes, 17.2% were long non-coding RNAs, 13.62% represented pseudogenes and other RNAs, and the biotype of the remaining 4.52% was not assigned (Supplementary Figure 1).
Overview of transcript fusion landscape in NSCLC. The results of the initial JAFFA analysis were filtered to exclude fusions with less than 2 reads mapping to the break-point, fusions between mitochondrial and autosomal genes, those involving pseudogenes, fusions originating from adjacent or overlapping genes, and those where both partner genes were paralogues. Following filtering, the fusion detection resulted in 896 transcript fusion events across 270 NSCLC tumor samples. This data-set could be reduced to 782 unique fusions after removing redundant events with the same fusion partners. The number of fusions per sample ranged from 0 to 21, with mean and median values of 3.0 and 3.32 respectively. The full list of fusions retained after filtering is available in supplementary material (Supplementary Table I).
A fraction of transcript fusions equal to 4.22% or 33 out of 782 occurred more than once in any of the samples and were deemed recurrent. The rate of the highest recurrence was 30 or 11.1% of samples for LMO7:EXT2 fusion, other highly recurrent fusions were CD27-AS1:MANBAL (4.4%), and TTLL12:RAB17 (4.4%). The top 17 fusions that occurred in over 3 samples are shown in Figure 2. The full table with fusion occurrence data can be found in Supplementary Table II.
The majority of fusions, 480 (61.3%) originated from inter-chromosomal rearrangements while in 302 (38.6%) cases both partners were located on the same chromosome. Less than half of 37.6% of these fusions were in-frame and expected to produce viable protein while 62.4% were predicted to generate out-of-frame transcripts (Supplementary Figure 2).
Analysis of the distribution of fusions across chromosomes showed significantly higher fusion frequencies in chromosome 19. This increase was observed both for inter- and intra-chromosomal fusions (Supplementary Figure 3A and B). This data suggests that chromosome 19 serves as a genomic instability hotspot concerning both translocations and deletions resulting in inter- and intra-chromosomal chimeric genes.
BiomaRt Bioconductor package was used to annotate fusion partners with biotype: protein-coding, long non-coding RNA (lncRNA), microRNA, constant and variable T-cell receptor genes (TR C and TR V), and unknown. The absolute majority or 83.1% of all fusions contained protein-coding genes as both partners, a much smaller fraction (10%) had protein-coding partner fused with a gene with “unknown” biotype, and 4% consisted of protein-coding gene joined to lncRNA, other categories were less numerous (Figure 3).
Fusions detected in this study were searched against ChimerDB v.4.0, a large fusion database that incorporates 67,610 fusions sourced from the prediction based on deep sequencing, text mining of PubMed articles, and derived from other databases (19). Only 31 fusions found in our study were also present in ChimerDB 4.0, while 751 or 96% were absent and considered novel. Known fusions were observed across 15 cancer types, with the exception of glioblastoma multiforme and melanoma, all were carcinomas. Four fusions (ATP2A2:IFT81, CD74:ROS1, HIF1A:PRKCH, MIRLET7BHG:ATXN10) were previously observed in lung adenocarcinomas and 2 (ARMC7:HN1, ATP11A:CUL4A) in lung squamous cell carcinomas. Genes classified as oncogenes in the ChimerDB database were observed in 6 out of 31 known fusions and include ROS1, HIF1A, PMS1, MET, MAP3K1, XRCC2, and KMT2C (Table II). The prevalence of novel fusions (751 or 96%) indicates high diversity and widespread presence of these aberrations combined with the relative lack of relevant data mining efforts.
Recurrent transcript fusions in NSCLC. Research suggests that gene fusions observed in multiple cases either within or across multiple cancer types are functional regarding the induction or progression of the malignant process. To identify such candidates, we analyzed fusions that occurred in at least 2 tumor samples in our dataset. In total, we found that 33 of 782 (4.22%) fusions were recurrent. Four recurrent fusions were observed in at least 10 samples (Figure 4). Out of 33 recurrent fusions 19 or 57.6% were out-of-frame, and 14 were in-frame (Supplementary Table III). Certain fusions had unusually high frequency, for example, LMO7:EXT2 had the highest recurrence and was found in 30 (11.11%) samples, CD27-AS1:MANBAL was detected in 12 (4.44%) samples, while GABPB1:LMCD1 and TTLL12:RAB17 each appeared in 10 (3.7%) samples (Supplementary Table III). These fusions were absent from ChimerDB and the existing literature. It is unclear whether these are false-positives originating from misalignments or library preparation artifacts, however, all of the fusions included in the analysis had breakpoints at the exact exon-exon boundaries, which we find unlikely in the case of artifacts. It is possible that highly recurring fusions originate from trans-splicing events observed in normal non-cancerous tissue (35).
Only 2 of 33 recurrent fusions from our data set were also found in ChimerDB: ATP2A2:IFT81 and NOC4L:FBRSL1. Both occurred in 2 patients and were included in ChimerSeq-Plus subset of highly reliable fusions. NOC4L:FBRSL1 was associated with ovarian serous cystadenocarcinoma (OV), and also detected in B-cell acute lymphoblastic leukemia (36). In the ChimerSeq data set, ATP2A2:IFT81 was found in lung adenocarcinoma, uterine corpus endometrial, and liver hepatocellular carcinomas. The presence of this fusion in lung adenocarcinoma was also evidenced in a recent survey of gene fusions across multiple cancer types (37).
The majority of recurrent fusions in our data set were novel. In order to gain further insight into their potential role, we cross-referenced gene-fusion partners with Cancer Gene Census (CGC) and the list of oncodrivers obtained from the literature (38). Also, literature searches were conducted to gather additional evidence and insight into the role of fusion partners in lung cancer and in malignant processes in general (Supplementary Table IV). Two genes, EXT2 in LMO7:EXT2 fusion and ZNF669:ZNF136 were previously characterized as TSGs according to CGC. Twenty-five of the recurrent fusions had gene partners with clear involvement in a malignant process described in the literature. Only 6 fusions had partners with no known involvement in cancer.
Recurrence of fusion partners and gene promiscuity. Genes participating in fusions are frequently promiscuous in terms of partners and form fusions across a range of cancer histologies (39). Such genes play important roles in cancer initiation and progression as either oncogenes or tumor suppressor genes (TSGs). Their disruption by genomic rearrangement results in a formation of a chimeric protein that benefits cancerous cells and survives selective pressure regardless of the second partner. Participation of a certain gene in multiple fusions indicates its importance in the malignant process and offers potential for clinical intervention. To identify promiscuous fusion genes in our data set, we analyzed the recurrence of individual fusion partners and cross-referenced them with the lists of known oncogenes and TSGs, while focusing on fusion genes with multiple partners. In total, we identified 1,363 unique genes as fusion participants with 229 observed in more than 1 sample (recurrent). Matching the list of fusion partners to the collection of known cancer-related genes revealed 51 oncogenes and 48 TSGs. Of these 23 had dual roles and could serve as both drivers and suppressors. Out of recurrent fusion partners, 13 were oncogenes while 7 played the role of tumor suppressors (Supplementary Table V). The highest degree of recurrence (31/270 samples) was observed in the case of EXT2 tumor suppressor gene (40) that, with one exception, had LMO7 as an exclusive partner. Other known cancer genes had much lower recurrence; TBL1XR1 and ZBTB20 were observed in 3 patients, each forming chimeric transcripts with different partners. TBL1XR1 is known for both oncogenic and tumor suppressor roles. In NSCLC, TBLXR1 mediates cell proliferation, survival, and metastasis by regulating MEK and AKT pathway through c-Met (41). ZBTB20 was shown to promote cell proliferation in NSCLC through transcriptional repression of FOXO1 tumor suppressor (42). Other notable oncogenes included serine/threonine kinase BRAF, fibroblast growth factor receptor family member FGFR2, and tyrosine kinase receptor MET. These genes were observed in fusions across 2 samples each.
In total, 179 genes formed transcript fusions with multiple partners, 24 genes took part in at least 3 different fusion configurations (Supplementary Table IV). PubMed searches revealed cancer involvement for 18 of these genes with 8 participating in the development of lung cancer. The highest degree of promiscuity was found in the case of ZBTB7A and RNF213 genes that each formed fusions with 4 different partners. None of these genes was present in the reference list of oncodrivers or TSGs, however, a literature search revealed their involvement in the development of NSCLC. ZBTB7A is a transcriptional suppressor of glycolysis down-regulated in many cancer types including LC. Tumors deficient for this gene progress fast and display heightened sensitivity to glycolysis inhibition (43). RNF213 mutations discovered in circulating DNA discriminate between early-stage lung cancer and benign disease pulmonary nodules (44).
Prediction of oncogenic potential with Oncofuse. The oncogenic potential of detected fusions was predicted using a naive Bayesian classifier implemented in Oncofuse software (21). In a total, 657 fusions were available for the prediction of oncogenic potential expressed as driver probability. The full list of oncodriver prediction results is available in Supplementary Table VI. There were 88 fusions with driver probability over 0.95, eight of them were formed with the participation of known oncogenes ERBB4, SMAD2, BRAF, PRKAR1A, FGFR2, KMT2C, ZNF236, and MET.
Six fusions with high oncodriver probability (>0.8) were recurrent (Table III). ZEB2:SEC13 was the recurrent fusion with the highest driver probability of 1.0. This fusion was detected in 2 out 270 (0.74%) of samples, SCC and AC. ZEB2 is an E-box homeobox binding transcription factor that regulates epithelial-to-mesenchymal transition (EMT) and is subject to recurrent translocations in T-cell acute lymphoblastic leukemia (45). MicroRNA mediated inhibition of ZEB2 suppresses migration and invasion in NSCLC (46, 47). Another EMT-inducing factor PlexinB1 was observed as a part of HSPG2:PLXNB1 transcript fusion present in 3 SCC samples. Activation of PlexinB1 by CD100 promotes EMT and metastasis in head and neck SCC (48). Other recurrent fusions with high driver probability involved zink finger proteins with putative DNA binding and gene regulation functions. For example, ZNF587B:ZNF211 had a driver probability 0.97 and was found in 4 AC samples, 1 LCC and 1 SCC. ZNF587B is a C2H2-type zinc finger protein (ZFP) family protein that was shown to inhibit proliferation, colony formation, migration, and invasion in ovarian cancer (49).
GABPB1:LMCD1 fusion with driver probability 0.84 was observed in 10 or 3.70% of the samples, both squamous and non-squamous. Down-regulation of long non-coding RNA GABPB1-IT1 expressed from the intronic region of GABPB1 is associated with poor prognosis in NSCLC (50). Somatic mutations in a 3′ partner LMCD1 promoted migration and metastasis in hepatocellular carcinoma (HCC) and systemic lung metastasis in murine model (51).
Associations of transcript fusions with clinicopathologic characteristics and survival analysis. Generalized linear models (GLM) method was used to investigate the association of recurrent fusions and promiscuous fusion partners with known clinicopathologic characteristics: sex (M/F), smoking status (never/ever), age (over/under 65), histology, recurrence, stage (I, II and III). The survival analysis included the same set of clinical and demographic covariates. Considering that transcript fusions may arise from genomic rearrangements, the number of fusions within the tumor serves as a surrogate of genomic stability. Our GLM analysis failed to detect any significant association between a number of fusions per sample and clinicopathological characteristics; similarly, no such association was found with patient survival (data not shown).
Four of the recurrent fusions were significantly associated with clinicopathologic covariates (GLM p-value <0.05) (Supplementary Table VII). CD27-AS1:MANBAL was associated with a female gender (p=0.03); DNPEP:C9ORF3 was predominantly found in cases with no recurrence (p=0.04). The remaining fusions were associated with tumor histological types. GABPB1:LMCD1 was found in 1 out 5 (20%) of large cell neuroendocrine type while only 8 out 253 (3.16%) were positive in the remaining cases (p=0.03). LMO7:EXT2 fusion was prevalent in adenosquamous carcinoma with 4 out 13 (30.76%) cases as opposed to 8.62% in the rest of the samples (p=0.03). The same method was applied to analyze potential relationships between recurrent promiscuous genes and clinical characteristics did not reveal any significant results.
Integrative multivariate analysis failed to detect any recurrent transcript fusions associated with survival when the model included relevant clinical and demographic characteristics. Among clinical covariates, the later stage at diagnosis (II) and recurrence strongly predicted poor survival (p<0.001). Improved survival was expected in never smokers (p=0.01), while large cell histological type was associated with poor outcome (p=0.03) (Supplementary Table VIII). Three recurrent genes participating in multiple fusions MROH1, N4BP2L2 and TRAPPC9 were significantly associated with survival. However only in the case MROH1 the association was highly significant in both Cox regression (HR=14.04, p<0.001) and log-rank test (Figure 5; Supplementary Table VIII). Recent research suggests the involvement of MROH1 in the late actin-dependent stage of exocytosis (52); also, a recent study detected copy number variation of MROH1 in drug-resistant ovarian cancer (OC), while duplication of MROH1 gene was associated with decreased survival in these patients (53). In addition, MROH1 was identified as one of the top genes that undergo CNV in TCGA dataset (53).
Discussion
In this study, we demonstrated the utility of RNA-sequencing approaches applied to archived tumor specimens in the identification of transcript fusions with a potential contribution to the development of NSCLC. Our work involved a large collection (270 samples) of early-stage FFPE tumor samples that were surgically resected prior to the start of radio- or chemo-therapy. This allowed us to observe tumors in their native state without the selective pressure and added genomic instability induced by therapeutic intervention. We found a diverse fusion landscape in early-stage NSCLC with chimeric transcripts present in the majority of samples. Fusion candidates were filtered to reduced false-positives and compared to existing fusion datasets. The absolute majority (751 or 96%) of the fusion candidates were novel showing that there is a plethora of chimeric transcripts yet to be fully described. We provide the research community with plenty of supporting data including full list fusion candidates, the sequences of chimeric transcripts, prediction of oncodriver potential, data regarding recurrent fusions, genes with multiple partners, evidence of their roles in cancer, and associations with clinicopathologic variables. This data is available in supplementary materials and is meant to generate research leads resulting in the identification and validation of clinically relevant fusion events.
Currently, the majority of NGS applications are focused on the detection of fusions in lung cancer specimens using either amplicon-based or hybrid-capture enrichment approaches. These methods rely on custom and commercial panels with known targets (18). NGS is also an excellent method for the discovery of novel fusions with clinical significance. Normally researchers mine the collection of TCGA datasets to discover and characterize fusions across various cancers (54-56). Considering a limited number of lung cancer cases in TCGA and generally low frequency of fusions, there is a need to expand fusion detection to other case studies. A recent study of 54 Taiwanese patients with NSCLC was conducted on cryo-preserved specimens and resulted in the discovery 218 fusions where only 24 were previously reported (22). Earlier, Rudin et al. applied NGS to characterize genomic characteristics including gene fusions in small cell lung cancer (SCLC) (24). This study used RNA-seq to detect transcript fusions in 55 samples composed of primary tumors, adjacent normal tissue, and cancer cell lines. In total, 44 transcript fusions were detected of which 4 were recurrent. Another large study applied RNA-seq to detect 45 fusions across 87 AC cases, 8 of them involved in transformation of tyrosine kinases (57). Other similar studies involved a much smaller number of cases (23, 58).
Compared to earlier reports, our project involved a larger panel of 270 cases representing diverse histological NSCLC types. Our samples were selected to represent early stages of malignant development, mostly stages I and II with the lesser representation of stage III. The tumors were resected before chemo- and radiation therapy to take the snapshot of the cancer genome in its native state. Importantly, we successfully conducted fusion detection using RNA extracted from FFPE samples collected and stored over a long time period, showing the feasibility of large retrospective fusion studies with archived tumor material. In total, we detected 782 unique chimeric transcripts with an average occurrence rate of 3.32 per sample in line with the fusion frequency described in a recent Taiwanese study (22). Also, similar to Chang et al., the absolute majority of the fusions, or 96% were not reported previously. Despite the similar nature of the samples, there was no overlap between fusions detected in Chang et al. (2021) and the current study. Comparing our results with 48 adenocarcinoma fusions from Seo et al. (2012) report found a single overlap - CD74:ROS1 fusion.
We observed a significantly higher frequency of transcript fusion events where either one or both partners were located on chromosome 19. This chromosome is known to have some unique properties such us increased gene density, unusual richness in clustered gene families, CpG islands, and repetitive elements (59). The repeats located on chromosome 19 are mainly comprised of Alu and LINE elements (59). A 32-38 bp long MSR1 minisatellite tandem repeat concentrated in 19q13 can contribute to genomic instability, influence gene expression by changes in copy number variation and increase a risk of cancer (60). Interestingly, aberrations in chromosome 19 are associated with lung asbestos-induced lung cancer (61).
Chromosome-Centric Human Proteome Project (C-HPP) determined that chromosome 19 aberrations are present in various lung cancer subtypes and lead to loss of tumor suppressor genes, changes in DNA repair genes, TGFB pathway genes, and formation of fusion oncogene BRD4-NUT (62). At least 2 chromosome 19 genes detected in C-HPP project (LTBP4 and PPP1R13L) were identified as fusion partners in our study. It appears that chromosome 19 is prone to forming aberrations in lung cancer, however published evidence is still very limited.
A handful of genes contain mutations with a well-established oncodriver role in NSCLC (63). Of those, ROS1, FGFR2 and BRAF formed fusions in 2 samples, while NRG1 and MET fusions were single occurrences. Only one of these fusions CD74:ROS1 was previously described (64), all others recombined with novel fusion partners. Erythroblastic oncogene 2, also known as HER2, is up-regulated in 1-4% cases of adenocarcinoma and is a well-established biomarker in a variety of other cancers (65). Although we did not find any fusions harboring HER2/ERBB2 oncogene, 2 fusions involving related genes ERBB2IP (ERBB2IP:FAM156B) and ERBB4 (PSME4:ERBB4) were observed. ERBB2IP is a scaffold protein that binds to ERBB2 with high specificity and stabilizes it at the plasma membrane (66). The IntOGen web platform (intogen.org) identified ERBB2IP as a mutational cancer driver in lung squamous cell carcinoma and pan-cancer analysis (67). ERBB4 is a receptor tyrosine kinase (RTK) known to carry activating mutations in NSCLC (68) and forming fusions with multiple partners; a novel KIF5B:ERBB4 fusion was recently identified in AC patients (69). Other notable RTK oncogenes observed as fusion partners included MET, IGF1R, FYN, and SRC. MET fusions were found in 2 samples while the rest were non-recurrent. Of these only 2 fusions were described previously, specifically, FYN:REV3L was found in breast cancer (70) and CAV1:MET was detected in SCLC case with the positive response to Crizotinib and Osimertinib (71). The rest of the RTK fusions were novel.
We hypothesized that genes that recombine with multiple partners play a role in carcinogenesis. This notion is based on the assumption that a disruption of a single oncogene or a TSG is required to facilitate the malignant process while the second fusion partner may vary due to the randomness of genomic rearrangement. To detect such genes, we analyzed promiscuity in fusion formation defined as a number of different partners that fuse with a gene of interest. Literature searches showed that most of the promiscuous genes were involved in either lung or other types of cancer. Two genes with the highest promiscuity, RNF213, and ZBTB7A, formed fusions with 4 fusion partners. All fusions that involved RNF213 and ZBTB7A had very high predicted driver probability (detected by Oncofuse) with a mean of 0.92 and a range of 0.78-0.99. RNF213 fusion was recently reported and validated in chronic myeloid leukemia (72); furthermore, knockdown of RNF213 reverses hypoxia-induced death and promotes tumorigenicity (73). The mutational status of RNF213 was shown to differentiate early lung cancer from benign pulmonary modules (44). ZBTB7A is a POZ/BTB and Kruppel (POK) family transcription factor that acts as an oncogene or a TSG depending on cellular context. Originally described as an oncogene in hematological cancers, ZBTB7A role was expanded to solid cancers including NSCLC where this gene is frequently amplified and over-expressed (74, 75).
Naive Bayes classifier implemented in Oncofuse software was used to predict oncodriver potential of detected fusion candidates. Seven recurrent fusions had high oncogenic potential (Table III). Two of them, ZEB2:SEC13 and HSPG2:PLXNB1 involved key regulators of epithelial-to-mesenchymal transition. ZEB2:SEC13 had a driver probability of 1 and was observed in 2/270 or 0.74% of samples, HSPG2:PLXNB1 had the same occurrence. Also, ZEB2 formed a third ZEB2:SRGAP rearrangement with low predicted driver potential. Novel ZEB2 fusions were recently identified by RNA sequencing in AML (76) and lipoblastoma (77). MicroRNA-mediated targeting of ZEB2 blocks EMT and WNT/Beta-catenin pathway (78); furthermore, ZEB2 cooperates with PAX6 to promote EMT, metastasis, and cisplatin resistance in NSCLC (79). Little is known about the role of PLXNB1 in lung cancer, however, it is an established cancer gene in other malignancies. PLXNB1 serves as a receptor of semaphorins in a pathway that regulates cellular shape and motility (80). Mutations of PLXNB1 drive invasion and metastasis in prostate cancer (81); in breast cancer, PLXNB1 directly interacts with ERBB2 to promote metastasis (82). Other recurrent novel fusion candidate GABPB1:LMCD1 had driver probability 0.84 and was found in 2.7% of samples. Recently, LMCD1 was identified as a transcriptional factor regulating EMT in NSCLC and breast cancer cell lines (83); somatic mutations of this gene promoted migration and metastasis in hepatocellular carcinoma (51). Overall, most of the recurrent fusions with driver potential have known involvement in cancer by either promoting EMT and metastasis (ZEB2, PLXNB1, LMCD1) or acting as tumor suppressors (ZNF587B:ZNF211). To our knowledge, none of these fusions were detected previously.
We used linear models to find the relationships between fusion candidates and their partner genes with clinicopathological variables and survival. The rate of genomic rearrangements can be used as a proxy for genomic stability. In our analysis, the number of fusion candidates per sample was not significantly associated with any of the clinical covariates or patient survival. There were significant associations (raw p-value <0.05) between individual fusions and certain demographic and clinical variables. For example, CD27-AS1:MANBAL had a higher occurrence in female patients, DNPEP:C9ORF3 showed association with recurrence, GABPB1:LMCD1 had a higher frequency in large cell neuroendocrine histology, and LMO7:EXT2 in adenosquamous carcinoma. All of these fusions were novel with no literature data available. Survival analysis with Cox proportional hazards model identified detection at stage II and recurrence as strong predictors (p<0.001) of poor outcomes when other variables were accounted for (Supplementary Table VIII). The association of survival with stage III did not reach significance, likely due to the smaller samples size (n=36). Weaker associations (p<0.05) were detected with smoking, where “never” smokers had improved survival, while the poorer outcome was associated with LCC histology. No significant associations were found between survival and the occurrence of individual fusion candidates. The presence of MROH1 as a variable fusion partner was significantly associated with decreased survival (HR=14.04; p<0.001), however, these results are based on 2 positive samples only. Despite being a large and highly conserved gene, MROH1 is not well researched, a single functional study suggests its involvement in vesicle transport (52). There is no research on the involvement of this gene in cancer, however, MROH1:BOP1 fusion was detected and validated in CCLE cell lines (56); the data available in the COSMIC portal shows MROH1 over-expression in multiple cancers, including intestinal, stomach, lung, and ovary.
In conclusion, we applied the JAFFA fusion detection pipeline to identify potential transcript fusion candidates in 270 RNA-seq libraries. All the datasets obtained from archived NSCLC tumor samples representing early stages of cancer development. In total we detected 782 transcript fusions, only 33 of them were found in fusion databases and the rest appear novel. Many fusions included well-known and emerging oncogenes with the evidence of lung cancer involvement revealed by literature searches. Considering the association between fusion recurrence and their role in malignancy, we compiled the lists of recurrent fusions and promiscuous fusion partners annotated with the literature evidence of cancer role. In addition to the analysis of published data, fusions with high oncodriver potential were predicted with computation methods. Considering the availability of clinicopathological data we analyzed the associations between recurrent fusions and fusion partners with select clinical covariates including survival. The major drawbacks and limitations of this study lie in the lack of molecular validation of the promising fusion candidates. The nature of the study that used RNA sequencing reads only makes it impossible to differentiate transcript fusions originating from genomic re-arrangements and those caused at the post-transcriptional level (trans-splicing). Identification of a wide range of transcript fusion candidates enables us and other interested researchers to select, validate, and characterize clinically actionable targets to improve patient outcomes.
Acknowledgements
The Authors wish to thank the donors supporting the Glans Look Database and the generous gift by Mr. Arnie Charbonneau to the POET program at the University of Calgary which funded the research.
Footnotes
Supplementary Material
Available at: https://github.com/slavailn/fusion_manuscript_suppl_materials
Conflicts of Interest
The Authors disclose no conflicts of interest.
Authors’ Contributions
YI: Bioinformatics data analysis, investigation, writing the original manuscript. LP: Conceptualization, data curation, review. JBM, MK, AE, and AG: experimental work. OK and IK conceptualization, review, editing. GB: conceptualization, review, editing, supervision.
- Received January 4, 2023.
- Revision received June 23, 2023.
- Accepted July 6, 2023.
- Copyright © 2023, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC-ND) 4.0 international license (https://creativecommons.org/licenses/by-nc-nd/4.0).