Abstract
With the development of high-throughput biological techniques, high-dimensional omics data have emerged. These molecular data provide a solid foundation for precision medicine and prognostic prediction of cancer. Bayesian methods contribute to constructing prognostic models with complex relationships in omics and improving performance by introducing different prior distribution, which is suitable for modelling the high-dimensional data involved. Using different omics, several Bayesian hierarchical approaches have been proposed for variable selection and model construction. In particular, the Bayesian methods of multi-omics integration have also been consistently proposed in recent years. Compared with single-omics, multi-omics integration modelling will contribute to improving predictive performance, gaining insights into the underlying mechanisms of tumour occurrence and development, and the discovery of more reliable biomarkers. In this work, we present a review of current proposed Bayesian approaches in prognostic prediction modelling in cancer.
There were more than nine million deaths caused by cancer in 2018, which made it the second leading cause of death worldwide (1). Besides traditional radiotherapy, surgery and chemotherapy, targeted therapy and immunotherapy have also been developed recently, such as targeted epidermal growth factor receptor therapies, targeted vascular endothelial growth factor therapies, inhibitor of programmed cell death protein-1 and its ligand, and cytotoxic T-lymphocyte-associated protein 4 inhibitor (2). However, since cancer is a highly heterogeneous disease, these emerging therapies might not be sufficiently effective in all patients. Therefore, it is still necessary to further investigate the biological mechanism of cancer development and explore possible potential biomarkers related to tumour prognosis, diagnosis, etc.
With the development of high-throughput sequencing technology, high-dimensional data based on different omics data is constantly being generated, which provided researchers with data at the molecular level to reveal relevant biomarkers of cancer development or prognosis. At present, there are already some public cancer databases based on molecular data, such as The Cancer Genome Atlas (TCGA), Gene Expression Omnibus, International Cancer Genome Consortium, and the Catalogue of Somatic Mutations in Cancer. Among them, TCGA, known as a comprehensive database, has collected multiple omics data, including those regarding the transcriptome, methylome, somatic mutation and copy number variation, and it is the most common database used by researchers for data mining (http://cancergenome.nih.gov) (3). Different omics data can be utilized as potential biomarkers to predict the prognosis of different cancer types (4-7).
Furthermore, compared to the actual sample size (i.e., hundreds of samples), the scale of omics data is relatively large, which means that it needs to be processed (i.e., dimensionality reduction) rather than used to fit the predictive model directly. In addition, there are considerable challenges to performing better dimensionality reduction and variable selection for identification or prediction in terms of these high-dimensional omics data. Traditional frequentist approaches (i.e., regarding unknown parameters as deterministic values, which means that they can be estimated by the sample data) for coping with these challenges include ridge (8), the least absolute shrinkage and selection operator (lasso) method and its variants (9-12), and elastic net (13), amongst others. The comparison of different methods has been studied in the case of high-dimensional data (14). Benefiting from the improvement of computing capacity on computers, Bayesian methods have been developed in recent years which have better flexibility in dealing with the problem of high-dimensionality (p>>n) compared to traditional approaches. In fact, Bayesian approaches have become popular tools for medical researchers involved in clinical experimental design and drug development (15). Although some Bayesian methods have been proposed in terms of prognostic prediction, there is still a lack of relevant reviews for a comprehensive discussion.
In the present work, we review these Bayesian methods involving high-dimensional omics data proposed or applied for prognostic prediction in cancer. The primary outcome in this review focuses on relevant survival endpoints, such as overall survival, progression-free survival and relapse-free survival. We mainly explain Bayesian survival modelling from two aspects of single omics and multi-omics integration and emphasize the superiority of multi-omics integration. This survey mainly searched PubMed and Web of Science databases for articles based on key words “cancer” or “tumour”, “Bayesian”, “survival”, “omics”, “gene”, “prediction”, and “integrative”. After removing 25 reviews or meta-analyses, we found a total of 313 articles based on the above key words. Through reading the titles and abstracts, we initially excluded those involving experiments or clinical trials, which left 38 articles that involved Bayesian prognosis prediction and high-dimensional omics. After reading the full articles, we finally included 32 in this review.
Basic Bayesian Methodology
In this section, we briefly review relevant content about Bayesian inference to help readers understand the basic ideas of the Bayesian theorem. In survival analysis, semi-parametric Cox proportional hazards regression is the most common modelling tool for clinical prediction due to there being no restriction on the particular survival distribution. Another alternative is the parametric accelerated failure time (AFT) model that assumes a direct relationship between covariates and outcome (generally, the logarithm of survival time), which is able to acquire more interpretable results. Parameter estimation for both approaches is based on maximum likelihood (strictly, partial likelihood for Cox regression), which infers the most possible values of unknown parameters based on existing sample data. Differently from the view of frequentists, Bayesian theory also focuses on “experience” knowledge rather than sample data alone. Instead of considering the unknown parameter as a fixed value, Bayesian theory regards it as a random variable that follows a specific distribution. Before combining the sample data, an empirical distribution is given, regarded as a kind of pre-judgement for the unknown parameter, which is also known as the prior distribution.
Let θ and y represent the unknown parameter of interest and sample data, respectively. Then the Bayesian formula can be expressed in the following form:
Where P(θ) represents the prior distribution of unknown parameter θ; P(y|θ) can be regarded as the likelihood function from sample data. P(y) is the so-called marginal probability and can be extended as ∫ P(θ)P(y|θ)dθ, which is irrelevant to the parameter θ to be estimated. Therefore, the formula can be re-expressed as a kernel form. P(θ|y) represents the posterior probability, which is the main aim of our inference. It can be regarded as an update of the prior information P(θ) which was utilized for Bayesian inference.
Different priors play an essential impact in the calculation and inference of the posterior distribution. Especially under a high dimension, variable selection is a necessary process and can be achieved by setting suitable priors. Reviews of these variable selection methods have been published (16-20). In particular, conjugate priors can simplify the complexity of the calculation of the posterior distribution, such as Gamma-Poisson prior, inverse-Gamma-normal prior, Beta-binomial prior, Wishart-normal prior and so on (21). However, this situation is not common in all Bayesian analyses. In addition, the integral process becomes more arduous as the parameters estimated are multi-dimensional (θ=∑θi). Therefore, approximation techniques would be an alternative choice, including Laplace approximation (22), variational inference (23) and sampling-based Markov chain Monte Carlo (MCMC). Among them, MCMC algorithms are applied widely in survival analysis. By constructing a Markov chain with a stationary distribution for simulating the sample of the posterior distribution, the posterior expectation [E(θ|D)] and variance of the corresponding parameters can be obtained. However, the original MCMC algorithm sets a relatively lower transfer acceptance ratio based on the stable property of the Markov chain, which meand it takes a long time to reach the convergence situation, especially with high-dimension data. In order to obtain a faster convergence speed, several improved algorithms have been proposed, such as Metropolis–Hastings (24), Gibbs (25), Hamiltonian Monte Carlo (26), Expectation-Maximization (EM) cyclical coordinate descent (27).
Bayesian Applications in Cancer for Prediction of Prognosis
Single omics analysis. In this section, single omics analysis refers to the inclusion of single omics platform data in the study, combined with or without clinical factors for prognostic modelling. Through literature collection, we found that most of the studies were based on expression profiles, followed by proteomics and genomics. The relevant characteristics of the various studies have been summarised in Table I. Bayesian methods in these studies played a significant role in dimensionality reduction/variable selection and modelling.
Variable selection based on priors. Here, the variable selection procedure is performed by a kind of priors with the specific function of variable selection. The current main priors include shrinkage priors and spike-and-slab priors. Shrinkage priors constrain corresponding coefficients towards zero by imposing a continuous prior distribution. For instance, Kaderali et al. introduced a hierarchical Bayesian approach to predict individual survival times based on expression profiles, which imposed the normal-Gamma prior on regression parameters. It combined dimension reduction and regression in one single step to select the most discriminatory genes under study (28). This method has been proven to have better prediction performance in neuroblastoma (29). Nikooienejad et al. introduced a procedure for Bayesian variable selection that used a mixture prior comprised of a point mass at zero and an inverse moment prior, which led to lower false-positive rates in the variable selection process. Due to the Cauchy-like tail, this prior introduced comparatively small shrinkage of large coefficients (30). Considering potential pathway/structure among genes, Jiang et al. proposed a Bayesian Robit regression model with hyper-lasso priors for survival-related feature selection. It took possible group structures into account automatically without considering a pre-specific grouping structure (31). Zhang et al. proposed a two-stage pathway-based Bayesian modelling strategy for survival modelling, which imposed double-exponential prior (or Laplace/Bayesian lasso) on the coefficients (32). In short, those genes mapped into different pathways were fitted into a Bayesian hierarchical Cox model, and then the prognostic scores for each pathway, which were calculated by leave-one-out cross-validation, were regarded as new predictors to build an integrated prognostic model. Based on proteomics, Maity et al. developed a jointly Bayesian hierarchical model to model both the survival time and binary outcome, and applied a shrinkage prior (i.e., a truncated Cauchy distribution) to identify common significant proteins that affected survival and stage (33). Furthermore, the authors also presented a pan-cancer Bayesian hierarchical AFT model for survival (34). It adopted the sparse horseshoe prior for identifying the major proteomic drivers and was allowed to borrow strength across multiple cancer types by setting a correlation structure among prior distributions.
Another main prior is the spike-and-slab prior (as a kind of mixture prior), which includes a point mass at 0 (the spike) and a continuous distribution (the slab) (35). It can assign different distributions (the slab or spike) to model different effect sizes. For example, Duan et al. proposed Bayesian survival regression with the spike-and-slab prior applied on coefficients for variable selection (36). It extended EM variable selection (37) to a parametric survival model with Weibull distribution and utilized an EM algorithm to quickly obtain the posterior modes of unknown parameters. Similarly, considering the underlying structure, Stingo et al. proposed a comprehensive Bayesian model that achieved two-level selections on the pathway and genes involved. Except for the prior on coefficients, the Markov random field prior was applied for obtaining a better separation between relevant and nonrelevant pathways and fewer false-positive rates in a model with fairly small coefficients (38). The efficacy of the above Bayesian approach was further validated in lung adenocarcinoma for prognostic analysis (39). Peterson et al. proposed a joint Bayesian modelling strategy for learning relevant networks and selecting network-structure variables simultaneously. It applied a mixture prior with a normal density and a Dirac function for variable selection, which was also known as the spike-and-slab. The authors also constructed the network through a Gaussian graphical model to provide a sparse and interpretable representation of the dependency relationship from the data (40).
The above Bayesian approaches mainly imposed either a shrinkage prior or the spike-and-slab prior for regression coefficients. Recently, a kind of hybrid approach (i.e., the spike-and-slab lasso) has been proposed, which combines the advantages of shrinkage and spike-and-slab priors (41). Tang et al. presented Bayesian generalized linear models based on the spike-and-slab lasso (27) and extended the spike-and-slab framework to a Cox model for prognostic model (sslasso Cox) (42). It proposed a new EM coordinate descent algorithm to achieve faster converge speed compared to the MCMC algorithm. Compared to traditional lasso, the approach is able to estimate the effect of predictors more accurately. Later, considering possible group structures among genes, the above survival model was further extended into the sslasso Cox group, incorporating potential pathway structure information (43).
Variable selection based on non-priors. Except for above mentioned Bayesian variable selection approaches based on priors, variable selection procedures can be carried out in other ways, generally speaking, i.e., model-based selection. In this approach, the priors given to predictors (i.e., to the coefficients) tend to be assigned as flat priors (e.g., general normal distribution). Bonato et al. proposed a non-parametric Bayesian ensemble method that extended Bayesian additive regression trees to three survival models (Cox, AFT, and Weibull) using a Bayesian hierarchical framework which incorporated additive and interaction effects between genes. It achieved variable selection by Bayesian false discovery rate through the Bayesian model averaging (BMA) approach (45). Zhang et al. proposed a kind of semi-parametric Bayesian kernel survival model (46), which considered gene pathway effects related to survival outcome via a kernel function in a Gaussian process (47). It adjusted the traditional Bayes factor (48) to adapt multiple-comparison for performing robust statistical inference. It allowed taking into account the possibility of nonlinear single-gene effects and a complicated structure of interactions among genes within the same pathway. Samorodnitsky et al. proposed a novel pan-cancer Bayesian hierarchical survival model based on the somatic mutation profile. They evaluated the optimal model of fitted data and predictive importance of genes by calculating the log out-of-sample posterior predictive likelihood and mean log-posterior likelihood (i.e. model selection and forward selection) (49). Newcombe et al. considered the Weibull regression model embedded in a Bayesian framework based on Reversible Jump MCMC algorithm, which was called sparse Bayesian Weibull regression. It applied a beta-binomial prior for indicator variable γ to construct the model selection framework (if γ=1, the covariate was included in the model) (50).
Another type of approach, known as BMA, focused on coping with the model uncertainty problem (i.e. there might simultaneously exist multiple fitted-well models for the same data) (51). To accommodate high-dimensional data, iterative BMA was proposed later (52). Furthermore, Annest et al. extended iterative BMA to survival analysis through several modifications to the iterative BMA algorithm, including using Cox proportional hazards model for ranking individual genes instead of the between-group to within-group sum of squares technique, applying a user-specified number of top-rank genes for BMA algorithm iterating, and re-considering these discarded models based on an adaptive threshold after iterations (53). The risk scores of individuals were calculated by the weighted average of the risk scores calculated for each model in the set of final models. Kaplan–Meier analysis in breast cancer and diffuse large B-cell lymphoma showed a significant difference between high-risk and low-risk groups based on gene sets selected by this iterative-BMA approach (53). Similarly, the same approach was further applied in early-stage lung adenocarcinoma (54), mantle-cell lymphoma (55) and breast cancer (56).
Multi-omics analysis. Considering that cancer is a highly heterogeneous disease, single omics information can only explain part of the potential mechanisms involved in the development of tumours, which means the above prognostic models based on single omics data might not be accurate enough. Therefore, in recent years, multi-omics integrated modelling analysis has provided the potential direction for us to comprehensively understand the mechanisms and find more convincing biomarkers. Moreover, except for potential structures within omics, there are often complex biological relationships between different omics platforms (Figure 1), which brings larger challenges to omics fusion modelling. Currently, the idea of integration can be divided into two categories, namely multi-stage analysis and meta-dimensional analysis (57). The former focuses on the association between omics and splits into multiple steps for analysis, while the latter considers that all data are combined simultaneously to identify complex multi-variable models (e.g., concatenation-based integration, transformation-based integration, model-based integration). However, due to distinct modelling perspectives, the Bayesian approaches discussed below might not be classified into the above types accurately. In order to describe these more logically, we briefly categorize the following literature into three types based on their approach characteristics. A brief summary of these Bayesian integrative approaches has been listed in Table II.
Integration based on a multi-stage approach. In this approach, the idea of the integration model tended to be divided into multiple steps. For instance, Wang et al proposed the integrative Bayesian analysis of genomics data (iBAG), which modelled biological relationships among different omics by focusing on the “gene-centric” level. It was composed of two modelling stages, including a mechanistic model and a clinical model. The former was used to infer the direct effects of different features (e.g., methylation) on gene expression and the latter was used to construct a final prognostic model based on the result of the previous step (58). Jennings et al. extended the iBAG to a generalized version and focused on genes from several key cancer signalling pathways. Similar to iBAG, in the first stage, a mechanistic model was constructed by partitioning the expression of each gene into the factors explained by methylation, copy number and other features using a principal component regression model. In the second stage, the above factors were treated as predictors and established the clinical model. This study set a normal-Gamma prior on regression coefficients, which provided the sparsity (i.e. only a few genes will significantly affect the outcome while others have no or weak influence) (59). Bernal Rubio et al. developed a multi-layered Bayesian model for integrating clinical information and multi-omics. It assessed the inter-individual variation that could be explained by molecular predictors related to survival. The research process was divided into multi-stages: First, a baseline Cox regression model was used to analyse the association between clinical/demographic covariates and survival. After considering the association between omics and clinical/demographic covariates using principal components analysis, the baseline model was extended by adding omics profiles in a linear model with the logarithm of survival time as the dependent variable (60).
Integration based on the extension of existing methods. Bayesian modelling methods in this part largely focused on dealing with several specific challenges or problems in the multi-omics integration process. Vazquez et al. proposed the Bayesian generalized additive model for integrating high-dimensional multi-layer omics data (61), which was based on the generalized additive model framework (62). This modelling framework allowed the integration of high-dimension inputs from multiple omics layers, interactions between omics and different effect architectures across layers. Srivastava et al. presented the hierarchical relevance vector machines approach for addressing nonlinear patterns in the complex interactions within and between different platforms (gene expression and microRNA). Based on kernel functions, the nonlinear effects of possible interactions were considered, and it was possible to evaluate the respective contributions of different omics simultaneously (63). Ni et al. proposed a Bayesian hierarchical varying-sparsity regression model to integrate the genomic, proteomic and clinical information based on the AFT model, which contributed to identifying patient-specific prognostic biomarkers (64). Following the concept of varying coefficients, this study introduced sparsity into the model, which allowed for flexible interactions and the sparsity of the protein–outcome relationship to vary simultaneously with gene expression. Moreover, it thoroughly considered potential nonlinear relationships between the proteins and genes by spline-based semi-parametric forms. Maity et al. proposed a Bayesian structural equation model for integrating two platforms (namely copy number variation and mRNA) data (65). Considering the potential direction among different platforms (biological relationships), a latent variable in the structural equation was introduced to overcome the challenge that a specific omics component may or may not necessarily affect other specific omics components. Furthermore, a basic problem might exist in the integration analysis, that is, an insufficient valid sample size (omics data may not be available for all patients). Fortunately, Chekouo et al. proposed a novel Bayesian integrative multi-regression model for dealing with this problem. In their study, multiple regression models were constructed to make full use of the sample data of each omics rather than just the complete data across all platforms (66). The key point was to apply the Markov Random Fried prior for borrowing strength across groups, which means that it encouraged selecting the same predictors across multiple models.
Integration based on network/graph structure. Here, the integrative relationship of multiple omics was portrayed by a network/graph structure to establish prognostic models for screening key cancer biomarkers. Wang et al. developed an integrative network-based Bayesian analysis approach for analyzing multi-platform high-dimensional genomic data. It assumed one triplet as the combination of the expression level for one gene, one microRNA and the associated clinical outcome for modelling the underlying biological relationship, and considered eight possible triplet structures (models). Based on Gaussian graphical models, this study utilized Bayes factors with hyper-inverse Wishart g-prior to perform the model selection procedure (67). Chekouo et al. proposed a network-based Bayesian hierarchical model to identify microRNAs and target genes associated with survival. The microRNA regulatory network was modelled as a Bayesian network and a specific prior was constructed to link the two omics, which gave a higher probability of survival related to those highly connected biomarkers in the regulatory network (68). Kundu et al. proposed a novel multi-scale Bayesian approach that distinguished essential genomic drivers of cancer progression from integrative graphical structure learning. The process of their study was divided into two parts. Firstly, the dependence relationship within and between platform-specific features was assessed. Then identification of important prognosis-related molecular markers was performed by the above dependence relationship. Structured variable selection was implemented by identifying important cliques (i.e., the overlapping functional subgroups within/across platforms that influenced clinical outcome together) through computing corresponding marginal inclusion probabilities and eliminating unimportant covariates in cliques identified through computing Bayesian 95% credible intervals (69).
Discussion
As a summary of existing literature, the current study contributes to developing the prognostic modelling of cancer based on high-dimensional data and expanding new ideas/strategies for researchers. After briefly introducing the basic Bayesian theorem, we mainly discussed current published Bayesian approaches based on single/multiple omics data with/without clinical characteristics. As one essential part of Bayesian inference, prior information plays a critical role in the process. Notably, although variable selection can be accomplished by model-based approaches (such as BMA), Bayesian inference depends largely on the setting of suitable priors under the assumption that only a small subset of predictors significantly influence the survival situation. The correct usage of priors will help improve the performance of the predictive model but it is also a seemingly difficult task to choose appropriate priors. A previous study explored the effects of different types of Bayesian variable selection based on simulated and actual data in a relatively low-dimensional situation (20). However, in the case of high dimensions, different prior choices still need to be fully discussed. Although we perhaps find references from previous similar literature for specific analysis goals, it might be more appropriate to be guided by available data at hand.
Bayesian prognostic modelling with singe omics, to a certain extent, might improve predictive accuracy. Nevertheless, some limitations still exist. Tumour development is frequently accompanied by hundreds of distinct molecular changes, from many kinds of genomic and epigenetic alterations to their interactions. This means that interesting clinical outcomes might be affected by the use of other platforms (for example, microRNA, methylation) when single omics is utilized for prognosis modelling and screening of potential biomarkers, which might weaken the reliability of evidence for powerful targeted markers.
In fact, multi-omics integration might be more advantageous in investigating the biological mechanism of tumour progression or prognosis modelling for finding reliable targets in the era of precision medicine. A large-scale benchmark study from TCGA discovered that methods taking into account the multi-omics structure had better predictive performance, although the utility of multi-omics data was limited (70). The current methods (mainly machine learning) in the multi-omics integration field have been reported in a previous review, although it was not specifically illustrated from the Bayesian perspective (71). In relevant Bayesian articles involving multi-omics fusion, a core issue is also how to integrate different large-scale platforms, in addition to dimension reduction/variable selection. Integration based on biological relationships has better interpretability in the final result than simple merging. Then grasping the complicated relationship between multiple omics in the integration process will improve the effectiveness of the predictive model, such as interactions within/between platforms (61, 63, 64), non-linear relationships (63, 64). Moreover, relying on the complex network or intuitive graph model in the Bayesian framework might contribute to comprehensively describing the relationship of multiple omics for better survival modelling (68, 69).
However, several deficiencies exist that need to be further discussed or explored in the above approaches. i) Model validation. Nowadays, evaluating the performance of a prognostic model is increasingly relying on effective internal or external validation. Compared with external validation, internal validation is convenient to implement, such as k-fold cross-validation and bootstrapping. Although internal validation used in most literature was able to verify the performance of the models, adopting an independent external cohort would further illustrate the reliability of the predictive model. ii) The occurrence and development of cancer is a dynamic process, which also means that the integration of different omics may also be considered a dynamic process, with complex patterns of feedback loops both within and between platforms. Dynamic modelling could be a major challenge. iii) Computational complexity or burden must also be considered, especially when integrating multiple high-dimensional omics under the Bayesian framework. iv) Practicality needs to be deliberated. Most integrative studies that were validated proposed integrative Bayesian approaches on glioblastoma. The actual effect of extending these to other tumour types is still unclear, which needs more relevant literature support in the future. In addition, model interpretability is also an issue not to be ignored, primarily when multi-omics is utilized for integration into the Bayesian modelling process. The real improvement in utility of such screened biomarkers needs further evaluation after clinical transformation.
As for prognostic modelling in the future, new biostatistical methods should overcome the above limitations. Approaches based on the Bayesian theorem might provide more flexible and accurate results regarding the prognosis of cancer. Interestingly, as well as these molecular omics data, emerging radiomics might also be considered in the modelling framework for acquiring optimal predictive accuracy (72, 73). The integration of these omics will help us understand the development of tumours at multiple levels for better predicting the prognosis of patients.
Conclusion
In summary, this review systematically collated the Bayesian approaches based on high-dimensional molecular data for prognostic analysis in cancer, especially the multi-omics fusion method. These approaches have been proposed chiefly in the past 10 years. From these studies, multi-omics integration methods with different strategies for feature selection have better predictive performance and more excellent value in finding biomarkers that affect prognosis in cancer. However, multi-omics integration technology often faces a key challenge, namely how to effectively integrate the relationship between multiple platforms. At the same time, the possible intractable issue involving the estimation of the complex posterior distribution is also worthy of attention. Therefore, further research is necessary.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (project number 81973143).
Footnotes
Authors’ Contributions
Jiadong Chu and Na Sun: Collection and writing. Wei hu and Xuanli Chen: table and graph. Yueping Shen: Design and revision. Nengjun Yi: Revision.
Conflicts of Interest
The Authors declare that they have no conflicts of interest.
- Received October 18, 2021.
- Revision received November 24, 2021.
- Accepted November 30, 2021.
- Copyright © 2022, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved