Abstract
Rapid advancements in high-throughput biological techniques have facilitated the generation of high-dimensional omics datasets, which have provided a solid foundation for precision medicine and prognosis prediction. Nonetheless, the problem of missing heritability persists. To solve this problem, it is essential to explain the genetic structure of disease incidence risk and prognosis by incorporating interactions. The development of the Bayesian theory has provided new approaches for developing models for interaction identification and estimation. Several Bayesian models have been developed to improve the accuracy of model and identify the main effect, gene-environment (G×E) and gene-gene (G×G) interactions. Studies based on single-nucleotide polymorphisms (SNPs) are significant for the exploration of rare and common variants. Models based on the effect heredity principle and group-based models are relatively flexible and do not require strict constraints when dealing with the hierarchical structure between the main effect and interactions (M-I). These models have a good interpretability of biological mechanisms. Machine learning-based Bayesian approaches are highly competitive in improving prediction accuracy. These models provide insights into the mechanisms underlying the occurrence and progression of complex diseases, identify more reliable biomarkers, and develop higher predictive accuracy. In this paper, we provide a comprehensive review of these Bayesian approaches.
Complex diseases, such as cancer, are critical non-communicable chronic diseases that seriously endanger the health and quality of life of people. They consistently rank among the leading causes of death, contributing significantly to the global burden of disease (1). In 2020, the World Health Organization reported an estimated 19.29 million new cancer cases worldwide and 9.89 million cancer-related deaths (2). In clinical practice, surgical treatment, radiotherapy, and chemotherapy are the most common treatments used for malignant tumors. In recent years, targeted therapies, such as epidermal growth factor receptor (EGFR) inhibitors and immunotherapies (including PD-1 and PD-L1 inhibitors), have opened new frontiers in precision therapy, providing better treatment outcomes for specific patients (3, 4). However, the survival of patients with cancer has not improved significantly. This can be attributed to several critical potential targets that may not have been detected or fully understood, immune escape and drug resistance (5).
Accordingly, there is an urgent need to identify the important factors associated with the progression and prognosis of cancer to guide the prevention and treatment of highly heterogeneous cancer types and facilitate prognostic assessment. However, models based on traditional clinical factors, such as treatment, stage, age, and sex often perform poorly (6). Fortunately, the rapid development of genome sequencing technology has led to the generation of high-dimensional omics data encompassing various types, including single-nucleotide polymorphisms (SNPs), methylation, microRNA, mRNA, lncRNA, and proteomics data (7). Notably, previous research has shown that biomarkers screened from different omics data using appropriate biostatistical methods can reliably predict the occurrence and prognosis of various diseases (8-10).
In recent years, the development of diagnostic and prognostic interaction models for complex diseases has become a popular research topic (11). Studies have shown that diagnostic and prognostic models that only include the main effects cannot significantly improve model performance (12). The interactions are critical for the diagnosis and prognosis of complex diseases beyond the role of the main effects of genetic or environmental factors (13). In incidence risk studies, models that integrate gene-environment (G×E) and gene-gene (G×G) interactions may provide additional genetic evidence for a high incidence of cancer in specific exposure groups (14, 15). These interactions should not be ignored in prognostic studies. Integrating G×G and G×E interactions into prognostic models is important to improve model performance. For example, Marley et al. showed that the interactions between citrus intake and genes may influence melanoma risk (16). Other studies found that two-way and three-way interactions between DNA methylation and smoking significantly affected the prognosis of lung cancer, which is expected to be a methylation-specific drug target (14, 17, 18).
Models developed in the Bayesian framework that integrate interactions can utilize the qualities of Bayesian theory to provide new ideas for studying interactions (19-21). Firstly, this class of models can adaptively incorporate prior information (22, 23). Secondly, Bayesian methods can flexibly incorporate the principle of effect heredity between the main terms and interactions, and the developed models can be used to obtain good interpretability of biological mechanisms (22). Thirdly, Bayesian approaches can be combined with machine learning (ML) methods, such as neural networks and factorization machines (FMs), to develop novel approaches with higher predictive accuracy (24, 25). However, Bayesian approaches that incorporate interactions involve many challenges, such as the high dimensionality of the data and its resulting computational cost, the incorporation of dependence on the main effect and interaction (M-I), and the complexity of the posterior inference (26).
In the past years, the analysis of interactions under the Bayesian framework has been developed to allow for two-order and higher-order models (27, 28), linear and nonlinear (29), binary and count (30), with continuous traits and censored survival outcomes (22, 31). Although several Bayesian methods have been proposed for the estimation and identification of interactions, relevant comprehensive reviews are lacking. In this study, we review Bayesian approaches that involve interactions. The purpose of this review is to provide a survey of approaches that can be used within the Bayesian framework to detect G×G and G×E interactions related to the occurrence or prognosis of complex diseases.
Bayesian Approaches in Interaction Studies
We summarize four classes of approaches (SNP-based, effect heredity-based, group-based, and ML-based) within the Bayesian framework for detecting interactions associated with the occurrence and development of complex diseases. Although we focused on the Bayesian approaches used in human genetics, we also incorporated Bayesian approaches used in animal and plant genetics (32), which are closely related to human genetics. In addition, some ML-based Bayesian methods are not particularly developed for medical research; however, they are generally similar in terms of data dimensions and research strategies to the methods used in human genetics data. Therefore, they are also included in this study.
SNP-based approaches. SNP-based studies are characterized by certain unique challenges. First, SNP-based studies usually incorporate genotype SNPs with strong linkage disequilibrium (LD); hence, strong correlations may not only exist between the main terms and interactions but also among the main terms (33). Secondly, SNP data typically include a higher number of genotypes with lower frequencies, which generate predictors of near-zero variation (34). Over the past few years, a series of SNP-based Bayesian interaction analysis approaches have been developed to explore the key roles of rare haplogroups and G×G and G×E interactions on the occurrence of complex diseases. Their relevant characteristics in this study are summarized in Table I.
Summary of SNP-based Bayesian approaches for G×G and G×E interactions.
Moss et al. extended the Bayes model averaging (BMA) approach and proposed a BMA 2DF approach (35). The method used the multivariate Wald test with 2DF to test the G×E interaction and the main effects of genes. Kerin et al. proposed a Bayesian whole genome regression model to jointly model the main genetic effects and G×E interactions in a large-scale dataset (36). It built a linear environment mixed model analysis for the assumption that the main effects and the G×E interaction effects follow two separate mixtures of Gaussian priors and estimated the environmental scores (ES). The ES can be used to both estimate the proportion of phenotypic variance attributable to G×E effects and to test for G×E effects at genetic variants across the genome (36). In addition, stratification by minor allele frequency and LD can be analyzed to better explore the genetic structure of the disease. Zhang et al. proposed a Bayesian epistasis association mapping (BEAM) algorithm to identify the critical main terms, low-order and high-order interactions in genome-wide case-control studies (37). The BEAM algorithm has two components: a Bayesian epistasis inference tool that is implemented with Markov Chain Monte Carlo (MCMC) and a test B statistic that can be adapted to the correlation structure of candidate markers to assess statistical significance. In addition, they extended the BEAM algorithm in 2011 and developed the block-based BEAM2 algorithm, dividing SNPs into LD-blocks and selecting the main effects and interactions within the blocks associated with the disease (38). Wang et al. developed a Bayesian method for genome-wide association studies (GWAS) that allows for a Bayesian model on both continuous and discrete data. This method is suitable for detecting higher-order interactions associated with phenotypes (28). The above methods are mostly applicable to genome-wide association studies in large-scale cohorts.
Several Bayesian regression models have been proposed to detect variants and interactions. Biswas and the team developed a series of Logistic Bayesian LASSO (LBL) approaches, including LBL-G×E-I, LBL-G×E-D, LBL-G×E, LBLc-G×E, and LBLc-G×E-G×S to detect phenotype-related interactions in case-control studies (39-43). The three methods, LBL-G×E-I, LBL-G×E-D, and LBL-G×E, were differentiated based on whether the assumption of G-E independence was made. LBL-G×E-I assumed G-E independence in the control population (39). LBL-G×E-D allowed for G-E dependence by modeling haplotype frequencies as a function of covariates using a multinomial logistic regression model (41). LBL-G×E allowed for uncertainty in the assumption of G-E independence by allowing Markov chains to move between LBL-G×E-I and LBL-G×E-D via the reversible jump MCMC (41). LBLc-G×E and LBLc-G×E-G×S were developed to accommodate complex sampling designs (43). LBLc-G×E incorporated stratified variables as covariates with their main effects in the model, whereas LBLc-G×E-G×S incorporated the interaction between stratified variables and haplotypes in the model (43).
Effect heredity-based approaches. The effect heredity is widely used in studying interaction identification. It incorporates the main terms and interaction dependencies into the analysis (27). Strong heredity indicates that the interaction can only be active if all its main effects are active, whereas weak heredity indicates that the interaction can be active if one of the two main effects is active (27). Models that follow the hereditary principle are generally easier to interpret (44). These models allow us to reliably estimate the main and interaction effects, which increase the power for detecting real signals (30). In contrast, models that violate the hereditary principle are unreliable and perform poorly (31, 45). To address the problem of variable selection in interaction studies, two priors dominate the Bayesian literature: spike-and-slab (also referred to as stochastic search variable selection, SSVS) and shrinkage priors (46-49). The spike-and-slab prior is a mixture of two discrete distributions that can be assigned to the model parameters using an indicator variable with different distributions (spike or slab prior distributions) to fit different effects (46). The shrinkage prior allows the regression coefficients to be constrained by imposing a continuous prior distribution (47). The relevant characteristics of these studies are summarized in Table II.
Summary of effect heredity-based Bayesian approaches for G×G and G×E interactions.
Spike-and-slab prior. Chipman developed priors for interaction identification based on the spike-and-slab prior and two alternative priors, and followed the effect heredity principle (27, 50, 51). This approach has been extended and applied to case-control studies. Wakefield et al. established a special case of the Bayesian mixture model of G×E (46). Their study included a small number of statistically significant environmental variables. Therefore, it was assumed that the main effect of environmental variables was activated, and only the dependence between the main effect of the genes and the interaction effects needed to be considered (46). Liu et al. proposed a Bayesian hierarchical mixture model that incorporates the effect heredity principle to enhance the study of genetic and environmental main effects as well as G×E and G×G interactions (52).
Researchers have conducted numerous innovative studies to explore the main interaction effects associated with continuous response variables. Im et al. developed a Bayesian approach for a finite mixture regression based on environmental factors, imaging features, and imaging × environment interactions, following the effect heredity principle (53). In addition, another study expanded the method to the least absolute deviation regression to construct a robust Bayesian variable selection approach that can effectively account for heavy-tailed errors and outliers in the response variable while following the heredity of M-I effects (54). Zhao et al. proposed a Bayesian interaction selection model, which worked not only for continuous response but also for qualitative response variables. Furthermore, the model could accurately identify the primary effect features and interactions under hereditary conditions (55). Prior distribution can accommodate correlations between brain topological information and modalities to improve biological plausibility. Qin et al. developed a structured Bayesian approach using a linear model that considered both low- and high-level interactions (31). They integrated the “main effects/interactions-network” information by constructing an adjacency matrix.
They further developed a two-level Bayesian interaction analysis method based on a log-normal accelerated failure time model (AFT) (22). This method allowed the analysis of both low-level gene-gene interactions and high-level pathway-pathway interactions in the model. To improve the interpretability of the results, the effect heredity principle was followed not only between M-I but also between low-level gene-gene interactions and high-level pathway-pathway interactions. Thus, high-level pathway-pathway interactions could only be selected if at least, one low-level gene-gene interaction was selected.
The effect heredity-based approach utilizing spike-and-slab prior had been extended in other fields. For example, a Bayesian model that allowed variable selection and followed the principle of effect heredity was proposed by Lin et al. This model can be used for any split-plot and blocked screening design (56). In addition to the main effects, two-way interaction effects and quadratic effects were included in the model. Particularly, Lin et al. developed strong effect heredity models and two types of weak effect heredity models: the strict weak and relaxed weak effect heredity models.
Shrinkage prior. Griffin et al. developed a simple method to constrain the hierarchical relationship between M-I using strong and weak effect heredity (47). They incorporated the local shrinkage parameters of the two main effects into the local shrinkage parameters of the interaction. In the case of strong heredity, the local shrinkage parameters of the interaction term tended to be large when the local shrinkage parameters of the two main effects were large. In the case of weak heredity, the local shrinkage parameter of the interaction term tends to be large when one of the local shrinkage parameters of the two main effects is large. This study also incorporated these hierarchical priors into a generalized additive model. Yi et al. developed Bayesian models based on logistic and Probit regression, using different shrinkage parameters of Student-t prior distributions for the main and interaction effects, in order to follow the relationship between the main and interaction effects, and implemented a fast and stable expectation-maximization (EM) algorithm for fitting models (30).
Group-based approaches. An alternative analysis strategy commonly used to identify interactions is a group-based approach. Group-based Bayesian approaches employ two types of analysis strategies. The first is that once a group is selected, all variables in that group are included in the final model. Gu et al. proposed a Bayesian two-step Lasso approach to select the main terms and interactions for censored survival outcomes (57). In the first step, the Bayesian group Lasso was used to select the important marker groups, and in the second step, Bayesian adaptive Lasso was used to identify individual biomarkers. Groups were selected when both the main terms and interactions were statistically significant. The Bayesian group Lasso not only serves as an initial screening step to reduce the parameter space but also has a similar role to the strong effect heredity principle for variable selection, as proposed by Chipman (27, 57).
The other analysis strategy is that whether the group is activated or not is solely determined by a few variables within the group. In other words, for an active group, not all variables within that group are active. For example, Chen et al. developed a Bayesian sparse group selection approach that used two nested binary indicator variables to indicate whether the group and variables were active or not (58). When none of the variables within the group were active, the group was not activated. However, when a variable within the group was active, the group was activated. Although the analysis strategy was not specifically developed for detecting interactions, it can incorporate the interaction and its two main effects within each group. Similarly, other group-based Bayesian variable selection approaches can be used as alternatives to the Bayesian group Lasso and Bayesian sparse group selection approach (59). Several studies employing systematic reviews have been conducted for group-based Bayesian variable selection approaches (55).
ML-based approaches. Neural networks, decision tree models, regression tree models, linear kernel functions, nonlinear kernel functions, and FMs are commonly used ML techniques (60, 61). Generally, models based on ML techniques have a higher prediction accuracy. Novel methods constructed by combining the Bayesian framework with these methods are highly competitive in detecting low- and high-order linear or non-linear interactions. The relevant characteristics of these studies are summarized in Table III. A neural network is a powerful predictive method in ML, and has the potential to identify interactions (62). Cui et al. developed a Bayesian neural network based on neural networks and Hessian to estimate global interactions by aggregating local interactions from trained Bayesian neural networks, thereby improving the estimation accuracy of the interaction effects (25).
Summary of ML-based Bayesian approaches for G×G and G×E interactions.
Du et al. developed a Bayesian decision tree method to detect low-order interactions (63). It solved the general problem of detecting spurious interactions by introducing Dirichlet process forests. The Bayesian additive regression tree (BART) is a nonparametric ensemble ML model that can reliably estimate regression relationships and has good prediction accuracy when non-linear relationships and interactions are present (29, 64). Spanbauer et al. developed a mixed BART model to achieve the goal of precision medicine by extending the BART with a generalized random effect matrix (65). Zeldow et al. developed a new Bayesian semi-parametric model based on BART (semi-BART) to obtain an interpretable coefficient (66).
In previous statistical genomics studies, kernel machine regression (KMR) was mainly used to test the overall effect of a genetic pathway or to study the role of a gene in the presence of possible interactions (67, 68). Bobb et al. developed a hierarchical variable selection method in the Bayesian KMR that can efficiently handle highly correlated predictors (69). Agrawal et al. developed a kernel interaction trick that computes the exact posterior of the main effects and the interactions between selected main effects with reduced runtime by orders of magnitude over MCMC applications (70). Furthermore, the interaction could handle not only pairwise interactions but also higher-order multi-way interactions.
FMs are generic-supervised learning methods widely used in ML to efficiently model feature interactions (71). Chen et al. proposed a Bayesian feature interaction selection method based on FMs. This method effectively reduced the number of interactions and extended the model to detect higher-order interactions in higher-order decomposers (24). Significantly, the method incorporates the principle of effect heredity into a second-order interaction study.
Madhukar et al. developed an efficient and accurate platform, BANDIT, based on a Bayesian ML approach to identify drug interaction targets, accelerate drug discovery, and guide clinical application (72). The researchers tested BANDIT and showed that it achieved a high accuracy in identifying shared target interactions and discovering novel potential targets for cancer therapies.
Discussion
As can be observed, several Bayesian approaches are available to detect G×E or/and G×G interactions using omics data. By summarizing the existing literature, the current study is particularly useful for researchers performing interaction studies. SNP data is a special kind of omics data with a strong LD between SNPs, and usually includes a higher number of genotypes with lower frequencies (73, 74). Therefore, we first summarized the Bayesian interaction approaches that specifically handle SNPs. Although statistical approaches, such as LBL, are well developed and demonstrate good performance, two core problems remain that are not solved in the existing Bayesian approaches. First, many exposures change over time (e.g., environmental tobacco smoke and pharmaceutical drugs) (75) and the nature of exposure could be related to the G×E interaction. Thus, the G×E interaction may be dynamically changing, and to the best of our knowledge, present Bayesian approaches do not incorporate them. Second, millions of SNPs exist in humans. When only second-order interactions are considered and higher-order interactions are exempted, it can pose a huge computational challenge owing to the existence of hundreds of millions of G×G SNPs (76).
We further summarize other Bayesian approaches for detecting interactions, including effect heredity-based and group-based approaches. Previous studies improved the spike-and-slab and shrinkage priors, and the strong and weak effect heredity of the main terms and interactions were incorporated into the effect heredity-based approaches. However, certain challenges persist. Reasonable values of the hyperparameters in spike-and-slab priors are crucial for identifying and estimating variables. The “semi-automatic” procedure for hyperparameter selection has been proposed; however, in the case of high-dimensional and highly correlated predictors, it needs to be discussed extensively (19, 50, 77). When using the shrinkage prior with correlated predictors, the multimodality of the posterior may lead to sampling difficulties, particularly slow convergence of the MCMC (78). Furthermore, some existing studies on effect heredity require further studies because they have only incorporated the dependence between the main effects and two-way interactions and did not involve the dependence between terms and three-way interactions. Some group-based approaches perform similarly to strong effect heredity-based approaches (57, 79). While the details of these approaches are different, in many cases, there are close conceptual links between them. However, group-based approaches may involve a more complex structure. To summarize, the group-based and effect heredity-based approaches have improved the interpretability of results and have accurately detected interactions of biological significance that affect complex diseases. However, they may perform poorly in terms of prediction accuracy compared with the ML-based approaches.
Finally, we summarized the Bayesian approaches for interaction detection based on ML. ML-based approaches mainly emphasize improving the predictive accuracy of the models (80); hence, the interpretability of the results is poor. Only the Bayesian feature interaction selection method based on FMs incorporates the heredity of the main effects and two-way interactions (24). In addition, the prediction accuracy of ML methods is influenced by the tuning parameters, which may require professional skills and is time consuming (81). Moreover, the application of methods such as neural networks in the detection of complex disease interactions is limited because they are generally processed as black-boxes (82). In the future, new Bayesian ML-based approaches in which interactions are learned using white-box ML models could be considered.
Conclusion
In conclusion, novel biostatistical methods are expected to eliminate the above limitations for future Bayesian approaches to G×G and G×E interactions. Methods based on a Bayesian framework may provide more accurate results, drive ongoing advances in interaction detection, and explain missed heritability during the development of complex diseases. More importantly, they can provide a novel approach for theoretical studies in areas such as precision medicine. Unfortunately, Bayesian approaches are relatively rare to detect the interactions of censored survival outcomes. This is because the complex survival time distribution and hazard function make it difficult to infer the posterior distribution via the conjugate distribution. Unlike the classical estimation methods, Bayesian approaches based on the Cox proportional hazards model are unable to eliminate the baseline hazard, and the posterior sampling and convergence are more challenging. Therefore, future exploration is required.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant Number 81973143.
Footnotes
Conflicts of Interest
The Authors declare no conflicts of interest.
Authors’ Contributions
Na Sun: Collection and writing. Yu Wang: Collection. Jiadong Chu and Qiang Han: Drawing of tables. Yueping Shen: Designed and critically revised the work. All Authors have participated actively in this study and agree to the content of the manuscript and being listed as authors on the paper.
- Received September 19, 2023.
- Revision received October 30, 2023.
- Accepted November 1, 2023.
- Copyright © 2023, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC-ND) 4.0 international license (https://creativecommons.org/licenses/by-nc-nd/4.0).