Abstract
Background/Aim: In the field of cancer research, reconstructing clonal evolution is of major interest. The technique provides new insights for analysis and prediction of tumor development. However, reconstruction based on mutational data is characterized by several challenges. Materials and Methods: By performing extensive literature research, we identified 51 currently available tools for reconstructing clonal evolution. By analyzing two cancer data sets (n=21), we investigated the applicability and performance of each tool. Results: Seventeen out of 51 tools could be applied to our data. Correct clustering of variants can be observed for 4 patients in the presence of ≤3 clusters and ≥5 time points. Correct phylogenetic trees are determined for 10 patients. Accurate visualization is possible, by applying adjustments to the original algorithms. Conclusion: Despite bearing considerable potential, automatic reconstruction of clonal evolution remains challenging. To replace tedious manual reconstruction, further research including systematic error analyses using simulation tools needs to be conducted.
According to the International Agency for Research on Cancer, more than 19 million new cases of cancer were registered in 2020. Furthermore, almost 10 million deaths were recorded (1). For many types of cancer, precise mutational characterization is of major interest: For Burkitt lymphoma (BL), for example, translocations affecting the MYC gene are considered a biological hallmark and essential in terms of diagnosis (2). Furthermore, recent data shows that relapse may be associated with a deficiency of TP53 (3).
Another example is myeloid neoplasia, including acute myeloid leukemia (AML) and myelodysplastic syndromes (MDS). These heterogeneous hematopoietic stem cell disorders are characterized by the acquisition of somatic alterations and clonal evolution (4-6). To optimize treatment selection, patients are stratified, among others, according to their genetic alterations (7). Recent data even shows clusters within lower risk MDS patients based on their mutational profiles and association of these clusters with clinical endpoints (8).
To determine genetic alterations, a variety of laboratory techniques is currently available. Small variants are usually investigated via bulk DNA analysis using Next-generation sequencing (NGS) techniques. Over the last 20 years, the costs for sequencing have been reduced significantly (9), allowing for not only targeted, but whole-exome sequencing (WES) and even whole-genome sequencing (WGS) data to be generated for one patient at several time points in the course of his/her disease. For the detection of large variants, like copy-number variants (CNVs), a variety of techniques is available: karyotyping, fluorescent in situ hybridization (FISH), SNP arrays, array-CGH and – with limitations – also NGS.
Over the last years, the detection of variants has been greatly optimized (10-12). However, for the combined evaluation of mutations and their development over time in terms of clonal evolution, many studies still rely on manual analyses (3, 13).
Currently, a wide set of algorithms is available performing automatic analysis of clonal evolution. These algorithms can be split according to the three main tasks in reconstructing clonal evolution: 1) clustering variants by cell fraction, 2) reconstructing the clonal evolution tree, 3) visualizing clonal evolution.
Analyzing two sets of well-characterized, publicaly available real data, we performed a systematic evaluation of currently available tools for automatic reconstruction of clonal evolution, considering all three categories. Data set 1 covers 10 BL patients – 5 with and 5 without relapse (3). Data set 2 covers 11 MDS patients – 6 patients receiving supportive care and 5 patients treated with lenalidomide (13). In both data sets, cases of linear as well as branched evolution can be observed.
Materials and Methods
Data sets analyzed. Two cancer data sets were analyzed. Their main characteristics are summarized in Table I. Data set 1 covers 10 patients with BL – 5 patients with relapse (germline + 2 additional time points) and 5 patients without relapse (germline + 1 time point). For all patients, clonal evolution based on WES, targeted sequencing, Sanger sequencing, SNP array analysis (Infinium OmniExpressExome-8v1.3kit) and FISH has been reconstructed manually. Results on clonal evolution, as well as detailed variant calling information (including coverage and variant allele frequency – VAF), have been published (3), original FASTQ files of the WES experiments are available at the Sequence Read Archive (PRJNA561490).
Overview of the two cancer data sets analyzed and their main characteristics.
The data set can be considered a typical, representative use-case for the application of a pipeline performing automatic reconstruction of clonal evolution: a variety of data sources is available. However, data has only been collected at a few time points. The number of clones is relatively high (up to 17). In 1 out of 10 cases, manual reconstruction of clonal evolution did not lead to a unique result on the basis of available data.
Data set 2 covers 11 patients with MDS – 6 patients receiving supportive care and 5 patients treated with lenalidomide. To reconstruct clonal evolution WES, targeted deep sequencing, karyotyping, CytoScan HD Array (Affymetrix) and FISH data have been integrated (13). Results on clonal evolution, as well as variant calling information (including coverage and VAF), are available along with the publication, FASTQ files of the NGS experiments are available at the Sequence Read Archive (PRJNA355124).
This second data set poses another potential challenge for automatic reconstruction of clonal evolution: A considerably higher number of time points (up to 30) is available compared to data set 1. Fewer clones are detected (up to 9), however, clonal evolution itself is highly complex. Cases with parallel dependent evolution, including several branches, as well as branched independent evolution are present.
Algorithms evaluated. We searched PubMed for algorithms connected to clonal evolution (clonal evolution AND algorithm). Additionally, we considered algorithms referenced in papers identified by PubMed search. We focused on approaches handling bulk DNA-seq data and – ideally – additionally considering information on CNVs. Any approach requiring single-cell data was not taken into account. Furthermore, algorithms just analyzing CNVs detected in e.g. array data were not considered.
Altogether, we identified 51 algorithms (Figure 1). These can be split into three categories according to the three main tasks in reconstructing clonal evolution: 1) clustering of variants, 2) reconstructing the clonal evolution tree, 3) visualizing clonal evolution.
Overview of algorithms identified for clonal evolution analysis. Approaches are classified according to the three main tasks analyzing clonal evolution: clustering of variants, reconstruction of clonal evolution trees and visualization.
As a first step, variants detected in genomic data are clustered. Thereby, information on variants can originate from different sources, e.g., NGS, array, FISH or karyotyping. Variants detected by NGS are characterized by coverage and VAF. Thus, the cancer cell fraction (CCF), i.e., the percentage of cells affected by a mutation, is not directly available and has to be inferred, considering overlapping copy number variants (CNVs) and the most likely genotype. Variants detected by array data are – in case of SNP-arrays – characterized by rough estimates of the CCF based on observed changes in the Log R Ratio (LRR) and the beta allele frequency (BAF). These calls are lacking information on coverage. Variants detected by FISH and/or karyotyping are characterized by estimates of CCF, based on the number of evaluated cells.
Algorithms performing clustering ideally support variants characterized by VAF as well as variants characterized by CCF as input and automatically perform all necessary transformations. Altogether, we identified 17 algorithms performing clustering of variants. Ten out of 17 were, however, excluded due to reasons of unavailability, unresolvable error messages or having the wrong scope (see Figure 1, see Supplementary Information, section 1.1 for information on the precise reasons and – if applicable – the precise error messages). The remaining tools are CloneHD (14), clonosGP (15), DeCiFer (16), PyClone (17), PyClone-VI (18), QuantumClone (19) and sciClone (20). Additionally, we identified 23 algorithms performing both clustering and tree reconstruction. Nineteen out of 23 were excluded due to reasons of unavailability, unresolvable error messages, having the wrong scope or missing documentation (see Figure 1, see Supplementary Information, section 1.2). The remaining 4 [Canopy (21), Cloe (22), LICHeE (23) and SPRUCE (24)] were, together with the 7 tools only performing clustering, considered for detailed evaluation.
As a second step, subsequent to clustering, actual tree reconstruction is performed: Clustering results are taken as input. Every cluster represents one clone/node in the clonal evolution tree. By analyzing the development of the different clones over time, the most likely underlying clonal evolution tree is determined.
Altogether, we identified 7 tools with the main functionality of clonal evolution tree reconstruction. Three out of 7 were excluded due to reasons of unresolvable error messages or having the wrong scope (see Figure 1, see Supplementary information, section 1.3). The remaining tools are ClonalTree (25), ClonEvol (26), SCHISM (27) and TRaP (28). Additionally, tools providing a full analysis pipeline, performing clustering as well as tree reconstruction, were taken into account: Canopy, Cloe, LICHeE and SPRUCE. Thus, a total of 8 tools was considered for detailed evaluation.
As a third and final step, we consider visualization of clonal evolution. Most algorithms performing clonal evolution tree reconstruction provide a basic visualization of the determined tree. However, we concentrate on specialized tools for visualization in the field of clonal evolution. We require an approach to display the development of all clones over time, to show the precise tumor load at every time point, and to provide an option for labeling clones according to their precise mutations. To visualize clonal evolution, we identified 4 tools. However, performing the required tasks, 2 out of 4 tools were excluded (see Figure 1, see Supplementary Information, section 1.4). For the remaining 2 tools, fishplot (29) and timescape (30), a detailed evaluation was performed.
In Table II an overview of all algorithms considered for detailed evaluation on the basis of two real data sets is provided. For every tool, the precise command executing analysis is provided in Supplementary Information, sections 1.1 to 1.4, exemplarily considering data set 1, patient 1.
Overview of all tools considered for reconstructing clonal evolution based on two real data sets. Tools are sorted according to their main functionality: clustering, tree reconstruction and visualization. C: Clustering; T: tree reconstruction; V: visualization.
Results
Clustering. As an essential prerequisite to clonal evolution tree reconstruction, clustering of variants is performed. All variants with similar CCFs are expected to be clustered together, assuming that they represent one clone. We focus our analysis on algorithms clustering SNVs and small indels, ideally also taking the underlying copy number into account. Algorithms clustering both, small variants originating from NGS data and non-overlapping CNVs originating from array data, FISH and/or karyotyping could not be identified.
We evaluate 11 tools for variant clustering: CloneHD, clonosGP, DeCiFer, PyClone, PyClone-VI, QuantumClone, sciClone, Canopy, Cloe, LICHeE and SPRUCE. Tools show considerable differences in the approach applied for clustering (see Table II) and, as a result, the required input data.
CloneHD is evaluating coverage information (number of reads with the alternative and sequencing depth) of SNVs/indels, optionally also considering read depth and B-allele count data. However, as read depth profiles are commonly considered in the context of WGS data and our data is mainly characterized by targeted sequencing and partly WES data, we focus analysis with CloneHD on SNV/indel information only.
Canopy allows for considering coverage information on SNVs/indels and underlying CNVs. However, as the corresponding function leads to errors when applied to our data, we perform analysis on the basis of SNV/indel data only (see Supplementary Information, section 1.2.1).
Cloe considers, to our knowledge, only coverage information on SNVs/indels. Similarly, LICHeE is just taking information on the variants’ VAFs as input.
ClonosGP, PyClone, PyClone-VI, QuantumClone and sciClone evaluate information on coverage of SNVs/indels as well as the underlying copy number in tumor vs normal samples (deletions, amplifications and loss of heterozygosity – LOH). This approach also allows for the correct handling of variants on the X- and Y-chromosome.
DeCiFer and SPRUCE evaluation not only covers information of SNVs/indels and the underlying copy number, but also the fraction of cells in which this copy number is observed. Thereby, DeCiFer and SPRUCE are the only tools not just modeling CNVs as binary events (present in all or no cells), but the precise circumstances.
Clustering results for all tools and samples are summed up in Table III. Detailed results on each variant and its assigned cluster are available in Supplementary Data 1 (data set 1) and Supplementary Data 2 (data set 2).
Number of clusters for 11 tools performing clustering of variants in comparison to true clusters. Results marked in bold indicate perfect results, not just concerning the number of clusters, but also including assignment of the right variant to the right cluster.
It can be observed that clusters reported by the 11 approaches show major differences compared to true clustering results. For data set 1, no tool succeeds in clustering variants correctly for any sample. For data set 2, three tools – clonosGP, Canopy and LICHeE – report correct clustering for 2 out of 11 samples. Five additional tools – PyClone, PyClone-VI, QuantumClone, sciClone and Cloe – perform correct clustering for 1 out of 11 samples.
Data set 1 is characterized by a high number of clusters (7 to 17) and a low number of time points (1 to 2). It is expected that this situation is difficult to handle for all clustering algorithms. In general, the number of clusters is underestimated by the applied tools. SPRUCE fails to report any results for any of the samples. CloneHD reports the number of clusters, but output files do not allow for the assignment of the variants to a cluster.
Detailed evaluation of the clustering results shows that clonosGP, DeCiFer, PyClone, PyClone-VI, Cloe and LICHeE tend to cluster variants similar to biological truth, but with higher gradation. However, exceptions from this observation can also be made. For patient 4, for example, all tools except for clonosGP report completely mixed-up clusters. QuantumClone and Canopy, on the contrary, show considerable differences compared to biological truth for all patients. Partly, variants from most distant original clusters are put together, while variants from the same cluster are split. SciClone reports only one cluster for a majority of patients. However, most variants are excluded by the algorithm as they do not fit this clustering. The single cluster reported for patient 8, for example, contains only 1 out of 17 variants.
Detailed evaluation of data set 2 shows similar results. Again, ClonosGP, DeCiFer, PyClone, PyClone-VI, Cloe and LICHeE show clusters similar, but less finely graduated compared to true clusters. However, the performance of DeCiFer appears slightly inferior as several mixed-up clusters can be observed (UPN04, UPN09, UPN11 and to a lower extend UPN02, UPN03, UPN06, UPN07). Like in case of data set 1, QuantumClone reports mixed-up clusters for several samples. However, for five samples, only a single cluster is reported. Against the background of partly high differences in CCF (e.g., UPN01 time point 3: CCFTP53=0%, CCFPALLD=75%) this observation appears astonishing. Of note, improved performance of Canopy can be observed compared to data set 1. Clustering results are close to true clustering and – for UPN01 and UPN06 – perfectly match biological truth.
For data set 2, systematic underestimation of the true number of clusters can no longer be observed except for sciClone. The highest number of clusters is reported by SPRUCE. Detailed evaluation of the results shows that only part of the SNVs/indels present in the data is actually considered by the approach. Furthermore, no clustering is performed. Instead, every variant represents a single cluster. It appears likely that due to the high number of mutations characterizing patients in data set 1, SPRUCE fails to generate any results.
For Canopy and Cloe we provide the true number of clusters as an input parameter. It can be observed that Cloe fails to report the pre-defined number of clusters in all but 4 cases (all in data set 2). For Canopy, the determined number of clusters corresponds to the pre-defined one for 1 out of 10 samples in data set 1 and for 10 out of 11 samples in data set 2. Of note, the actual label of the highest clone, e.g., clone 17 for data set 1, patient 1, is always reported by Canopy. However, several intermediate clones are missing.
Tree reconstruction. Based on clustered variant calls, a clonal evolution tree can be determined. We distinguish between algorithms providing a full pipeline for variant clustering and subsequent tree reconstruction based on the tools’ own clustering results and algorithms only performing tree reconstruction. For each type, we identified four approaches: Canopy, Cloe, LICHeE and SPRUCE provide a full pipeline, while ClonalTree, ClonEvol, SCHISM and TRaP only provide an approach for tree reconstruction.
For Canopy, Cloe, LICHeE and SPRUCE no additional input has to be defined to allow for tree reconstruction. Analyses are performed on the initially provided input characterizing SNVs/indels and information on the underlying copy number in the case of SPRUCE.
ClonalTree and TRaP do not take any information on detected variants as input. Instead, clusters and their CCFs at every time point are directly evaluated. TRaP additionally provides an option to specify the estimated CCFs’ errors.
By contrast, ClonEvol analyzes information on every variant, their assigned cluster and their CCF at every time point. Similarly, SCHISM requires information on every SNV/indel, their coverage, the underlying copy number and the assigned cluster as input.
The mechanisms of clonal evolution, determined by the 8 tools, are – for all samples – summed up in Table IV. Detailed results on each clone and its determined parent are available in Supplementary Data 3 (data set 1) and Supplementary Data 4 (data set 2).
Mechanism of clonal evolution determined by 8 tools in comparison to true clonal evolution. Results marked in bold indicate perfect results, not just concerning the general mechanism, but also including correct ordering of the clones. D: Branched dependent evolution; I: branched independent evolution; L: linear evolution; N: no evolution; *two to three versions reported; **more than three versions reported; ***more than 100 versions reported; A: all possible versions reported.
For all tools just providing a tree reconstruction approach – ClonalTree, ClonEvol, SCHISM and TRaP – we provide the correct clustering of variants as input. Still, the correct mechanism of clonal evolution cannot be determined in all cases. ClonalTree succeeds in 5 out of 22 (counting data set 1, patient 8 twice due to two models being possible; no tree can be determined for 7 samples), ClonEvol in 8 (no tree for 12 samples), SCHISM in 5 (no tree for 1 sample) and TRaP in 9 (no tree for 1 sample).
It can be observed that all tools except for SCHISM show especially good performance considering only a single time point (data set 1, patients 6 to 10). ClonEvol and TRaP even succeed in identifying both possible models for patient 8 – linear as well as branched dependent evolution. SCHISM, however, mistakes the ordering of clones in 4 out of 5 patients.
For data set 1, patients 1 to 5, the correct clonal evolution tree is not reconstructed by any tool. For data set 2, correct results can partly be observed for UPN01, UPN02, UPN03, UPN04, UPN06 and UPN11. However, no tool succeeds in reporting the correct results for all 6 patients. Of note, branched independent evolution (UPN08, UPN09, UPN10) is never determined correctly.
When reconstructing clonal evolution manually, the presence of a high number of time points eases the process considerably. Usually, it is possible to retrieve unique solutions. When determining clonal evolution trees automatically, however, data indicates that many time points rather lead to inferior results. TRaP tends to report a high number of possible solutions, partly >100. SCHISM similarly reports several possible solutions or – in two cases – even reconstructs a fully connected graph. By contrast, ClonalTree and ClonEvol report no solution as no converging tree can be found.
For tools providing a full pipeline for reconstruction of clonal evolution – Canopy, Cloe, LICHeE and SPRUCE – the true clonal evolution can only be reconstructed for two patients: data set 2, UPN01 and UPN04. Of note, UPN04 actually does not show any clonal evolution as all variants form a single cluster, showing minor changes in frequency over time. UPN01 is characterized by the presence of only 2 clones, 15 time points and linear clonal evolution. Canopy and LICHeE report the correct tree. For UPN06 Canopy reported the correct clustering. The tool correctly determines linear clonal evolution. However, ordering of the clones does not match biological truth. For all other patients and tools, clustering does not lead to the correct result, which is why the correct tree cannot be determined.
Visualization. Numerous approaches exist for visualizing clonal evolution. A basic approach considers common trees, with nodes representing clones and edges indicating their evolutionary relation. In this work, however, we focus on plots displaying detailed information on clonal evolution, including the development of all clones over time and clonal prevalence. We identified and evaluated two tools performing this task: fishplot and timescape. The clonal evolution plots generated with fishplot and timescape for a representative example of linear and branched dependent evolution each, evaluating data set 1, are provided in Figure 2. Results for all samples are available in Supplementary Information, section 2.1.
Visualization of clonal evolution analyzing two representative examples of data set 1 (linear: patient 4; branched dependent: patient 5). (A) Visualization of linear evolution using fishplot. (B) Visualization of linear evolution using timescape. (C) Visualization of branched dependent evolution using fishplot. (D) Visualization of branched dependent evolution using timescape.
Comparing visualization of linear clonal evolution in the presence of few time points and many clones with fishplot (Figure 2A) to timescape (Figure 2B), major differences can be observed. The patient is characterized by 10 clones, one only present at the second time point. In the plot generated with fishplot, however, only 7 clones are visible. The time between two emerging clones continuously decreases. According to the figure, the first gray clone is exclusively present for a long time. Then, within short time, clones 2 to 9 appear. With respect to time point 2, the newly emerging clone 10 is not visible at all.
Different from fishplot’s visualization, clones are well separated in the plot generated by timescape. Instead of decreasing time spans, separating two emerging clones, equidistant development is assumed. As a result, all clones present at time points 1 and 2 can be identified in the plot. Additionally, a basic tree visualizing clonal evolution, using matching colors, is provided next to the main plot.
Similar results can also be observed when considering an exemplary case of branched dependent evolution in the presence of a few time points and many clones. The visualization by fishplot (Figure 2C) does not allow for the identification of all 17 clones. Instead, only 9 clones can be identified. None of the 5 clones characterizing time point 2 are visible. In the plot generated by timescape, however, all 17 clones are distinguishable. Additionally, the tree plot allows for quick identification of the underlying evolutionary model.
Despite this disadvantage of timescape, it is possible to manually improve visualization using additional auxiliary time points, previously described by Reutter et al. (3). These time points can be set to force fishplot towards visualization of equidistant development. Exemplary output for all patients in data set 1 is provided in Supplementary Information, Figures S1 to S11.
Considering visualization of clonal prevalence, a disadvantage of timescape can be observed: it is known that the tumor load at time point 1 is at ~85%, while it decreases to ~70% at time point 2, which can be clearly observed in Figure 2A generated by fishplot. By contrast, Figure 2B generated by timescape suggests a tumor load of 100% for both time points.
The relative clonal prevalence as displayed by timescape, however, matches the actual prevalence. As plots generated by timescape are interactive, it is possible to receive information on the exact underlying clonal prevalence of each clone by hovering over it. However, it should be noted that timescape does not take the CCF as input, but the difference in CCF of related clones.
Similar to fishplot, it is possible to manually improve visualization with timescape. An additional clone can be defined, representing normal cells. If no relation is given for this clone, it is not displayed in the plot, but empty space is included instead, leading to a correct representation of absolute clonal prevalence. Exemplary output for all patients in data set 1 with CCFs of the outer-most clone <100% are provided in Supplementary Information, Figures S2 to S4 and S8 to S11.
Results for data set 2, considering a representative example of linear evolution, branched dependent and branched independent evolution each is provided in Figure 3. Results for all samples are available in Supplementary Information, section 2.1.
Visualization of clonal evolution analyzing three representative examples of data set 2 (linear: UPN01; branched dependently: UPN03; branched independent: UPN09). (A) Visualization of linear evolution using fishplot. (B) Visualization of linear evolution using timescape. (C) Visualization of branched dependent evolution using fishplot. (D) Visualization of branched dependent evolution using timescape. (E) Visualization of branched independent evolution using fishplot. Of note, timescape does not support visualization of branched independent evolution.
Data set 2 is characterized by a relatively high number of time points (5 to 30) and low number of clones (1 to 9). As a result, visualization by fishplot is considerably improved. Manually adding auxiliary time points does not appear necessary. In Figure 3A, showing linear clonal evolution of 2 clones, both clones can be distinguished easily. The same is true for the exemplary branched dependent evolution in Figure 3C (5 clones) and – with minor constraints – also the branched independent evolution in Figure 3E (8 clones). As the orange subclone is dominated by the green and blue/yellow subclone, it is expected to be observed with certain difficulties. Also in case of plots generated with timescape, all clones can be identified easily. Of note, branched independent evolution cannot be visualized using timescape. The tool mandatorily requires trees with a single root.
As regards visualization of clonal prevalence, correct scaling can be observed in the case of fishplot. For timescape the relative clonal prevalence is displayed correctly. However, to visualize the correct absolute scaling, normal cells have to be added manually (see Supplementary Information, Figures S12 to S14, S16 to S18 and S22).
Of note, timescape fails at visualizing the development of UPN04, which is characterized by a single clone. Strictly speaking, no clonal evolution can be observed in this case. However, visualization of clonal prevalence over time might still be interesting. With respect to timescape, this is not possible as no information on tree edges can be provided.
Discussion
In this work we evaluated tools performing the three main tasks in the process of reconstructing clonal evolution: clustering of variants, reconstructing clonal evolution trees and visualizing clonal evolution. We identified 51 approaches and performed a detailed evaluation of 17. We analyzed two publically available data sets, covering altogether 21 patients.
When clustering variants we focused on SNVs and indels. Most tools performing clustering consider the underlying copy number. Even in case of CNVs being absent, this approach appears essential to model small variants located on the X- and Y-chromosome correctly. However, only two tools – DeCiFer and SPRUCE – allow for the definition of the precise cellular fraction of a CNV. Unexpectedly, these two tools do not perform best on our data sets. Instead, they never report the correct clustering. Performance of the remaining 9 tools is greatly comparable: the correct clustering is – at a maximum – reported for 2 out of 21 patients per tool. For data set 1, the correct clustering is never reported by any tool.
It may be argued that manually determined clusters in data set 1, taken as biological truth, are defined too finely graduated. In the presence of only 1 or 2 time points it is possible that the true number of clusters is manually overestimated. Natural variation in the VAF of variants may be misinterpreted to originate from different clones. Though, data is characterized by high sequencing depth (at every variant position commonly >100×). Furthermore, clustering results determined by the tools do usually not just indicate 2 or 3 clusters being merged, but instead completely different clustering.
Our results show the general deficiencies of currently available algorithms with respect to clustering variants. To our knowledge, not a single algorithm accepts variant calls from different sources as input, automatically integrating NGS, array, FISH and/or karyotyping data. In theory, it is possible to define CNVs detected by FISH or karyotyping just like SNVs/indels. The number of evaluated cells could be defined as coverage, the number of cells with the CNV divided by 2 as the number of reads carrying a heterozygous variant. However, for variants only detected by array analyses, i.e., LOH and smaller CNVs, this is impossible. Furthermore, equating the number of evaluated cells with read counts does not reflect validity of the variant calls correctly. A CNV that is microscopically detected in 1 out of 10 cells indicates a true variant with a CCF of 10%. A mismatch present in 1 out of 10 reads, on the contrary, indicates a sequencing artifact.
With respect to reconstructing clonal evolution trees we consider both, tools only performing tree reconstruction based on user-defined clustering as well as tools performing reconstruction based on their own clustering results. Even when providing perfect clustering as input, the correct tree is not reported by any of the tools for 11 out of 21 patients. The best results can be observed for data available at a single time point. It remains unclear why the tools do not show superior performance in the presence of more time points.
For tools providing a full analysis pipeline, results are even worse as the initial clustering is usually not correct. LICHeE, as well as SPRUCE, do not provide any option for interference: Clustering results cannot be manipulated manually as the main analysis is executed by a single command. Cloe is working with specific data structures within R, including a “cloe_input” object and “meta-mutations”. Thus, manipulation of the clustering appears equally impossible. Just in the case of Canopy it may be possible to change clusters determined by the tool to improve tree reconstruction.
As regards visualizing clonal evolution, both available tools – fishplot and timescape – show certain deficiencies: while fishplot does not allow for identification of distinct clones in the context of many clones and few time points, timescape does not display the correct clonal prevalence in case of a tumor load <100%. However, it is possible to compensate for both deficiencies. Auxiliary time points can be added to improve visualization with fishplot, normal cells can be added to improve visualization with timescape.
Despite these two acceptable solutions being available, to our knowledge, no tool visualizing clonal evolution on an allelic level is currently available. In cancer research, the identification of “double-hit events” is particularly important as it might be related to a higher risk for disease progression or relapse (3). If, for example, a patient has a deletion of chr17p and – on the remaining p-arm – an additional SNV in gene TP53, this double-hit of TP53 cannot be visualized, neither by fishplot nor by timescape.
It may be questioned why we focus our work on variants detected in bulk DNA-seq and not single-cell DNA-seq (scDNA-seq) data. Despite the clear advantage of scDNA-seq to potentially determine the exact mutational profile of a single cell, the technique is, at present, still characterized by major disadvantages: Due to technical issues, like amplification bias, leading to allelic imbalance, and allelic dropout, sensitivity with respect to variant calling is still low. Data preparation for single-cell sequencing limits its application to either determining small variants or CNVs in a single cell – not both. Furthermore, analysis of retrospective samples is, conditioned by missing suitable material, not possible (31-34).
Concluding, currently available methods evaluated in this paper, performing reconstruction of clonal evolution, cannot yet replace manual reconstruction. No valid tool or combination of tools could be identified to perform clustering and tree reconstruction with comparable precision.
In the context of large sets of data collected in clinical routine, it appears highly beneficial to develop a reliable method replacing tedious manual reconstruction of clonal evolution. However, in order to develop such a method, systematic error analysis has to be performed first. While our analysis of two real data sets clearly shows the deficiencies of current approaches, the reasons for their limitations are not thoroughly clear. Therefore, it will be the next step to develop a simulation tool generating realistic user-definable input data for clonal evolution analysis. On the basis of this data, it can be investigated why certain algorithms generate mixed-up clusters of variants or fail to generate clonal evolution trees in the presence of >1 time point. Only then it may be possible to identify the best approach to develop a pipeline for thoroughly automatic reconstruction of clonal evolution.
Acknowledgements
We would like to thank Dr. Marius Wöste for his technical support.
Footnotes
Authors’ Contributions
SS performed analyses and wrote the manuscript. SR performed supporting analyses. JV and XJ supported the analyses and provided feedback. All Authors read and approved the final version of this manuscript.
Supplementary Material
Available at: https://uni-muenster.sciebo.de/s/YTHAwxTvXcrRzp9
Conflicts of Interest
The Authors declare that they have no conflicts of interest.
- Received October 29, 2021.
- Revision received December 11, 2021.
- Accepted December 17, 2021.
- Copyright © 2022, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved