Abstract
The observed number per base pair (i.e. the frequency) of G3+N1-7G3+N1-7G3+N1-7G3+ motifs has increased rapidly in the eumetazoa for which complete genomic sequences are available. This increase appears to be under positive selective pressure since it exceeds the frequency expected for a random sequence genome in every case. Since the motif is capable of forming several non-B DNA structures including quadruplexes, triplexes and hairpins, the expansion has been enabled by the presence of systems capable of suppressing non-B DNA conformations during normal replication and repair and by the emergence of proteins that promote the formation of unusual structures at these sites. Positive selection for these motifs suggests that they are not merely associated with their negative effects on genome stability, but may be useful in increasing the number of structural states in nucleic acids that are available for the elaboration of epigenetic states.
The evolutionary progression among the eumetazoan animals is a progression that has generated increasing developmental complexity. The serial emergence of eumetazoans, bilaterians, protostomes, deuterostomes, chordates, vertebrates and mammals marks a progressive increase in developmental complexity and therefore a progressive increase in the underlying epigenetic potential of the respective genomes. The well-known expansion of genome size in this progression is consistent with evidence for combinatorial models of epigenetic complexity in which multiple inputs from regulatory proteins and regulatory RNAs in expanded genomes modulate transcription of structural and metabolic proteins to enhance epigenetic complexity (1, 2). While the evidence for gene regulatory networks and the attendant requirement for a general genome expansion and a general increase in transcription factors (3-5) is incontrovertible, additional modulating mechanisms are being recognized. For example, DNA methylation patterning in vertebrate genomes has been proposed to have important epigenetic functions in gene regulatory networks (6-8).
Non-B DNA structure potential has also been proposed as a component of epigenetic systems (9-21). Searches of genomic sequences from bacteria and partial sequences of the yeast and human genomes have suggested that the homopurine mirror repeats characteristic of the H-DNA triplex, cruciform and slipped DNA structures (Figure 1) are overrepresented in eukaryotic genomes (22).
Sequence elements defined as G3+N1-7G3+N1-7G3+N1-7G3+ motifs residing on either DNA strand (14) carry the potential for G-quadruplex formation (Figure 1). In addition, they identify a subset of the imperfect homopurine mirror repeats capable of triplex formation (23) and C-strand i-motif formers (17) (Table I). Moreover, slipped and foldback structures are implicit intermediates in the formation of both quadruplex (24) and triplex (25) structures at these motifs. This report describes the prevalence of this motif in the eight eukaryotic genomes for which complete genomic sequences are currently available.
Materials and Methods
Evolutionary distance. No single bioinformatics approach that is based on nucleic acid or protein sequencing linking eukaryotic organisms in an evolutionary hierarchy has been universally adopted (26). When comparing sequenced eukaryotic genomes one is often tempted to employ a modern version of the Aristotelian concept of graded scale of existence (i.e. the scala naturae or the Great Chain of Being) as refined by Plotinus (27). Certainly, roundworms are less highly evolved than birds or mice, but quantifying the evolutionary distance between them based on genomic parameters is difficult. A variety of molecular clocks and paleonotlogical evidence (1, 28-30) tend to provide measures of evolutionary distance, and the molecular and paleontological time scales now tend to agree on the timing of the major evolutionary divergences (26, 31). In this report, the paleontological view based primarily on the fossil record was adopted as the measure of evolutionary distance (32-34).
Genomic searches, background data gathering. Genomic sequences for the different organisms were obtained through links provided by the NCBI website (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome). Human sequence (Homo sapiens, reference build, release 2.1, build 36.3) and the mouse sequence (Mus musculus, build 36.1) were obtained from the NCBI site. Chicken sequence (Gallus gallus, release 2.1, build 2.1) was obtained from the WUSTL website (http://genome.wustl.edu/genomes/view/gallus_gallus/). Zebrafish (Danio rerio) and medaka (Oryzias latipes) sequences were obtained from the Ensembl FTP site (http://uswest.ensembl.org/info/data/ftp/index.html). The Drosophila sequence (Drosophila melanogaster, release 5.9) was obtained from Flybase (ftp://ftp.flybase.net/genomes/). C. elegans (Caenorhabditis elegans, WS170, build 7.1) was obtained from the Sanger project website (http://www.sanger.ac.uk/Projects/C_elegans/Genomic_Sequence.shtml). To calculate the quantity of each nucleotide, chromosomal sequences were opened using EditPad Pro (Just Great Software, Phuket, Thialand) and a search was made through each of the files. Microsoft Excel (Microsoft Corp., Redmond Washington, USA) was used to create the table with the total genome size and also used to calculate the GC fraction by dividing the number of G+C over G+C+A+T. Unknown bases, labeled N, were not included in this calculation but still counted towards the total genome size.
Data collection
Complete genome searches. All unusual structures counts were placed into Excel, and the frequencies were calculated by dividing the number of found structures by the total genome size, as determined previously. G3+N1-7G3+N1-7G3+N1-7G3+ density was calculated using the Huppert algorithm (35) with the program Quadparser, available from http://www.quadruplex.org/. General quadruplex frequencies were calculated with the Maizels algorithm (36) with the program G4P, available from http://depts.washington.edu/maizels9/G4calc.php. Similar trends were detected with both algorithms. Data from Quadparser were depicted graphically.
Quadparser was placed in the same folder as the FASTA-formatted sequence files, and the program was executed through the command prompt. Output text files were then individually examined and the total number of unique quadruplexes was counted for each file. The G4P calculator used Microsoft.NET (Microsoft) and sequence files were used as input for the algorithm. The output was the density of regions that contained a quadruplex.
Tandem repeats density was calculated using Tandem Repeats Finder (TRF) (37), available from http://tandem.bu.edu/trf/trf.html. Alignment parameters for match, mismatch, and indel were set at 2, 7, and 7 respectively. Alignment scores above 50 were reported, and the maximum period size was 500. Sequences were opened and scanned by the program. Output files were opened and counts were tallied in Excel.
Inverted repeat density was calculated using Inverted Repeats Finder (IRF) (38), available from http://tandem.bu.edu/irf/irf.download.html. Default parameters for this program were used. Alignment parameters for match, mismatch, and indel were set at 2, 3, and 5 respectively. Matching and indel probabilities were 0.8 and 0.1, respectively. Alignment scores above 40 were reported, and the maximum period size was 2000. The program was executed through the command prompt and was placed in the same folder as the sequence files. Mirror repeat frequencies were also calculated using IRF.
Sampled genome searches. Because the algorithms for finding Z-DNA and palindromes take much longer to run on long sequences, random segments of the genome were sampled to determine the density of the entire genome. Fifty sequences totaling 1.5% of the entire genome were sampled. Numbers were generated using a random number generator via the process of converting atmospheric noise to numbers (available from http://www.random.org) and then mapped to the starting locations of the sequences. To verify the accuracy of this approach, results from random samples taken from smaller genomes were compared to those of the entire genome and were found to be within around 3% of the actual results.
Z-Hunt was used to determine the density of Z-DNA in each of the genomes (39), available from http://gac-web.cgrb.oregonstate.edu/zDNA/. Segments of the genome were uploaded to the server, which then processed the data and displayed the output file listing all the potential Z-DNA sequences. Output sequences were copied to a text file and the number of Z-DNA sequences was counted.
Palindromes were counted using the palindrome function in the EMBOSS package mEMBOSS (40), available from http://emboss.sourceforge.net/. Sequences were used as input to mEMBOSS for analysis. The minimum length of the palindrome was set at 15 and the maximum at 100 base pairs, with two possible mismatches allowed in the palindrome. The maximum gap between each of the palindromes was set to be 100. The program generated a text file containing the palindromes, and the total number was tallied in Excel.
Expected density. Expected densities for quadruplex, tandem repeats, inverted repeats, Z-DNA, mirror repeats and palindromes were determined by using a randomly generated sequence (with equal probability for G, C, A, T). Each letter was assigned to a number and the random numbers were generated using the aforementioned random number generator. 10 sequences of 1 million bp were generated, as well as 1 sequence of 10 million bp, for a total of 20 million bp. The sequences were then converted to FASTA format and run through each of the algorithms to determine the density. A normal distribution between the sequences was assumed, and the 99% confidence interval for the density of random sequences was determined from their standard deviation.
Data analysis. All data was input into Mathematica (Wolfram Research Inc., Champaign, IL, U.S.A.) and processed with Mathematica subroutines. Plots were generated with the ListPlot subroutine. The subroutines LinearModelFit or NonlinearModelFit were used to model the data and to obtain fitted parameter confidence intervals and R2 values. Parameter confidence interval shading was obtained with the MeanPredictionBands subroutine.
Results
Available genomic sequence information allowed us to calculate genome size with a high degree of accuracy. In order to detect evolutionary trends, the data for the sequenced eukaryotic genomes were plotted on a geologic time scale that links the genome to its point of divergence from the main eumetazoan lineage based on paleontological dating from the fossil record (32-34). The data are plotted in Figure 2.
The same genomic sequence information allowed calculation of the G+C content with a high degree of accuracy. The data (Figure 3) suggested a very weak tendency toward increased G+C content in the evolutionary progression of the eumetazoans. Since the frequency of the (G3+N1-11G3+N1-11G3+N1-11G3+) motif is expected to increase with increasing G+C content, a baseline expectation was calculated for a random genome of 20Mb with equal frequencies for each of the four nucleotides. This baseline expectation for random sequences was calculated from a direct search of 20Mb of random sequence for the (G3+N1-7 G3+N1-7G3+N1-7G3+) motif. Since the random sequence was 50% G+C, the expectation exceeded that for the highest G+C content in the sequenced genomes studied here (42%).
Corresponding scans of each of the sequenced genomes yielded the data given in Figure 4. The motif was overrepresented relative to the random expectation in each organism and the frequency in number/bp increased steadily as the epigenetic potential of the genome increased.
Interestingly, inverted repeat frequency, tandem repeat frequency and Z-DNA sequence frequency did not show smooth increases as a function of the time of divergence (Table II). With the exception of Z-DNA frequency, each of these motifs was present in every eukaryotic genome tested at a frequency that exceeded the expectation for a random genome as previously suggested from sampling data (22). However, only the G3+N1-7G3+N1-7G3+N1-7G3+ motif approximated a linear increase as a function of time at divergence (Figure 4).
Discussion
Developmental complexity is synonymous with genetic and epigenetic complexity. Since each stage in a developmental program manipulates the same genomic sequence as new cell types and stably differentiated states are elaborated, developmental complexity results from the combined epigenetic and genetic complexity of an organism. While it is clear that the developmental complexity of a mammal is significantly greater than that of a roundworm, it is difficult to assign a measure of developmental complexity to the respective genomes. In general, the Aristotelian Chain of Being has been used to order organisms qualitatively in demonstrating that genome size (C-value paradox) (41) and gene number (G-value paradox) (42, 43) cannot account for evolutionary complexity. Current best evidence suggests that transcription factor numbers (3), or increasingly complex cis regulatory elements and multiprotein transcription complexes scale with the Chain of Being (4, 44).
Scaling developmental and epigenetic complexity. While the Chain of Being places organisms in an intuitive order, the underlying logic of this order can be seen in palenontology. Palenontology, interpreted through Darwinian evolution, offers the best approach to an unbiased scale of developmental complexity against which candidate processes can be measured. Among the animals, the date at which a given species diverged from the main evolutionary lineage can be taken as its position in the genetic and epigenetic hierarchy. This approach places the Aristotelian Chain of Being on a paleontological time scale that not only orders genetic and epigenetic complexity but also permits genomic analysis of the trends in the emergence of that complexity.
Based on paleontolological evidence (34) and molecular clocks (26, 28, 30) that encompass the major laboratory organisms (34), the genomes of the eukaryotic animals can now be placed on a paleontological time scale with a high degree of confidence.
When the eumetazoan genome sizes studied here are placed on this paleontological time scale, it is clear that these particular genomes scale linearly from about 584 million years ago as one might expect for a lineage thought to have originated at about the time of the Cambrian Explosion. Obviously this linear relationship for genome size holds only for the laboratory species studied here. Additional organisms plotted on this scale would obscure this linear relationship (27): nematodes can have genomes with sizes comparable to mammals. Nevertheless, these organisms can be traced to the divergence points shown in Figure 2 and, thus, provide an evolutionary series of organisms with monotonically increasing genome sizes that permit further analysis.
Given the evidence demonstrating that sequence motifs associated with non-B conformations can undermine genomic stability by promoting mutagenesis (45), dynamic mutation (46-48) and gene rearrangement (24, 49), it is reasonable to conclude that the maintenance of the larger genomes is enabled by the evolution of suppressors of non-B structure like the RecQ (50) helicase family represented in modern yeast by Sgs1 (51). By enabling the genomes to incorporate high conformation space sequence motifs (52) like those associated with the formation of quadruplex, triplex and hairpin structures, the emergence of large genomes is allowed to go forward unhindered. Given neutral selection, the aggregate frequency (namely the number per base pair) at which these motifs should occur in a random DNA sequence is expected to be a constant, independent of genome size. The present analysis showed that Z-DNA is maintained below the random expectation in every organism tested (Table II) suggesting that it is under negative selective pressure. In contrast, the observed frequency of every motif studied except that of Z-DNA fell above the associated random expectation for every organism studied (Table II), suggesting that at least some of these motifs have been the subject to positive selection. This is consistent with the results on the maintenance of cruciforms on Y (53) and the abundance of simple repeats (54) if one assumes that the positively selected palindromes, inverted repeats and tandem repeats form a class with low annealing temperatures.
Of the sequences that occured at frequencies above the respective random expectation, only the (G3+N1-7G3+N1-7 G3+N1-7G3+) motif exhibited a monotonic increase in frequency on the palenotological scale (Figure 4). Since it is capable of forming quadruplex, triplex and hairpin structures (Figure 1 and Table I), the expanding frequency of this motif can be taken as a measure of the expansion of non-B-DNA forming potential that is linked to the expansion of developmental potential (Figure 4). Importantly, these rather large changes in sequence motif representation occurred without dramatic changes in the overall G+C content. Although the progression encompassed poikilothermic (cold blooded) eukaryotes and homeothermic (warm blooded) metazoans, none are extremophiles (requiring physically extreme conditions). The organisms studied here develop in temperatures that range from 10°C to 44°C, thus ranging in G+C content from about 35% to 42%. While the range from 35% to 42% suggests a trend toward a higher density of G-rich motifs in more highly evolved genomes, none of the G+C contents present in sequenced organisms approached the 50% G+C content present in a completely random genome. Clearly, the increase in the frequency of linked G-rich motifs (Figure 4) exceeded the expectation for a random sequence and cannot be due to the small increase in G+C content associated with the evolutionary progression.
Positive selection for the G3+N1-7G3+N1-7G3+N1-7G3+ motif. The question of whether positive selection for this form of non-B structure potential is a reflection of the expansion of the developmental potential of the genome or is a reflection of the accumulation of junk DNA (55) sequences is an important one. While human Alu sequences lack the motif, the consensus sequence of the human L1 retrotransposon contains one copy of the motif near the polyA sequence. Thus at least some of the instances of the motif (as much as a third of the human occurrences) can be attributed to retrotransposition within the L1 family in humans. Although retrotransposition may actually be promoted by the formation of quadruplex, hairpin and triplex structures, it does not appear to be required for transposition, since P-elements in Drosophila lack the motif. Thus, the proliferation of repetitive elements does not appear to account for the observed smooth expansion of the motif frequency with developmental complexity. In short, the observed increase in the frequency of the G3+N1-7G3+N1-7G3+N1-7G3+ motif was more consistent with positive selection for functionality (perhaps via retrotransposition) than with neutral selection associated with the accumulation of junk DNA sequences.
Positive selection is also consistent with the appearance of proteins that promote the formation of unusual structures in DNA (e.g. the meiotic pairing gene Hop1 (11) and Nucleolin (56)). Moreover, positive selection is difficult to rationalize in terms of the coding capacity of these sequences for proteins given the redundancy of the genetic code and its capacity to mold codon usage (57). The analysis suggests that sequences capable of quadruplex, hairpin and triplex formation are not merely associated with harmful effects as one may expect from their association with sites of dynamic mutation (46-48, 58) and gene rearrangement (49) but also serve useful developmental functions since they appear to have been under strong positive selective pressure during the emergence of developmental complexity.
Roles for the G3+N1-7G3+N1-7G3+N1-7G3+ motif. In vitro studies of the formation of quadruplex DNAs have generally been performed with short oligodeoxynucleotides representing regions of biologically important sequences. A key example is the nuclease hypersensitive element (NHE III1) located between promoter P0 and promoter P1 in the human MYC gene (59). This element has been shown to form an intramolecular quadruplex (21) in vitro. However, mutagenesis studies and the presence of an imperfect homopurine mirror repeat present in the sequence suggest that it can also form an intramolecular triplex (60). Consistent with these findings, the region has also been shown to carry a canonical motif (G3+N1-7G3+N1-7 G3+N17 G3+) characteristic of both quadruplex sequences (14) and the G-rich subset of triplex sequences formed by imperfect homopurine mirror repeats (23).
Several well-studied examples of quadruplex or triplex formation (Table 1) exhibit this motif and thus carry the capacity for the formation of either or both types of non-B DNA structure. Moreover, other similar motifs are known to form quadruplex sequences. For example, motifs such as (G3+N1-11G3+N1-11G3+N1-11G3+) with extended loop size are expected to form quadruplex DNA (10, 61). However, longer loops reduce stability (62). In addition, it is well known that the (GGC)n motif in triplet repeat sequences from the FMR1 gene has been observed to form quadruplex DNA (47), and is associated with the formation of a spontaneous slippage intermediate on the complementary C-rich strand (46, 48, 58).
Although quadruplex formation in oligodeoxynucleotides is not often studied in the presence of both complementary strands, quadruplexes have been observed to form in vitro in the presence of their complementary strands (58). Moreover, the extended loop size (Lamparska-Kupsik and Smith, unpublished) or the presence of the complementary strand (63) can result in the stable trimolecular triple helices. The full structural complexity of these motifs is depicted schematically in Figure 1.
A significant number of proposals for the function of these sequence motifs are consistent with the expectation that they should expand in frequency in concert with developmental complexity (35, 64). Since the intermolecular quadruplex is a molecular mimic of the synaptonemal alignment of homologues during meiosis, an active role in meiosis has been suggested (10). Support for this proposal has been adduced from the KEM1/SEP1 nuclease system (65) in yeast which can block meiosis at pachytene when mutated (66), and from the presence of quadruplex binding proteins of similar function in human cells (67). These proposals are also consistent with the expansion of the G3+N1-7G3+N1-7 G3+N1-7G3+ motif, because homologous pairing is expected to require a higher density of interstrand links as chromosome size increases.
Gene expression. Proposals for a role in the control of gene expression through the formation of intramolecular triplex (60) or quadruplex (21) at promoter sequences in the human c-MYC (59), KRAS (68) and c-KIT (69) genes are consistent with the presence of these elements in the 5' UTRs of a significant number of human genes (14) and with the capacity of Hop1 (11) and Nucleolin (56) to induce these structures in DNA.
These proposals require the generation of the structure in DNA, however, it is also possible that the RNA transcript would carry the folded structure encoded by a B-DNA sequence as discussed by Huppert et al. (14). For example, transcription from start site P0 at the human MYC gene would produce a 5' UTR capable of intramolecular quadruplex or triplex formation, while transcription from start site P1 would not. Thus, the presence of these motifs in 5' and 3' UTRs is consistent with a role in RNA processing (14) and the mechanism of action of non-coding RNAs.
The combinatorial limit. Based on Ohno's original suggestion (55) that mutation rate limits the total number of genes to between 15,000 and 20,000 genes per genome, it is tempting to speculate that once this combinatorial limit (2) is reached (apparently already at the roundworms), additional epigenetic mechanisms involving phenomena like unusual DNA structure formation and DNA methylation must come into play.
Structural complexity and epigenetic potential. A structurally complex motif like the (G3+N1-7G3+N1-7 G3+N1-7G3+) motif necessarily generates a dramatic increase in the epigenetic potential of the sequence at a given site. A simple B-DNA sequence offers one DNA conformation that can be recognized by a protein or nucleic acid modulator. A (G3+N1-7G3+N1-7G3+N1-7G3+) motif can present six distinct DNA conformations that can be uniquely recognized by protein or nucleic acid modulators (Figure 1). Thus, even the pair-wise interaction potential of these motifs as sites of protein or nucleic acid recognition will be increased by fifteen times (=6!/2!(6-2)!) the number of motifs present in the genome. Thus, the increase in (G3+N1-7G3+N1-7G3+N1-7G3+) density links the positive selection for these non-B conformations with the emergence of epigenetic complexity through a clear enhancement of information content. In other words, the different intermolecular and intramolecular conformations that can be adopted or encoded in RNA by a single DNA sequence form different epigenetic signals that effectively increase the epigenetic information content of the genome (12-14, 52, 59, 70, 71) by increasing the number of protein and nucleic acid binding sites available from a single sequence. The data reported here suggest that eumetazoans have enhanced their developmental and epigenetic potential by selectively incorporating structurally complex motifs in their genomes in spite of the potential that these motifs have for chromosomal damage (50, 63).
In conclusion, similar to transcription factors and cis-regulatory elements, non-B DNA forming motifs like the G3+N1-7G3+N1-7G3+N1-7G3+ motif appeared to scale with organismic complexity when placed on a paleontological scale that quantifies the Aristotelian Chain of Being. The observed steady increase in the frequency of these motifs associated with the emergence of complexity was consistent with the proposed roles for this motif in amplifying the number of structural states in nucleic acids that are available for molecular recognition during the elaboration of epigenetic states.
Acknowledgments
Yu-Loung Chang and Jarrod Clark are thanked for their help in compiling data and illustrating figures.
- Received March 20, 2010.
- Revision received May 12, 2010.
- Accepted May 14, 2010.
- Copyright© 2010 International Institute of Anticancer Research (Dr. John G. Delinassios), All rights reserved