Journal of Molecular Biology
Volume 296, Issue 5, 10 March 2000, Pages 1205-1214
Journal home page for Journal of Molecular Biology

Regular article
Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae1

https://doi.org/10.1006/jmbi.2000.3519Get rights and content

Abstract

AlignACE is a Gibbs sampling algorithm for identifying motifs that are over-represented in a set of DNA sequences. When used to search upstream of apparently coregulated genes, AlignACE finds motifs that often correspond to the DNA binding preferences of transcription factors. We previously used AlignACE to analyze whole genome mRNA expression data. Here, we present a more detailed study of its effectiveness as applied to a variety of groups of genes in the Saccharomyces cerevisiae genome. Published functional catalogs of genes and sets of genes grouped by common name provided 248 groups, resulting in 3311 motifs. In conjunction with this analysis, we present measures for gauging the tendency of a motif to target a given set of genes relative to all other genes in the genome and for gauging the degree to which a motif is preferentially located in a certain distance range upstream of translational start sites. We demonstrate improved methods for comparing and clustering sequence motifs. Many previously identified cis-regulatory elements were found. We also describe previously unidentified motifs, one of which has been verified by experiments in our laboratory. An extensive set of AlignACE runs on randomly selected sets of genes and on sets of genes whose upstream regions contain known transcription factor binding sites serve as controls.

Introduction

The recent increase in the number of sequenced genomes and the amount of genome-scale experimental data allows the use of computational techniques to investigate cis-acting sequences controlling transcriptional regulation. Some methods seek to find new sites for a given transcription factor based on a set of known sites, often by using online search engines where one may submit sequences to be scanned for known motifs Heinemeyer et al 1998, Zhu and Zhang 1999. Others, such as AlignACE, seek to find unknown DNA binding motifs for unspecified transcription factors by searching the regions upstream of the translational start sites of a set of potentially coregulated genes Spellman et al 1998, van Helden et al 1998, Brazma et al 1998, Roth et al 1998.

AlignACE is based on a Gibbs sampling algorithm and returns a series of motifs that are over-represented in the input set. It previously has been used to find transcriptional regulatory DNA motifs in Saccharomyces cerevisiae using groups of genes derived from genome-wide mRNA expression data Roth et al 1998, Tavazoie et al 1999. While many known cis-acting elements were identified, AlignACE returned many more motifs about which no literature information was found. A distinguishing feature of most of the known motifs was that their corresponding highest scoring genomic sites tended to be strongly selective for the upstream regions of the genes used to find them. One might expect this to be always true, since each motif is itself composed of sites in those regions, but we found that the vast majority of the unknown motifs were not very selective in this way. Also, a subset of the known motifs seemed to be preferentially positioned relative to the start of translation.

Here, we describe statistics to measure these two motif properties, which we call group specificity and positional bias. Furthermore, we present results from the systematic application of AlignACE to a sample set of functional groups of genes in S. cerevisiae, as well as positive and negative control sets. These data sets allow us to calibrate AlignACE and the associated motif measures so that empirical significance thresholds for these statistics may be determined. Many known cis-regulatory elements, as well as novel motifs, are identified by this method.

Section snippets

The input sets of genes

A total of 248 groups were examined, including 135 groups from the database at the Munich Information Center for Protein Sequences (Heinemeyer et al., 1998), 17 groups from the Yeast Protein Database (Hodges et al., 1999), and 96 groups based on common name root as listed in the table of open reading frames (ORFs) from the Saccharomyces Genome Database (SGD) (ftp://genome-ftp.stanford.edu/pub/yeast/SacchDB; Cherry et al., 1998). We considered only groups of six or more genes. The number of

Discussion

We present a set of analytical tools for the computational discovery and validation of cis-acting regulatory elements in a sequenced and annotated genome.

The group specificity score is a useful statistic for gauging whether a given motif is real in the sense that it describes a sequence feature that is functionally relevant for the genes under consideration. This measure is independent of the method being used to find motifs. It works as long as there is a method of ranking potentially

AlignACE

AlignACE is an algorithm implemented in C++ for finding multiple motifs in any given set of DNA input sequences. We define a motif as the characteristic base-frequency patterns of the most information-rich columns of a set of aligned sites. AlignACE is based on a Gibbs sampling algorithm previously used to find motifs in protein sequences Neuwald et al 1995, Lawrence et al 1993, Liu et al 1995. It differs from this method in the following ways: (1) the motif model was changed so that the base

References (39)

  • F Caspary et al.

    Constitutive and carbon source-responsive promoter elements are involved in the regulated expression of the Saccharomyces cerevisiae malate synthase gene MLS1

    Mol. Gen. Genet.

    (1997)
  • A Delahodde et al.

    Positive auroregulation of the yeast transcription factor Pdr3p, which is involved in control of drug resistance

    Mol. Cell. Biol.

    (1995)
  • F Fisher et al.

    Single amino acid substitutions alter helix-loop-helix protein specificity for bases flanking the core CANNTG motif

    EMBO J.

    (1992)
  • K.B Freeman et al.

    Histone H3 transcription in Saccharomyces cerevisiae is controlled by multiple cell cycle activation sites and a constitutive negative regulatory element

    Mol. Cell. Biol.

    (1992)
  • V Gailus-Durner et al.

    Analysis of a meiosis-specific URS1 sitesequence requirements and involvement of replication protein a

    Mol. Cell. Biol.

    (1997)
  • P.M Goncalves et al.

    Transcription activation of yeast ribosomal protein genes requires additional elements apart from binding sites for Abf1p and Rap1p

    Nucl. Acids Res.

    (1995)
  • J.A Hartigan

    Clustering Algorithms

    (1975)
  • T Heinemeyer et al.

    Databases on transcriptional regulationTRANSFAC, TRRD and COMPEL

    Nucl. Acids Res.

    (1998)
  • P Hodges et al.

    Yeast protein database (YPD)a model for the organization and presentation of genome-wide functional data

    Nucl. Acids Res.

    (1999)
  • 1

    Edited by F. E. Cohen

    View full text