We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

I am working with a set of DNA motifs that are predicted as potential regulatory motifs (e.g. transcription factor binding sites). The motifs belong to several species, and I wanted to cluster these motifs via their Position Weight Matrices (PWMs) (also known as PSSMs) to collapse similar motifs together into groups.

There is a tool called MATLIGN (website here) that does what I need, but their required format for the PWMs are different to what I have, they claim:

"Matrices must be in the frequency matrix format (only integer numbers are acceptable)"

The problem is that my PWM matrices do not have integer numbers but decimals instead. e.g.:

`A C G T 1 0.000000 1.000000 0.000000 0.000000 2 1.000000 0.000000 0.000000 0.000000 3 0.000000 0.000000 1.000000 0.000000 4 0.000000 0.421755 0.000000 0.578245 5 0.289407 0.000000 0.282556 0.428038`

In other words, instead of the decimal values I have in my matrix I need to have integer counts. Could anybody suggest what I can do? Would I need to create artificial "pseudo-counts"?

So what you need is basically your data expressed as counts instead of proportions. Even if you do not have the matrix of counts as raw data, these proportions only needs to be multiplied by the total number of binding sites used in the study (e.g. the number of sequences that have been analysed) to get the counts (since proportion = count/total number of binding sites). You should have that information somewhere.

@hello_there_andy: indeed there was this missing piece of information available to me, it came in the form of a variable called **nsites** which equates to the total number of DNA sites that the PWM was generated from.

## Logo2PWM: a tool to convert sequence logo to position weight matrix

position weight matrix (PWM) and sequence logo are the most widely used representations of transcription factor binding site (TFBS) in biological sequences. Sequence logo - a graphical representation of PWM, has been widely used in scientific publications and reports, due to its easiness of human perception, rich information, and simple format. Different from sequence logo, PWM works great as a precise and compact digitalized form, which can be easily used by a variety of motif analysis software. There are a few available tools to generate sequence logos from PWM however, no tool does the reverse. Such tool to convert sequence logo back to PWM is needed to scan a TFBS represented in logo format in a publication where the PWM is not provided or hard to be acquired. A major difficulty in developing such tool to convert sequence logo to PWM is to deal with the diversity of sequence logo images.

### Results

We propose logo2PWM for reconstructing PWM from a large variety of sequence logo images. Evaluation results on over one thousand logos from three sources of different logo format show that the correlation between the reconstructed PWMs and the original PWMs are constantly high, where median correlation is greater than 0.97.

### Conclusion

Because of the high recognition accuracy, the easiness of usage, and, the availability of both web-based service and stand-alone application, we believe that logo2PWM can readily benefit the study of transcription by filling the gap between sequence logo and PWM.

## Background

Discovering and characterizing DNA and protein sequence motifs are fundamental problems in computational biology. Here, we use the term 'motif' to refer to a position-specific probability matrix that describes a short sequence of amino acids or nucleotides that is important to the functioning of the cell. For example, the regulation of transcription requires sequence-specific binding of transcription factors to certain *cis*-acting motifs, which typically are located upstream of transcriptional start sites [1]. On the other hand, protein sequence motifs might correspond to active sites in enzymes or to binding sites in receptors [2].

A wide variety of statistical methods have been developed to identify sequence motifs in an unsupervised manner from collections of functionally related sequences [3]. In addition, databases such as JASPAR [4], TRANSFAC [5], and BLOCKS [6] can be used to scan a sequence of interest for known DNA or protein motifs. In this work we develop a statistical method for comparing two DNA or protein motifs with one another. This type of comparison is valuable within the context of motif discovery. For example, imagine that you are given a collection of promoter regions from genes that share similar mRNA expression profiles, and that a motif discovery algorithm identifies a motif within those promoters. Often, the first question you would ask is whether this new motif resembles some previously identified transcription factor binding site motif. To address this question, you need a computer program that will scan a motif database for matches to your new (query) motif. The program must consider all possible relative offsets between the two motifs, and for DNA motifs it must consider reverse complement matches as well. An example alignment between two similar motifs is shown in Figure 1. An alternate use for a motif comparison program would be to identify and then eliminate or merge highly redundant motifs within an existing motif database.

An aligned pair of similar motifs. The query and target motifs are both derived from JASPAR motif NF-Y, following the simulation protocol described in the text. Tomtom assigns an *E* value of 3.81 × e -10 to this particular match. The figure was created using a version of seqlogo [26], modified to display aligned pairs of Logos.

We are not the first to describe a method for quantifying the similarities between pairs of motifs. Pietrokovski [7] compared protein motifs using a straightforward algorithm based on the Pearson correlation coefficient (PCC). Subsequently, Hughes and coworkers [8] applied a similar method to DNA motifs. Wang and Stormo [9] introduced an alternate motif column comparison function, termed the average log-likelihood ratio (ALLR). More recently, Schones and coworkers [10] introduced two motif similarity functions, one based on the Pearson *χ* 2 test and the other on the Fisher-Irwin exact test (FIET). They showed that these two new functions have better discriminative power than the PCC and ALLR similarity functions. In addition, multiple research groups have used Kullback-Leibler divergence (KLD) to compare motifs [11–13], and Choi and coworkers [14] used euclidean distance (ED) to compare protein profiles. Finally, Sandelin and Wasserman [15] used their own column comparison function (SW) within the context of a dynamic programming alignment approach to compare DNA motifs. This method differs significantly from all other DNA-motif based approaches in the sense that it allows gaps in the motif-motif alignments.

In this report we focus on ungapped alignments of motifs. We describe a general method for accurately modeling the empirical null distribution of scores from an arbitrary, additive column comparison function. We estimate the null distribution of scores for each column in a 'query' motif using the observed scores of aligning it with each motif column in a database of 'target' motifs. Using a dynamic programming algorithm inspired by earlier work on searching a sequence database with a motif [16–18], we estimate the null distribution of the sum of scores for any range of contiguous columns in the query motif. This makes it possible for the user to determine whether the motif comparison score between the query motif and a particular target motif is statistically significant. Previous methods begin by defining a score between two motif columns, and then they combine these scores either by summing (as we do) [7–9, 14] or by taking the mean [11–13] or geometric mean [10] of the column scores. Our scoring method differs in that it computes the *P* values of the match scores for the columns of the query motif aligned with a given target motif in all possible ways (without gaps). These 'offset' *P* values are computed using the cumulative density functions estimated from the target database, as described above. The minimum *P* value among these offset *P* values is used to compute the overall *P* value of the match between the query motif and the target motif, assuming independence of the offset *P* values. This is called the 'motif' *P* value. Finally, we apply a Bonferroni correction to the motif *P* values to derive an *E* value.

This algorithm is implemented in a software tool called Tomtom, which is publicly available as part of the MEME Suite of motif analysis tools [19–21]. Tomtom can compute *E* values based on any one of seven column comparison functions: PCC, ALLR, PCS, FIET, KLD, ED, or SW. In this work, we demonstrate the accuracy of Tomtom's statistical estimates. We also validate Tomtom'smotif retrieval accuracy via a simulation experiment. The results show that, in addition to providing formal semantics for motif similarity scores, Tomtom's *P* value estimation yields improved rankings relative to *ad hoc* normalization schemes.

## Results

Based on published work and our own experience, we have defined three benchmarking protocols for ChIP-seq, HT-SELEX, and PBM data. The three protocols are available in containerized form as docker images. The protocols for ChIP-seq and HT-SELEX data are also publicly accessible via a web interface. PBM data is publicly available via UniPROBE database [12].

### Protocol for ChIP-seq peak lists

The raw data from a ChIP-seq experiment are sequence reads. The first step in the analysis of the data is the mapping of reads to the genome. The next step is peak calling resulting in a peak list containing the coordinates of bound genomic segments at a resolution of about 200 bp. Our protocol starts with peaks and further requires binding strength-related quantitative scores assigned to peaks that can be used for ranking. As it will become clear later, the protocol requires a peak to be represented by a single position. To comply with this requirement, we take either the mid-point of the peak regions or, if available, the so-called summit position from the source peak files. We use only the *N* top-scoring peaks for benchmarking, extract the surrounding genomic sequences (+/− *w* bp), and score these sequences with the PWM under investigation, using the sum occupancy score as defined in [20]. Next, we score a set of negative control sequences of the same length, taken from genomic regions located at a fixed distance *d* upstream or downstream from the positive sequences. An area under the curve for the receiver operating characteristic (AUC ROC) value is then computed from the binding scores of the two sets, supposed to reflect the PWM’s capacity to discriminate between in vivo binding and non-binding sites.

### Protocol for HT-SELEX data

This protocol is applicable to all flavors of SELEX, which enrich a random pool of DNA oligonucleotides for sequences with high affinity to a particular DNA binding protein of interest. The data from such an experiment consist of a library of DNA sequences of a constant length (typically 14–40 bp). Current high-throughput SELEX technologies produce millions of sequences per experiment. As with the ChIP-seq peak-based evaluation method, we need a negative sequence set. It can be obtained by shuffling the positive sequences. Alternatively, sequences from the input library used in the protein-DNA binding reaction may be available for this purpose. From this point on, we proceed in a similar way as with the ChIP-seq peak lists. We first compute sum occupancy scores for all sequences in both libraries. However, because SELEX libraries are often only weakly enriched with true binding sequences (Additional file 1: Fig.S1, see also Fig. 1 in [24]), we take only a top percentile of the positive and negative scores (e.g., the top 10%) for ROC AUC value computation. Importantly, before PWM scoring, we extend the random insert sequences obtained from the sequence repository with the primer and barcode sequences that were present (and thus accessible to proteins) during the SELEX experiments.

### Protocol for PBM data

To assess the performance of PWMs on in vitro PBM data from the UniPROBE database [12], Pearson correlation values between normalized log probe intensities and log sum occupancy scores (see the “Methods” section) were computed per pair of PWM and PBM experiment.

### Overview of benchmarking study

The above-described protocols were used to benchmark 4972 PWMs characterizing binding specificities of human TFs from JASPAR [7], HOCOMOCO [11], and CIS-BP [13] against 2017 ChIP-seq peak lists from ReMap [25], 547 HT-SELEX experiments from [26] and [27], and 597 PBMs from UniPROBE [12]. ReMap ChIP-Seq peak lists included only human TFs data, whereas in vitro data from HT-SELEX contained samples from both human and mouse. The PBM data sets downloaded from UniPROBE were filtered for (i) belonging to human and mouse TFs but (ii) excluding non-wildtype TFs or technical variations of the experiment, see Additional file 7 for a complete list of retained experiments. Mouse data were mapped to the orthologous human TFs for the identification of the best performing motif matrix for a given TF.

Only peak lists containing at least 5000 peaks were considered. The benchmarking with SELEX data was done twice, with top-score cut-offs of 10% and 50%, in order to account for different degrees in true binding site enrichment of individual SELEX libraries. In total, this study produced a database of over 18 million performance values for experiment-PWM combinations, which we offer as a public resource. All results presented further below are based on this resource.

### Benchmarking reveals similar binding specificity across many but not all TF structural families

The ability of a motif to recognize peaks from a ChIP-seq experiment targeted at another TF of the same structural TF family is a long-disputed issue [28]. To quantitatively assess this family-related cross-binding, we grouped the PWMs with regard to the structural class of the respective TF’s DNA binding domain as recorded in the CIS-BP [13] and TFClass databases [29]. CIS-BP provides a more compact view of the TF families, while TFClass allows separating particular subfamilies, which are of interest in huge and diverse families such as zinc finger TFs.

ReMap ChIP-Seq peak lists included only human TFs data, and HT-SELEX and PBM included human and mouse data to increase coverage of TF families. In all cases, there is a clear signal at the diagonal showing that PWMs present in the databases indeed recognize TFBS of the corresponding TFs or at least of the TFs from the same CIS-BP family. Thus, on the global scale, the benchmark results agree with expectations based on DNA binding domain classification. Interestingly, there are families, e.g., “C2H2 zinc finger factors,” the TFs of which show almost negligible average family-wide AUC ROC on the corresponding data sets, both in vivo and in vitro, which agrees with their known diversity of DNA binding specificities. The results are consistent with the use of the 50% threshold in HT-SELEX benchmarks (Additional file 1: Figure S2A). The PBM-based heatmap (Fig. 2c) is sparse due to lower representation of TF families in the available experiments.

The average performance (**а**, **b** AUC ROC **c** Pearson correlation coefficient) achieved by PWMs (rows) on particular data sets (columns). The average is taken over all binding motifs for TFs from the same family of DNA recognition domains according to the CIS-BP TF family classification and for all experiments for all TFs from the family of their DNA recognition domains. The PWMs were benchmarked on ChIP-seq (**a**), HT-SELEX 10% (**b**), and PBM (**c**) data. Only families with no less than 2 PWMs, 2 ChIP-seq, and 2 SELEX data sets are shown

### Identifying the best performing matrix for each experiment and TF

It is well known that many TFs recognize similar binding sites due to the similarity of their DNA binding domains. Yet, direct motif comparison tells little whether two TFs of the same family recognize the same motif because motif differences can relate to experimental errors or other confounding factors such as base composition inhomogeneity of the genomic context of binding sites. To test this, we identified for each experiment the best performing matrix among all available PWMs. We also extracted the globally best performing matrix for each TF, defined as the matrix with the highest aggregate rank score (see the "Methods" section) over all experiments for the same factor (different types of experiments were analyzed separately). We further distinguished between cases where the best performing matrix was for a TF from the same or a different family according to TFСlass [29]. The statistics of the best performing PWMs classified according to their place in the TFClass hierarchy is presented in Fig. 3.

Statistics about best performing matrices. Four thousand nine-hundred seventy-two matrices from JASPAR, HOCOMOCO, and CIS-BP were benchmarked on ChIP-seq, HT-SELEX, and PBM data. The “best” matrix for a TF was chosen based on its aggregate rank score over all experiments attributed to this TF (see the “Methods” section)

We observed that for most of the experiments, the best performing matrix was not attributed to the same factor, but in fact very often to a member of the same family. Quite surprising is the relatively high number of best performers coming from different TF families both for ChIP-seq and HT-SELEX experiments. However, a closer look at the results indicates that these matrices most often either are attributed to a TF family from the same structural class or show low performance on an absolute scale. In summary, the high-quality matrices tend to perform well for several TFs of the same family, sometimes even for other families of the same class.

Different PWMs for TFs from the same structural families tend to perform similarly across experiments. To explore this trend in detail for selected TF families, we applied t-SNE [30] to the complete matrix of AUC ROC values and distributed motifs on a 2D plane according to their tendency to recognize the bound regions in different data sets. The expectation was that motifs performing similarly on the same ChIP-seq experiments will be located close to each other. Indeed, the motif distribution at the t-SNE projections in general agrees with the structural classification of TFs. In Fig. 4 (see also Additional files 2, 3, 4, and 5 for interactive plots in which any family can be highlighted), each point corresponds to a PWM, and the PWMs for TFs from the selected illustrative families are colored with the same color. One can see that most of ETS and Forkhead motifs group together. On the other hand, motifs of the “factors with multiple dispersed zinc fingers <2.3.4>” are dispersed on the t-SNE plane, which agrees with low average family-wise AUC ROC values seen in Fig. 2. In contrast, motifs for three-zinc finger Krüppel-related factors <2.3.1>binding CCCCG-boxes are nicely clustered together.

PWMs grouping according to the similarity of their performance measure values across data sets. PWMs recognizing bound regions in similar selections of experiments are grouped reasonably well according to their TFClass families. Dimensionality reduction with t-SNE is applied to motifs’ performance at ChIP-seq (**a**), HT-SELEX 10% (**b**), and PBM data (**c**). For illustration, several TF families are highlighted with color. Each point corresponds to a PWM. Source coordinates are AUC ROC values (**a**, **b**) or Pearson correlation coefficients (**c**) calculated for different data sets. ‘o’ HOCOMOCO and JASPAR PWMs, ‘x’ CIS-BP PWMs

The clustering patterns for motifs for TFs from the same structural families do not depend on the type of experimental data (e.g., ChIP-seq or HT-SELEX) or on the TF motif collection (e.g., JASPAR or HOCOMOCO). Also, we did not observe any dependence on the TF classification resource used results for CIS-BP families agreed well with those for TFClass, see Additional files 2, 3, 4, and 5.

For most of the TFs, the best performing matrix comes from the same structural family. Yet, there exist some exceptions, which can be readily visualized by plotting motif-experiment best-performance pairs at alluvial plots.

Figure 5 shows many cases of individual TFs with the best average performance obtained for TFs for other structural classes for ChIP-seq (A), HT-SELEX 10% (B), and PBM (C) benchmarks (for HT-SELEX 50%, see Additional file 1: Fig.S3). This is particularly true for C2H2 zinc fingers, many PWMs of which often recognize TF binding peaks for different TF families. Interestingly, in most other cases, ChIP-seq peaks or HT-SELEX oligos were recognized by TFs belonging to the same TF family. Most of the cross-family recognition for C2H2 zinc fingers are likely to be explained by random errors in motif building or, in the case of ChIP-seq, by TF-TF interactions, where the target TF is bound to DNA through another TF acting as a mediator. Another cross-family example is given by a low-complexity (probably incorrect) polyG TBX15 motif of T-box factors that achieves the best performance for several other TFs in the HT-SELEX benchmark only. This might be an artifact arising from the nucleotide composition of the HT-SELEX input control libraries that were used as a negative control set.

Alluvial plots illustrating the performance of PWMs from particular CIS-BP TF families in ChIP-seq (**a**), SELEX 10% (**b**), and PBM (**c**) benchmarks. For each TF, PWMs with the highest average AUC ROC (**a**, **b**) or Pearson correlation coefficient (**c**) across the data sets for this TF are used to construct the links. PWMs grouped by CIS-BP TF family are shown on the left TFs are shown on the right. The link width corresponds to the number of TFs. Only TFs with at least one ChIP-seq data set and one HT-SELEX data set are included. For illustration, selected motif families are highlighted with color

The cross-family recognition patterns become much noisier when all the cases reaching an average 0.75 AUC ROC or average 0.3 Pearson correlation are taken into account rather than the best average performers (Fig. 6 and Additional file 1: Fig.S3B). In this setting, cross-family prediction rates increase significantly, suggesting that partial overlaps in the consensus sequences allow reaching significantly higher-than-random recognition quality both when using in vitro and in vivo data and thus highlighting the need for comparative assessment of multiple motifs.

Alluvial plots illustrating the performance of PWMs from particular CIS-BP TF families in ChIP-seq (**a**), SELEX 10% (**b**), and PBM (**c**) benchmarks. For each TF, PWMs displaying the average AUC ROC of no less than 0.75 (**a**, **b**) or Pearson correlation coefficient of no less than 0.3 (**c**) across the data sets for this TF are selected for link construction. PWMs grouped by CIS-BP TF family are shown on the left TFs are shown on the right. The link width is proportional to the square root of the number of appropriate PWM-TF pairs. Only TFs with at least one ChIP-seq data set and one HT-SELEX data set are included. For illustration, selected motif families are highlighted with color

### Making a small, non-redundant, quality-filtered PWM library

Recently introduced high-throughput TF binding assays have generated a wealth of data, from which DNA binding specificity models can be built. As a result, the number of TF binding matrices has exploded, and fighting redundancy has become an issue of great practical importance. If the binding specificity of several TFs can be explained by the same motif, it will be possible to significantly reduce the number of binding specificity models while still covering the same number of TFs.

One way of making a non-redundant motif library for TFs is by choosing the best performing matrices for each TF (Fig. 7). This approach reduces the number of motifs more than a tenfold from 4972 to 414 (339 if only high-quality matrices with AUC ROC > 0.75 or Pearson correlation coefficient > 0.35 are considered) while increasing the average predictive performance for each TF at the same time.

Statistics on the best performing TF motif matrices. “Best performance per gene” means globally best performance over all corresponding ChIP-seq, HT-SELEX (top 10%), and PBM experiments in terms of aggregate rank scores (see the “Methods” section) over all corresponding experiments. The qualifier “filtered” relates to the numbers obtained when we only considered experiments for which at least one matrix achieved a ROC AUC value > 0.75 (ChIP-seq, HT-SELEX) or a Pearson correlation coefficient > 0.35 (PBM). The first three bar plots show the numbers for individual motif collections analyzed separately, whereas the last plot at the bottom shows the numbers obtained when all three collections were considered simultaneously

As SELEX data are qualitative (binding versus non-binding) and PBM data quantitative (relative binding affinities of different sequences), we were wondering whether matrices performing well in a classification-based test were able to predict normalized signal intensities from a PBM experiment. Examining several cases, where a TF has been assayed by at least two different techniques, and multiple matrices were available for the same TF, led us to conclude that this is indeed the case. An example is presented in (Additional file 1: Fig.S4) where the ROC AUC values of 12 different PWMs for FOXJ3 obtained on HT-SELEX data were compared to corresponding correlation coefficients obtained on PBM data. The two number series are highly concordant with a correlation coefficient of 0.78.

The alternative approach to reduce redundancy is exemplified by the widely used methods for identifying representative motifs by clustering. Motif clustering is a usual procedure in many applications. In motif activity response analysis [31], TFs with similar binding specificities bring about mathematical difficulties, so it is necessary to reduce the set of considered motifs to a subset of dissimilar motifs. As a result of motif clustering, a representative motif is often selected or constructed for a set of similar known motifs. Binding specificities of untested TFs can be predicted by the similarity of DNA binding domains [32].

For motif clustering, we used a benchmark-blind approach similar to that used in [33], with PWMs from HOCOMOCO and JASPAR databases. The systematic comparison of performance metrics allowed us to assess the effectiveness of the two different strategies, selecting the best representative motif for a motif similarity cluster or selecting the best performing motif within a TF structural family. In Fig. 8, we show the applicability of representative motifs for the problem of binding site recognition in a selected TFClass structural family on three examples: Ets-related factors <3.5.2>, Forkhead box (FOX) factors <3.3.1>, and factors with multiple dispersed zinc fingers <2.3.4>.

Violin plots of AUC ROC values obtained on ChIP-seq data sets for TFs of a particular TFClass family by PWMs belonging to TFs of the same family and representative motifs of the family selected using the PWM clustering-by-similarity. **a** Ets-related factors, **b** Forkhead box (FOX) factors, and **c** factors with multiple dispersed zinc fingers. All AUC ROC values are obtained using data sets of TFs from the selected family. The first 4 violins of each plot show AUC ROC values of (1) all PWMs of TFs from the selected family, (2) the best PWM (with the highest average AUC ROC) from the family, (3) the best PWM from all tested, and (4) family representatives obtained by the motif similarity clustering. The next violins of each plot show AUC ROC values achieved by particular representative PWMs belonging to the family and selected from the motif clustering by similarity

The performance of the family-specific matrices selected by different approaches is visualized as violin plots. It turned out that “the best from the structural family” or “the best globally” matrices generally outperform “the representatives from family” matrices derived by clustering. Indeed, Fig. 8 displays that while some representative motifs (e.g., in FOX or ETS families) displayed good TFBS recognition, close to the best available PWMs, the other had mediocre or even “random-guess” (AUC ROC of 0.5) performance. There are many more representative PWMs for dispersed zinc finger proteins, which can be explained from their diversity (see Figs. 2 and 4). Not surprisingly, in this case, none of the representative PWMs is able to properly recognize TFBS across the entire family.

The high level of within-family cross-recognition is exemplified in Table 1 where a single PWM (MA0028.2 from JASPAR) is the best predictor for multiple TFs from the ETS family. In JASPAR, this DNA motif is assigned to ELK1. All TFs, for which this matrix reaches AUC ROC > 0.75 in at least one ChIP-seq or HT-SELEX experiment, are shown. An exclamation sign indicates that this matrix is the overall best performer for the corresponding factor (see table legend). Regarding ChIP-seq in vivo experiments, this matrix is the best performer for as many as 10 different TFs, only 5 of which are ETS family members. Surprisingly, it is also the best performing matrix for 5 unrelated TFs, including BRCA1. Note further that the list of top-ranked TFs includes three histone-modifying proteins (JMJDC1, SUZ12, KDM5A) and two cycline-dependent kinases (CDK9, CDK7). We suspect that these unrelated factors could be recruited to their target binding sites through protein-protein interactions with a DNA-bound ETS factor.

The corresponding list for HT-SELEX experiments looks very different. There, ELK1 appears on the second place of the list, preceded and followed exclusively by other members of the ETS family. JASPAR MA0028.2 is the best performing matrix for ETV4 and ELK4, but surprisingly not for ELK1, where it is outperformed by JASPAR MA0765.1 assigned to ETV5 (AUC ROC 0.9998). The extremely high-performance values for most HT-SELEX data sets suggest that the majority of ETS members have indistinguishable DNA binding specificity. Also with PBM data, this matrix shows good performance only for ETS family members.

### Comprehensive benchmarking helps clarify the molecular meaning of binding motifs

PWMs from public resources are typically assigned to a single TF (single polypeptide chain) identified by a gene symbol. This TF usually corresponds to the target of the antibody used in the experiment. However, with regard to motifs derived from ChIP-seq data, the role of the motif in the recruitment of the TF to target sites remains essentially unknown. It is possible that the motif represents the binding specificity of another TF, which tethers the TF targeted by the experiment to the ChIP-seq peak regions through protein-protein interactions. Several such cases are well documented by experiments. Two cases where performance reports help clarify these respective roles of dissimilar motifs in TF recruiting are presented in Tables 2 and 3.

Table 2 shows a comparison between two alternative matrices for STAT1 from HOCOMOCO. We first note that the motif logos are quite different, the one on the left side being near palindromic, and the one on the right side being composed of tandem repeats. Regarding ChIP-seq experiments, we note that the former reaches high-performance values (AUC ROC > 0.75) only for members of the STAT family, whereas the latter shows good performance for the unrelated proteins IRF1/2, ZNF384, SPIB, and SPI1. Regarding HT-SELEX data, the tandem-repeat motif performs best on an experiment targeted at IRF8. The HT-SELEX results for the palindromic motif are not congruent with the ChIP-seq results. This is however not surprising as the HT-SELEX collection does not include any data set for a STAT family member.

Overall, the results are consistent with models derived from experimental studies: (i) Upon alpha-interferon induction, STAT1 forms a complex with STAT2 and IRF9, which binds to a target DNA motif called ISRE [23] (ii) IRF8 was found to form a complex with SPIB, which then binds to an Ets/IRF composite element [34]. Taking together, these previous findings and our results suggest that the STAT1 matrix, whose logo is displayed on the right side, likely represents the intrinsic DNA specificity of 2 or 3 IRF family proteins arranged in tandem fashion.

Table 3 shows benchmarking results for two alternative NANOG matrices. This is another well-known example of alternative TF-to-DNA recruitment pathways for the same TF. Nanog reportedly can bind indirectly to DNA by associating with a SOX2-POU5F1 (Oct4) dimer bound to a composite motif alternatively, Nanog can also directly bind to DNA via its own homeo domain [35, 36]. The results shown in Table 3 suggest that the matrix on the left side represents the indirect binding mode and the matrix on the right side the direct one. The composite motif matrix shows good performance for all three TFs on corresponding ChIP-seq data. It also performs well on HT-SELEX data for POU5F1 and on PBM data for SOX-related proteins. The matrix on the right side shows good performance on ChIP-seq data for general promoter-binding factors (NFY, SP1/2) and (with somewhat lower AUC ROC values) for Nanog itself. Unfortunately, the HT-SELEX and PBM collections do not include data for Nanog. However, the fact that the top-ranked HT-SELEX experiment, and 4 out of 10 top-ranked PBM experiments are assigned to a TF from the same homeodomain subfamily as Nanog (NK-related), speaks in favor of the hypothesis that this matrix represents the intrinsic DNA binding specificity of Nanog. In contrast, the majority of the top-ranked HT-SELEX experiments for the other motif relate to POU domain factors.

Recently, a modified HT-SELEX protocol has been published [37], enabling the characterization of TF heterodimers. We were wondering whether the hypothesized indirect binding target of Nanog mediated by a SOX2-POU5F1 dimer was supported by new data generated with this technology. This is indeed the case. The authors of the above-cited paper were able to extract a motif for a closely related complex, a SOX2-POU2F1 TF pair (Additional file 1: Fig.S5), which closely resembles the sequence logo shown on the left side of Table 3.

## 3 ALGORITHMS

### 3.1 Calculating the P-value for a word motif

We first give a simple algorithm for the case of *k* = 1 (i.e. the motif hits the region at least once) and then sketch an algorithm for the general *k*. The algorithms and ideas for computing the optimal spaced seeds (Keich *et al.*, 2004 Ma *et al.*, 2002) can be readily borrowed and generalized to our situation.

The basic idea is to calculate a series of conditional probabilities instead of the target probability. The conditional probabilities can be calculated by DP and the number of such probabilities is polynomial to *n*. We will give the formal definition of the conditional probability first and then constrain the domain of the parameters of the conditional probability.

We next determine the domain of *w*. Clearly, if *w* is not a prefix of *m*, Pr (*m* = *R*[*i,i* +| *m*| − 1] |*w* is a prefix of *R*[*i,n*]) = 0, which means that *m* cannot start at *R*[*i*]. We can see that, when *w*[*j*,|*w*|−1] is not a prefix of *m, m* cannot start at *R*[*j*] which has a prefix *w*[*j*,|*w*|−1]. So we have to find the longest suffix of *w* which is also a prefix of *m*. This leads us to the following definition.

*Let P (m) be the set of all prefixes of a word m. For any string w, let S_{m}(w) denote the longest suffix of w which is in P(m).*

For example, if *m* = *ACCAC* and *w* = *CCAC*, then *S _{m}*(

*w*) =

*AC*which is a suffix of

*w*and a prefix of

*m*. It is the longest string that satisfies the two requirements simultaneously.

Algorithm 1 Dynamic programming for k = 1 . |
---|

Input: A motif word m over Ω = <A,C,G,T>, the stationary probability I_{1} × 4 and transition matrix T_{4} × 4 of a first-order Markov region R. |

Output: Probability that m hits R at least once. |

1: Calculate map , ∀ w ∈ P(m) and c ∈ Ω, g(w,c) = S(_{m}wc) using suffix tree |

2: fori = n → 1 do |

3: forw ∈ P(m) from longest to shortest do |

4: if |w| = 0 then |

5: calculate F(i,q), ∀ q ∈ Ω |

6: else |

7: calculate f (i, w) |

8: end if |

9: end for |

10: end for |

11: output ∑_{q ∈ Ω}I(q)F(1,q) |

Algorithm 1 Dynamic programming for k = 1 . |
---|

Input: A motif word m over Ω = <A,C,G,T>, the stationary probability I_{1} × 4 and transition matrix T_{4} × 4 of a first-order Markov region R. |

Output: Probability that m hits R at least once. |

1: Calculate map , ∀ w ∈ P(m) and c ∈ Ω, g(w,c) = S(_{m}wc) using suffix tree |

2: fori = n → 1 do |

3: forw ∈ P(m) from longest to shortest do |

4: if |w| = 0 then |

5: calculate F(i,q), ∀ q ∈ Ω |

6: else |

7: calculate f (i, w) |

8: end if |

9: end for |

10: end for |

11: output ∑_{q ∈ Ω}I(q)F(1,q) |

Algorithm 1 Dynamic programming for k = 1 . |
---|

Input: A motif word m over Ω = <A,C,G,T>, the stationary probability I_{1} × 4 and transition matrix T_{4} × 4 of a first-order Markov region R. |

Output: Probability that m hits R at least once. |

1: Calculate map , ∀ w ∈ P(m) and c ∈ Ω, g(w,c) = S(_{m}wc) using suffix tree |

2: fori = n → 1 do |

3: forw ∈ P(m) from longest to shortest do |

4: if |w| = 0 then |

5: calculate F(i,q), ∀ q ∈ Ω |

6: else |

7: calculate f (i, w) |

8: end if |

9: end for |

10: end for |

11: output ∑_{q ∈ Ω}I(q)F(1,q) |

Algorithm 1 Dynamic programming for k = 1 . |
---|

Input: A motif word m over Ω = <A,C,G,T>, the stationary probability I_{1} × 4 and transition matrix T_{4} × 4 of a first-order Markov region R. |

Output: Probability that m hits R at least once. |

1: Calculate map , ∀ w ∈ P(m) and c ∈ Ω, g(w,c) = S(_{m}wc) using suffix tree |

2: fori = n → 1 do |

3: forw ∈ P(m) from longest to shortest do |

4: if |w| = 0 then |

5: calculate F(i,q), ∀ q ∈ Ω |

6: else |

7: calculate f (i, w) |

8: end if |

9: end for |

10: end for |

11: output ∑_{q ∈ Ω}I(q)F(1,q) |

We show how to calculate *f* (*i,w*) in Algorithm 2. *F*(*i, q*) can be calculted similarly

Algorithm 2 Calculate f (i, w) . |
---|

1: if |m| > n+1−ithen |

2: f(i,w) = 0 |

3: else ifw = mthen |

4: f(i,w) = 1 |

5: else |

6: forc ∈ Ω do |

7: r = _{c}T[w_{−1}][c] which is the transition probability from w_{−1} to c, where w_{−1} denotes the last character of w |

8: i′ = i + |wc| − | S(_{m}wc)| |

9: if |S(_{m}wc)| = 0 then |

10: f = _{c}F(i′,c) |

11: else |

12: f = _{c}f(i′,S(_{m}wc)) |

13: end if |

14: end for |

15: f(i,w) = ∑_{c ∈ Ω}f × _{c}r_{c} |

16: end if |

Algorithm 2 Calculate f (i, w) . |
---|

1: if |m| > n+1−ithen |

2: f(i,w) = 0 |

3: else ifw = mthen |

4: f(i,w) = 1 |

5: else |

6: forc ∈ Ω do |

7: r = _{c}T[w_{−1}][c] which is the transition probability from w_{−1} to c, where w_{−1} denotes the last character of w |

8: i′ = i + |wc| − | S(_{m}wc)| |

9: if |S(_{m}wc)| = 0 then |

10: f = _{c}F(i′,c) |

11: else |

12: f = _{c}f(i′,S(_{m}wc)) |

13: end if |

14: end for |

15: f(i,w) = ∑_{c ∈ Ω}f × _{c}r_{c} |

16: end if |

Algorithm 2 Calculate f (i, w) . |
---|

1: if |m| > n+1−ithen |

2: f(i,w) = 0 |

3: else ifw = mthen |

4: f(i,w) = 1 |

5: else |

6: forc ∈ Ω do |

7: r = _{c}T[w_{−1}][c] which is the transition probability from w_{−1} to c, where w_{−1} denotes the last character of w |

8: i′ = i + |wc| − | S(_{m}wc)| |

9: if |S(_{m}wc)| = 0 then |

10: f = _{c}F(i′,c) |

11: else |

12: f = _{c}f(i′,S(_{m}wc)) |

13: end if |

14: end for |

15: f(i,w) = ∑_{c ∈ Ω}f × _{c}r_{c} |

16: end if |

Algorithm 2 Calculate f (i, w) . |
---|

1: if |m| > n+1−ithen |

2: f(i,w) = 0 |

3: else ifw = mthen |

4: f(i,w) = 1 |

5: else |

6: forc ∈ Ω do |

7: r = _{c}T[w_{−1}][c] which is the transition probability from w_{−1} to c, where w_{−1} denotes the last character of w |

8: i′ = i + |wc| − | S(_{m}wc)| |

9: if |S(_{m}wc)| = 0 then |

10: f = _{c}F(i′,c) |

11: else |

12: f = _{c}f(i′,S(_{m}wc)) |

13: end if |

14: end for |

15: f(i,w) = ∑_{c ∈ Ω}f × _{c}r_{c} |

16: end if |

The recursion formulae here are

In other words, we use the conditional probabilities in a smaller subregion with fewer hits and a prefix constructed from *w* to calculate *f* (*j*) (*i,w*).

Algorithm 3 Computing P-value of a motif word . |
---|

1: Calculate the map g using suffix tree |

2: fori = n → 1 do |

3: forj = 1 → kdo |

4: forw ∈ P(m) from longest to shortest do |

5: Calculate f (j) (i,w) |

6: end for |

7: end for |

8: end for |

Algorithm 3 Computing P-value of a motif word . |
---|

1: Calculate the map g using suffix tree |

2: fori = n → 1 do |

3: forj = 1 → kdo |

4: forw ∈ P(m) from longest to shortest do |

5: Calculate f (j) (i,w) |

6: end for |

7: end for |

8: end for |

Algorithm 3 Computing P-value of a motif word . |
---|

1: Calculate the map g using suffix tree |

2: fori = n → 1 do |

3: forj = 1 → kdo |

4: forw ∈ P(m) from longest to shortest do |

5: Calculate f (j) (i,w) |

6: end for |

7: end for |

8: end for |

Algorithm 3 Computing P-value of a motif word . |
---|

1: Calculate the map g using suffix tree |

2: fori = n → 1 do |

3: forj = 1 → kdo |

4: forw ∈ P(m) from longest to shortest do |

5: Calculate f (j) (i,w) |

6: end for |

7: end for |

8: end for |

The details of calculating *f* (*j*) (*i,w*) are the same as Algorithm 1 and 2 with appropriate modification of the recursion formulae. The complete time complexity analysis will be given in Section 4.2.

### 3.2 Calculating P-value for multiple-word motif

The basic idea is similar to Algorithm 1. We just extend all the definitions in Section 3.1 from a single word to a set of words. Two main definitions are conditional probability and prefix set.

The definition of is similar and we omit it here. The prefix set of *M* is the union of the prefix sets of all the motif words in *M*, that is . The recursion formulae for multiple-word motif are Replacing *P*(*m*), *f* (*j*) (*i,w*) and *F* (*j*) (*i,q*) by *P*(*M*), and in algorithms in Section 3.1, we can get the DP algorithm to calculate *P*-value for multiple-word motif. We will analyze time and space complexity of the algorithm in Section 4.2.

if *w* matches any word in *M*,

is a prefix of *R*[*i, n*]))

is a prefix of *R*[*i,n*]))

### 3.3 Computing one PWM motif hit probability

We will prove that calculating *P*-value for a PWM motif over a Markov region is NP-hard in Section 4.1. This explains the prevalence of approximation and exponential algorithms in the field. We present a DP algorithm by constructing a multiple-words motif from a matrix motif with threshold *c*_{0} and then use the algorithms described in Section 3.2.

Given a threshold *c*_{0} and a motif *m* represented by PWM *P*, we say a string of length |*m*| is *compatible* with *P* if its motif score is at least *c*_{0}. The main idea of the algorithm is to rank the letters in each position of the motif in decreasing order of score.

In position *i* of motif *m*, call the highest score letter *m*_{i,0}, the second most *m*_{i,1}, the third most *m*_{i,2} and the lowest score letter *m*_{i,3}. So each position independently ranks the letters from 0 through 3. We enumerate sequences of length *l* = |*m*| in rank-lexicographic order, skipping incompatible strings whenever possible. In particular, if a string *s* = *m*_{1,s1}*m*_{2,s2} … *m*_{l,sl} is compatible with *m*, then its successor, the next string to be considered, is simply the rank-lexicographically next string *m*_{1},*s*_{1} … *m*_{i−1,si−1}*m*_{i,si + 1}*m*_{i + 1,0} … *m*_{l,0}, where *i* is the largest position with letter rank less than 3. If, on the other hand, *s* is not compatible with *m*, then, letting *j* be the largest position with non-zero letter rank, we skip all 4 *l* − *j* sequences of the form *m*_{1,s1} … *m*_{j,sj}*m*_{j + 1,*} … *m*_{l,*}. These sequences can obviously score no better than *s* and are therefore also incompatible. The successor of *s* is then *m*_{1,s1} … *m*_{i−1,si−1}*m*_{i,si + 1}*m*_{i + 1,0} … *m*_{l,s0}, with *i* the largest position before *j* with letter rank less than 3. Since no *l* = |*m*| consecutive successors can be incompatible, we can enumerate the set of compatible strings in time *O*(*K* × |*m*|), where *K* is the size of the set of all strings compatible with *m*. If we consider the double-stranded structure of DNA, we have to add the sequences reserve-complementary to those found.

Then we can use the algorithm for multiple-word motifs to calculate the *P*-value. Although it is an exponential time algorithm, in real biological applications *c*_{0} is often large enough for MotifRank to be practical.

## DISCUSSION

Direct DNA binding by transcription factors can be inferred from the central enrichment of predicted binding sites in ChIP-seq peak regions. The CentriMo algorithm provides visual and statistical information that facilitate this inference. A combination of central enrichment features appears to be informative: (1) low *P* -value (2) narrow region of optimal enrichment (<100 bp) (3) high maximal site-probability (4) unimodal site-probability curve (e.g. not like ACAAWRS in Figure 2 a). Using these features, we have extracted strong evidence about the *in vivo* DNA-binding affinity of NFIC, Nanog, E2f1 and Smad1 from ChIP-seq data that is not apparent without considering central enrichment. CentriMo is an important new tool for the biologist studying ChIP-seq data.

Central motif enrichment analysis can be applied to ChIP-seq data in several contexts. Firstly, it can be used to evaluate motifs reported by motif discovery algorithms to determine the most likely candidates for direct DNA binding in a ChIP-seq experiment. Secondly, it can be used as a motif enrichment tool in conjunction with databases of known motifs, especially in cases where motif discovery algorithms fail. (In this application, CentriMo has the advantage relative to other MEA approaches of not requiring selection of a set of control sequences, since the flanking regions provide the control.) Finally, it can be used to investigate the *in vivo* validity of *in vitro* motifs from the literature, as well as variations of those motifs, as we do here for Nanog. All of this suggests, of course, that central enrichment can be applied directly in a motif discovery algorithm, and we are currently developing such algorithms.

The CentriMo algorithm resembles the earlier ‘position-analysis’ algorithm ( 29 , 30 ). That algorithm detects words that show a heterogeneous distribution of occurrences within a set of input sequences. CentriMo generalizes this approach to predicting TF binding sites represented by a PWM (rather than individual words), and specializes it by confining the search to enrichment of a central region. ‘Position-analysis’ counts all occurrences of a given word in each evenly-spaced bin of a preselected size (typically 50 bp), and detects deviation from a uniform background using a Chi-squared test. This contrasts with the approach of CentriMo, which allows each sequence to contribute at most one predicted binding site, making the statistical model significantly simpler. Unlike ‘position-analysis’, CentriMo does not require a bin size because it considers central regions of all size and computes the enrichment *P* -value for each using the binomial test.

CentriMo also bears some similarity to our recent SpaMo algorithm ( 2 ). The SpaMo algorithm detects enriched ‘spacings’ in ChIP-seq regions between predicted binding sites using *two* PWMs. In a typical application of SpaMo, one of the PWMs represents the DNA-binding affinity of the ChIP-ed TF, and the other a suspected co-factor. In each ChIP-seq region, SpaMo first predicts a binding site for the first PWM. It then aligns the regions on this binding site, and then predicts a binding site using the second PWM. SpaMo counts the number of times the predicted sites have each possible order, strand orientation and distance, and applies a binomial test to each of the combinations to detect significantly enriched spacings. CentriMo functions analogously, but the situation is simpler because the strand matched by the (single) PWM does not matter, nor does order (which side of the center a predicted site is located).

The distribution of distances between binding sites predicted by motifs and ChIP-seq peaks has been used previously to compare the quality of ChIP-seq peak-calling algorithms ( 31 ). In the current work, we use the distribution of distances for a completely distinct task—for comparing the quality of motifs as candidate descriptions of the direct DNA binding of the ChIP-ed TF. It is possible that Gerstein *et al.* ( 32 ) had something similar in mind with the ‘localization tests’ mentioned in the Supplement to that paper. However, they do not give details of their approach for computing the distance distributions or for computing their statistical significance, which is an essential facet of the analysis method we describe. To the best of our knowledge, the algorithm we describe here, and, especially, the associated methodology for inferring direct DNA binding, are both novel.

## References

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.

Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z: GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009, 10: 48-10.1186/1471-2105-10-48.

Ragle-Aure M, Steinfeld I, Baumbusch LO, Liestøl K, Lipson D, Nyberg S, Naume B, Sahlberg KK, Kristensen VN, Børresen-Dale A-L, Lingjærde OC, Yakhini Z: Identifying in-trans process associated genes in breast cancer by integrated analysis of copy number and expression data. PLoS ONE. 2013, 8: e53014-10.1371/journal.pone.0053014.

Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe’er D: An integrated approach to uncover drivers of cancer. Cell. 2010, 143: 1005-1017. 10.1016/j.cell.2010.11.013.

Dehan E, Ben-Dor A, Liao W, Lipson D, Frimer H, Rienstein S, Simansky D, Krupsky M, Yaron P, Friedman E, Rechavi G, Perlman M, Aviram-Goldring A, Izraeli S, Bittner M, Yakhini Z, Kaminski N: Chromosomal aberrations and gene expression profiles in non-small cell lung cancer. Lung Cancer. 2007, 56: 175-184. 10.1016/j.lungcan.2006.12.010.

Al-Shahrour F, Díaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics. 2004, 20: 578-580. 10.1093/bioinformatics/btg455.

Leibovich L, Yakhini Z: Efficient motif search in ranked lists and applications to variable gap motifs. Nucleic Acids Res. 2012, 40: 5832-5847. 10.1093/nar/gks206.

Leibovich L, Paz I, Yakhini Z, Mandel-Gutfreund Y: DRIMust: a web server for discovering rank imbalanced motifs using suffix trees. Nucleic Acids Res. 2013, 41: W174-W179. 10.1093/nar/gkt407.

Steinfeld I, Navon R, Ach R, Yakhini Z: miRNA target enrichment analysis reveals directly active miRNAs in health and disease. Nucleic Acids Res. 2013, 41: e45-e45. 10.1093/nar/gks1142.

Enerly E, Steinfeld I, Kleivi K, Leivonen S-K, Ragle-Aure M, Russnes HG, Rønneberg JA, Johnsen H, Navon R, Rødland E, Mäkelä R, Naume B, Perälä M, Kallioniemi O, Kristensen VN, Yakhini Z, Børresen-Dale A-L: miRNA-mRNA integrated analysis reveals roles for miRNAs in primary breast tumors. PLoS ONE. 2011, 6: e16915-10.1371/journal.pone.0016915.

Plis SM, Weisend MP, Damaraju E, Eichele T, Mayer A, Clark VP, Lane T, Calhoun VD: Effective connectivity analysis of fMRI and MEG data collected under identical paradigms. Comput Biol Med. 2011, 41: 1156-1165. 10.1016/j.compbiomed.2011.04.011.

Eden E, Lipson D, Yogev S, Yakhini Z: Discovering motifs in ranked lists of DNA sequences. PLoS Comput Biol. 2007, 3: e39-10.1371/journal.pcbi.0030039.

Steinfeld I, Navon R, Ardigò D, Zavaroni I, Yakhini Z: Clinically driven semi-supervised class discovery in gene expression data. Bioinformatics. 2008, 24: i90-i97. 10.1093/bioinformatics/btn279.

Straussman R, Nejman D, Roberts D, Steinfeld I, Blum B, Benvenisty N, Simon I, Yakhini Z, Cedar H: Developmental programming of CpG island methylation profiles in the human genome. Nat Struct Mol Biol. 2009, 16: 564-571. 10.1038/nsmb.1594.

Lee B-K, Bhinge AA, Iyer VR: Wide-ranging functions of E2F4 in transcriptional activation and repression revealed by genome-wide analysis. Nucleic Acids Res. 2011, 39: 3558-3573. 10.1093/nar/gkq1313.

Rhee Ho S, Pugh BF: Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 2011, 147: 1408-1419. 10.1016/j.cell.2011.11.013.

Lebedeva S, Jens M, Theil K, Schwanhäusser B, Selbach M, Landthaler M, Rajewsky N: Transcriptome-wide analysis of regulatory interactions of the RNA-binding protein HuR. Molecular Cell. 2011, 43: 340-352. 10.1016/j.molcel.2011.06.008.

Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M, Jungkamp A-C, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and MicroRNA target sites by PAR-CLIP. Cell. 2010, 141: 129-141. 10.1016/j.cell.2010.03.009.

Staden R: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984, 12: 505-519. 10.1093/nar/12.1Part2.505.

Stormo GD, Schneider TD, Gold L: Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res. 1986, 14: 6661-6679. 10.1093/nar/14.16.6661.

Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.

Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotech. 2005, 23: 137-144. 10.1038/nbt1053.

Sinha S: On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics. 2006, 22: e454-e463. 10.1093/bioinformatics/btl227.

Abramowitz M, Stegun IA: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. 1964, New York: Dover Publications, Inc.

Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn. 1995, 21: 51-80.

Bailey TL: DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics. 2011, 27: 1653-1659. 10.1093/bioinformatics/btr261.

Luehr S, Hartmann H, Söding J: The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences. Nucleic Acids Res. 2012, 40: W104-W109. 10.1093/nar/gks602.

Smeenk L, van Heeringen SJ, Koeppel M, van Driel MA, Bartels SJJ, Akkers RC, Denissov S, Stunnenberg HG, Lohrum M: Characterization of genome-wide p53-binding sites upon stress response. Nucleic Acids Res. 2008, 36: 3639-3654. 10.1093/nar/gkn232.

Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800.

Hogan DJ, Riordan DP, Gerber AP, Herschlag D, Brown PO: Diverse RNA-binding proteins interact with functionally related sets of RNAs. Suggesting an extensive regulatory system. PLoS Biol. 2008, 6: e255-10.1371/journal.pbio.0060255.

Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011, 25: 1915-1927. 10.1101/gad.17446611.

Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen C-y, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, Wasserman WW: JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2013, 42: D142-D147.

Yang J-H, Li J-H, Jiang S, Zhou H, Qu L-H: ChIPBase: a database for decoding the transcriptional regulation of long non-coding RNA and microRNA genes from ChIP-Seq data. Nucleic Acids Res. 2013, 41: D177-D187. 10.1093/nar/gks1060.

Gupta S, Stamatoyannopoulos J, Bailey T, Noble W: Quantifying similarity between motifs. Genome Biol. 2007, 8: R24-10.1186/gb-2007-8-2-r24.

Brandeis M, Frank D, Keshet I, Siegfried Z, Mendelsohn M, Names A, Temper V, Razin A, Cedar H: Sp1 elements protect a CpG island from de novo methylation. Nature. 1994, 371: 435-438. 10.1038/371435a0.

Bert SA, Robinson MD, Strbenac D, Statham AL, Song JZ, Hulf T, Sutherland RL, Coolen MW, Stirzaker C, Clark SJ: Regional activation of the cancer genome by long-range epigenetic remodeling. Cancer Cell. 2013, 23: 9-22. 10.1016/j.ccr.2012.11.006.

Nejman D, Straussman R, Steinfeld I, Ruvolo M, Roberts D, Yakhini Z, Cedar H: Molecular rules governing de novo methylation in cancer. Cancer Res. 2014, 74: 1475-1483. 10.1158/0008-5472.CAN-13-3042.

Kubosaki A, Tomaru Y, Tagami M, Arner E, Miura H, Suzuki T, Suzuki M, Suzuki H, Hayashizaki Y: Genome-wide investigation of in vivo EGR-1 binding sites in monocytic differentiation. Genome Biol. 2009, 10: R41-10.1186/gb-2009-10-4-r41.

McLeay R, Bailey T: Motif enrichment analysis: a unified framework and an evaluation on ChIP data. BMC Bioinformatics. 2010, 11: 165-10.1186/1471-2105-11-165.

Frank DE, Saecker RM, Bond JP, Capp MW, Tsodikov OV, Melcher SE, Levandoski MM, Record MT: Thermodynamics of the interactions of lac repressor with variants of the symmetric lac operator: effects of converting a consensus site to a non-specific site. J Mol Biol. 1997, 267: 1186-1206. 10.1006/jmbi.1997.0920.

Benos PV, Lapedes AS, Stormo GD: Is there a code for protein-DNA recognition? Probab(ilistical)ly. Bioessays. 2002, 24: 466-475. 10.1002/bies.10073.

D’heaseleer, P.: What are DNA sequence motifs. Natl. Biotechnol. **24**(4), 423–425 (2006)

Latchman, D.S.: Transcription Factors: A Practical Approach. Oxford University Press, Oxford (1993)

Wu, B., et al.: Identify target genes involved in transcription factor GCF2 that promotes cell migration in tumor cell BEL-7404. Genomics Appl. Biol. **34**(1), 35–40 (2015)

Haruka, O., Wataru, I.: MOCCS: clarifying DNA-binding motif ambiguity using ChIP-Seq data. Comput. Biol. Chem. **63**, 62–72 (2016)

Bussemaker, H.J., Li, H., Siggia, E.D.: Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA **97**(18), 10096–10100 (2000)

Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. **30**(24), 5549–5560 (2002)

Sinha, S., Tompa, M.: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. **31**(13), 3586–3588 (2003)

Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol. **5**, 279–305 (1998)

Du, Y.H., Wang, Z.Z.: Review on computational prediction of transcription factor blinding sites. Life Sci. Res. **10**(2), 24–31 (2006)

Li, T.T., Jiang, B., Wang, X.W.: Tutorial for computational analysis of transcription factor binding sites. Acta Biophys. Sin. **24**(5), 334–347 (2008)

Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics **15**(7–8), 563–577 (1999)

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M., Kumar, S.: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. **28**, 2731–2739 (2011)

Lawrence, C., Altschul, S.H.: Combinatorial approaches to finding subtle signals in DNA sequence. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB-2000), pp. 269–278. AAAI Press, San Diego (2000)

Neuwald, A.F., Liu, J.S., Lawrence, C.E.: Gibbs motif sampling: detection of bacterial outer membrane repeats. Protein Sci. **4**(8), 1618–1632 (1995)

Surujon, D., Ratner, D.I.: Use of a probabilistic motif search to identify histidine phosphotransfer domain-containing proteins. PLoS ONE **11**, 1–18 (2016)

Stine, M.: Motif discovery in upstream sequences of coordinately expressed genes. In: Proceedings of the CEC’03, pp. 1596–1603. [s. n.], Memphis (2003)

Liu, F.F.M.: FMGA: finding motifs by genetic algorithm. In: Proceedings of the BIBE’04, pp. 459–466. IEEE Press, Taichung (2004)

Che, D.S.: MDGA: motif discovery using a genetic algorithm. In: Proceedings of the Conference on Genetic and Evolutionary Computation, pp. 447–452. [s. n.], Washington D.C. (2005)

Congdon, C.B.: Preliminary results for GAMI: a genetic algorithms approach to motif inference. In: Proceedings of the Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 1–8. IEEE Press, [S. l.] (2005)

Paul, T.K., Iba, H.: Identification of weak motifs in multiple biological sequences using genetic algorithm. In: Proceedings of the GECCO’06, pp. 271–278. [s. n.], Seattle (2006)

Zhang, F., Tan, J., Xie, J.B.: Comparison, analysis and optimization of motif finding based on different algorithms. Comput. Eng. **35**(22), 94–96 (2009)

Watson, J.D., Crick, F.H.C.: A structure for DNA. Nature **171**, 737–738 (1953)

Vaidyanathan, P.P.: Genomics and proteomics: a signal processor’s tour. Circuits Syst. **4**(4), 6–29 (2004)

Lenhard, B., Wasserman, W.W.: TFBS: computational framework for transcription factor binding sites analysis. Bioinform. Appl. Note **18**(8), 1135–1136 (2002)

Hou, L., Qian, M.P., Zhu, Y.P.: Advances on bioinformatic research in transcription factor binding sites. HEREDITAS **31**(4), 365–373 (2009)

Waterman, M.S., Arratia, R., Galas, D.J.: Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol. **46**, 515–527 (1984)

Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics **15**, 563–577 (1999)

Crooks, G.E., Hon, G., Chandonia, J.M., et al.: Web Logo: a sequence logo generator. Genome Res. **14**, 1188–1190 (2004)

Schuster, B., Schultz, J., Rahmann, S.: HMM logos for visualization of protein families. BMC Bioinform. **5**, 7 (2004)

Kok, W.Y., Oon, Y.B., Lee, N.K.: Perception enhancement using visual attributes in sequence motif visualization. BioRxiv **31**, 1–8 (2016). doi:10.1101/066928

Tang, Z.G., Yang, B.R., Yang, J.: New outlier detection algorithm based on Markov chain. Syst. Eng. Electron. **32**(12), 2721–2724 (2010)

Hughes, J., Estep, P., Tavazoie, S., Church, G.: Computational identification of Cis-regulatory elements associated with groups of functionally related genes in *Saccharomyces cerevisiae*. J. Mol. Biol. **296**(5), 1205–1214 (2000)

Martin, T., Nan, L., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. **23**, 137–144 (2005)

Zhou, Qingyuan: Research on heterogeneous data integration model of group enterprise based on cluster computing. Clust. Comput. **19**(3), 1275–1282 (2016)

Zhou, Q., Luo, J.: Artificial neural network based grid computing of E-government scheduling for emergency management. Comput. Syst. Sci. Eng. **30**(5), 327–335 (2015)

Xu, Z., Zhang, H., Hu, C., Mei, L., Xuan, J., Choo, K.R., Sugumaran, V., Zhu, Y.: Building knowledge base of urban emergency events based on crowdsourcing of social media. Concurr. Comput.: Pract. Exp. **28**(15), 4038–4052 (2016)

Xu, Z., Zhang, H., Sugumaran, V., Choo, K.R., Mei, L., Zhu, Y.: Participatory sensing-based semantic and spatial analysis of urban emergency events using mobile social media. EURASIP J. Wireless Commun. Netw. **2016**, 44 (2016)

Xu, Z., Hu, C., Mei, L.: Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools Appl. **75**(19), 12155–12172 (2016)

Xu, Z., Wei, X., Liu, Y., Mei, L., Hu, C., Choo, K.R., Zhu, Y., Sugumaran, V.: Building the search pattern of web users using conceptual semantic space model. IJWGS **12**(3), 328–347 (2016)

Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. **19**(3), 1283–1292 (2016)

## Related work

Many simple motif extraction algorithms have been proposed primarily for extracting the transcription factor binding sites, where each motif consists of a unique binding site [4–10] or two binding sites separated by a fixed number of gaps [11–13]. A pattern with a single component is also called a *monad pattern*. Structured motif extraction problems, in which variable number of gaps are allowed, have attracted much attention recently, where the structured motifs can be extracted either from multiple sequences [14–21] or from a single sequence [22, 23]. In many cases, more than one transcription factor may cooperatively regulate a gene. Such patterns are called *composite regulatory patterns*. To detect the composite regulatory patterns, one may apply single binding site identification algorithms to detect each component separately. However, this solution may fail when some components are not very strong (significant). Thus it is necessary to detect the whole composite regulatory patterns (even with weak components) directly, whose gaps and other possibly strong components can increase its significance.

Several algorithms have been used to address the composite pattern discovery with two components, which are called *dyad patterns*. Helden et al. [11] propose a method for dyad analysis, which exhaustively counts the number of occurrences of each possible pair of patterns in the sequences and then assesses their statistical significance. This method can only deal with fixed number of gaps between the two components. MITRA [12] first casts the composite pattern discovery problem as a larger monad discovery problem and then applies an exhaustive monad discovery algorithm. It can handle several mismatches but can only handle sequences less than 60 kilo-bases long. Co-Bind [24] models composite transcription factors with Position Weight Matrices (PWMs) and finds PWMs that maximize the joint likelihood of occurrences of the two binding site components. Co-Bind uses Gibbs sampling to select binding sites and then refines the PWMs for a fixed number of times. Co-Bind may miss some binding sites since not all patterns in the sequences are considered. Moreover, using a fixed number of iterations for improvement may not converge to the global optimal dyad PWM.

SMILE [14] describes four variants of increasing generality for common structured motif extraction, and proposes two solutions for them. The two approaches for the first problem, in which the structured motif template consists of two components with a gap range between them, both start by building a generalized suffix tree for the input sequences and extracting the first component. Then in the first approach, the second component is extracted by simply jumping in the sequences from the end of the first one to the second within the gap range. In the second approach, the suffix tree is temporarily modified so as to extract the second component from the modified suffix tree directly. The drawback of SMILE is that its time and space complexity are exponential in the number of gaps between the two components. In order to reduce the time during the extraction of the structured motifs, [18] presents a parallel algorithm, PSmile, based on SMILE, where the search space is well-partitioned among the available processors.

RISO [15–17] improves SMILE in two aspects. First, instead of building the whole suffix tree for the input sequences, RISO builds a suffix tree only up to a certain level *l*, called a *factor tree*, which leads to a large space saving. Second, a new data structure called *box-link* is proposed to store the information about how to jump within the DNA sequences from one simple component (box) to the subsequent one in the structured motif. This accelerates the extraction process and avoids exponential time and space consumption (in the gaps) as in SMILE. In RISO, after the generalized factor tree is built, the box-links are constructed by exhaustively enumerating all the possible structured motifs in the sequences and are added to the leaves of the factor tree. Then the extraction process begins during which the factor tree may be temporarily and partially modified so as to extract the subsequent simple motifs. Since during the box-link construction, the structured motif occurrences are exhaustively enumerated and the frequency threshold is never used to prune the candidate structured motifs, RISO needs a lot of computation during this step.

For repeated structured motif identification problem, the frequency closure property that "all the subsequences of a frequent sequence must be frequent", doesn't hold any more since the frequency of a pattern can exceed the frequency of its sub-patterns. [22] introduces an closure-like property which can help prune the patterns without missing the frequent patterns. The two algorithms proposed in [22] can extract within one sequence all frequent patterns of length no greater than a length threshold, which can be either manually specified or automatically determined. However, this method requires that all the gap ranges [*l* _{i}, *u* _{i}], between adjacent *symbols* in the structured motif be the same, i.e., [*l* _{i}, *u* _{i}] = [*l*, *u*] for all *i* ∈ [1, *k* - 1]. Moreover, approximate matches are not allowed for the structured motif.

## Abstract

Cellular development, morphology and function are governed by precise patterns of gene expression. These are established by the coordinated action of genomic regulatory elements known as enhancers or *cis*-regulatory modules. More than 30 years after the initial discovery of enhancers, many of their properties have been elucidated however, despite major efforts, we only have an incomplete picture of enhancers in animal genomes. In this Review, we discuss how properties of enhancer sequences and chromatin are used to predict enhancers in genome-wide studies. We also cover recently developed high-throughput methods that allow the direct testing and identification of enhancers on the basis of their activity. Finally, we discuss recent technological advances and current challenges in the field of regulatory genomics.