Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

Majid Kazemian; Qiyun Zhu; Marc S. Halfon; Saurabh Sinha

doi:10.1093/nar/gkr621

Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

Majid Kazemian, Qiyun Zhu, Marc S. Halfon, Saurabh Sinha

Research output: Contribution to journal › Article › peer-review

27 Scopus citations

Abstract

Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.

Original language	English (US)
Pages (from-to)	9463-9472
Number of pages	10
Journal	Nucleic acids research
Volume	39
Issue number	22
DOIs	https://doi.org/10.1093/nar/gkr621
State	Published - Dec 2011
Externally published	Yes

ASJC Scopus subject areas

Genetics

Access to Document

10.1093/nar/gkr621

Cite this

@article{4b7cf91c6092410e83527e94b8c4e862,

title = "Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison",

abstract = "Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.",

author = "Majid Kazemian and Qiyun Zhu and Halfon, {Marc S.} and Saurabh Sinha",

note = "Funding Information: Funding for open access charge: National Institutes of Health (1R01GM085233-01 to S.S. and M.S.H.).",

year = "2011",

month = dec,

doi = "10.1093/nar/gkr621",

language = "English (US)",

volume = "39",

pages = "9463--9472",

journal = "Nucleic acids research",

issn = "0305-1048",

publisher = "Oxford University Press",

number = "22",

}

TY - JOUR

T1 - Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

AU - Kazemian, Majid

AU - Zhu, Qiyun

AU - Halfon, Marc S.

AU - Sinha, Saurabh

N1 - Funding Information: Funding for open access charge: National Institutes of Health (1R01GM085233-01 to S.S. and M.S.H.).

PY - 2011/12

Y1 - 2011/12

N2 - Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.

AB - Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.

UR - http://www.scopus.com/inward/record.url?scp=80053321076&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053321076&partnerID=8YFLogxK

U2 - 10.1093/nar/gkr621

DO - 10.1093/nar/gkr621

M3 - Article

C2 - 21821659

AN - SCOPUS:80053321076

SN - 0305-1048

VL - 39

SP - 9463

EP - 9472

JO - Nucleic acids research

JF - Nucleic acids research

IS - 22

ER -

Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this