MotifHyades

In higher eukaryotes, protein-DNA binding interactions are the central activities in gene regulation. In particular, DNA motifs such as transcription factor binding sites are the related key factors. Taking advantages of the recent high-throughput chromatin interaction data, a preliminary study has been conducted to reveal the coupling DNA motif pairs on chromatin interactions in the human K562 cells. However, the previous study is based on an ad hoc computational pipeline which was not designed for DNA motif pair discovery at the beginning.

Therefore, we have developed a probabilistic model (namely, MotifHyades) for DNA motif pair discovery on paired sequences. Under comprehensive simulation scenarios, MotifHyades is demonstrated more accurate than the previous ad hoc computational pipeline for DNA motif pair discovery. Most importantly, we have run MotifHyades on the previous human K562 dataset to discover thousands of novel DNA motif pairs which are found to have higher TomTom matching ratio, higher DNase accessibility, and higher evolutionary conservation than the previous DNA motif pairs discovered.

Downloads

Window 64-bits

Download MotifHyades Executables and Demo Dataset

Download MCR Runtime (2014b)

Linux 64-bits

Download MotifHyades Executables and Demo Dataset

Download MCR Runtime (2013a)

For source codes and potential collaboration, please email here

If you encounter the ghostscript error in linux, another linux version (MCR R2016a) without any image format output is compiled and available here via this link if you are interested. You may also need to install the MCR R2016a from the official matlab website.

The DNA motif pairs discovered on the long-range promoter-enhancer pairs on chromatin interactions in six human cell types (i.e. K562, GM12878, HeLa-S3, HUVEC, IMR90, and NHEK) [1] using MotifHyades can be downloaded via this link.

[1] Whalen, Sean, Rebecca M. Truty, and Katherine S. Pollard. "Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin." Nature genetics 48.5 (2016): 488-496.

Command Line Usage

MotifHyades inFastaFilePath1 inFastaFilePath2 [maxMotifWidth1] [maxMotifWidth2] [numOfMotifPairs] [threshold] [maxIterations]

Input Arguments:
inFastaFilePath1	The first input sequence file path (example: SequenceSet1.fasta)
inFastaFilePath2	The second input sequence file path (example: SequenceSet2.fasta)
maxMotifWidth1	Maximal first motif length (default: 15)
maxMotifWidth2	Maximal second motif length (default: 15)
numOfMotifPairs	Number of motif pairs (default: 2)
threshold	The tolerance used for testing convergence (default: 0.05)
maxIterations	Maximal number of EM iterations (default: 100)

Output Files:
All motif pair information will be outputted in the MotifHyades_results folder.
All motif pair images will be outputted in the MotifHyades_images folder.

Microsoft Windows 64-bits examples:
C:\> MotifHyades <argument_list>
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta 20 20 3
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta 20 20 3 0.05 100

Linux 64-bits examples:
>./run_MotifHyades.sh <mcr_directory> <argument_list>
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta 20 20 3
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta 20 20 3 0.05 100

FAQ

What is MotifHyades ?

What is MCR ?

MCR stands for Matlab Compiler Runtime. If your machine does not have Matlab, you need to install MCR to execute MotifHyades. MCR can be downloaded from the internet easily. In particular, we advise you to download the same version indicated in the "Downloads" section.

Is there any demo ?

By default, a small testing dataset (SequenceSet1.fasta and SequenceSet2.fasta) is zipped with the MotifHyades executables in the "Downloads" section. Once downloaded, you can simply change your current directory to it and type "MotifHyades SequenceSet1.fasta SequenceSet2.fasta" to run a MotifHyades demo on the testing dataset (which has 2 DNA motif pairs to be discovered from the 100 sequence pairs). After the run, you will see the 2 DNA motif pairs discovered by MotifHyades. Details can be found in the result folders "MotifHyades_images" and "MotifHyades_results" generated; for instance, the following is the comparison between the actual 2 DNA motif pairs (left picture) and those discovered by MotifHyades (right picture). As you can see, the 2 DNA motif pairs discovered by MotifHyades (right picture) match well with the actual answer (left picture) except the discovery order which is normal in de novo motif pair discovery as we did not know the pair order in advance.

More data ?

Public genome annotation data can be accessed through ENCODE consortium and Gene Expression Omnibus (GEO).

Contact

Please internet-engine-search and contact "Ka-Chun Wong" for any enquiry; for instance, Google Search and Baidu Search

Reference

Ka-Chun Wong: MotifHyades: Expectation Maximization for de novo DNA Motif Pair Discovery on Paired Sequences. Bioinformatics (2017)

Funding Support

We would like to thank Amazon Web Service (AWS) for providing cloud credits for the software development.