MotifHyades


In higher eukaryotes, protein-DNA binding interactions are the central activities in gene regulation. In particular, DNA motifs such as transcription factor binding sites are the related key factors. Taking advantages of the recent high-throughput chromatin interaction data, a preliminary study has been conducted to reveal the coupling DNA motif pairs on chromatin interactions in the human K562 cells. However, the previous study is based on an ad hoc computational pipeline which was not designed for DNA motif pair discovery at the beginning.

Therefore, we have developed a probabilistic model (namely, MotifHyades) for DNA motif pair discovery on paired sequences. Under comprehensive simulation scenarios, MotifHyades is demonstrated more accurate than the previous ad hoc computational pipeline for DNA motif pair discovery. Most importantly, we have run MotifHyades on the previous human K562 dataset to discover thousands of novel DNA motif pairs which are found to have higher TomTom matching ratio, higher DNase accessibility, and higher evolutionary conservation than the previous DNA motif pairs discovered.

Downloads

For source codes and potential collaboration, please email here

The DNA motif pairs discovered on the long-range promoter-enhancer pairs on chromatin interactions in six human cell types (i.e. K562, GM12878, HeLa-S3, HUVEC, IMR90, and NHEK) [1] using MotifHyades can be downloaded via this link.

[1] Whalen, Sean, Rebecca M. Truty, and Katherine S. Pollard. "Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin." Nature genetics 48.5 (2016): 488-496.

Command Line Usage

MotifHyades inFastaFilePath1 inFastaFilePath2 [maxMotifWidth1] [maxMotifWidth2] [numOfMotifPairs] [threshold] [maxIterations]

Input Arguments:  
inFastaFilePath1 The first input sequence file path (example: SequenceSet1.fasta)
inFastaFilePath2 The second input sequence file path (example: SequenceSet2.fasta)
maxMotifWidth1 Maximal first motif length (default: 15)
maxMotifWidth2 Maximal second motif length (default: 15)
numOfMotifPairs Number of motif pairs (default: 2)
threshold The tolerance used for testing convergence (default: 0.05)
maxIterations Maximal number of EM iterations (default: 100)
   
Output Files:
All motif pair information will be outputted in the MotifHyades_results folder.
All motif pair images will be outputted in the MotifHyades_images folder.
 
Microsoft Windows 64-bits examples:
C:\> MotifHyades <argument_list>
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta 20 20 3
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta 20 20 3 0.05 100
 
Linux 64-bits examples:
>./run_MotifHyades.sh <mcr_directory> <argument_list>
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta 20 20 3
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta 20 20 3 0.05 100

FAQ

What is MotifHyades ?

What is MCR ?

MCR stands for Matlab Compiler Runtime. If your machine does not have Matlab, you need to install MCR to execute MotifHyades. MCR can be downloaded from the internet easily. In particular, we advise you to download the same version indicated in the "Downloads" section.

Is there any demo ?

By default, a small testing dataset (SequenceSet1.fasta and SequenceSet2.fasta) is zipped with the MotifHyades executables in the "Downloads" section. Once downloaded, you can simply change your current directory to it and type "MotifHyades SequenceSet1.fasta SequenceSet2.fasta" to run a MotifHyades demo on the testing dataset (which has 2 DNA motif pairs to be discovered from the 100 sequence pairs). After the run, you will see the 2 DNA motif pairs discovered by MotifHyades. Details can be found in the result folders "MotifHyades_images" and "MotifHyades_results" generated; for instance, the following is the comparison between the actual 2 DNA motif pairs (left picture) and those discovered by MotifHyades (right picture). As you can see, the 2 DNA motif pairs discovered by MotifHyades (right picture) match well with the actual answer (left picture) except the discovery order which is normal in de novo motif pair discovery as we did not know the pair order in advance.

More data ?

Public genome annotation data can be accessed through ENCODE consortium and Gene Expression Omnibus (GEO).

Contact

Please internet-engine-search and contact "Ka-Chun Wong" for any enquiry; for instance, Google Search and Baidu Search

Funding Support

We would like to thank Amazon Web Service (AWS) for providing cloud credits for the software development.