In higher eukaryotes, protein-DNA binding interactions are the central activities in gene regulation. In particular, DNA motifs such as transcription factor binding sites are the related key factors. Taking advantages of the recent high-throughput chromatin interaction data, a preliminary study has been conducted to reveal the coupling DNA motif pairs on chromatin interactions in the human K562 cells. However, the previous study is based on an ad hoc computational pipeline which was not designed for DNA motif pair discovery at the beginning.
Therefore, we have developed a probabilistic model (namely, MotifHyades) for DNA motif pair discovery on paired sequences. Under comprehensive simulation scenarios, MotifHyades is demonstrated more accurate than the previous ad hoc computational pipeline for DNA motif pair discovery. Most importantly, we have run MotifHyades on the previous human K562 dataset to discover thousands of novel DNA motif pairs which are found to have higher TomTom matching ratio, higher DNase accessibility, and higher evolutionary conservation than the previous DNA motif pairs discovered.
For source codes and potential collaboration, please email here
If you encounter the ghostscript error in linux, another linux version (MCR R2016a) without any image format output is compiled and available here via this link if you are interested. You may also need to install the MCR R2016a from the official matlab website.
The DNA motif pairs discovered on the long-range promoter-enhancer pairs on chromatin interactions in six human cell types (i.e. K562, GM12878, HeLa-S3, HUVEC, IMR90, and NHEK) [1] using MotifHyades can be downloaded via this link.
[1] Whalen, Sean, Rebecca M. Truty, and Katherine S. Pollard. "Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin." Nature genetics 48.5 (2016): 488-496.
MotifHyades inFastaFilePath1 inFastaFilePath2 [maxMotifWidth1] [maxMotifWidth2] [numOfMotifPairs] [threshold] [maxIterations]
Input Arguments: | |
inFastaFilePath1 | The first input sequence file path (example: SequenceSet1.fasta) |
inFastaFilePath2 | The second input sequence file path (example: SequenceSet2.fasta) |
maxMotifWidth1 | Maximal first motif length (default: 15) |
maxMotifWidth2 | Maximal second motif length (default: 15) |
numOfMotifPairs | Number of motif pairs (default: 2) |
threshold | The tolerance used for testing convergence (default: 0.05) |
maxIterations | Maximal number of EM iterations (default: 100) |
Output Files: | |
All motif pair information will be outputted in the MotifHyades_results folder. | |
All motif pair images will be outputted in the MotifHyades_images folder. | |
Microsoft Windows 64-bits examples: | |
C:\> MotifHyades <argument_list> | |
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta | |
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta 20 20 3 | |
C:\> MotifHyades SequenceSet1.fasta SequenceSet2.fasta 20 20 3 0.05 100 | |
Linux 64-bits examples: | |
>./run_MotifHyades.sh <mcr_directory> <argument_list> | |
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta | |
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta 20 20 3 | |
>./run_MotifHyades.sh /mathworks/home/application/v80 SequenceSet1.fasta SequenceSet2.fasta 20 20 3 0.05 100 |
MCR stands for Matlab Compiler Runtime. If your machine does not have Matlab, you need to install MCR to execute MotifHyades. MCR can be downloaded from the internet easily. In particular, we advise you to download the same version indicated in the "Downloads" section.
By default, a small testing dataset (SequenceSet1.fasta and SequenceSet2.fasta) is zipped with the MotifHyades executables in the "Downloads" section. Once downloaded, you can simply change your current directory to it and type "MotifHyades SequenceSet1.fasta SequenceSet2.fasta" to run a MotifHyades demo on the testing dataset (which has 2 DNA motif pairs to be discovered from the 100 sequence pairs). After the run, you will see the 2 DNA motif pairs discovered by MotifHyades. Details can be found in the result folders "MotifHyades_images" and "MotifHyades_results" generated; for instance, the following is the comparison between the actual 2 DNA motif pairs (left picture) and those discovered by MotifHyades (right picture). As you can see, the 2 DNA motif pairs discovered by MotifHyades (right picture) match well with the actual answer (left picture) except the discovery order which is normal in de novo motif pair discovery as we did not know the pair order in advance.
Public genome annotation data can be accessed through ENCODE consortium and Gene Expression Omnibus (GEO).
Please internet-engine-search and contact "Ka-Chun Wong" for any enquiry; for instance, Google Search and Baidu Search
We would like to thank Amazon Web Service (AWS) for providing cloud credits for the software development.