Ensemble Framework for Motif Discovery Based on Data Partitioning

Computational DNA motif prediction is a challenging problem because motifs are short, degenerated, and are associated with ill-defined features. With the advances of genome-wide ChIP analysis technology, computational motif discovery tools are necessary to effectively tackle the large-scale datasets...

Full description

Saved in:
Bibliographic Details
Main Author: Choong, Allen Chieng Hoon
Format: Thesis
Language:English
English
Published: 2020
Subjects:
Online Access:http://ir.unimas.my/id/eprint/31761/1/Ensemble%20Framework%20for%20Motif%20Discovery%20Based%20on%20Data%20Partitioning%20-%2024%20pgs.pdf
http://ir.unimas.my/id/eprint/31761/4/Allen%20Choong%20Chieng%20Hoon%20ft.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-unimas-ir.31761
record_format uketd_dc
spelling my-unimas-ir.317612023-04-18T02:30:16Z Ensemble Framework for Motif Discovery Based on Data Partitioning 2020-09-11 Choong, Allen Chieng Hoon Q Science (General) Computational DNA motif prediction is a challenging problem because motifs are short, degenerated, and are associated with ill-defined features. With the advances of genome-wide ChIP analysis technology, computational motif discovery tools are necessary to effectively tackle the large-scale datasets for motifs search. Ensemble of DNA motif discovery methods is one of the most successful approaches for motif discovery. Nevertheless, most of the existing works cannot perform motif searches in ChIP datasets because of the limited input sizes of the classical tools employed in the ensemble. Ensemble approach not only uses the results from the classical motif discovery tools, it also combines the discovered results to produce better results. The merging algorithm contributes to the prediction accuracy of the discovered motifs. The primary contribution of this thesis work is the development of an ensemble method called ENSPART with the novelty of using data partitioning technique on ChIP dataset for DNA motif prediction. The idea is to reduce the search space by portioning the input datasets into subsets and tackle by ensemble of classical motif discovery tools separately. Then, using a proposed merging algorithm, the candidate motifs are merged regardless the different lengths. Three experiments are conducted. ChIP datasets have been downloaded to evaluate the performances of the ENSPART with Receiver Operative Curves and Area Under Curve performance metrics. ENSPART was compared with the genome-wide motif discovery tools MEME-ChIP, ChIPMunk, and RSAT peak-motifs using partitioning technique. The results demonstrate that ENSPART performed significantly better than MEME-ChIP and RSAT peak-motifs in terms of the two performance metrics. Another set of datasets are gathered and sampled without partitioning. ENSPART is compared to its employed classifiers: AMD, BioProspector, MDscan, MEME-ChIP, MotifSampler, and Weeder 2. ENSPART is also compared to MEME-ChIP, ChIPMunk, and RSAT peak-motifs without partitioning. The results show that ENSPART produces significantly better results than its individual classifiers and also MEME-ChIP, ChIPMunk, and RSAT peak-motifs. Finally, an experiment on the simulated datasets is conducted. ENSPART is compared to GimmeMotifs and MotifVoter which both are also ensemble-based tools. The results show that ENSPART produce significantly higher precision and recall rates than GimmeMotifs and MotifVoter. In conclusion, the ensemble technique is effective for DNA motif prediction, while the ChIP dataset can be tackled effectively using data partitioning techniques. The developed merging technique in ENSPART allows effective merging of same motifs from different data partitions. Such methods are generally applicable to any ensemble techniques that utilised classical motif discovery tools, or more recently, ChIP analysis tools. Universiti Malaysia Sarawak (UNIMAS) 2020-09 Thesis http://ir.unimas.my/id/eprint/31761/ http://ir.unimas.my/id/eprint/31761/1/Ensemble%20Framework%20for%20Motif%20Discovery%20Based%20on%20Data%20Partitioning%20-%2024%20pgs.pdf text en public http://ir.unimas.my/id/eprint/31761/4/Allen%20Choong%20Chieng%20Hoon%20ft.pdf text en validuser phd doctoral Universiti Malaysia Sarawak (UNIMAS) Faculty of Cognitive Sciences and Human Development
institution Universiti Malaysia Sarawak
collection UNIMAS Institutional Repository
language English
English
topic Q Science (General)
spellingShingle Q Science (General)
Choong, Allen Chieng Hoon
Ensemble Framework for Motif Discovery Based on Data Partitioning
description Computational DNA motif prediction is a challenging problem because motifs are short, degenerated, and are associated with ill-defined features. With the advances of genome-wide ChIP analysis technology, computational motif discovery tools are necessary to effectively tackle the large-scale datasets for motifs search. Ensemble of DNA motif discovery methods is one of the most successful approaches for motif discovery. Nevertheless, most of the existing works cannot perform motif searches in ChIP datasets because of the limited input sizes of the classical tools employed in the ensemble. Ensemble approach not only uses the results from the classical motif discovery tools, it also combines the discovered results to produce better results. The merging algorithm contributes to the prediction accuracy of the discovered motifs. The primary contribution of this thesis work is the development of an ensemble method called ENSPART with the novelty of using data partitioning technique on ChIP dataset for DNA motif prediction. The idea is to reduce the search space by portioning the input datasets into subsets and tackle by ensemble of classical motif discovery tools separately. Then, using a proposed merging algorithm, the candidate motifs are merged regardless the different lengths. Three experiments are conducted. ChIP datasets have been downloaded to evaluate the performances of the ENSPART with Receiver Operative Curves and Area Under Curve performance metrics. ENSPART was compared with the genome-wide motif discovery tools MEME-ChIP, ChIPMunk, and RSAT peak-motifs using partitioning technique. The results demonstrate that ENSPART performed significantly better than MEME-ChIP and RSAT peak-motifs in terms of the two performance metrics. Another set of datasets are gathered and sampled without partitioning. ENSPART is compared to its employed classifiers: AMD, BioProspector, MDscan, MEME-ChIP, MotifSampler, and Weeder 2. ENSPART is also compared to MEME-ChIP, ChIPMunk, and RSAT peak-motifs without partitioning. The results show that ENSPART produces significantly better results than its individual classifiers and also MEME-ChIP, ChIPMunk, and RSAT peak-motifs. Finally, an experiment on the simulated datasets is conducted. ENSPART is compared to GimmeMotifs and MotifVoter which both are also ensemble-based tools. The results show that ENSPART produce significantly higher precision and recall rates than GimmeMotifs and MotifVoter. In conclusion, the ensemble technique is effective for DNA motif prediction, while the ChIP dataset can be tackled effectively using data partitioning techniques. The developed merging technique in ENSPART allows effective merging of same motifs from different data partitions. Such methods are generally applicable to any ensemble techniques that utilised classical motif discovery tools, or more recently, ChIP analysis tools.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Choong, Allen Chieng Hoon
author_facet Choong, Allen Chieng Hoon
author_sort Choong, Allen Chieng Hoon
title Ensemble Framework for Motif Discovery Based on Data Partitioning
title_short Ensemble Framework for Motif Discovery Based on Data Partitioning
title_full Ensemble Framework for Motif Discovery Based on Data Partitioning
title_fullStr Ensemble Framework for Motif Discovery Based on Data Partitioning
title_full_unstemmed Ensemble Framework for Motif Discovery Based on Data Partitioning
title_sort ensemble framework for motif discovery based on data partitioning
granting_institution Universiti Malaysia Sarawak (UNIMAS)
granting_department Faculty of Cognitive Sciences and Human Development
publishDate 2020
url http://ir.unimas.my/id/eprint/31761/1/Ensemble%20Framework%20for%20Motif%20Discovery%20Based%20on%20Data%20Partitioning%20-%2024%20pgs.pdf
http://ir.unimas.my/id/eprint/31761/4/Allen%20Choong%20Chieng%20Hoon%20ft.pdf
_version_ 1783728408173215744