Ranking-based pruning and weighted support model for gene association in frequent itemsets / Sofianita Mutalib
Biological domain is one of the critical areas that always seek for useful knowledge and patterns observed through available methods, including data mining. One of genomic benchmark data sources is from Genome Wide Association Studies (GWAS), which uses a set of genetic variants, namely Single Nucle...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | https://ir.uitm.edu.my/id/eprint/40261/1/40261.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Biological domain is one of the critical areas that always seek for useful knowledge and patterns observed through available methods, including data mining. One of genomic benchmark data sources is from Genome Wide Association Studies (GWAS), which uses a set of genetic variants, namely Single Nucleotide Polymorphisms (SNPs), in different individuals to observe the association of the variants with a particular trait. Usually, the association test in GWAS is done by finding the risk measure of each of the SNPs separately. But many of the variants and its effect remain a mystery which has high potential of knowledge discovery especially for complex diseases. The aim of the research is to develop an improved method for processing information and to find the relationship between genetic variants and disease with in-depth interpretation. Therefore, this research attempts to investigate the association between genetic variants to diseases, and thus propose a method that can identify multiple SNPs combination to form an association using Frequent Itemset Mining (FIM). Five main stages of methodology in this research are, data understanding, data representation and pruning items, FIM and analysis and validation of knowledge. This thesis elaborates a set of crucial tasks in FIM for GWAS datasets. It proposes a strategy of Ranking-based Pruning of Items (RPI) for SNPs. Next, the development of Weighted Support Model (WSM) was done to search for interesting itemsets. The measurement used are Information Gain for ranking to prune items and Weighted Support for interestingness of itemset. High dimensional dataset presented by SNPs confirmed the reason to apply row enumeration strategy algorithm to mine frequent closed itemsets. It is found that SNPs with known risks to Type 2 Diabetes Mellitus (T2DM) occur in low support values, that cause the process of searching frequent itemsets to be repeated many times until the low support values are retrieved. The implementation of WSM with Odds Ratio (OR) values, gives visibility of these itemsets as higher weighted support value. Finally, the validation for interestingness of produced itemsets is through the integration of available and relevant biological information with scrutinization of an expert as presented in the Descriptive Gene set Analysis (DGA). The information found in the itemsets concluded that the identified SNPs interact with other variants in the chain of T2DM. The scope of the work is using two most commonly chromosomes of T2DM studied, which are Chromosome 11 and 16. The results show that the itemsets with the T2DM risk variants were found within the support values of 40 to 48, and after the RPI and WSM are applied, the weighted support value increases to 50 and 97 within significant number of SNPs. These results show that RPI-WSM is able to solve the huge dataset problem and low support value problem in FIM. In addition, to improve the interpretation, each itemset is presented as combination of genes in DGA with gene annotation information, that supplies scientist with further valuing patterns. RPI, WSM and DGA are the contributions of the research and significant in discovering potential new knowledge and complimenting research by scientists to perform further validations. The study could also contribute to the advancement in healthcare and digital genome market, which focuses on developing healthier society through monitoring and early protection of any threats, especially of chronic diseases such as T2DM through personalized treatment or medicine. |
---|