Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide th...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-utm-ep.96282 |
---|---|
record_format |
uketd_dc |
spelling |
my-utm-ep.962822022-07-05T08:07:14Z Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine 2021 Machap, Logenthiran QA75 Electronic computers. Computer science Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide the genes into different clusters, where a gene or a condition only belongs to one cluster. A gene may contribute to more than one biological process, so a gene may belong to multiple clusters. Besides, the centroid in the objective function of network-assisted coclustering for the identification of cancer subtypes (NCIS) dragged with outliers. So, these outliers get their cluster instead of being ignored. Hence, this research is focusing on improving the NCIS method. Enhanced NCIS (iNCIS) is basically assigned weights to genes base on a gene interaction network, and it imperatively optimizes the sum-squared residue to get co-clusters. Next, supervised infinite feature selection with multiple support vector machine (SinfFS-mSVM) is proposed to obtain significant genes from a high dimensional data by using the classes obtained from iNCIS and improve the accuracy of classification. The effectiveness of iNCIS and SinfFS-mSVM is being evaluated on a large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. From the implementation, there are five breast cancer gene subtypes and four glioblastoma multiforme cancer gene subtypes that have been successfully identified. The weighted co-clustering approach in iNCIS provides a unique solution to integrate gene network interaction into the clustering process. The improvement of the co-clustering Rand Index and F1-measure is 54.5% and 33.9% for BRCA and 34.2% and 31.5% for GBM. Meanwhile, a significant gene subset with higher classification accuracy was selected from SinfFS-mSVM. The classification accuracy for the selected gene subset improved by 3.00% and 2.99% for BRCA and GBM, correspondingly. Furthermore, biological validation conducted on the selected genes from each subtype is to justify the validity of the results. In conclusion, the empirical study on large-scale cancer datasets using iNCIS and SinfFS-mSVM comprehensively find cancer gene subtypes and genes by achieving higher clustering and classification accuracy. Future works are needed to integrate more comprehensive gene network information and to select optimal parameters. 2021 Thesis http://eprints.utm.my/id/eprint/96282/ http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143093 phd doctoral Universiti Teknologi Malaysia Faculty of Engineering - School of Computing |
institution |
Universiti Teknologi Malaysia |
collection |
UTM Institutional Repository |
language |
English |
topic |
QA75 Electronic computers Computer science |
spellingShingle |
QA75 Electronic computers Computer science Machap, Logenthiran Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
description |
Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide the genes into different clusters, where a gene or a condition only belongs to one cluster. A gene may contribute to more than one biological process, so a gene may belong to multiple clusters. Besides, the centroid in the objective function of network-assisted coclustering for the identification of cancer subtypes (NCIS) dragged with outliers. So, these outliers get their cluster instead of being ignored. Hence, this research is focusing on improving the NCIS method. Enhanced NCIS (iNCIS) is basically assigned weights to genes base on a gene interaction network, and it imperatively optimizes the sum-squared residue to get co-clusters. Next, supervised infinite feature selection with multiple support vector machine (SinfFS-mSVM) is proposed to obtain significant genes from a high dimensional data by using the classes obtained from iNCIS and improve the accuracy of classification. The effectiveness of iNCIS and SinfFS-mSVM is being evaluated on a large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. From the implementation, there are five breast cancer gene subtypes and four glioblastoma multiforme cancer gene subtypes that have been successfully identified. The weighted co-clustering approach in iNCIS provides a unique solution to integrate gene network interaction into the clustering process. The improvement of the co-clustering Rand Index and F1-measure is 54.5% and 33.9% for BRCA and 34.2% and 31.5% for GBM. Meanwhile, a significant gene subset with higher classification accuracy was selected from SinfFS-mSVM. The classification accuracy for the selected gene subset improved by 3.00% and 2.99% for BRCA and GBM, correspondingly. Furthermore, biological validation conducted on the selected genes from each subtype is to justify the validity of the results. In conclusion, the empirical study on large-scale cancer datasets using iNCIS and SinfFS-mSVM comprehensively find cancer gene subtypes and genes by achieving higher clustering and classification accuracy. Future works are needed to integrate more comprehensive gene network information and to select optimal parameters. |
format |
Thesis |
qualification_name |
Doctor of Philosophy (PhD.) |
qualification_level |
Doctorate |
author |
Machap, Logenthiran |
author_facet |
Machap, Logenthiran |
author_sort |
Machap, Logenthiran |
title |
Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
title_short |
Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
title_full |
Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
title_fullStr |
Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
title_full_unstemmed |
Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
title_sort |
identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine |
granting_institution |
Universiti Teknologi Malaysia |
granting_department |
Faculty of Engineering - School of Computing |
publishDate |
2021 |
url |
http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf |
_version_ |
1747818655365726208 |