AK-means geometric smote with data complexity analysis for imbalanced dataset

Many binary class datasets in real-life applications are affected by class imbalance problem. Data complexities like noise examples, class overlap and small disjuncts problems are observed to play a key role in producing poor classification performance. These complexities tend to exist in tandem wit...

Full description

Saved in:

Bibliographic Details
Main Author:	Nur Athirah, Azhar
Format:	Thesis
Language:	eng eng
Published:	2023
Subjects:	QA299.6-433 Analysis
Online Access:	https://etd.uum.edu.my/10933/1/Depositpermission_s827670.pdf https://etd.uum.edu.my/10933/2/s827670_01.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-uum-etd.10933
record_format	uketd_dc
spelling	my-uum-etd.109332024-01-28T01:26:03Z AK-means geometric smote with data complexity analysis for imbalanced dataset 2023 Nur Athirah, Azhar Mohd Pozi, Muhammad Syafiq Mohamed Din, Aniza Jatowt, Adam Awang Had Salleh Graduate School of Arts & Sciences Awang Had Salleh Graduates School of Arts & Sciences QA299.6-433 Analysis Many binary class datasets in real-life applications are affected by class imbalance problem. Data complexities like noise examples, class overlap and small disjuncts problems are observed to play a key role in producing poor classification performance. These complexities tend to exist in tandem with class imbalance problem. Synthetic Minority Oversampling Technique (SMOTE) is a well-known method to re-balance the number of examples in imbalanced datasets. However, this technique cannot effectively tackle data complexities and has the capability of magnifying the degree of complexities. Therefore, various SMOTE variants have been proposed to overcome the downsides of SMOTE. Furthermore, no existing study has yet to identify the correlation between N1 complexity measure and classification measures such as geometric mean (G-Mean) and F1-Score. This study aims: (i) to identify the suitable complexity measures that have correlation with performance measures, (ii) to propose a new SMOTE variant which is K-Means Geometric SMOTE (KM-GSMOTE) that incorporates complexity measures during synthetic data generation task, and (iii) to evaluate KM-GSMOTE in term of classification performance. Series of experiments have been conducted to evaluate the classification performances related to G-Mean and F1-Score as well as the measurement of N1 complexity of benchmark SMOTE variants and KM-GSMOTE. The performance of KM-GSMOTE was evaluated on 6 benchmark binary datasets from the UCI repository. KM-GSMOTE records the highest percentage of average differences of G-Mean (22.76%) and F1-Score (15.13%) for SVM classifier. A correlation between classification measures and N1 complexity measures has been observed from the experimental results. The contributions of this study are (i) introduction of KM-GSMOTE that combines complexity measurement with model selection to pick models with the best classification performance and lower complexity value and (ii) observation of connection between classification performance and complexity measure, showing that as N1 complexity measure decreases, the likelihood of obtaining a substantial classification performance increases. 2023 Thesis https://etd.uum.edu.my/10933/ https://etd.uum.edu.my/10933/1/Depositpermission_s827670.pdf text eng staffonly https://etd.uum.edu.my/10933/2/s827670_01.pdf text eng public other masters Universiti Utara Malaysia
institution	Universiti Utara Malaysia
collection	UUM ETD
language	eng eng
advisor	Mohd Pozi, Muhammad Syafiq Mohamed Din, Aniza Jatowt, Adam
topic	QA299.6-433 Analysis
spellingShingle	QA299.6-433 Analysis Nur Athirah, Azhar AK-means geometric smote with data complexity analysis for imbalanced dataset
description	Many binary class datasets in real-life applications are affected by class imbalance problem. Data complexities like noise examples, class overlap and small disjuncts problems are observed to play a key role in producing poor classification performance. These complexities tend to exist in tandem with class imbalance problem. Synthetic Minority Oversampling Technique (SMOTE) is a well-known method to re-balance the number of examples in imbalanced datasets. However, this technique cannot effectively tackle data complexities and has the capability of magnifying the degree of complexities. Therefore, various SMOTE variants have been proposed to overcome the downsides of SMOTE. Furthermore, no existing study has yet to identify the correlation between N1 complexity measure and classification measures such as geometric mean (G-Mean) and F1-Score. This study aims: (i) to identify the suitable complexity measures that have correlation with performance measures, (ii) to propose a new SMOTE variant which is K-Means Geometric SMOTE (KM-GSMOTE) that incorporates complexity measures during synthetic data generation task, and (iii) to evaluate KM-GSMOTE in term of classification performance. Series of experiments have been conducted to evaluate the classification performances related to G-Mean and F1-Score as well as the measurement of N1 complexity of benchmark SMOTE variants and KM-GSMOTE. The performance of KM-GSMOTE was evaluated on 6 benchmark binary datasets from the UCI repository. KM-GSMOTE records the highest percentage of average differences of G-Mean (22.76%) and F1-Score (15.13%) for SVM classifier. A correlation between classification measures and N1 complexity measures has been observed from the experimental results. The contributions of this study are (i) introduction of KM-GSMOTE that combines complexity measurement with model selection to pick models with the best classification performance and lower complexity value and (ii) observation of connection between classification performance and complexity measure, showing that as N1 complexity measure decreases, the likelihood of obtaining a substantial classification performance increases.
format	Thesis
qualification_name	other
qualification_level	Master's degree
author	Nur Athirah, Azhar
author_facet	Nur Athirah, Azhar
author_sort	Nur Athirah, Azhar
title	AK-means geometric smote with data complexity analysis for imbalanced dataset
title_short	AK-means geometric smote with data complexity analysis for imbalanced dataset
title_full	AK-means geometric smote with data complexity analysis for imbalanced dataset
title_fullStr	AK-means geometric smote with data complexity analysis for imbalanced dataset
title_full_unstemmed	AK-means geometric smote with data complexity analysis for imbalanced dataset
title_sort	ak-means geometric smote with data complexity analysis for imbalanced dataset
granting_institution	Universiti Utara Malaysia
granting_department	Awang Had Salleh Graduate School of Arts & Sciences
publishDate	2023
url	https://etd.uum.edu.my/10933/1/Depositpermission_s827670.pdf https://etd.uum.edu.my/10933/2/s827670_01.pdf
_version_	1794023791986212864

AK-means geometric smote with data complexity analysis for imbalanced dataset

Similar Items