The new efficient and accurate attribute-oriented clustering algorithms for categorical data

Categorical data clustering has attracted much attention recently due to the fact that much of the data contained in today’s databases is categorical in nature. Many algorithms for clustering categorical data have been proposed, in which attribute-oriented hierarchical divisive clustering algorithm...

Full description

Saved in:
Bibliographic Details
Main Author: Qin, Hongwu
Format: Thesis
Language:English
Published: 2012
Online Access:http://umpir.ump.edu.my/id/eprint/37063/1/The%20new%20efficient%20and%20accurate%20attribute-oriented%20clustering%20algorithms%20for%20categorical%20data.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-ump-ir.37063
record_format uketd_dc
spelling my-ump-ir.370632023-02-20T07:41:14Z The new efficient and accurate attribute-oriented clustering algorithms for categorical data 2012 Qin, Hongwu Categorical data clustering has attracted much attention recently due to the fact that much of the data contained in today’s databases is categorical in nature. Many algorithms for clustering categorical data have been proposed, in which attribute-oriented hierarchical divisive clustering algorithm Min-Min Roughness (MMR) has the highest efficiency among these algorithms with low clustering accuracy, conversely, genetic clustering algorithm Genetic-Average Normalized Mutual Information (G-ANMI) has the highest clustering accuracy among these algorithms with low clustering efficiency. This work firstly reveals the significance of attributes in categorical data clustering, and then investigates the limitations of algorithms MMR and G-ANMI respectively, and correspondingly proposes a new attribute-oriented hierarchical divisive clustering algorithm termed Mean Gain Ratio (MGR) and an improved genetic clustering algorithm termed Improved G-ANMI (IG-ANMI) for categorical data. MGR includes two steps: selecting clustering attribute and selecting equivalence class on the clustering attribute. Information theory based concepts of mean gain ratio and entropy of clusters are used to implement these two steps, respectively. MGR can be run with or without specifying the number of clusters while few existing clustering algorithms for categorical data can be run without specifying the number of clusters. IG-ANMI algorithm improves G-ANMI by developing a new attribute-oriented initialization method in which part of initial chromosomes is generated by using the attributes partitions. Four real-life data sets obtained from University of California Irvine (UCI) machine learning repository and ten synthetically generated data sets are used to evaluate MGR and IG-ANMI algorithms, and other four algorithms are used to compare with these two algorithms. The experimental results show that MGR overcomes the limitations of MMR and the average clustering accuracy is improved by 19% (from 0.696 to 0.83), at the same time maintains the highest efficiency. IG-ANMI greatly improves the efficiency of G-ANMI (improved by 31% on the Zoo data set, 74% on the Votes data set, 59% on the Breast Cancer data set, and 3428% on the Mushroom data set) as well as the clustering accuracy of G-ANMI (the average clustering accuracy on four UCI data sets is improved by 10.6%, from 0.815 to 0.901), at the same time maintains the highest clustering accuracy. IG-ANMI has obvious advantage against G-ANMI on large data sets in terms of clustering efficiency as well as clustering accuracy. In addition, both of MGR and IG-ANMI have good scalability. The running time of MGR and IG-ANMI algorithms tend to vary linearly with the increase of the number of objects as well as the number of clusters. 2012 Thesis http://umpir.ump.edu.my/id/eprint/37063/ http://umpir.ump.edu.my/id/eprint/37063/1/The%20new%20efficient%20and%20accurate%20attribute-oriented%20clustering%20algorithms%20for%20categorical%20data.pdf pdf en public phd doctoral Universiti Malaysia Pahang Faculty of Computer System and Software Engineering Jazni, Mohamad Zain
institution Universiti Malaysia Pahang Al-Sultan Abdullah
collection UMPSA Institutional Repository
language English
advisor Jazni, Mohamad Zain
description Categorical data clustering has attracted much attention recently due to the fact that much of the data contained in today’s databases is categorical in nature. Many algorithms for clustering categorical data have been proposed, in which attribute-oriented hierarchical divisive clustering algorithm Min-Min Roughness (MMR) has the highest efficiency among these algorithms with low clustering accuracy, conversely, genetic clustering algorithm Genetic-Average Normalized Mutual Information (G-ANMI) has the highest clustering accuracy among these algorithms with low clustering efficiency. This work firstly reveals the significance of attributes in categorical data clustering, and then investigates the limitations of algorithms MMR and G-ANMI respectively, and correspondingly proposes a new attribute-oriented hierarchical divisive clustering algorithm termed Mean Gain Ratio (MGR) and an improved genetic clustering algorithm termed Improved G-ANMI (IG-ANMI) for categorical data. MGR includes two steps: selecting clustering attribute and selecting equivalence class on the clustering attribute. Information theory based concepts of mean gain ratio and entropy of clusters are used to implement these two steps, respectively. MGR can be run with or without specifying the number of clusters while few existing clustering algorithms for categorical data can be run without specifying the number of clusters. IG-ANMI algorithm improves G-ANMI by developing a new attribute-oriented initialization method in which part of initial chromosomes is generated by using the attributes partitions. Four real-life data sets obtained from University of California Irvine (UCI) machine learning repository and ten synthetically generated data sets are used to evaluate MGR and IG-ANMI algorithms, and other four algorithms are used to compare with these two algorithms. The experimental results show that MGR overcomes the limitations of MMR and the average clustering accuracy is improved by 19% (from 0.696 to 0.83), at the same time maintains the highest efficiency. IG-ANMI greatly improves the efficiency of G-ANMI (improved by 31% on the Zoo data set, 74% on the Votes data set, 59% on the Breast Cancer data set, and 3428% on the Mushroom data set) as well as the clustering accuracy of G-ANMI (the average clustering accuracy on four UCI data sets is improved by 10.6%, from 0.815 to 0.901), at the same time maintains the highest clustering accuracy. IG-ANMI has obvious advantage against G-ANMI on large data sets in terms of clustering efficiency as well as clustering accuracy. In addition, both of MGR and IG-ANMI have good scalability. The running time of MGR and IG-ANMI algorithms tend to vary linearly with the increase of the number of objects as well as the number of clusters.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Qin, Hongwu
spellingShingle Qin, Hongwu
The new efficient and accurate attribute-oriented clustering algorithms for categorical data
author_facet Qin, Hongwu
author_sort Qin, Hongwu
title The new efficient and accurate attribute-oriented clustering algorithms for categorical data
title_short The new efficient and accurate attribute-oriented clustering algorithms for categorical data
title_full The new efficient and accurate attribute-oriented clustering algorithms for categorical data
title_fullStr The new efficient and accurate attribute-oriented clustering algorithms for categorical data
title_full_unstemmed The new efficient and accurate attribute-oriented clustering algorithms for categorical data
title_sort new efficient and accurate attribute-oriented clustering algorithms for categorical data
granting_institution Universiti Malaysia Pahang
granting_department Faculty of Computer System and Software Engineering
publishDate 2012
url http://umpir.ump.edu.my/id/eprint/37063/1/The%20new%20efficient%20and%20accurate%20attribute-oriented%20clustering%20algorithms%20for%20categorical%20data.pdf
_version_ 1783732260767268864