Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification
The advancement in technology has resulted in large size of data, which then introduce challenges to labelling or classification tasks with high dimensional features. Specifically, in the case of text labelling problem, the existing classification models are challenged with a huge number of instance...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English English |
Published: |
2023
|
Subjects: | |
Online Access: | http://eprints.uthm.edu.my/11027/1/24p%20ABUBAKAR%20ADO.pdf http://eprints.uthm.edu.my/11027/2/ABUBAKAR%20ADO%20COPYRIGTH%20DECLARATION.pdf http://eprints.uthm.edu.my/11027/3/ABUBAKAR%20ADO%20WATERMARK.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-uthm-ep.11027 |
---|---|
record_format |
uketd_dc |
spelling |
my-uthm-ep.110272024-05-29T02:25:08Z Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification 2023-07 Ado, Abubakar T Technology (General) The advancement in technology has resulted in large size of data, which then introduce challenges to labelling or classification tasks with high dimensional features. Specifically, in the case of text labelling problem, the existing classification models are challenged with a huge number of instances, millions number of features, and large number of categories. Such challenge requires a well-defined hierarchy structure and automated classification models to label the instances within the hierarchy, which can be referred to as Large-Scale Hierarchical Text Classification (LSHTC). Even with a well-defined hierarchy, the LSHTC problem is still facing a scalability issue. Therefore, this requires improvements in the dimensional reduction phase of the LSHTC framework that aim at constructing a subset of informative features. However, using the existing dimensionality reduction methods in LSHTC problem has the consequence of introducing bad collisions or results discrepancy limitations. Therefore, in this thesis, a Multi-stage Dimensional Reduction Method (MDRM) based on feature hashing and bi-strategy filter method is proposed for the LSHTC problem. In view of solving the aforementioned problems, a Modified Feature Hashing (MFH) based on term weight to minimize bad collisions rate is presented, whereas for dealing with results discrepancy, a new Bi-strategy Filtering Approach (BFA) is presented. Experimental results show that the proposed MFH outperformed the conventional features hashing approximately by 3%. BFA has achieved the highest average micro-f1 score of 53.38% and 55.58%, and the highest average macro-f1 score of 45.83% and 49.23% compare to the single strategy filtering methods. It also achieves highest hierarchical-f1 of 79.99%, 67.83%, and 67.95% compare to existing multi-strategy filtering approaches. Lastly, the MDRM has achieved the best performance in terms of average micro-f1 (58.47% and 54.77%) and average macro-f1 (51.14% and 48.70%), respectively. In the case of running time, the MDRM has achieved 11% faster than the single stage reduction method and about 37% faster than baseline method 2023-07 Thesis http://eprints.uthm.edu.my/11027/ http://eprints.uthm.edu.my/11027/1/24p%20ABUBAKAR%20ADO.pdf text en public http://eprints.uthm.edu.my/11027/2/ABUBAKAR%20ADO%20COPYRIGTH%20DECLARATION.pdf text en staffonly http://eprints.uthm.edu.my/11027/3/ABUBAKAR%20ADO%20WATERMARK.pdf text en validuser phd doctoral Universiti Tun Hussein Onn Malaysia Fakulti Sains Komputer dan Teknologi Maklumat |
institution |
Universiti Tun Hussein Onn Malaysia |
collection |
UTHM Institutional Repository |
language |
English English English |
topic |
T Technology (General) |
spellingShingle |
T Technology (General) Ado, Abubakar Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
description |
The advancement in technology has resulted in large size of data, which then introduce challenges to labelling or classification tasks with high dimensional features. Specifically, in the case of text labelling problem, the existing classification models are challenged with a huge number of instances, millions number of features, and large number of categories. Such challenge requires a well-defined hierarchy structure and automated classification models to label the instances within the hierarchy, which can be referred to as Large-Scale Hierarchical Text Classification (LSHTC). Even with a well-defined hierarchy, the LSHTC problem is still facing a scalability issue. Therefore, this requires improvements in the dimensional reduction phase of the LSHTC framework that aim at constructing a subset of informative features. However, using the existing dimensionality reduction methods in LSHTC problem has the consequence of introducing bad collisions or results discrepancy limitations. Therefore, in this thesis, a Multi-stage Dimensional Reduction Method (MDRM) based on feature hashing and bi-strategy filter method is proposed for the LSHTC problem. In view of solving the aforementioned problems, a Modified Feature Hashing (MFH) based on term weight to minimize bad collisions rate is presented, whereas for dealing with results discrepancy, a new Bi-strategy Filtering Approach (BFA) is presented. Experimental results show that the proposed MFH outperformed the conventional features hashing approximately by 3%. BFA has achieved the highest average micro-f1 score of 53.38% and 55.58%, and the highest average macro-f1 score of 45.83% and 49.23% compare to the single strategy filtering methods. It also achieves highest hierarchical-f1 of 79.99%, 67.83%, and 67.95% compare to existing multi-strategy filtering approaches. Lastly, the MDRM has achieved the best performance in terms of average micro-f1 (58.47% and 54.77%) and average macro-f1 (51.14% and 48.70%), respectively. In the case of running time, the MDRM has achieved 11% faster than the single stage reduction method and about 37% faster than baseline method |
format |
Thesis |
qualification_name |
Doctor of Philosophy (PhD.) |
qualification_level |
Doctorate |
author |
Ado, Abubakar |
author_facet |
Ado, Abubakar |
author_sort |
Ado, Abubakar |
title |
Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
title_short |
Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
title_full |
Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
title_fullStr |
Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
title_full_unstemmed |
Hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
title_sort |
hierarchical multi-stage dimensional reduction based on feature hashing and bi-filtering strategy for large-scale text classification |
granting_institution |
Universiti Tun Hussein Onn Malaysia |
granting_department |
Fakulti Sains Komputer dan Teknologi Maklumat |
publishDate |
2023 |
url |
http://eprints.uthm.edu.my/11027/1/24p%20ABUBAKAR%20ADO.pdf http://eprints.uthm.edu.my/11027/2/ABUBAKAR%20ADO%20COPYRIGTH%20DECLARATION.pdf http://eprints.uthm.edu.my/11027/3/ABUBAKAR%20ADO%20WATERMARK.pdf |
_version_ |
1804890137186795520 |