Enhanced text stemmer for standard and non-standard word patterns in Malay texts
Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years,...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2020
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/98431/1/MohamadNizamKassimPSC2020.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-utm-ep.98431 |
---|---|
record_format |
uketd_dc |
spelling |
my-utm-ep.984312023-01-08T02:12:33Z Enhanced text stemmer for standard and non-standard word patterns in Malay texts 2020 Kassim, Mohamad Nizam QA75 Electronic computers. Computer science Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application. 2020 Thesis http://eprints.utm.my/id/eprint/98431/ http://eprints.utm.my/id/eprint/98431/1/MohamadNizamKassimPSC2020.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143711 phd doctoral Universiti Teknologi Malaysia, Faculty of Engineering - School of Computing Faculty of Engineering - School of Computing |
institution |
Universiti Teknologi Malaysia |
collection |
UTM Institutional Repository |
language |
English |
topic |
QA75 Electronic computers Computer science |
spellingShingle |
QA75 Electronic computers Computer science Kassim, Mohamad Nizam Enhanced text stemmer for standard and non-standard word patterns in Malay texts |
description |
Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application. |
format |
Thesis |
qualification_name |
Doctor of Philosophy (PhD.) |
qualification_level |
Doctorate |
author |
Kassim, Mohamad Nizam |
author_facet |
Kassim, Mohamad Nizam |
author_sort |
Kassim, Mohamad Nizam |
title |
Enhanced text stemmer for standard and non-standard word patterns in Malay texts |
title_short |
Enhanced text stemmer for standard and non-standard word patterns in Malay texts |
title_full |
Enhanced text stemmer for standard and non-standard word patterns in Malay texts |
title_fullStr |
Enhanced text stemmer for standard and non-standard word patterns in Malay texts |
title_full_unstemmed |
Enhanced text stemmer for standard and non-standard word patterns in Malay texts |
title_sort |
enhanced text stemmer for standard and non-standard word patterns in malay texts |
granting_institution |
Universiti Teknologi Malaysia, Faculty of Engineering - School of Computing |
granting_department |
Faculty of Engineering - School of Computing |
publishDate |
2020 |
url |
http://eprints.utm.my/id/eprint/98431/1/MohamadNizamKassimPSC2020.pdf |
_version_ |
1776100586983260160 |