Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman

Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR ha...

Full description

Saved in:
Bibliographic Details
Main Author: Abd Rahman, Nurazzah
Format: Thesis
Language:English
Published: 2011
Subjects:
Online Access:https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uitm-ir.65319
record_format uketd_dc
institution Universiti Teknologi MARA
collection UiTM Institutional Repository
language English
advisor Abu Bakar, Zainab
topic Information storage and retrieval systems
spellingShingle Information storage and retrieval systems
Abd Rahman, Nurazzah
Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
description Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR has been established, research on IR using Malay Language has only emerged in the middle of 1990s. Cluster Analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items or group. In clusterbased information retrieval, clustering can be applied to terms in documents, or all documents in the corpus, or the user queries or the retrieval results itself. Each type of clustering will improve the retrieval effectiveness. This thesis focuses on document clustering. The Malay documents corpus consists of digitized Malay translated hadith text from well-known Islamic scholars, which are Sahih Muslim, Sahih Bukhari, Sunan Ibnu Majjah, Sunan At-Tirmidzi, Sunan Abu Daud and Sunan An-Nasaie. The corpus was developed by scanning, editing and proofreading the Malay text into digital form. Pre-processing for Malay translated hadith text need to be executed as most of the texts are in Indonesian Language. Differences in the meaning of many terms need to be clarified and converted to Malay language using dictionary and also human experts in both languages. Experts in the Hadith domain is sought after for reliability of the Malay translated Hadith text documents. A digitized updated Malay thesaurus is used in the first experiment to improve the effectiveness of Malay document retrieval. For Clustering Analysis, the Malay translated hadith test collection consists of 2028 documents from Sahih Bukhari, where each Hadith document consists of words ranging from 13 to 2561. The determination of interdocument similarity depends on both the document representation in terms of the weights assigned to the indexing terms characterizing each document and the similarity coefficient chosen. This thesis presents the results of applying five different hierarchical agglomerative clustering techniques, namely Single Linkage, Complete Linkage, Group Average Linkage, Weighted Median Linkage and Ward's Method, using Dice, Jaccard and Cosine similarity coefficients on Malay corpus. The evaluation of the experiments uses redefined well-known IR metrics Recall (R), proportion of relevant documents that is clustered, and Precision (P), proportion of clustered documents that are relevant. The results of first experiment obtained shows that by using Dice similarity coefficient, Complete Linkage is the most effective and Average Linkage is highest in precision, in clustering Malay translated Hadith text documents. By using Jaccard similarity coefficient, Single Linkage is the most effective in clustering Malay translated Hadith text documents, while Ward's Method is the highest in precision. Lastly, by using Cosine coefficient, Complete Linkage gives the highest precision in clustering Malay translated Hadith text documents. Therefore, Complete Linkage combined with Cosine coefficient is used to run on a larger Malay Hadith corpus in the second experiment, which is Sahih Bukhari that consists of 2028 text documents. Different testing proved that the Precision is increased from 18% to 55% if the corpus is clustered into 100 clusters, compared to 50 and 20 clusters. This has led to the conclusion that larger the number of clusters has higher precision compared to smaller number of clusters, since larger number of clusters has smaller number of documents in each cluster. Hence, recall is decreased and precision increased.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Abd Rahman, Nurazzah
author_facet Abd Rahman, Nurazzah
author_sort Abd Rahman, Nurazzah
title Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_short Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_full Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_fullStr Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_full_unstemmed Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_sort evaluation of retrieval effectiveness using clustering techniques in malay document retrieval / nurazzah abd rahman
granting_institution Universiti Teknologi MARA (UiTM)
granting_department Faculty of Computer and Mathematical Sciences
publishDate 2011
url https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf
_version_ 1783735548035203072
spelling my-uitm-ir.653192023-01-04T08:19:41Z Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman 2011 Abd Rahman, Nurazzah Information storage and retrieval systems Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR has been established, research on IR using Malay Language has only emerged in the middle of 1990s. Cluster Analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items or group. In clusterbased information retrieval, clustering can be applied to terms in documents, or all documents in the corpus, or the user queries or the retrieval results itself. Each type of clustering will improve the retrieval effectiveness. This thesis focuses on document clustering. The Malay documents corpus consists of digitized Malay translated hadith text from well-known Islamic scholars, which are Sahih Muslim, Sahih Bukhari, Sunan Ibnu Majjah, Sunan At-Tirmidzi, Sunan Abu Daud and Sunan An-Nasaie. The corpus was developed by scanning, editing and proofreading the Malay text into digital form. Pre-processing for Malay translated hadith text need to be executed as most of the texts are in Indonesian Language. Differences in the meaning of many terms need to be clarified and converted to Malay language using dictionary and also human experts in both languages. Experts in the Hadith domain is sought after for reliability of the Malay translated Hadith text documents. A digitized updated Malay thesaurus is used in the first experiment to improve the effectiveness of Malay document retrieval. For Clustering Analysis, the Malay translated hadith test collection consists of 2028 documents from Sahih Bukhari, where each Hadith document consists of words ranging from 13 to 2561. The determination of interdocument similarity depends on both the document representation in terms of the weights assigned to the indexing terms characterizing each document and the similarity coefficient chosen. This thesis presents the results of applying five different hierarchical agglomerative clustering techniques, namely Single Linkage, Complete Linkage, Group Average Linkage, Weighted Median Linkage and Ward's Method, using Dice, Jaccard and Cosine similarity coefficients on Malay corpus. The evaluation of the experiments uses redefined well-known IR metrics Recall (R), proportion of relevant documents that is clustered, and Precision (P), proportion of clustered documents that are relevant. The results of first experiment obtained shows that by using Dice similarity coefficient, Complete Linkage is the most effective and Average Linkage is highest in precision, in clustering Malay translated Hadith text documents. By using Jaccard similarity coefficient, Single Linkage is the most effective in clustering Malay translated Hadith text documents, while Ward's Method is the highest in precision. Lastly, by using Cosine coefficient, Complete Linkage gives the highest precision in clustering Malay translated Hadith text documents. Therefore, Complete Linkage combined with Cosine coefficient is used to run on a larger Malay Hadith corpus in the second experiment, which is Sahih Bukhari that consists of 2028 text documents. Different testing proved that the Precision is increased from 18% to 55% if the corpus is clustered into 100 clusters, compared to 50 and 20 clusters. This has led to the conclusion that larger the number of clusters has higher precision compared to smaller number of clusters, since larger number of clusters has smaller number of documents in each cluster. Hence, recall is decreased and precision increased. 2011 Thesis https://ir.uitm.edu.my/id/eprint/65319/ https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf text en public phd doctoral Universiti Teknologi MARA (UiTM) Faculty of Computer and Mathematical Sciences Abu Bakar, Zainab