Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman

Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR ha...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Abd Rahman, Nurazzah
التنسيق:	أطروحة
اللغة:	English
منشور في:	2011
الموضوعات:	Information storage and retrieval systems
الوصول للمادة أونلاين:	https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	my-uitm-ir.65319
record_format	uketd_dc
institution	Universiti Teknologi MARA
collection	UiTM Institutional Repository
language	English
advisor	Abu Bakar, Zainab
topic	Information storage and retrieval systems
spellingShingle	Information storage and retrieval systems Abd Rahman, Nurazzah Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
description	Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR has been established, research on IR using Malay Language has only emerged in the middle of 1990s. Cluster Analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items or group. In clusterbased information retrieval, clustering can be applied to terms in documents, or all documents in the corpus, or the user queries or the retrieval results itself. Each type of clustering will improve the retrieval effectiveness. This thesis focuses on document clustering. The Malay documents corpus consists of digitized Malay translated hadith text from well-known Islamic scholars, which are Sahih Muslim, Sahih Bukhari, Sunan Ibnu Majjah, Sunan At-Tirmidzi, Sunan Abu Daud and Sunan An-Nasaie. The corpus was developed by scanning, editing and proofreading the Malay text into digital form. Pre-processing for Malay translated hadith text need to be executed as most of the texts are in Indonesian Language. Differences in the meaning of many terms need to be clarified and converted to Malay language using dictionary and also human experts in both languages. Experts in the Hadith domain is sought after for reliability of the Malay translated Hadith text documents. A digitized updated Malay thesaurus is used in the first experiment to improve the effectiveness of Malay document retrieval. For Clustering Analysis, the Malay translated hadith test collection consists of 2028 documents from Sahih Bukhari, where each Hadith document consists of words ranging from 13 to 2561. The determination of interdocument similarity depends on both the document representation in terms of the weights assigned to the indexing terms characterizing each document and the similarity coefficient chosen. This thesis presents the results of applying five different hierarchical agglomerative clustering techniques, namely Single Linkage, Complete Linkage, Group Average Linkage, Weighted Median Linkage and Ward's Method, using Dice, Jaccard and Cosine similarity coefficients on Malay corpus. The evaluation of the experiments uses redefined well-known IR metrics Recall (R), proportion of relevant documents that is clustered, and Precision (P), proportion of clustered documents that are relevant. The results of first experiment obtained shows that by using Dice similarity coefficient, Complete Linkage is the most effective and Average Linkage is highest in precision, in clustering Malay translated Hadith text documents. By using Jaccard similarity coefficient, Single Linkage is the most effective in clustering Malay translated Hadith text documents, while Ward's Method is the highest in precision. Lastly, by using Cosine coefficient, Complete Linkage gives the highest precision in clustering Malay translated Hadith text documents. Therefore, Complete Linkage combined with Cosine coefficient is used to run on a larger Malay Hadith corpus in the second experiment, which is Sahih Bukhari that consists of 2028 text documents. Different testing proved that the Precision is increased from 18% to 55% if the corpus is clustered into 100 clusters, compared to 50 and 20 clusters. This has led to the conclusion that larger the number of clusters has higher precision compared to smaller number of clusters, since larger number of clusters has smaller number of documents in each cluster. Hence, recall is decreased and precision increased.
format	Thesis
qualification_name	Doctor of Philosophy (PhD.)
qualification_level	Doctorate
author	Abd Rahman, Nurazzah
author_facet	Abd Rahman, Nurazzah
author_sort	Abd Rahman, Nurazzah
title	Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_short	Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_full	Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_fullStr	Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_full_unstemmed	Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman
title_sort	evaluation of retrieval effectiveness using clustering techniques in malay document retrieval / nurazzah abd rahman
granting_institution	Universiti Teknologi MARA (UiTM)
granting_department	Faculty of Computer and Mathematical Sciences
publishDate	2011
url	https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf
_version_	1783735548035203072
spelling	my-uitm-ir.653192023-01-04T08:19:41Z Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman 2011 Abd Rahman, Nurazzah Information storage and retrieval systems Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR has been established, research on IR using Malay Language has only emerged in the middle of 1990s. Cluster Analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items or group. In clusterbased information retrieval, clustering can be applied to terms in documents, or all documents in the corpus, or the user queries or the retrieval results itself. Each type of clustering will improve the retrieval effectiveness. This thesis focuses on document clustering. The Malay documents corpus consists of digitized Malay translated hadith text from well-known Islamic scholars, which are Sahih Muslim, Sahih Bukhari, Sunan Ibnu Majjah, Sunan At-Tirmidzi, Sunan Abu Daud and Sunan An-Nasaie. The corpus was developed by scanning, editing and proofreading the Malay text into digital form. Pre-processing for Malay translated hadith text need to be executed as most of the texts are in Indonesian Language. Differences in the meaning of many terms need to be clarified and converted to Malay language using dictionary and also human experts in both languages. Experts in the Hadith domain is sought after for reliability of the Malay translated Hadith text documents. A digitized updated Malay thesaurus is used in the first experiment to improve the effectiveness of Malay document retrieval. For Clustering Analysis, the Malay translated hadith test collection consists of 2028 documents from Sahih Bukhari, where each Hadith document consists of words ranging from 13 to 2561. The determination of interdocument similarity depends on both the document representation in terms of the weights assigned to the indexing terms characterizing each document and the similarity coefficient chosen. This thesis presents the results of applying five different hierarchical agglomerative clustering techniques, namely Single Linkage, Complete Linkage, Group Average Linkage, Weighted Median Linkage and Ward's Method, using Dice, Jaccard and Cosine similarity coefficients on Malay corpus. The evaluation of the experiments uses redefined well-known IR metrics Recall (R), proportion of relevant documents that is clustered, and Precision (P), proportion of clustered documents that are relevant. The results of first experiment obtained shows that by using Dice similarity coefficient, Complete Linkage is the most effective and Average Linkage is highest in precision, in clustering Malay translated Hadith text documents. By using Jaccard similarity coefficient, Single Linkage is the most effective in clustering Malay translated Hadith text documents, while Ward's Method is the highest in precision. Lastly, by using Cosine coefficient, Complete Linkage gives the highest precision in clustering Malay translated Hadith text documents. Therefore, Complete Linkage combined with Cosine coefficient is used to run on a larger Malay Hadith corpus in the second experiment, which is Sahih Bukhari that consists of 2028 text documents. Different testing proved that the Precision is increased from 18% to 55% if the corpus is clustered into 100 clusters, compared to 50 and 20 clusters. This has led to the conclusion that larger the number of clusters has higher precision compared to smaller number of clusters, since larger number of clusters has smaller number of documents in each cluster. Hence, recall is decreased and precision increased. 2011 Thesis https://ir.uitm.edu.my/id/eprint/65319/ https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf text en public phd doctoral Universiti Teknologi MARA (UiTM) Faculty of Computer and Mathematical Sciences Abu Bakar, Zainab

Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman

مواد مشابهة