Enhanced normalization approach to address stop-word complexity in compound-word schema labels

An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references ava...

Full description

Saved in:
Bibliographic Details
Main Author: Hossain, Jafreen
Format: Thesis
Language:English
Published: 2014
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/60506/1/FSKTM%202014%2026IR.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.60506
record_format uketd_dc
spelling my-upm-ir.605062018-05-08T03:23:07Z Enhanced normalization approach to address stop-word complexity in compound-word schema labels 2014-06 Hossain, Jafreen An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references available for both. However, an indepth analysis of all the available approaches suggests there are further scopes for improvement in the field of semantic schema matching. Normalization and lexical annotation methods using WordNet have been proposed in several studies. However the results show comparatively poor accuracy due to the presence of stop-words in schema labels. Stop-words have previously been ignored in most studies resulting in false negative conclusions. This research work proposes, NORMSTOP (NORMalizer of schemata having STOP-words), an improved schema normalization approach, addressing the complexity of stop-words (e.g. ‗by‘, ‗at‘, ‗and,‘ or‘) in Compound Word (CW) schema labels. NORMSTOP isolates these labels during the preprocessing stage and resets the base-form to a relevant WordNet term, or an annotable compound noun; using a combined set of WordNet features like Attributes, Derivationally Related Forms, and LexNames. When tested on the same real dataset used in the earlier approach - (NORMS or NORMalizer of Schemata), NORMSTOP shows up to 13% improvement in annotation recall measurement. This level of improvement takes the overall schema matching process one step closer to perfect accuracy; and the lack of it exposes a gap in expectation, especially in today‘s databases where stop-words are in abundance. Data integration (Computer science) 2014-06 Thesis http://psasir.upm.edu.my/id/eprint/60506/ http://psasir.upm.edu.my/id/eprint/60506/1/FSKTM%202014%2026IR.pdf text en public masters Universiti Putra Malaysia Data integration (Computer science)
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
topic Data integration (Computer science)


spellingShingle Data integration (Computer science)


Hossain, Jafreen
Enhanced normalization approach to address stop-word complexity in compound-word schema labels
description An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references available for both. However, an indepth analysis of all the available approaches suggests there are further scopes for improvement in the field of semantic schema matching. Normalization and lexical annotation methods using WordNet have been proposed in several studies. However the results show comparatively poor accuracy due to the presence of stop-words in schema labels. Stop-words have previously been ignored in most studies resulting in false negative conclusions. This research work proposes, NORMSTOP (NORMalizer of schemata having STOP-words), an improved schema normalization approach, addressing the complexity of stop-words (e.g. ‗by‘, ‗at‘, ‗and,‘ or‘) in Compound Word (CW) schema labels. NORMSTOP isolates these labels during the preprocessing stage and resets the base-form to a relevant WordNet term, or an annotable compound noun; using a combined set of WordNet features like Attributes, Derivationally Related Forms, and LexNames. When tested on the same real dataset used in the earlier approach - (NORMS or NORMalizer of Schemata), NORMSTOP shows up to 13% improvement in annotation recall measurement. This level of improvement takes the overall schema matching process one step closer to perfect accuracy; and the lack of it exposes a gap in expectation, especially in today‘s databases where stop-words are in abundance.
format Thesis
qualification_level Master's degree
author Hossain, Jafreen
author_facet Hossain, Jafreen
author_sort Hossain, Jafreen
title Enhanced normalization approach to address stop-word complexity in compound-word schema labels
title_short Enhanced normalization approach to address stop-word complexity in compound-word schema labels
title_full Enhanced normalization approach to address stop-word complexity in compound-word schema labels
title_fullStr Enhanced normalization approach to address stop-word complexity in compound-word schema labels
title_full_unstemmed Enhanced normalization approach to address stop-word complexity in compound-word schema labels
title_sort enhanced normalization approach to address stop-word complexity in compound-word schema labels
granting_institution Universiti Putra Malaysia
publishDate 2014
url http://psasir.upm.edu.my/id/eprint/60506/1/FSKTM%202014%2026IR.pdf
_version_ 1747812277055127552