Enhanced normalization approach to address stop-word complexity in compound-word schema labels
An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references ava...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/60506/1/FSKTM%202014%2026IR.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-upm-ir.60506 |
---|---|
record_format |
uketd_dc |
spelling |
my-upm-ir.605062018-05-08T03:23:07Z Enhanced normalization approach to address stop-word complexity in compound-word schema labels 2014-06 Hossain, Jafreen An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references available for both. However, an indepth analysis of all the available approaches suggests there are further scopes for improvement in the field of semantic schema matching. Normalization and lexical annotation methods using WordNet have been proposed in several studies. However the results show comparatively poor accuracy due to the presence of stop-words in schema labels. Stop-words have previously been ignored in most studies resulting in false negative conclusions. This research work proposes, NORMSTOP (NORMalizer of schemata having STOP-words), an improved schema normalization approach, addressing the complexity of stop-words (e.g. ‗by‘, ‗at‘, ‗and,‘ or‘) in Compound Word (CW) schema labels. NORMSTOP isolates these labels during the preprocessing stage and resets the base-form to a relevant WordNet term, or an annotable compound noun; using a combined set of WordNet features like Attributes, Derivationally Related Forms, and LexNames. When tested on the same real dataset used in the earlier approach - (NORMS or NORMalizer of Schemata), NORMSTOP shows up to 13% improvement in annotation recall measurement. This level of improvement takes the overall schema matching process one step closer to perfect accuracy; and the lack of it exposes a gap in expectation, especially in today‘s databases where stop-words are in abundance. Data integration (Computer science) 2014-06 Thesis http://psasir.upm.edu.my/id/eprint/60506/ http://psasir.upm.edu.my/id/eprint/60506/1/FSKTM%202014%2026IR.pdf text en public masters Universiti Putra Malaysia Data integration (Computer science) |
institution |
Universiti Putra Malaysia |
collection |
PSAS Institutional Repository |
language |
English |
topic |
Data integration (Computer science) |
spellingShingle |
Data integration (Computer science) Hossain, Jafreen Enhanced normalization approach to address stop-word complexity in compound-word schema labels |
description |
An extensive review of the existing research work in the field of schema matching uncovers the significance of semantics in this subject. It is beyond doubt that both structural and semantics aspect of schema matching have been the topic of research for many years and there are strong references available for both. However, an indepth analysis of all the available approaches suggests there are further scopes for improvement in the field of semantic schema matching. Normalization and lexical annotation methods using WordNet have been proposed in several studies. However the results show comparatively poor accuracy due to the presence of stop-words in schema labels. Stop-words have previously been ignored in most studies resulting in false negative conclusions. This research work proposes, NORMSTOP (NORMalizer of schemata having STOP-words), an improved schema normalization approach, addressing the complexity of stop-words (e.g. ‗by‘, ‗at‘, ‗and,‘ or‘) in Compound Word (CW) schema labels. NORMSTOP isolates these labels during the preprocessing stage and resets the base-form to a relevant WordNet term, or an annotable compound noun; using a combined set of WordNet features like Attributes, Derivationally Related Forms, and LexNames. When tested on the same real dataset used in the earlier approach - (NORMS or NORMalizer of Schemata), NORMSTOP shows up to 13% improvement in annotation recall measurement. This level of improvement takes the overall schema matching process one step closer to perfect accuracy; and the lack of it exposes a gap in expectation, especially in today‘s databases where stop-words are in abundance. |
format |
Thesis |
qualification_level |
Master's degree |
author |
Hossain, Jafreen |
author_facet |
Hossain, Jafreen |
author_sort |
Hossain, Jafreen |
title |
Enhanced normalization approach to address stop-word complexity in compound-word schema labels |
title_short |
Enhanced normalization approach to address stop-word complexity in compound-word schema labels |
title_full |
Enhanced normalization approach to address stop-word complexity in compound-word schema labels |
title_fullStr |
Enhanced normalization approach to address stop-word complexity in compound-word schema labels |
title_full_unstemmed |
Enhanced normalization approach to address stop-word complexity in compound-word schema labels |
title_sort |
enhanced normalization approach to address stop-word complexity in compound-word schema labels |
granting_institution |
Universiti Putra Malaysia |
publishDate |
2014 |
url |
http://psasir.upm.edu.my/id/eprint/60506/1/FSKTM%202014%2026IR.pdf |
_version_ |
1747812277055127552 |