Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing,...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | en_US |
Subjects: | |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-usim-ddms-13152 |
---|---|
record_format |
uketd_dc |
spelling |
my-usim-ddms-131522024-05-29T05:43:31Z Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition Arwa Mahmoud Yousef Al-Khatatneh Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing, segmentation, feature extraction and classification. Binarization is the main pre-processing process that consists of the existing local and global thresholding methods. However, those methods are not applicable in many binarization problems especially for degraded document images. Baseline estimation is another pre-processing method aims to extract the virtual horizontal line where all characters lay and join in a specific part of each character. This existing method is inaccurate due to irregularity in sub-words alignment and a wide variety of free writing styles. The third component in OCR system is segmentation of text into characters. Nevertheless, the cursive, ligatures and overlapping characters differentiate the Arabic script from other languages. Therefore, Arabic OCR system requires a highly sophisticated segmentation method. This work proposes three methods and framework for OCR. First, the proposed compound binarization method that combines th e advantages of local and global thresholding method, tested on DIBCO 2009, 20 1 1 and 20 13 benchmark. Based on experimental results, the F-measure of proposed binarization method for printed document image is 88% and for handwritten is 78%, while the PSNR measurement for printed document image is 15.99 and for handwritten is 16.34. Secondly the proposed baseline estimation method for binary image based on feature points detection which tested on IFNIENIT dataset, when the estimated pixel error is less than 15 pixels the accuracy of the proposed baseline estimation method is 87.3%. And finally the proposed segmentation method based on baseline estimation and structural rules which tested using IFNIENIT dataset, the accuracy of the proposed method is 87.09%. The developed methods gained better accuracy rate when compared with the state of the art methods using quantities measurements.I t is able to recover document image of Arabic texts. 2016-11 Thesis en_US https://oarep.usim.edu.my/handle/123456789/13152 https://oarep.usim.edu.my/bitstreams/de079f30-e2e6-400b-ac96-42fe29b8cd35/download 8a4605be74aa9ea9d79846c1fba20a33 Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts. |
institution |
Universiti Sains Islam Malaysia |
collection |
USIM Institutional Repository |
language |
en_US |
topic |
Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts. |
spellingShingle |
Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts. Arwa Mahmoud Yousef Al-Khatatneh Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition |
description |
Optical character recognition (OCR) is a system aims to improve human machine
interaction and widely used in many areas. Recognition of Arabic characters is difficult
due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five
components: image acquisition, pre-processing, segmentation, feature extraction and
classification. Binarization is the main pre-processing process that consists of the
existing local and global thresholding methods. However, those methods are not
applicable in many binarization problems especially for degraded document images.
Baseline estimation is another pre-processing method aims to extract the virtual
horizontal line where all characters lay and join in a specific part of each character. This
existing method is inaccurate due to irregularity in sub-words alignment and a wide
variety of free writing styles. The third component in OCR system is segmentation of
text into characters. Nevertheless, the cursive, ligatures and overlapping characters
differentiate the Arabic script from other languages. Therefore, Arabic OCR system
requires a highly sophisticated segmentation method. This work proposes three methods
and framework for OCR. First, the proposed compound binarization method that
combines th e advantages of local and global thresholding method, tested on DIBCO
2009, 20 1 1 and 20 13 benchmark. Based on experimental results, the F-measure of
proposed binarization method for printed document image is 88% and for handwritten
is 78%, while the PSNR measurement for printed document image is 15.99 and for
handwritten is 16.34. Secondly the proposed baseline estimation method for binary
image based on feature points detection which tested on IFNIENIT dataset, when the
estimated pixel error is less than 15 pixels the accuracy of the proposed baseline
estimation method is 87.3%. And finally the proposed segmentation method based on
baseline estimation and structural rules which tested using IFNIENIT dataset, the
accuracy of the proposed method is 87.09%. The developed methods gained better
accuracy rate when compared with the state of the art methods using quantities
measurements.I t is able to recover document image of Arabic texts. |
format |
Thesis |
author |
Arwa Mahmoud Yousef Al-Khatatneh |
author_facet |
Arwa Mahmoud Yousef Al-Khatatneh |
author_sort |
Arwa Mahmoud Yousef Al-Khatatneh |
title |
Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition |
title_short |
Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition |
title_full |
Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition |
title_fullStr |
Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition |
title_full_unstemmed |
Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition |
title_sort |
compound binarization for degraded document image and feature point extraction for handwritten arabic optical character recognition |
_version_ |
1812444671521062912 |