Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition

Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing,...

全面介绍

Saved in:

书目详细资料
主要作者:	Arwa Mahmoud Yousef Al-Khatatneh
格式:	Thesis
语言:	en_US
主题:	Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts.
标签:	添加标签没有标签, 成为第一个标记此记录!

id	my-usim-ddms-13152
record_format	uketd_dc
spelling	my-usim-ddms-131522024-05-29T05:43:31Z Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition Arwa Mahmoud Yousef Al-Khatatneh Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing, segmentation, feature extraction and classification. Binarization is the main pre-processing process that consists of the existing local and global thresholding methods. However, those methods are not applicable in many binarization problems especially for degraded document images. Baseline estimation is another pre-processing method aims to extract the virtual horizontal line where all characters lay and join in a specific part of each character. This existing method is inaccurate due to irregularity in sub-words alignment and a wide variety of free writing styles. The third component in OCR system is segmentation of text into characters. Nevertheless, the cursive, ligatures and overlapping characters differentiate the Arabic script from other languages. Therefore, Arabic OCR system requires a highly sophisticated segmentation method. This work proposes three methods and framework for OCR. First, the proposed compound binarization method that combines th e advantages of local and global thresholding method, tested on DIBCO 2009, 20 1 1 and 20 13 benchmark. Based on experimental results, the F-measure of proposed binarization method for printed document image is 88% and for handwritten is 78%, while the PSNR measurement for printed document image is 15.99 and for handwritten is 16.34. Secondly the proposed baseline estimation method for binary image based on feature points detection which tested on IFNIENIT dataset, when the estimated pixel error is less than 15 pixels the accuracy of the proposed baseline estimation method is 87.3%. And finally the proposed segmentation method based on baseline estimation and structural rules which tested using IFNIENIT dataset, the accuracy of the proposed method is 87.09%. The developed methods gained better accuracy rate when compared with the state of the art methods using quantities measurements.I t is able to recover document image of Arabic texts. 2016-11 Thesis en_US https://oarep.usim.edu.my/handle/123456789/13152 https://oarep.usim.edu.my/bitstreams/de079f30-e2e6-400b-ac96-42fe29b8cd35/download 8a4605be74aa9ea9d79846c1fba20a33 Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts.
institution	Universiti Sains Islam Malaysia
collection	USIM Institutional Repository
language	en_US
topic	Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts.
spellingShingle	Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts. Arwa Mahmoud Yousef Al-Khatatneh Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
description	Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing, segmentation, feature extraction and classification. Binarization is the main pre-processing process that consists of the existing local and global thresholding methods. However, those methods are not applicable in many binarization problems especially for degraded document images. Baseline estimation is another pre-processing method aims to extract the virtual horizontal line where all characters lay and join in a specific part of each character. This existing method is inaccurate due to irregularity in sub-words alignment and a wide variety of free writing styles. The third component in OCR system is segmentation of text into characters. Nevertheless, the cursive, ligatures and overlapping characters differentiate the Arabic script from other languages. Therefore, Arabic OCR system requires a highly sophisticated segmentation method. This work proposes three methods and framework for OCR. First, the proposed compound binarization method that combines th e advantages of local and global thresholding method, tested on DIBCO 2009, 20 1 1 and 20 13 benchmark. Based on experimental results, the F-measure of proposed binarization method for printed document image is 88% and for handwritten is 78%, while the PSNR measurement for printed document image is 15.99 and for handwritten is 16.34. Secondly the proposed baseline estimation method for binary image based on feature points detection which tested on IFNIENIT dataset, when the estimated pixel error is less than 15 pixels the accuracy of the proposed baseline estimation method is 87.3%. And finally the proposed segmentation method based on baseline estimation and structural rules which tested using IFNIENIT dataset, the accuracy of the proposed method is 87.09%. The developed methods gained better accuracy rate when compared with the state of the art methods using quantities measurements.I t is able to recover document image of Arabic texts.
format	Thesis
author	Arwa Mahmoud Yousef Al-Khatatneh
author_facet	Arwa Mahmoud Yousef Al-Khatatneh
author_sort	Arwa Mahmoud Yousef Al-Khatatneh
title	Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_short	Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_full	Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_fullStr	Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_full_unstemmed	Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_sort	compound binarization for degraded document image and feature point extraction for handwritten arabic optical character recognition
_version_	1812444671521062912

Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition

相似书籍