Methods of handling missing data with reference to rainfall in Peninsular Malaysia

Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data b...

Full description

Saved in:
Bibliographic Details
Main Author: Ho, Ming Kang
Format: Thesis
Language:English
Published: 2014
Subjects:
Online Access:http://eprints.utm.my/id/eprint/78077/1/HoMingKangPFS2014.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.78077
record_format uketd_dc
spelling my-utm-ep.780772018-07-23T06:06:01Z Methods of handling missing data with reference to rainfall in Peninsular Malaysia 2014-09 Ho, Ming Kang QA Mathematics Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data being missing is unidentified. Therefore, this study objectively investigates the reasons why some data is missing, known as missingness mechanism, and selects the best model to impute the missing rainfall data. A model using a combination of expectation maximization and logit (EM-Logit) is proposed and applied to a simulated data with missing values that are characterised as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Besides, homogeneous rainfall data that are coupled with temperature and humidity in Damansara and Kelantan are also used before validating the proposed model. The results indicate that the model is able to identify types of missingness mechanism which leads to a data being missing. The results of the model has also identified that the MNAR is best missingness mechanism to describe missing rainfall data in both study areas. Therefore, for the imputation purposes, a two-step approach is proposed. The first step is to analyze the rainfall events, either wet or dry day, by using weighted-average algorithm and the subsequent step is the wet-classified day with missing data is estimated by self-organizing map (SOM). The two-step approach, also known as Probability Density Function Preserving Approach with SOM (PDSOM), is then compared with SOM model alone and Multilayer Perceptron (MLP). By using the mean absolute error (MAE) and root mean square error (RMSE) criteria and comparing the statistical properties of the imputed data with the rainfall data, PDSOM is found to be performing better than SOM and MLP. The missing rainfall data from 1996 to 2004 from the two stations (Damansara and Kelantan) are also selected to validate the performance of PDSOM by comparing the estimated mean and variance of the rainfall data with missing values that are imputed by PDSOM. The imputations are found within the confidence interval that are constructed under observed rainfall data. PDSOM has shown its capability to well preserve the mean and variance of the missing rainfall data, as well as the number of rainfall events in Damansara and Kelantan. Thus, PDSOM can be an alternative imputation model in dealing with rainfall data with missing values. 2014-09 Thesis http://eprints.utm.my/id/eprint/78077/ http://eprints.utm.my/id/eprint/78077/1/HoMingKangPFS2014.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98266 phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA Mathematics
spellingShingle QA Mathematics
Ho, Ming Kang
Methods of handling missing data with reference to rainfall in Peninsular Malaysia
description Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data being missing is unidentified. Therefore, this study objectively investigates the reasons why some data is missing, known as missingness mechanism, and selects the best model to impute the missing rainfall data. A model using a combination of expectation maximization and logit (EM-Logit) is proposed and applied to a simulated data with missing values that are characterised as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Besides, homogeneous rainfall data that are coupled with temperature and humidity in Damansara and Kelantan are also used before validating the proposed model. The results indicate that the model is able to identify types of missingness mechanism which leads to a data being missing. The results of the model has also identified that the MNAR is best missingness mechanism to describe missing rainfall data in both study areas. Therefore, for the imputation purposes, a two-step approach is proposed. The first step is to analyze the rainfall events, either wet or dry day, by using weighted-average algorithm and the subsequent step is the wet-classified day with missing data is estimated by self-organizing map (SOM). The two-step approach, also known as Probability Density Function Preserving Approach with SOM (PDSOM), is then compared with SOM model alone and Multilayer Perceptron (MLP). By using the mean absolute error (MAE) and root mean square error (RMSE) criteria and comparing the statistical properties of the imputed data with the rainfall data, PDSOM is found to be performing better than SOM and MLP. The missing rainfall data from 1996 to 2004 from the two stations (Damansara and Kelantan) are also selected to validate the performance of PDSOM by comparing the estimated mean and variance of the rainfall data with missing values that are imputed by PDSOM. The imputations are found within the confidence interval that are constructed under observed rainfall data. PDSOM has shown its capability to well preserve the mean and variance of the missing rainfall data, as well as the number of rainfall events in Damansara and Kelantan. Thus, PDSOM can be an alternative imputation model in dealing with rainfall data with missing values.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Ho, Ming Kang
author_facet Ho, Ming Kang
author_sort Ho, Ming Kang
title Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_short Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_full Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_fullStr Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_full_unstemmed Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_sort methods of handling missing data with reference to rainfall in peninsular malaysia
granting_institution Universiti Teknologi Malaysia, Faculty of Science
granting_department Faculty of Science
publishDate 2014
url http://eprints.utm.my/id/eprint/78077/1/HoMingKangPFS2014.pdf
_version_ 1747817901357793280