Methods of handling missing data with reference to rainfall in Peninsular Malaysia
Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data b...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/78077/1/HoMingKangPFS2014.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-utm-ep.78077 |
---|---|
record_format |
uketd_dc |
spelling |
my-utm-ep.780772018-07-23T06:06:01Z Methods of handling missing data with reference to rainfall in Peninsular Malaysia 2014-09 Ho, Ming Kang QA Mathematics Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data being missing is unidentified. Therefore, this study objectively investigates the reasons why some data is missing, known as missingness mechanism, and selects the best model to impute the missing rainfall data. A model using a combination of expectation maximization and logit (EM-Logit) is proposed and applied to a simulated data with missing values that are characterised as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Besides, homogeneous rainfall data that are coupled with temperature and humidity in Damansara and Kelantan are also used before validating the proposed model. The results indicate that the model is able to identify types of missingness mechanism which leads to a data being missing. The results of the model has also identified that the MNAR is best missingness mechanism to describe missing rainfall data in both study areas. Therefore, for the imputation purposes, a two-step approach is proposed. The first step is to analyze the rainfall events, either wet or dry day, by using weighted-average algorithm and the subsequent step is the wet-classified day with missing data is estimated by self-organizing map (SOM). The two-step approach, also known as Probability Density Function Preserving Approach with SOM (PDSOM), is then compared with SOM model alone and Multilayer Perceptron (MLP). By using the mean absolute error (MAE) and root mean square error (RMSE) criteria and comparing the statistical properties of the imputed data with the rainfall data, PDSOM is found to be performing better than SOM and MLP. The missing rainfall data from 1996 to 2004 from the two stations (Damansara and Kelantan) are also selected to validate the performance of PDSOM by comparing the estimated mean and variance of the rainfall data with missing values that are imputed by PDSOM. The imputations are found within the confidence interval that are constructed under observed rainfall data. PDSOM has shown its capability to well preserve the mean and variance of the missing rainfall data, as well as the number of rainfall events in Damansara and Kelantan. Thus, PDSOM can be an alternative imputation model in dealing with rainfall data with missing values. 2014-09 Thesis http://eprints.utm.my/id/eprint/78077/ http://eprints.utm.my/id/eprint/78077/1/HoMingKangPFS2014.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98266 phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science |
institution |
Universiti Teknologi Malaysia |
collection |
UTM Institutional Repository |
language |
English |
topic |
QA Mathematics |
spellingShingle |
QA Mathematics Ho, Ming Kang Methods of handling missing data with reference to rainfall in Peninsular Malaysia |
description |
Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data being missing is unidentified. Therefore, this study objectively investigates the reasons why some data is missing, known as missingness mechanism, and selects the best model to impute the missing rainfall data. A model using a combination of expectation maximization and logit (EM-Logit) is proposed and applied to a simulated data with missing values that are characterised as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Besides, homogeneous rainfall data that are coupled with temperature and humidity in Damansara and Kelantan are also used before validating the proposed model. The results indicate that the model is able to identify types of missingness mechanism which leads to a data being missing. The results of the model has also identified that the MNAR is best missingness mechanism to describe missing rainfall data in both study areas. Therefore, for the imputation purposes, a two-step approach is proposed. The first step is to analyze the rainfall events, either wet or dry day, by using weighted-average algorithm and the subsequent step is the wet-classified day with missing data is estimated by self-organizing map (SOM). The two-step approach, also known as Probability Density Function Preserving Approach with SOM (PDSOM), is then compared with SOM model alone and Multilayer Perceptron (MLP). By using the mean absolute error (MAE) and root mean square error (RMSE) criteria and comparing the statistical properties of the imputed data with the rainfall data, PDSOM is found to be performing better than SOM and MLP. The missing rainfall data from 1996 to 2004 from the two stations (Damansara and Kelantan) are also selected to validate the performance of PDSOM by comparing the estimated mean and variance of the rainfall data with missing values that are imputed by PDSOM. The imputations are found within the confidence interval that are constructed under observed rainfall data. PDSOM has shown its capability to well preserve the mean and variance of the missing rainfall data, as well as the number of rainfall events in Damansara and Kelantan. Thus, PDSOM can be an alternative imputation model in dealing with rainfall data with missing values. |
format |
Thesis |
qualification_name |
Doctor of Philosophy (PhD.) |
qualification_level |
Doctorate |
author |
Ho, Ming Kang |
author_facet |
Ho, Ming Kang |
author_sort |
Ho, Ming Kang |
title |
Methods of handling missing data with reference to rainfall in Peninsular Malaysia |
title_short |
Methods of handling missing data with reference to rainfall in Peninsular Malaysia |
title_full |
Methods of handling missing data with reference to rainfall in Peninsular Malaysia |
title_fullStr |
Methods of handling missing data with reference to rainfall in Peninsular Malaysia |
title_full_unstemmed |
Methods of handling missing data with reference to rainfall in Peninsular Malaysia |
title_sort |
methods of handling missing data with reference to rainfall in peninsular malaysia |
granting_institution |
Universiti Teknologi Malaysia, Faculty of Science |
granting_department |
Faculty of Science |
publishDate |
2014 |
url |
http://eprints.utm.my/id/eprint/78077/1/HoMingKangPFS2014.pdf |
_version_ |
1747817901357793280 |