Robust techniques for linear regression with multicollinearity and outliers

The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflat...

Full description

Saved in:
Bibliographic Details
Main Author: Mohammed, Mohammed Abdulhussein
Format: Thesis
Language:English
Published: 2016
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.58669
record_format uketd_dc
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
advisor Midi, Habshah
topic Regression analysis
Multicollinearity

spellingShingle Regression analysis
Multicollinearity

Mohammed, Mohammed Abdulhussein
Robust techniques for linear regression with multicollinearity and outliers
description The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflated standard errors. Outlying observations can be classified into different types, such as vertical outlier, high leverage points (HLPs) and influential observations (IO). It is very crucial to identify HLPs and IO because of their responsibility for having large effect on various estimators, causing masking and swamping of outliers in multiple linear regression. All the commonly used diagnostic measures fail to correctly identify those observations. Hence, a new improvised diagnostic robust generalized potential (IDRGP) is proposed. The proposed IDRGP is very successful in detecting multiple HLPs with smaller masking and swamping rates. This thesis also concerned on the diagnostic measures for the identification of bad influential observations (BIO). The detection of BIO is very important because it is accountable for inaccurate prediction and invalid inferential statements as it has large impact on the computed values of various estimates. The Generalized version of DFFITS (GDFF) was developed only to identify IO without taking into consideration whether it is good or bad influential observations. In addition, although GDFF can detect multiple IO, it has a tendency to detect lesser IO as it should be due to swamping and masking effect. A new proposed method which is called the modified generalized DFFITS (MGDFF) is developed in this regard, whereby the suspected HLPs in the initial subset are identified using our proposed IDRGP diagnostic method. To the best of our knowledge, no research is done on the classification of observations into regular, good and bad IOs. Thus, the IDRGP-MGDFF plot is formulated to close the gap in the literature. This thesis also addresses the issue of multicollinearity problem in multiple linear regression models with regards to two sources. The first source is due to HLPs and the second source of multicollinearity problem is caused by the data collection method employed, constraints on the model or in the population,model specification and an over defined model. However, no research is focused on the parameter estimation method to remedy the problem of multicollinearity which is due to multiple HLPs. Hence, we propose a new estimation method namely the modified GM-estimator (MGM) based on MGDFF. The results of the study indicate that the MGM estimator is the most efficient method to rectify the problem of multicollinearity which is caused by HLPs. When multicollinearity is due to other sources (not HLPs), several classical methods are available. Among them, the Ridge Regression (RR), Jackknife Ridge Regression (JRR) and Latent Root Regression (LRR) are put forward to remedy this problem. Nevertheless, it is now evident that these classical estimation methods perform poorly when outliers exist in a data. In this regard, we propose two types of robust estimation methods. The first type is an improved version of the LRR to rectify the simultaneous problems of multicollinearity and outliers. The proposed method is formulated by incorporating robust MM-estimator and the modified generalized M-estimator (MGM) in the LRR algorithm. We call these methods the Latent Root MMbased (LRMMB) and the Latent Root MGM-based (LRMGMB) methods. Similar to the first type, the second type of robust multicollinearity estimation method also aims to improve the performance of the robust jackknife ridge regression. The MM-estimator and the MGM-estimator are integrated in the JRR algorithm for the establishment of the improved versions of JRR. The suggested method is called jackknife ridge MM-based denoted by JRMMB and the jackknife ridge MGM based denoted by JRMGMB. All the proposed methods outperform the commonly used methods when multicollinearity comes together with the existence of multiple HLPs. The classical multicollinearity diagnostic measure is not suitable to correctly diagnose the existence of multicollinearity in the presence of multiple HLPs. When the classical VIF is employed, HLPs will be responsible for the increased and decreased of multicollinearity pattern. This will give misleading conclusion and incorrect indicator for solving multicollinearity problem. In this respect, we propose robust VIF denoted as RVIF(JACK-MGM) which serves as good indicator that can help statistics practitioners to choose appropriate estimator to solve multicollinearity problem.
format Thesis
qualification_level Doctorate
author Mohammed, Mohammed Abdulhussein
author_facet Mohammed, Mohammed Abdulhussein
author_sort Mohammed, Mohammed Abdulhussein
title Robust techniques for linear regression with multicollinearity and outliers
title_short Robust techniques for linear regression with multicollinearity and outliers
title_full Robust techniques for linear regression with multicollinearity and outliers
title_fullStr Robust techniques for linear regression with multicollinearity and outliers
title_full_unstemmed Robust techniques for linear regression with multicollinearity and outliers
title_sort robust techniques for linear regression with multicollinearity and outliers
granting_institution Universiti Putra Malaysia
publishDate 2016
url http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf
_version_ 1747812222389714944
spelling my-upm-ir.586692022-01-25T06:53:29Z Robust techniques for linear regression with multicollinearity and outliers 2016-01 Mohammed, Mohammed Abdulhussein The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflated standard errors. Outlying observations can be classified into different types, such as vertical outlier, high leverage points (HLPs) and influential observations (IO). It is very crucial to identify HLPs and IO because of their responsibility for having large effect on various estimators, causing masking and swamping of outliers in multiple linear regression. All the commonly used diagnostic measures fail to correctly identify those observations. Hence, a new improvised diagnostic robust generalized potential (IDRGP) is proposed. The proposed IDRGP is very successful in detecting multiple HLPs with smaller masking and swamping rates. This thesis also concerned on the diagnostic measures for the identification of bad influential observations (BIO). The detection of BIO is very important because it is accountable for inaccurate prediction and invalid inferential statements as it has large impact on the computed values of various estimates. The Generalized version of DFFITS (GDFF) was developed only to identify IO without taking into consideration whether it is good or bad influential observations. In addition, although GDFF can detect multiple IO, it has a tendency to detect lesser IO as it should be due to swamping and masking effect. A new proposed method which is called the modified generalized DFFITS (MGDFF) is developed in this regard, whereby the suspected HLPs in the initial subset are identified using our proposed IDRGP diagnostic method. To the best of our knowledge, no research is done on the classification of observations into regular, good and bad IOs. Thus, the IDRGP-MGDFF plot is formulated to close the gap in the literature. This thesis also addresses the issue of multicollinearity problem in multiple linear regression models with regards to two sources. The first source is due to HLPs and the second source of multicollinearity problem is caused by the data collection method employed, constraints on the model or in the population,model specification and an over defined model. However, no research is focused on the parameter estimation method to remedy the problem of multicollinearity which is due to multiple HLPs. Hence, we propose a new estimation method namely the modified GM-estimator (MGM) based on MGDFF. The results of the study indicate that the MGM estimator is the most efficient method to rectify the problem of multicollinearity which is caused by HLPs. When multicollinearity is due to other sources (not HLPs), several classical methods are available. Among them, the Ridge Regression (RR), Jackknife Ridge Regression (JRR) and Latent Root Regression (LRR) are put forward to remedy this problem. Nevertheless, it is now evident that these classical estimation methods perform poorly when outliers exist in a data. In this regard, we propose two types of robust estimation methods. The first type is an improved version of the LRR to rectify the simultaneous problems of multicollinearity and outliers. The proposed method is formulated by incorporating robust MM-estimator and the modified generalized M-estimator (MGM) in the LRR algorithm. We call these methods the Latent Root MMbased (LRMMB) and the Latent Root MGM-based (LRMGMB) methods. Similar to the first type, the second type of robust multicollinearity estimation method also aims to improve the performance of the robust jackknife ridge regression. The MM-estimator and the MGM-estimator are integrated in the JRR algorithm for the establishment of the improved versions of JRR. The suggested method is called jackknife ridge MM-based denoted by JRMMB and the jackknife ridge MGM based denoted by JRMGMB. All the proposed methods outperform the commonly used methods when multicollinearity comes together with the existence of multiple HLPs. The classical multicollinearity diagnostic measure is not suitable to correctly diagnose the existence of multicollinearity in the presence of multiple HLPs. When the classical VIF is employed, HLPs will be responsible for the increased and decreased of multicollinearity pattern. This will give misleading conclusion and incorrect indicator for solving multicollinearity problem. In this respect, we propose robust VIF denoted as RVIF(JACK-MGM) which serves as good indicator that can help statistics practitioners to choose appropriate estimator to solve multicollinearity problem. Regression analysis Multicollinearity 2016-01 Thesis http://psasir.upm.edu.my/id/eprint/58669/ http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf text en public doctoral Universiti Putra Malaysia Regression analysis Multicollinearity Midi, Habshah