An improved K-nearest neighbor with grasshopper optimization algorithm for missing data imputation /
Concurrent with the advanced of data cleaning process, missing data have been influentially known as one of the most common issues encountered for many research area. A real collected dataset such as medical, business, transportation and education are prone to be incomplete or missing especially whe...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
Kuala Lumpur :
Kulliyyah of Information and Communication Technology, International Islamic University Malaysia,
2020
|
Subjects: | |
Online Access: | http://studentrepo.iium.edu.my/handle/123456789/9838 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
LEADER | 049140000a22004450004500 | ||
---|---|---|---|
008 | 200922s2020 my a f m 000 0 eng d | ||
040 | |a UIAM |b eng |e rda | ||
041 | |a eng | ||
043 | |a a-my--- | ||
050 | 0 | 0 | |a QA76.9.A43 |
100 | 0 | |a Nadzurah Zainal Abidin, |e author | |
245 | 1 | 3 | |a An improved K-nearest neighbor with grasshopper optimization algorithm for missing data imputation / |c by Nadzurah Zainal Abidin |
264 | 1 | |a Kuala Lumpur : |b Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, |c 2020 | |
300 | |a xv, 110 leaves : |b illustrations ; |c 30cm. | ||
336 | |2 rdacontent |a text | ||
337 | |2 rdamedia |a unmediated | ||
337 | |2 rdamedia |a computer | ||
338 | |2 rdacarrier |a volume | ||
338 | |2 rdacarrier |a computer disc | ||
338 | |2 rdacarrier |a online resource | ||
347 | |2 rdaft |a text file |b PDF | ||
500 | |a Abstracts in English and Arabic. | ||
500 | |a "A thesis submitted in fulfilment of the requirement for the degree of Master in Computer Science." --On title page. | ||
502 | |a Thesis (MCS)--International Islamic University Malaysia, 2020. | ||
504 | |a Includes bibliographical references (leaves 101-108). | ||
520 | |a Concurrent with the advanced of data cleaning process, missing data have been influentially known as one of the most common issues encountered for many research area. A real collected dataset such as medical, business, transportation and education are prone to be incomplete or missing especially when the respondents does not respond due to stress, fatigue or inadequacy of knowledge, some of the questions given are sensitive, and lack of option answers presented. One of the mechanisms in solving missing data is through imputation, which is the activity of substituting missing values with plausible records that yield to reasonable accuracy against actual values. A huge number of imputation algorithm has been proposed to estimate the missing values. Unfortunately, most imputation method employed provide less reliable estimations for missing data. Therefore, to accurately deal with missing data, an optimization of one of the state-of-the-art imputation algorithm, K-nearest neighbors (KNN), are proposed to impute those missing values. KNN algorithm has been widely adopted as an imputation algorithm for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. However, in many cases, KNN suffers from high computational cost, greater storage requirements, sensitive to noise, high time complexity, and difficult to choose the right centroid position and choice of different function for measuring the distance. Therefore, a conventional way of KNN computes an imputation method still imposes undesirable results. Accordingly, this thesis proposes to develop an optimized KNN imputation method with Grasshopper optimization algorithm (GOA) to present a better imputation result. Grasshopper optimization algorithm is a recent population based metaheuristics which have shown an improved results and efficiencies in tackling issues with missing data. The GOA is incorporated in the algorithm structure, inspired from the natural behavior of grasshopper that maximizes the imputation performance of KNN. The performances of the proposed algorithm will be applied to nine different datasets and compared with other optimization algorithms: Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Dragonfly Optimization (DA), Firefly Algorithm (FFA), Ant Lion Optimization (ALO), and Moth Flame Optimization (MFO), in terms of statistical correlation, error accuracy, and running time. The results show KNNGOA has the most promising performance and outperform among other optimization algorithms with regards to imputation accuracy and fastest time computing for datasets that are large and higher percentage in missing rates (20 percent and above). The analysis of statistical test is also conducted which supports the conclusion of the experiment. | ||
596 | |a 1 | ||
650 | 0 | |a Computer algorithms | |
650 | 0 | |a Heuristic algorithms | |
650 | 0 | |a Metaheuristics | |
650 | 0 | |a Missing observations (Statistics) | |
655 | 7 | |a Theses, IIUM local | |
690 | |a Dissertations, Academic |x Department of Computer Science |z IIUM | ||
700 | 0 | |a Amelia Ritahani Ismail, |e degree supervisor | |
710 | 2 | |a International Islamic University Malaysia. |b Department of Computer Science | |
856 | 4 | |u http://studentrepo.iium.edu.my/handle/123456789/9838 | |
900 | |a sz-ash-sar | ||
999 | |c 439310 |d 470824 | ||
952 | |0 0 |6 T QA 000076.9 A43 N126I 2020 |7 0 |8 THESES |9 761751 |a IIUM |b IIUM |c MULTIMEDIA |g 0.00 |o t QA 76.9 A43 N126I 2020 |p 11100418043 |r 2021-04-21 |t 1 |v 0.00 |y THESIS | ||
952 | |0 0 |6 TS CDF QA 76.9 A43 N126I 2020 |7 0 |8 THESES |9 859264 |a IIUM |b IIUM |c MULTIMEDIA |g 0.00 |o ts cdf QA 76.9 A43 N126I 2020 |p 11100418044 |r 2021-04-21 |t 1 |v 0.00 |y THESISDIG |