Data Mining Classification Techniques and Performances on Medical Data

This study evaluates the performance of classification techniques with the application of several software, among them are Rosetta, Tanagra, Weka and Orange. The classification technique has been tested on six medical datasets from the UCI Machine Learning Repository. The study will help researcher...

Full description

Saved in:
Bibliographic Details
Main Author: Benyehmad, Yahyia Mohammed M. Ali
Format: Thesis
Language:eng
eng
Published: 2006
Subjects:
Online Access:https://etd.uum.edu.my/1864/1/Yahyia_Mohammed_M._Ali_Benyehmad_-_Data_mining_classification_techniques_and_performances_on_medical_data.pdf
https://etd.uum.edu.my/1864/2/Yahyia_Mohammed_M._Ali_Benyehmad_-_Data_mining_classification_techniques_and_performances_on_medical_data.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.1864
record_format uketd_dc
institution Universiti Utara Malaysia
collection UUM ETD
language eng
eng
topic QA76 Computer software
spellingShingle QA76 Computer software
Benyehmad, Yahyia Mohammed M. Ali
Data Mining Classification Techniques and Performances on Medical Data
description This study evaluates the performance of classification techniques with the application of several software, among them are Rosetta, Tanagra, Weka and Orange. The classification technique has been tested on six medical datasets from the UCI Machine Learning Repository. The study will help researchers to select the best suitable technique of classification problem for medical datasets in term of classification accuracy. In this thesis, sixteen classification techniques have been evaluated and compared. These are Radial Basis Function (RBF), Multilayer Perceptron (MLP) Neural Networks, Multi Linear Regression (MLR), Logistic Regression (LR), Classification Tree (ID3, C4.5, 548, CART), Naive Bayes (NB), Support Vector Machines (SVM), k- Nearest Neighbors (kNN), Linear discriminate analysis (LDA),Rule based classifier, Standard voting, Voting with object tracking and Standard tuned voting (RSES). The experiments have been validated using 10-fold cross validation method. The results of the study shows that the most suitable classification technique is NB with an average classification accuracy of 90.13% and an average error rate of 9.87%. The worst classification technique is SLR with an average classification accuracy of 50.16% and an average error rate of 49.84%. The classification techniques has been ranked from the best to the worst based on average classification accuracy and average error rate. The top of the rank is NB and the bottom is SLR. The sequence of ranking from the best to the worst is NB, LDA, LR, SVM, C4.5, MLP, RBF, kNN, RuleB, ID3, CART, 548, SV, RSES, V, and SLR.
format Thesis
qualification_name masters
qualification_level Master's degree
author Benyehmad, Yahyia Mohammed M. Ali
author_facet Benyehmad, Yahyia Mohammed M. Ali
author_sort Benyehmad, Yahyia Mohammed M. Ali
title Data Mining Classification Techniques and Performances on Medical Data
title_short Data Mining Classification Techniques and Performances on Medical Data
title_full Data Mining Classification Techniques and Performances on Medical Data
title_fullStr Data Mining Classification Techniques and Performances on Medical Data
title_full_unstemmed Data Mining Classification Techniques and Performances on Medical Data
title_sort data mining classification techniques and performances on medical data
granting_institution Universiti Utara Malaysia
granting_department Faculty of Information Technology
publishDate 2006
url https://etd.uum.edu.my/1864/1/Yahyia_Mohammed_M._Ali_Benyehmad_-_Data_mining_classification_techniques_and_performances_on_medical_data.pdf
https://etd.uum.edu.my/1864/2/Yahyia_Mohammed_M._Ali_Benyehmad_-_Data_mining_classification_techniques_and_performances_on_medical_data.pdf
_version_ 1747827220896808960
spelling my-uum-etd.18642013-07-24T12:13:28Z Data Mining Classification Techniques and Performances on Medical Data 2006 Benyehmad, Yahyia Mohammed M. Ali Faculty of Information Technology Faculty of Information Technology QA76 Computer software This study evaluates the performance of classification techniques with the application of several software, among them are Rosetta, Tanagra, Weka and Orange. The classification technique has been tested on six medical datasets from the UCI Machine Learning Repository. The study will help researchers to select the best suitable technique of classification problem for medical datasets in term of classification accuracy. In this thesis, sixteen classification techniques have been evaluated and compared. These are Radial Basis Function (RBF), Multilayer Perceptron (MLP) Neural Networks, Multi Linear Regression (MLR), Logistic Regression (LR), Classification Tree (ID3, C4.5, 548, CART), Naive Bayes (NB), Support Vector Machines (SVM), k- Nearest Neighbors (kNN), Linear discriminate analysis (LDA),Rule based classifier, Standard voting, Voting with object tracking and Standard tuned voting (RSES). The experiments have been validated using 10-fold cross validation method. The results of the study shows that the most suitable classification technique is NB with an average classification accuracy of 90.13% and an average error rate of 9.87%. The worst classification technique is SLR with an average classification accuracy of 50.16% and an average error rate of 49.84%. The classification techniques has been ranked from the best to the worst based on average classification accuracy and average error rate. The top of the rank is NB and the bottom is SLR. The sequence of ranking from the best to the worst is NB, LDA, LR, SVM, C4.5, MLP, RBF, kNN, RuleB, ID3, CART, 548, SV, RSES, V, and SLR. 2006 Thesis https://etd.uum.edu.my/1864/ https://etd.uum.edu.my/1864/1/Yahyia_Mohammed_M._Ali_Benyehmad_-_Data_mining_classification_techniques_and_performances_on_medical_data.pdf application/pdf eng validuser https://etd.uum.edu.my/1864/2/Yahyia_Mohammed_M._Ali_Benyehmad_-_Data_mining_classification_techniques_and_performances_on_medical_data.pdf application/pdf eng public masters masters Universiti Utara Malaysia Antonie, M.-L., Za'yane, O.R. and Coman, A. (2001)." Application of Data Mining Techniques for Medical Image Classification". Proceeding of the Second International Workshop on Multimedia Data Mining (MDM/KDD).San Francisco, USA, August. Apte, C., and Hong, S. J. 1996. Predicting Equity Returns from Securities Data with Minimal Rule Generation. In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R . Uthurusamy, 5 14-560. Menlo Park,Calif.: AAAI Press. Begg, R. and Kamruzzaman. (2003). "A Comparison of Neural NetFvorks and Support Vector Machines for Recognizing Young-Old Gait Patterns". IEEE Transactions on Biomedical Engineering, Vol.1, pp.354-358. Berry, M.J.A. and Linoff, G. (1997). "Data Mining Techniques for Marketing, Sales and Customer Support". New York: Wiley. Brazdil, P., Gama, J., and B.Henry (1994). "Characterizing the Applicability of Classification Algorithms Using Meta-Level Learning". In Proc. of the European Conference of Machine Learning. Calvo, R. A. and. Ceccatto, H. A. (2000). "Intelligent document classification" . Intelligent Data Analysis, 4(5). Dunham, M.H. (2003), "Data Mining Introductory and Advanced Topics", 1st Edition Pearson Education (Singapore) Pte.Ltdl. Dudani, S.(1975). The distance-weighted k -nearest-neighbor rule. IEEE Transactions on Systems,Man and Cybernetics, SMC-6(4):325CE327. Djorgovski, S. G., Fayyad, U. M. and Weir, N. (1996). From Digitized Images to On-Line Catalogs: Data Mining a Sky Survey. AI Magazine 17(2): 51-66. Demsar, J., Zupan, B., Leban G. (2004). 'Orange: From Experimental Machine Learning to Interactive Data Mining', White Paper (www.ailab.si/orange), Faculty of Computer and Information Science, University of Ljubljana. Fayyad, U., Shapiro, P. and Smyth, P. (1996). "From Data Mining to Knowledge Discovery:An Overview," Advances in Knowledge Discovery and Data Mining, AAAI press/The MIT press, Menlo Park, CA, pp. 1-34. Frawley, W.J. and Piatetsky-Shapiro, G. (1992). "Knowledge Discovery in Databases". AAAVMIT Press. Gahegan, M., German, G.W.H. and West, G. (2000). "Statistical and A1 Techniques in GIS Classification: A Comparison". School of Computing, Curtin University of Technology, Bentley, Western Australia 6102. 2 Dept Geography, Penn State, University Park, PA 16802 USA. Guven, A., Kara, S. and Okandan, M. (2003) "Application of artificial neural networks in the pattern elektroretinographical diagnosis of eye diseases", International XI1 Turkish. Symposium on Artificial Intelligence and Neural Networks (TAINN), Canakkale, Turkey. Giannotti, F., Manco, G. and Franco Turini, F. (2004). "Towards a Logic Query Language for Data Mining". Database Support for Data Mining Applications, LNAI 2682,pp.76-94. Huang, Y.-L., Wang, K-L. and Chen, D.-R. (2005)." Diagnosis of breast tumors with ultrasonic texture analysis using support vector machines". Neural Computing & Applications , Issue: Volume 15, Number 2. 15:164-169. Houston, A .L., C hen, H ., Hubbard, S .M., S chatz, B.R., N g, T .D., Sewell, R.R. and Tolle, K.M. (1999). "Medical Data Mining on the Internet: Research on a Cancer Information System". Artificial Intelligence Review 13:437-466. Kluwer Academic Publishers. Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann. Habrard, A., Bernard, M. and Jacquenet, F. (2003)." Multi-Relational Data Mining in Medical Databases". Springer-Verlag. EURISE - Universite de Saint-Etienne - 23.42023 Saint-Etienne cedex 2 - France. Hussain, F., Liu, H., Tan, C. and .Dash, M. (2002). "Discretizatin:An Enabling Technique". Journal of Knowledge Discovery and Data Mining ,6(4):393-423. Kusiak, A., Shah, S. and Dixon, B. (2003). "Data Mining in Predicting Survival of Kidney Dialysis Patients - Invariant object approach". in Proceedings of Photonics West - Bios, Bass, L .S. et al. (Eds), Lasers in Surgery: Advanced Characterization, Therapeutics,and Systems XIII, Vol. 4949, SPIE, Belingham, WA , pp. 1-8. Kalousis, A. and Theoharis, T. (1999). "Noemon: Design, implementation and performance results of an intelligent assistant for classifier selection". Intelligent Data Analysis,3(5):3 19-337. Lim, T.-S., Loh W.-Y and Shih, Y.-S. (2000). "A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms". Machine Learning, 40,203-228. Lorenz, A., Bliim, M., Ermert, H. And Senge, T. (2000) "Comparison of Different Neuro-Fuzzy Classification Systems for the Detection of Prostate Cancer in Ultrasonic Images". Bundesministerium for Forschung and Technologic, D-44780 Bochum,Germany. Grant: 01 KF 8903/2. Leroy, G. and Rindflesch, T.C. (2004). "Using Symbolic Knowledge in the UMLS to Disambiguate Words in Small Datasets with a Naive Bayes Classifier". AMIA Symp.; p. 381-385. Mitchell, T. (1997). "Bayesian Learning, Machine Learning", 154-200. McGraw- Hill. Merz, C.J. and Murphy, P.M. (1996) .UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine, CA. (http:www.ics.uci.edu/~mlearn/MLRepository.html). Mitra, P., Mitra, S. and Pal, K. (2002). "Staging of Cervical Cancer with Soft Computing".IEEE Transactions on biomedical engineering, vol. 47, no.7. Michie, D., Spiegelhalter, D. J., and Taylor C. C. (1994). "Machine learning, Neural and Statistical classification". New York: Ellis Horwood. xiv + 289 pp. Ohrn, A., Komorowski, J. (1997). "ROSETTA: A Rough Set Toolkit for Analysis of Data". Proceedings of the Third International Joint Conference on Information Sciences, Durham, NC, USA, Department of Electrical and Computer Engineering, Duke University Vol.3 pp. 403-407. Ohrn, A. and T. Rowland (2000). "Rough sets: a knowledge discovery technique for multifactorial medical outcomes. " Am J Phys Med Rehabil 79(1): 100-108. Ohrn, A., (1999). "Discernibility and Rough Sets in Medicine: Tools and Applications", PhD thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway. NTNU report 1999:133. Rakotomalala, R. (2005). TANAGRA: "a free software for research and academic purposes",in Proceedings of EGC'OS, RNTI-E-3, vol.2, pp.697-702. Soares, C. and Brazdil, P. B. (2000). "Zoomed Ranking: Selection of Classification Algorithms Based on Relevant Performance Information". In Proceedings of Principles of Data Mining and Knowledge Discovery, 4th European Conference (PKDD-2000),126-135. Tan, A.C. and GILBERT, D. (2003) "An empirical comparison of supervised machine learning techniques in bio informatics". in the Proceedings of the First Asia Pacific Bioinformatics Conference, Vol. 19. Todorovski, L. and Dzeroski, S. (1999). "Experiments in meta-level learning with ILP". In Proceedings of the 3rd European Conference on Principles of Data Mining and Knowledge Discovery (PKDD-99), pages 98-106. TANAGRA. A Free Data Mining Software for Research and Education. (http://eric.univlyon2.fr/ricco/tanagra/). 2005. Xiong, L., Chitti, S. and Liu, L. (2004). "Mining Multiple Private Databases using a Privacy Preserving kNN Classifier". Submitted. Available as technical report, College of Computing, Georgia Institute of Technology. Yang, Y. and Liu, X. (1999) "A re-examination of text categorization methods". In 22nd Annual International SIGIR, pages 42-49, Berkley. Zheng, J., Yan, H., Jiang, Y., Peng, C. and Li, Q. (2003). "Development of a decision support system for heart disease diagnosis using multilayer preceptron". Proc. IEEE, Page(s): V-709-V-712 Vol.5.