Hybrid features for detection of malicious user in YouTube

Social media is any site that provides a network of people with a place to make connections. An example of the media is YouTube that connects people through video sharing. Unfortunately, due to the explosive number of users and various content sharing, there exist malicious users who aim to self-pro...

Full description

Saved in:
Bibliographic Details
Main Author: Sadoon, Omar Hadeb
Format: Thesis
Language:eng
Published: 2017
Subjects:
Online Access:https://etd.uum.edu.my/6562/1/816170_01.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.6562
record_format uketd_dc
institution Universiti Utara Malaysia
collection UUM ETD
language eng
advisor Yusof, Yuhanis
topic T58.5-58.64 Information technology
spellingShingle T58.5-58.64 Information technology
Sadoon, Omar Hadeb
Hybrid features for detection of malicious user in YouTube
description Social media is any site that provides a network of people with a place to make connections. An example of the media is YouTube that connects people through video sharing. Unfortunately, due to the explosive number of users and various content sharing, there exist malicious users who aim to self-promote their videos or broadcast viruses and malware. Even though detection of malicious users have been done using various features such as the content, user social activity, social network analyses, or hybrid features, the detection rate is still considered low (i.e., 46%). This study proposes a new set of features that includes features of the user, user behaviour and also features created based on Edge Rank concept. The work was realized by analysing a set of YouTube users and their shared video. It was followed by the process of classifying users using 22 classifiers based on the proposed feature set. An evaluation was performed by comparing the classification results of the proposed hybrid features against the non-hybrid ones. The undertaken experiments showed that most of the classifiers obtained better result when using the hybrid features as compared to using the non-hybrid set. The average classification accuracy is at 95.6% for the hybrid feature set. The result indicates that the proposed work would benefit YouTube users as malicious users who are sharing non-relevant content can be detected. The results also lead to the optimization of system resources and the creation of trust among users.
format Thesis
qualification_name masters
qualification_level Master's degree
author Sadoon, Omar Hadeb
author_facet Sadoon, Omar Hadeb
author_sort Sadoon, Omar Hadeb
title Hybrid features for detection of malicious user in YouTube
title_short Hybrid features for detection of malicious user in YouTube
title_full Hybrid features for detection of malicious user in YouTube
title_fullStr Hybrid features for detection of malicious user in YouTube
title_full_unstemmed Hybrid features for detection of malicious user in YouTube
title_sort hybrid features for detection of malicious user in youtube
granting_institution Universiti Utara Malaysia
granting_department Awang Had Salleh Graduate School of Arts & Sciences
publishDate 2017
url https://etd.uum.edu.my/6562/1/816170_01.pdf
_version_ 1747828092486811648
spelling my-uum-etd.65622021-08-18T06:40:02Z Hybrid features for detection of malicious user in YouTube 2017 Sadoon, Omar Hadeb Yusof, Yuhanis Awang Had Salleh Graduate School of Arts & Sciences Awang Had Salleh Graduate School of Arts and Sciences T58.5-58.64 Information technology Social media is any site that provides a network of people with a place to make connections. An example of the media is YouTube that connects people through video sharing. Unfortunately, due to the explosive number of users and various content sharing, there exist malicious users who aim to self-promote their videos or broadcast viruses and malware. Even though detection of malicious users have been done using various features such as the content, user social activity, social network analyses, or hybrid features, the detection rate is still considered low (i.e., 46%). This study proposes a new set of features that includes features of the user, user behaviour and also features created based on Edge Rank concept. The work was realized by analysing a set of YouTube users and their shared video. It was followed by the process of classifying users using 22 classifiers based on the proposed feature set. An evaluation was performed by comparing the classification results of the proposed hybrid features against the non-hybrid ones. The undertaken experiments showed that most of the classifiers obtained better result when using the hybrid features as compared to using the non-hybrid set. The average classification accuracy is at 95.6% for the hybrid feature set. The result indicates that the proposed work would benefit YouTube users as malicious users who are sharing non-relevant content can be detected. The results also lead to the optimization of system resources and the creation of trust among users. 2017 Thesis https://etd.uum.edu.my/6562/ https://etd.uum.edu.my/6562/1/816170_01.pdf text eng public masters masters Universiti Utara Malaysia Abdesslem, F. Ben, Parris, I., & Henderson, T. (2012). Reliable online social network data collection. In Computational Social Network: Mining and Visualization (pp. 183-210). http://doi.org/10.10071978-1-4471-4054-2_8 Academia. (2016). About academia.edu. Retrieved January 1, 2016, from http://www.academia.edu/about Acufia, E., & Rodriguez, C. (2004). The Treatment of Missing Values and its Effect on Classifier Accuracy. In Classification, Clustering, and Data Mining Applications (pp. 639-647). incollection, Springer. http://doi.org/1O.10071978-3-642-17103-1_60 Alberto, T. C., Lochter, J. V, & Almeida, T. A. (2015). TubeSpam: Comment Spam Filtering on YouTube. Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA '15), 138-143. Journal Article. Arnasyali, M. F., & Ersoy, 0. K. (2011). Comparison of single and ensemble classifiers in terms of accuracy and execution time. In INISTA 2011 - 2011 International Symposium on INnovations in Intelligent Systems and Applications (pp. 470-474). IEEE. http://doi.org/10.1109/INISTA.2011.5946119 Babu, T. A. F., & Pradeepa, R. (2013). Comparative study of multiclass classifiers for underwater target classification. In 2013 Third International Conference on Advances in Computing and Communications (pp. 400403). IEEE. http://doi.org/10.1109/ICACC.2013.85 Benevenuto, F., Magno, G., Rodrigues, T., & Almeida, V. (2010). Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS) (Vol. 6, p. 12). http://doi.org/10.1.1.297.5340 Benevenuto, F., Rodrigues, T., Almeida, J., Gonqalves, M., & Almeida, V. (2009). Detecting spammers and content promoters in online video social networks. In Proceedings - IEEE INFOCOM. http://doi.org/10.1109/INFCOMW.2009.5072127 Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., Zhang, C., & Ross, K. (2008). Identifying video spammers in online social networks. Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web AIR Web 08,45. Bermejo, P., Joho, H., Jose, J. M., & Villa, R. (2009). Comparison of feature construction methods for video relevance prediction. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5371 LNCS, 185-196. Bhat, S. Y., & Abulaish, M. (2013). Community-based features for identifying spammers in online social networks (12). In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - ASONAM '13 (pp. 100-107). http://doi.org/10.114512492517.2492567 Bhat, S. Y., Abulaish, M., & Mirza, A. a. (2014). Spammer classification using ensemble methods over structural social network features. In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (W7) and Intelligent Agent Technologies @AT) (pp. 454-458). http://doi.org/10.1109lWI-IAT.2014.133 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. http://doi.org/10.1023/A:1010933404324 Burnap, P., Javed, A., Rana, 0. F., & Awan, M. S. (2015). Real-time classification of malicious URLs on Twitter using machine activity data. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2015 (pp. 970-977). http://doi.org/10.1145/2808797.2809281 Cao, C., & Caverlee, J. (2015). Detecting spam URLs in social media via behavioral analysis. Advances in Information Retrieval, 9022, 703-714. http://doi.org/10.1007/978-3-319-16354-3-77 Chang, C.-C., & Lin, C.-J. (2011). LIBSVM. ACM Transactions on Intelligent Systems and Technology, 2(3), 1-27. http://doi.org/10.114511961189.1961199 Chawla, N. V. (2005). Data Mining for Imbalanced Datasets: An Overview. In Data Mining and Knowledge Discovery Handbook (pp. 853-867). http://doi.org/l0.1007/0-387-25465-X-40 Chiluka, N., Andrade, N., & Pouwelse, J. (2011). A link prediction approach to recommendations in large-scale user-generated content systems. In Advances in Information Retrieval (pp. 189-200). Chowdury, R., Adnan, M. N., Mahrnud, G. A. N., & Rahman, R. M. (2013). A data mining based spam detection system for YouTube. In Proceedings of the 8th International Conference on Digital Information Management (ICDIM'13) (pp. 373-378). Cohen, W. W. (1995). Fast effective rule induction. In Twelfth International Conference on Machine Learning (pp. 115-123). http://doi.org/10.1.1.50.8204 Digitalinsights. (2014). Social Media 2014 Statistics. Retrieved January 1, 2016, from http://blog.digitalinsights.in/social-media-users-2014-stats-numbers/05205287.html Diplaris, S., Tsoumakas, G., Mitkas, P. A., & Vlahavas, I. (2005). Protein classification with multiple algorithms. In Panhellenic Conference on Informatics (pp. 448-456). http://doi.org/10.1007/11573036-42 Dong, A., & Wang, B. (2009). Feature selection and analysis on mammogram classification. IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing - Proceedings, 731-735. Facebook, I. (2010). Facebook Developer Conference. Retrieved January 1, 2016, from https://www.fbf8.com/ Facebook, I. (2016). Help Center. Retrieved February 18, 2016, from https://www.facebook.com/help/ Facebook Help Center. (2016). What is spam? Retrieved January 1, 2016, from https://www.facebook.com/help/1416986764019121 Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871-1874. http://doi.org/10.1038/oby.2011.351 Fernandes, M. A., Patel, P., & Marwala, T. (2015). Automated detection of human users in Twitter. Procedia Computer Science, 53, 224-231. Frank, E., & Witten, I. (1998). Generating accurate rule sets without global optimization. Retrieved from http://researchcommons.waikato.ac.nz/handle/ l0289/1047 Freitas, A. A. (2001). Understanding the crucial role of attributehteraction in data mining. Artificial Intelligence Review, 16(3), 177-199. Retrieved from http:/lportal.acm.org/citation.cfm? id=508382.508383 Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. International Conference on Machine Learning, 96, 148-156. Gayle, D. (2012). YouTube cancels billions of music industry video views after finding they were fake or "dead." Retrieved January 1, 2016, from http://www.dailymail.co.uk/sciencetech/article-2215841/YouTube-wipes-billionsvideo-views-finding- faked-music-industry.htm1 George-Nektarios, T. (2013). Weka Classijiers Summary. Athens University of Economics and Bussiness Intracom-Telecom. Athens. Google. (2016). Policy Center. Retrieved January 10, 2016, from https://support.google.com/youtube/ topic/2801376?hl=en&ref_topic=2676378 Guo, Y., Zhou, L., He, K., Gu, Y., & Sun, Y. (2014). Bayesian spam filtering mechanism based on decision tree of attribute set dependence in the mapreduce framework. Open Cybernetics & Systemics Journal, 8, 435-441. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182. http://doi.org/10.11621153244303322753616 Hall, M. (1999). Correlation-based Feature Selection for Machine Learning. Methodology, 211195-i20(April), 1-5. http://doi.org/10.1.1.37.4643 Hall, M. (2000). Correlation-based feature selection of discrete and numeric class machine learning. article. Hassan, S., & El Fattah Hegazy, A. (2015). A model recommends best machine learning algorithm to classify learners based on their interactivity with moodle. In 2015 Second International Conference on Computing Technology and Information Management (ICCTIM) (pp. 49-54). IEEE. http://doi.org/10.1109/ICCTIM.2015.7224592 Heyrnann, P., Koutrika, G., & Garcia-Molina, H. (2007). Fighting spam on social websites: A survey of approaches and future challenges. IEEE Internet Computing, 11(6), 36-45. Hi5. (2016). About us. Retrieved January 1, 2016, from http://www.hi5.com/ Hsu, T. (2012). Yelp's new weapon against fake reviews: User alerts. Retrieved from http://www.latimes.com/business/la-fi-mo-yelp-fake-review-alert-20121018-story.html Hu, X., Tang, J., Gao, H., & Liu, H. (2015). Social spamrner detection with sentiment information. In Proceedings - IEEE International Conference on Data Mining, ICDM (Vol. 2015-Janua, pp. 180-189). http://doi.org/10.1109/ICDM.2014.141 Hu, X., Tang, J., & Liu, H. (2014). Online social spammer detection. In Twenty-Eighth AAAI Conference on Artificial (pp. 59-65). http://doi,org/10.1109/ICDM.2014.141 Ibarguren, I., Perez, J. M., Muguerza, J., Gurrutxaga, I., & Arbelaitz, 0. (2014). Coveragebased resampling: Building robust consolidated decision trees. Knowledge-Based Systems, 79, 51-67. http://doi.org/10.1016/j.knosys.2014.12.023 Jeff. (20 15). EdgeRank. Retrieved January 1, 2016, from http://edgerank.net/ John G. Cleary, L. E. T. (1995). K*: An Instance-based Learner Using an Entropic Distance Measure. In Proceedings of The 12th International Conference on Machine Learning (pp. 108--114). Khakham, P., Chumuang, N., & Ketcham, M. (2015). Isan Dhamma Handwritten Characters Recognition System by Using Functional Trees Classifier. In 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (pp. 606-612). IEEE. http://doi.org/10.1109/SITIS.2015.68 Kiran, P. S. (2015). Detecting spammers in YouTube: A study to find spam content in a video platform. IOSR Journal of Engineering (IOSRJEN), 5(7), 26-30. Korb, K. B., &Nicholson, A. E. (201 ). Bayesian artificial intelligence (Second Edi). CRC Press. Kumar, R., Naik, S. M., Naik, V. D., Shiralli, S., Sunil V.G, & Husain, M. (2015). Predicting clicks: CTR estimation of advertisements using Logistic Regression classifier. In 2015 IEEE International Advance Computing Conference (IACC) (pp.1134-1138). IEEE. http://doi.org/10.1109/IADCC.2015.7154880 LavraE, N., Dieroski, S., & Grobelnik, M. (1991). Learning nonrecursive definitions of relations with LINUS. In European Working Session on Learning (pp. 265-281). inproceedings. Lee, S., & Kim, J. (2012). W ARNING B IRD : Detecting suspicious URLs in Twitter stream. In NDSS Symposium 2012 (pp. 1-13). LinkedIn. (2016). About Us. Retrieved January 1, 2016, from https://www.linkedin.com/about-us? trk=uno-reg-guest-home-about Martin, B. (1995). Instance-based learning: nearest neighbour with generalisation (No. 95/18). McCord, M., & Chuah, M. (2011). Spam detection on twitter using traditional classifiers. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6906 LNCS, 175-186. Journal Article. Myspace. (2015). Press Releases. Retrieved January 1, 2016, from https://myspace.com/pressroom/pressreleases Nexgate. (2013). State of social media spam. Nisa, I. U., & Ahsan, S. N. (2015). Fault prediction model for software using soft computing techniques. In 2015 International Conference on Open Source Systems & Technologies (ICOSST) @P. 78-83). IEEE. http://doi.org/10.1109/ICOSST.2015.7396406 O'Callaghan, D., Harrigan, M., & Carthy, J. (2012). Network analysis of recurring youtube spam campaigns. arXiv Preprint arXiv: Retrieved from http://antiv.org/abs/l201.3783 Pagallo, G. (1989). Learning DNF by Decision Trees. In IJCAI (Vol. 89, pp. 639644). inproceedings. Pfahringer, B., Holmes, G., & Kirkby, R. (2007). New options for Hoeffding Trees. In AI 2007: Advances in Artijicial Intelligence @p. 90-99). Berlin, Heidelberg: Springer Berlin Heidelberg. http://doi.org/10.10071978-3-540-76928-6-11 Plan, J. (1998). Sequential Minimal Optimization: A fast algorithm for training Support Vector Machines. In Advances in kernel methods (pp. 185-208). techreport. Retrieved from https://www.microsoft.comlen-uslresearch publication/sequential-minimaloptimization- a-fast-algorithm-for-training-support-vector-machines/ Quinlan, J. (1993). C4. 5: programs for machine learning. In Machine Learning (p. 302). Elsevier. Razmara, M., Asadi, B., Narouei, M., & Ahrnadi, M. (2012). A novel approach toward spam detection based on iterative patterns per text. In International econference on Computer and Knowledge Engineering (ICCKE) (Vol. 3, pp. 3-8). IEEE. Rodrigues, T., Benevenuto, F., Cha, M., Gummadi, K., & Almeida, V. (2011). On word of-mouth based discovery of the web. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference - IMC '11 (p. 381). Roth, D., & Small, K. (2009). Interactive feature space construction using semantic information. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp. 66-74). inproceedings. Saab, S. A,, Mitri, N., & Awad, M. (2014). Ham or spam? A comparative study for some content-based classification algorithms for email filtering. In Proceedings of the Mediterranean Electrotechnical Conference - MELECON (pp. 439443). http://doi.org/10.1109/MELCON.2014.6820574 Salih, A., & Abraham, A. (2014). Novel ensemble decision support and health care monitoring system. Journal of Network and Innovative, 2(2014), 041-051. Sandvine. (2015). Global internet phenomena report: North America and latin America. Shams, R., & Mercer, R. E. (2013). Classifying spam emails using text And readability features. In 2013 IEEE 13th International Conference on Data Mining (pp. 657-666). IEEE. http://doi.org/l0.1109/ICDM.2013.131 Sia, F., Alfred, R., Yu, L., & Fun, T. S. (2012). A variable length feature construction method for data summarization using DARA. In Computing and Convergence Technology (irCCCT), 2012 7th International Conference on (pp. 881-887). Seoul: IEEE. Singh, M., Bansal, D., & Sofat, S. (2014). Detecting malicious users in Twitter using classifiers. In 7th International Conference on Security of Information and Networks (p. 247). Smith, C. (2015). Statistics of Social Networking Sites. DMR. Retrieved from http://expandedramblings.com/index.php/march-2103 -by-the-numbers-a-fewamazing-twitter-stats/3/#. U3xVv9KSyuE Socialbakers. (2015). EdgeRank checker. Retrieved from https://www.socialbakers.com/edgerankchecker/ edgeranMeam Soman, S. J., & Murugappan, S. (2014). A study of spam detection algorithm on social media networks. Journal of Computer Science, 10(10), 2135-2140. Sondhi, P. (2010). Feature construction methods: a survey. Sifaka. Cs. Uiuc. Edu, 69, 70-71. Statista. (2016). Number of monthly active Facebook users worldwide as of 4th quarter 2015 (in millions). Retrieved January 1, 2016, from http://www.statista.com/statistics/264810/number-of-monthly-active-facebookusers-worldwide/ Strano, M., & Colosimo, B. M. (2006). Logistic regression analysis for experimental determination of forming limit diagrams. International Journal of Machine Tools and Manufacture, 46(6), 673-682. http://doi.org/10.1016/j.ijmachtools.2005.07.005 Stringhini, G., Kruegel, C., & Vigna, G. (2010). Detecting spammers on social networks. In Proceedings of the 26th Annual Computer Security Applications Conference (pp. 1-9). inproceedings. Sureka, A. (2011). Mining user comment activity for detecting forum spammers in youtube. Arxiv - Computers & Society, 0-3. Retrieved from http://arxiv.org/abs/1103.5044 Tan, E., Guo, L., Chen, S., Zhang, X., & Zhao, Y. (2013). UNIK: unsupervised social network spam detection. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13 (pp. 479-488). Tan, E., Guo, L., Chen, S., Zhang, X., & Zhao, Y. E. (2012). Spammer behavior analysis and detection in user generated content on social networks. In Distributed Computing Systems QCDCS), 2012 IEEE 32nd International Conference on (pp. 305-3 14). IEEE. Tretyakov, K. (2004). Machine learning techniques in spam filtering. Data Mining Problem-Oriented Seminar, MTAT, (May), 60-79. Twitter. (2015). COMPANY FACTS. Retrieved January 1, 2016, from https://about.twitter.com/company Tynan, D. (2012). Social spam is taking over the Internet. Retrieved February 1, 2016, from http://www.itworld.com/article/2832566/it-management/social-spam-is-takingover-the- internet.html UK, H. I. (2008). A study of social networlcs scams. Ulriie, G. (2001). Social network analysis: Introduction and resources. Analysis. Retrieved from http://lrs.ed.uiuc.edu/tse-portalanalysis/ social-network-analysis/ Vafaie, H., & De Jong, K. (1995). Genetic algorithms as a tool for restructuring feature space representations. In Tools with Artificial Intelligence, 1995. Proceedings., Seventh International Conference on (pp. 8-11). inproceedings. Villuendas, Y., Yanez, C., & Rey, C. (2015). Attributes and cases selection for social data classification. IEEE Latin America Transactions, 13(1O), 3370-3381. http://doi.org/10.1109/TLA.2015.7387244 Web Scraper. (2016). Web Scraper. Retrieved January 1, 2016, from http://webscraper.io/ Webb, G. (1999). Decision tree grafting from the all-tests-but-one partition. In IJCAI (Vol. 2, pp. 702-707). Weber, B. G., & Mateas, M. (2009). A data mining approach to strategyprediction. In 2009 IEEE Symposium on Computational Intelligence and Games (pp. 140-147). IEEE. http://doi.org/10.1109/CIG.2009.5286483 Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. (Elsevier, Ed.) (2 edition). book, San Francisco: Elsevier. Witten, I. H., Frank, E., Trigg, L., Hall, M., Holmes, G., & Cunningham, S. J. (1999). Weka : Practical machine learning tools and techniques with Java implementations. Seminar, 99, 192-196. Wuest, C. (2010). The Risks of Social Networking. Syrnantec Security Response. Yardi, S., Romero, D., Schoenebeck, G., & Boyd, D. (2010). Detecting spam in a Twitter network. First Monday, 15(1). Retrieved from http://firstmonday.org/ojs/index.php/fm/article/ viewArticle/2793 YouTube. (2015). Statistics. Retrieved February 1, 2016, from https://www.youtube.com/yt/press/en-GB/ statistics.html Yuan, G.-X., Ho, C.-H., & Lin, C.-J. (2012). Recent advances of large-scale linear classification. In Proceedings of the IEEE (Vol. 100, pp. 2584-2603). http://doi.org/10.1109/JPROC.2012.2188013 Zheng, X., Zeng, Z., Chen, Z., Yu, Y., & Rong, C. (2015). Detecting spammers on social networks. Neurocomputing, 159, 27-34. Zhu, Y., Wang, X., Zhong, E., Liu, N., Li, H., & Yang, Q. (2012). Discovering sparnrners in social networks. In Association for the Advancement of Artificial Intelligence (pp. 171-177).