International Journal of Applied Science and Engineering
Published by Chaoyang University of Technology

Christine Dewi 1, Fransiskus Andika Indriawan 1, Henoch Juli Christanto 2*

1 Department of Information Technology, Satya Wacana Christian University, Salatiga City, 50711, Indonesia

Department of Infomation System, Atma Jaya Catholic University of Indonesia, Jakarta 12930, Indonesia


 

Download Citation: |
Download PDF


ABSTRACT


Spam classification is an important task in identifying unwanted and potentially harmful emails for internet users. The increasing number of internet users highlights the growing importance of handling spam effectively. In this paper, we propose an approach for spam classification using Support Vector Machines (SVM) with grid search hyperparameter optimization. Our research differs from existing studies by specifically focusing on the integration of SVM with grid search to achieve optimal hyperparameter tuning. Additionally, we provide a unique dataset comprising diverse samples of spam emails for evaluation purposes. We also employ pre-processing techniques, including the removal of unnecessary words such as stop words and punctuation marks, as well as word stemming to convert words into their base forms. To optimize the performance of the SVM model, we use Grid Search to determine the optimal values for hyperparameters, including C, gamma, and the kernel. The results of the first experiment using SVM with the first dataset show that grid search yields the optimal parameters {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}, resulting in an accuracy improvement from 98.02% to 98.47%. In the second experiment using the second dataset, the accuracy obtained is 99.1%, compared to the previous non-optimized parameters which achieved 98.8%. These results indicate a significant improvement in spam classification accuracy. The experimental results demonstrate that our approach outperforms existing methods in terms of accuracy, precision, and recall. The findings of our research have significant implications for improving spam detection systems and enhancing the overall effectiveness of email communication.


Keywords: SVM, Spam classification, Grid search, Machine learning.


Share this article with your colleagues

 


REFERENCES


  1. Almeida, T., Hidalgo, J.M., Silva, T. 2013. Towards SMS spam filtering: Results under a new dataset. International Journal of Information Security Science, 2, 1–18.

  2. Ardhianto, P., Subiakto, RBR., Lin, C-Y., Jan, Y-K., Liau, B-Y., Tsai, J-Y., Akbari, VBH., Lung, C-W. 2022. A deep learning method for foot progression angle detection in plantar pressure images, Sensors, 22, 2786.

  3. Assagaf, I., Sukandi, A., Abdillah, A.A., Arifin, S., Ga, J.L. 2023. Machine predictive maintenance by using support vector machines. Recent in Engineering Science and Technology, 1, 31–35.

  4. Budiman, E., Lawi, A., Wungo, S.L. 2019. Implementation of SVM kernels for identifying irregularities usage of smart electric voucher. 2019 5th International Conference on Computing Engineering and Design (ICCED), Singapore. 1–5.

  5. Cahyani, D.E., Patasik, I. 2021. Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics, 10, 2780–2788.

  6. Chen, R.C., Dewi, C., Huang, S.W., Caraka, R.E. 2020. Selecting critical features for data classification based on machine learning methods. Journal of Big Data, 7, 52.

  7. Chong, K., Shah, N. 2022. Comparison of naive bayes and SVM classification in grid-search hyperparameter tuned and non-hyperparameter tuned healthcare stock market sentiment analysis. International Journal of Advanced Computer Science and Applications (IJACSA), 13, 90–94.

  8. Clarke, C.L.A., Fuhr, N., Kando, N., Kraaij, W., De Vries, A.P. 2007. SIGIR 2007. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, USA.

  9. Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P. 2007. Spam filtering for short messages. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 313–320.

  10. Darmawan, Z.M.E., Dianta, A.F. 2023. Implementasi optimasi hyperparameter GridSearchCV pada sistem prediksi serangan jantung menggunakan SVM. Jurnal Ilmiah Sistem Informasi, 13, 8–15.

  11. Dewi, C., and Chen, R.C. 2022. Complement Naive Bayes Classifier for Sentiment Analysis of Internet Movie Database. In Intelligent Information and Database Systems: 14th Asian Conference, Vietnam. 81–93.

  12. Dewi, C., Chen, R.C. 2019. Random forest and support vector machine on features selection for regression analysis. International Journal of Innovative Computing, Information and Control, 15, 2027–2037.

  13. Dewi, C., Chen, R.C., Hendry, Hung, H.T. 2021. Experiment improvement of restricted Boltzmann machine methods for image classification. Vietnam Journal of Computer Science, 8, 417–432.

  14. Dewi, C., Tsai, B.J., Chen, R.C. 2022. Shapley additive explanations for text classification and sentiment analysis of internet movie database. 14th Asian Conference on Intelligent Information and Database Systems, 69–80.

  15. Fayaza, M.S.F., Farhath, F.F. 2021. Towards stop words identification in Tamil text clustering. International Journal of Advanced Computer Science and Applications, 12, 1–6.

  16. Guenther, N., Schonlau, M. 2016. Support vector machines, The Stata Journal, 16, 917–937.

  17. Gul, E., Alpaslan, N., Emiroglu, M.E. 2021. Robust optimization of SVM hyper-parameters for spillway type selection. Ain Shams Engineering Journal, 12, 2413–2423.

  18. Hamida, S., E.L. Gannour, O., Cherradi, B., Ouajji, H., Raihani, A. 2020. Optimization of machine learning algorithms hyper-parameters for improving the prediction of patients infected with COVID-19. 2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS), 1–6.

  19. Hidalgo, J.M.G., Bringas, G.C., Sánz, E.P., García, F.C. 2006. Content based SMS spam filtering. Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, Netherlands. 107–114.

  20. Hussain, Z.F., Ibraheem, H.R., Alsajri, M., Ali, A.H., Ismail, M.A., Kasim, S., Sutikno, T. 2020. A new model for iris data set classification based on linear support vector machine parameter’s optimization. International Journal of Electrical and Computer Engineering, 10, 1079–1084.

  21. Ibrahim, Y., Okafor, E., Yahaya, B. 2020. Optimization of RBF-SVM hyperparameters using genetic algorithm for face recognit. Nigerian Journal of Technology, 39, 1190–1197.

  22. Imrona, M.S., Widyawan, Nugroho, L.E. 2020. Pre-processing task for classifying satire in Indonesian news headline. 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 176–179.

  23. Kosasih, R., Alberto, A. 2021. Sentiment analysis of game product on shopee using the TF-IDF method and naive bayes classifier. ILKOM Jurnal Ilmiah, 13, 101–109.

  24. Lin, S.W., Ying, K.C., Chen, S.C., Lee, Z.J. 2008. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Systems with Applications, 35, 1817–1824.

  25. Mahajan, S.D., Ingle, D.R. 2021. News classification using machine learning. International Journal on Recent and Innovation Trends in Computing and Communication, 9, 873–877.

  26. Marcińczuk, M. 2017. Lemmatization of multi-word common noun phrases and named entities in Polish. Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 483–491.

  27. Menaka, K., Karpagavalli, S. 2013. Breast cancer classification using support vector machine and genetic programming. International Journal of Innovative Research in Computer and Communication Engineering, 1, 1410–1417.

  28. Poomka, P., Pongsena, W., Kerdprasop, N., Kerdprasop, K. 2019. SMS spam detection based on long short-term memory and gated recurrent unit. International Journal of Future Computer and Communication, 8, 11–15.

  29. Ramasubramanian, C., Ramya, R. 2013. Effective pre-processing activities in text mining using improved porter’s stemming algorithm. International Journal of Advanced Research in Computer and Communication Engineering, 2, 4536–4538.

  30. Ritonga, A.S., Purwaningsih, E.S. 2018. Penerapan metode support vector machine (SVM) dalam klasifikasi kualitas pengelasan smaw (shield metal arc welding). Jurnal Ilmiah Edutic: Pendidikan dan Informatika, 5, 17–25.

  31. Shafi, J., Iqbal, H.R., Nawab, R.M.A., Rayson, P. 2022. UNLT: Urdu natural language toolkit. Natural Language Engineering, 1–36.

  32. Singh, M., Pamula, R., Shekhar, S.K. 2018. Email spam classification by support vector machine. In 2018 International Conference on Computing, Power and Communication Technologies (GUCON), 878–882.

  33. Sjarif, N.N.A., Azmi, N.F.M., Chuprat, S., Sarkan, H.M., Yahya, Y., Sam, S.M. 2019. SMS spam message detection using term frequency-inverse document frequency and random forest algorithm. Procedia Computer Science, 161, 509–515.

  34. Sjarif, N.N.A., Yahya, Y., Chuprat, S., Azmi, N.H.F.M. 2020. Support vector machine algorithm for SMS spam classification in the telecommunication industry. International Journal on Advanced Science Engineering Information Technology, 10, 635–639.

  35. Sultana, T., Sapnaz, K.A., Sana, F., Najath, M.J. 2020. Email based Spam Detection. International Journal of Engineering Research & Technology (IJERT), 9, 135–139.

  36. Sulthana, R., Jaithunbi, A.K., Harikrishnan, H., Varadarajan, V. 2022. Sentiment analysis on movie reviews dataset using support vector machines and ensemble learning. International Journal of Information Technology and Web Engineering (IJITWE), 17, 1–23.

  37. Syarif, I., Prugel-Bennett, A., Wills, G. 2016. SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommunication Computing Electronics and Control), 14, 1502–1509.

  38. Tagg, C. 2009. A corpus linguistics study of sms text messaging. [Doctoral dissertation, University of Birmingham].

  39. Toman, M., Tesar, R., Jezek, K. 2006. Influence of word normalization on text classification. Proceedings of InSciT, 4, 354–358.

  40. Vijayarani, S., Ilamathi, M.J., Nithya, M. 2015. Preprocessing techniques for text mining-An overview. International Journal of Computer Science & Communication Networks, 5, 7–16.

  41. Wahyu Nugraha, A.S. 2022. Hyperparameter tuning pada algoritma klasifikasi dengan grid search. SISTEMASI: Jurnal Sistem Informasi, 11, 391–401.

  42. Wainer, J., Fonseca, P. 2021. How to tune the RBF SVM hyperparameters? An empirical evaluation of 18 search algorithms. Artificial Intelligence Review, 54, 4771–4797.

  43. Wan, C., Wang, Y., Liu, Y., Ji, J., Feng, G. 2019. Composite feature extraction and selection for text classification. IEEE Access, 7, 35208–35219.

  44. Wander Fernandes. 2020. Enron-Spam dataset, Version 1. Retrieved 2022-12-20 from https://www.kaggle.com/datasets/wanderfj/enron-spam.

  45. Wang, L., Feng, M., Zhou, B., Xiang, B., Mahadevan, S. 2015. Efficient hyper-parameter optimization for NLP applications. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2112–2117.

  46. Zareapoor, M., Seeja, K.R. 2015. Feature extraction or feature selection for text classification: A case study on phishing email detection. International Journal of Information Engineering and Electronic Business, 7, 60–65.


ARTICLE INFORMATION


Received: 2023-06-18
Revised: 2023-07-11
Accepted: 2023-07-21
Available Online: 2023-12-01


Cite this article:

Dewi, C., Indriawan, F.A., Christanto, H.J. 2023. Spam classification problems using support vector machine and grid search. International Journal of Applied Science and Engineering, 20, 2023214. https://doi.org/10.6703/IJASE.202312_20(4).006

  Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.