Text Categorization Model Based on Linear Support Vector Machine

Augustus Benson

Text Categorization Model Based on Linear Support Vector Machine

2022

Abstract

Spam mails constitute a lot of nuisances in our electronic mail boxes, as they occupy huge spaces which could rather be used for storing relevant data. They also slow down network connection speed and make communication over a network slow. Attackers have often employed spam mails as a means of sending phishing mails to their targets in order to perpetrate data breach attacks and other forms of cybercrimes. Researchers have developed models using machine learning algorithms and other techniques to filter spam mails from relevant mails, however, some algorithms and classifiers are weak, not robust, and lack visualization models which would make the results interpretable by even non-tech savvy people. In this work, Linear Support Vector Machine (LSVM) was used to develop a text categorization model for email texts based on two categories: Ham and Spam. The processes involved were dataset import, preprocessing (removal of stop words, vectorization), feature selection (weighing and sele...

Electronic Electronic messages, i.e. e-mails, are a communication tool frequently used by individuals or organizations. While e-mail is extremely practical to use, it is necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages created to promote a product or service, often sent frequently. It is very important to classify incoming e-mails in order to protect against malware that can be transmitted via e-mail and to reduce possible unwanted consequences. Spam email classification is the process of identifying and distinguishing spam emails from legitimate emails. This classification can be done through various methods such as keyword filtering, machine learning algorithms and image recognition. The goal of spam email classification is to prevent unwanted and potentially harmful emails from reaching the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms are used to classify spam emails and the results are compared. Algorithms with different approaches were used to determine the best solution for the problem. 5558 spam and non-spam e-mails were analyzed and the performance of the algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score metrics. The most successful result was obtained with the RF algorithm with an accuracy of 98.83%. In this study, high success was achieved by classifying spam emails with machine learning algorithms. In addition, it has been proved by experimental studies that better results are obtained than similar studies in the literature. 1. Introduction With the widespread use of the Internet, electronic communication has become more preferred. One of the most important tools of electronic communication is electronic messages, which we call e-mail. Today, individuals or organizations have one or more email accounts. Instant delivery of messages, no cost and ease of use increase the importance and prevalence of e-mail [1]. According to Statista Research Department data, the number of actively used e-mail accounts in 2020 is more than 4 billion. This number is estimated to increase to 4.6 billion in 2025. In 2020, 306 billion e-mails are sent and received every day, and this number is expected to exceed 376 billion in 2025 [2]. The use of e-mail is not only practical but also has various vulnerabilities. The e-mail account to be hijacked in various ways, for e-mails containing advertisements etc. to hijack your computer by installing a software on your computer when you click on the advertisement, and for the installed software to disrupt communication by sometimes filling the

Log In

Text Categorization Model Based on Linear Support Vector Machine

Sign up for access to the world's latest research

Abstract

Related papers

Related topics