Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2022
Spam mails constitute a lot of nuisances in our electronic mail boxes, as they occupy huge spaces which could rather be used for storing relevant data. They also slow down network connection speed and make communication over a network slow. Attackers have often employed spam mails as a means of sending phishing mails to their targets in order to perpetrate data breach attacks and other forms of cybercrimes. Researchers have developed models using machine learning algorithms and other techniques to filter spam mails from relevant mails, however, some algorithms and classifiers are weak, not robust, and lack visualization models which would make the results interpretable by even non-tech savvy people. In this work, Linear Support Vector Machine (LSVM) was used to develop a text categorization model for email texts based on two categories: Ham and Spam. The processes involved were dataset import, preprocessing (removal of stop words, vectorization), feature selection (weighing and sele...
IEEE Transactions on Neural Networks, 1999
We study the use of support vector machines (SVM's) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM's performed best when using binary features. For both data sets, boosting trees and SVM's had acceptable test performance in terms of accuracy and speed. However, SVM's had significantly less training time.
… Intelligence for Modelling, …, 2005
Spam is commonly defined as unsolicited email messages and the goal of spam categorization is to distinguish between spam and legitimate email messages. Many researchers have been trying to separate spam from legitimate emails using machine learning algorithms based ...
Nowadays, the increase volume of spam has been annoying for the internet users. Spam is commonly defined as unsolicited email messages, and the goal of spam detection is to distinguish between spam and legitimate email messages. Most of the spam can contain viruses, Trojan horses or other harmful software that may lead to failures in computers and networks, consumes network bandwidth and storage space and slows down email servers. In addition it provides a medium for distributing harmful code and/or offensive content and there is not any complete solution for this problem, then the necessity of effective spam filters increase. In the recent years, the usability of machine learning techniques for automatic filtering of spam can be seen. Support Vector Machines (SVM) is a powerful, state-of-the-art algorithm in machine learning that is a good option to classify spam from email. In this article, we consider the evaluation criterions of SVM for spam detection and filtering.
Currently, the Internet E-mail infrastructure has become very significant and most popular used for communication between end user, E-commerce and academic research purposes due to it is rapid, inexpensive and very active. This E-mail organizational structures is used for the daily work. Sometime we receive many undesirable E-mails from different unknown resource. These unwanted E-mails are identified as Spam E-mails. The determination of Ham and Spam E-mail is a main target and a variety of algorithms of classification have been implemented. The complication of a classifier algorithm is substantially reduced if the numbers of features in Spam E-mail data set are reduced. In this paper, it is proposed to present some of the most common data mining algorithms J48, Support Vector Machine (SVM) and Naive Bayes for Spam E-mail classification problem. The standard dataset Spam base is used. Enhanced the Spam Email classification is impact thereof and is objective of our study. An experimental study is carried out to build up a classifier Spam E-mail standard dataset that includes Ham and Spam E-mail message. A Rough Set Theory (RST) and Symmetric uncertainty (SU) methods are utilized to minimize dimensionality of Spam E-mail data group. The sub features got by the RST and Symmetric uncertainty are employed to train and test the different classifiers. A comparison of obtained results between by reduced features set and original data set are presented. The obtained results show that the effectiveness of classifiers with the reduced features has outperformed the existing systems.
2004
The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many researchers' attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on e-mail categorization, which has been conducted with four machine learning algorithms applied to different parts of e-mail. Experimental results reveal that the header of e-mail provides very useful information for all the machine learning algorithms considered to detect spam e-mail.
2020
here we present an inclusive review of recent and successful content-based e-mail spam filtering techniques. Our focus is majorly on machine learning-based spam filters and variants which inspired from them. We report on relevant ideas, techniques, major efforts, and the state-of-the-art in the field. The initial interpretation of the prior work shows the basics of e-mail spam filtering and feature engineering. In this we conclude by studying techniques, methods, evaluation benchmarks, and explore the promising offshoots of latest developments and suggest lines of future investigations. Keywords—— SVM Classifier, Spam Email Classification, Data Mining, Data Science, Machine Learning.
International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022
Email is the worldwide use of communication application. It is because of the ease of use and faster than other communication application. However, its inability to detect whether the mail content is either spam or ham degrade its performance. Nowadays, lot of cases have been reported regarding stealing of personal information or phishing activities via email from the user. This project will discuss how machine learning help in spam detection. Machine learning is an artificial intelligence application that provides the ability to automatically learn and improve data without being explicitly programmed. Binary classifier will be used to classify the text into two different categories: spam and ham. The algorithm will predict the score more accurately. The objective of developing this model is to detect and score word faster and accurately.
The majority of previous studies of data mining have been concentrate on structured data, such as relational, transactional and data warehouse data. But, in actuality, an important section of the available information is stored in text databases, which consist of large collections of web documents from various sources, such as news articles, research papers, e-books, digital libraries, e-mails, and Web pages. Moreover, It is in increasing phase and in magnitude of terabytes of size. Among the ample of provisions of internet, e-mail facility is very useful and broadly used. Spam email is the strongly attached issue with email provision. Among various approaches developed to stop spam emails, filtering is an important and popular one. In this paper, to categorize spam and non-span email which arrives to our email id, classification method-KNNC Classification can work for better accuracy using Vector Space Model in adaptive manner. For getting accuracy in spam classification we have used two dataset-personal & Ling Spam Corpus(Lemm dataset) and apply KNNC Classification on them. We got nearly 95% of precision in spam & 86.6% of precision in nonspam and got 83% of accuracy using personal dataset and 80% using Lemm dataset using adaptive approach. We propose our own solution by reviewing the result and related work that adaptive approach using vector space model in KNNC classification method is efficiently provide better accuracy for filtering the spam mail for both smaller and larger dataset.
2015
Emails are used by number of users for educational purpose or professional purpose. But the spam mails causes serious problem for email users likes wasting of user"s energy and wasting of searching time of users. This paper present as survey paper based on some popular classification technique to identify whether an email is spam and non-spam. For representing spam mails ,we use vector space model(VSM). Since there are so many different word in emails, and all classifier can not be handle such a high dimension ,only few powerful classification terms should be used. Other reason is that some of the terms may not have any standard meaning which may create confusion for classifier.
Lecture Notes in Computer Science, 2004
Many solutions have been deployed to prevent harmful effects from spam mail. Typical methods are either pattern matching using the keyword or method using the probability such as naive Bayesian method. In this paper, we proposed a classification method of spam mail from normal mail using support vector machine, which has excellent performance in binary pattern classification problems. Especially, the proposed method efficiently practices a learning procedure with a word dictionary by the n-gram. In the conclusion, we showed our proposed method being superior to others in the aspect of comparing performance.
International Journal for Research in Applied Science and Engineering Technology (IJRASET), 2022
Email is one of the most popular modes of communication we have today. Billions of emails are sent every day in our world but not every one of them is relevant or of importance. The irrelevant and unwanted emails are termed email spam. These spam emails are sent with many different targets that range from advertisement to data theft. Filtering these spam emails is very essential in order to keep the email space fluent in its functioning. Machine Learning algorithms are being extensively used in the classification of spam emails. This paper showcases the performance evaluation of some selected supervised Machine Learning algorithms namely Naive Bayes Classifier, Support Vector Machine, Random Forest, & XG-Boost for spam email classification on a combination of three different datasets. For feature extraction, both Bag of Words & TF-IDF models were used separately and performance with both of these approaches was also compared. The results showed that SVM performed better than all the other algorithms when trained with TF-IDF feature vectors. The performance metrics used were accuracy, precision, recall, and f1-score, along with the ROC curve.
Ieee Transactions on Neural Networks, 1999
We study the use of support vector machines (SVM's) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM's performed best when using binary features. For both data sets, boosting trees and SVM's had acceptable test performance in terms of accuracy and speed. However, SVM's had significantly less training time.
2015
With the technological revolution in the 21st century, time and distance of communication are decreased by using electronic mail (e-mail). Furthermore, the growing use of e-mail has led to the emergence and further growth problems caused by unsolicited bulk e-mails, commonly referred to as spam e-mail. Many of the existing supervised algorithms like the Support Vector Machine (SVM) were developed to stop the spam e-mail. However, the problem of dealing with large data and high dimensionality of feature space can lead to high execution-time and low accuracy of spam e-mail classification. Nowadays, removing the irrelevant and redundant features beside finding the optimal (or near-optimal) subset of features significantly influences the performance of spam e-mail classification; this has become one of the important challenges. Therefore, in order to optimize spam e-mail classification accuracy, dimensional reduction issues need to be solved. Feature selection schemes become very import...
2014
Internet plays a drastic role in part of communication nowadays but in e-mail, spam is the major problem. Email spam is unwanted, inappropriate or no longer wanted mails also known as junk email. To eliminate these spam mails, spam filtering methods are implemented using classification algorithms. Among various algorithms, Support Vector Machine (SVM) is used as an effective classifier for spam classification by various researchers. But, the accuracy level is not up to notable level so further. To improve the accuracy, Latent Semantic Indexing (LSI) is used as feature extraction method to select the suitable feature space. The hybrid model of spam mail classification can provide the effective results. The Ling spam email corpus is used as datasets for the experimentation. The performance of the system is evaluated using measures such as recall, precision and overall accuracy.
Computer Engineering and Intelligent Systems, 2020
Emails are essential in present century communication however spam emails have contributed negatively to the success of such communication. Studies have been conducted to classify messages in an effort to distinguish between ham and spam email by building an efficient and sensitive classification model with high accuracy and low false positive rate. Regular rule-based classifiers have been overwhelmed and less effective by the geometric growth in spam messages, hence the need to develop a more reliable and robust model. Classification methods employed includes SVM (support vector machine), Bayesian, Naïve Bayes, Bayesian with Adaboost, Naïve Bayes with Adaboost. However, for this project, the Bayesian was employed using Python programming language to develop a classification model.
Sakarya Üniversitesi Fen Bilimleri Enstitüsü dergisi/Sakarya Üniversitesi fen bilimleri enstitüsü dergisi, 2023
Electronic Electronic messages, i.e. e-mails, are a communication tool frequently used by individuals or organizations. While e-mail is extremely practical to use, it is necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages created to promote a product or service, often sent frequently. It is very important to classify incoming e-mails in order to protect against malware that can be transmitted via e-mail and to reduce possible unwanted consequences. Spam email classification is the process of identifying and distinguishing spam emails from legitimate emails. This classification can be done through various methods such as keyword filtering, machine learning algorithms and image recognition. The goal of spam email classification is to prevent unwanted and potentially harmful emails from reaching the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms are used to classify spam emails and the results are compared. Algorithms with different approaches were used to determine the best solution for the problem. 5558 spam and non-spam e-mails were analyzed and the performance of the algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score metrics. The most successful result was obtained with the RF algorithm with an accuracy of 98.83%. In this study, high success was achieved by classifying spam emails with machine learning algorithms. In addition, it has been proved by experimental studies that better results are obtained than similar studies in the literature. 1. Introduction With the widespread use of the Internet, electronic communication has become more preferred. One of the most important tools of electronic communication is electronic messages, which we call e-mail. Today, individuals or organizations have one or more email accounts. Instant delivery of messages, no cost and ease of use increase the importance and prevalence of e-mail [1]. According to Statista Research Department data, the number of actively used e-mail accounts in 2020 is more than 4 billion. This number is estimated to increase to 4.6 billion in 2025. In 2020, 306 billion e-mails are sent and received every day, and this number is expected to exceed 376 billion in 2025 [2]. The use of e-mail is not only practical but also has various vulnerabilities. The e-mail account to be hijacked in various ways, for e-mails containing advertisements etc. to hijack your computer by installing a software on your computer when you click on the advertisement, and for the installed software to disrupt communication by sometimes filling the
International Journal of Managment, IT and Engineering, 2012
The majority of previous studies of data mining have been concentrate on structured data, such as relational, transactional and data warehouse data. But, in actuality, an important section of the available information is stored in text databases, which consist of large collections of web documents from various sources, such as news articles, research papers, e-books, digital libraries, e-mails, and Web pages. Moreover, It is in increasing phase and in magnitude of terabytes of size. Among the ample of provisions of internet, e-mail facility is very useful and broadly used. Spam email is the strongly attached issue with email provision. Among various approaches developed to stop spam emails, filtering is an important and popular one. In this paper, to categorize spam and non-span email which arrives to our email id, classification method-KNNC Classification can work for better accuracy using Vector Space Model in adaptive manner. For getting accuracy in spam classification we have us...
Background: As people using social media increases the data generation also increases and the data generated may be safe or unsafe. If we see some applications like Twitter and mail. We get a lot of emails or twits that include all dangerous and useful things. Here to be safe from the threats and dangers we need a filter that separates useful messages from spam and helps us not to drown in a trap. And one of the approaches to do this is explained in this paper. In this paper, the algorithm followed is the Naïve Bayes classifier. This also provides the comparison between using Naïve Bayes, KNN, and Logistic Regression to solve the same problem that is spam filtering and term frequency-inverse document frequency (TFIDF).
Today, many spam attempts to make difficulty with email connections. In this article we try to expose a way regarding spam identification based on Support Vector Machines (SVMs). Based on this method on delivery email three steps should be occur first of all a reoperation then flowing data. In operation step the user is sending an email preprocess is done by data miner system. The number of training information apply with window based solution will be selected with default, W=100, the first 100 data would be used as training category. Each delivery email input to SVM to be sorted in to 2 predetermined categories named: Non spam, and Spam. An algorithm is written that 4 different types of time window in order to SVM training is selected (100,200,500 and all the preset data or open window). The criteria for assessing include accuracy rate, recall, and precision rate. The results show that the techniques that some specialists have some criticisms to it.
International Journal of Applied Science and Engineering, 2022
Spam classification is an important task in identifying unwanted and potentially harmful emails for internet users. The increasing number of internet users highlights the growing importance of handling spam effectively. In this paper, we propose an approach for spam classification using Support Vector Machines (SVM) with grid search hyperparameter optimization. Our research differs from existing studies by specifically focusing on the integration of SVM with grid search to achieve optimal hyperparameter tuning. Additionally, we provide a unique dataset comprising diverse samples of spam emails for evaluation purposes. We also employ pre-processing techniques, including the removal of unnecessary words such as stop words and punctuation marks, as well as word stemming to convert words into their base forms. To optimize the performance of the SVM model, we use Grid Search to determine the optimal values for hyperparameters, including C, gamma, and the kernel. The results of the first experiment using SVM with the first dataset show that grid search yields the optimal parameters {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}, resulting in an accuracy improvement from 98.02% to 98.47%. In the second experiment using the second dataset, the accuracy obtained is 99.1%, compared to the previous non-optimized parameters which achieved 98.8%. These results indicate a significant improvement in spam classification accuracy. The experimental results demonstrate that our approach outperforms existing methods in terms of accuracy, precision, and recall. The findings of our research have significant implications for improving spam detection systems and enhancing the overall effectiveness of email communication.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.