0% found this document useful (0 votes)

52 views9 pages

10-2018-Composite Email Features For Spam Identification

Uploaded by

meghanareddyk03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views9 pages

10-2018-Composite Email Features For Spam Identification

Uploaded by

meghanareddyk03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Composite Email Features for Spam

Identication

Princy George and P. Vinod

Abstract An approach is proposed in this work to search for composite email

features by applying a language-specic technique known as NLP (Natural
Language Processing) in email spam domain. Different style markers are employed
on Enron-spam dataset to capture the nature of emails written by spam and ham
email authors. Mainly, features from ve categories, consisting of character-based
features, word-based features, tag-based, structural features, and Bag-of-Words, are
extracted. Dimensionality reduction is applied subsequently using TF–IDF–CF
(Term Frequency–Inverse Document Frequency–Class Frequency) feature selection
method in order to choose the prominent features from the huge feature space. The
experiments are carried out on individual feature as well as composite feature
models. A promising performance is produced by composite model with an
F-measure of 0.9935 and minimum FPR of 0.0004.

Keywords Email  Ham  Spam  Style markers  Dimensionality reduction

Composite model

1 Introduction

Email is one of the most popular means of communication in the era of Internet.
Spam, referred to as unsolicited commercial email (UCE) or unsolicited bulk email
(UBE) [1], consumes most of the bandwidth. Moreover, it can also quickly con-
sume server storage space. It is observed that the nature and the characteristics of
spams change over time [2] and this demanded efcient approach for ltering
unwanted emails. There are many techniques designed and developed to categorize

P. George (&)  P. Vinod

Department of Computer Science & Engineering,
SCMS School of Engineering & Technology, Ernakulam, Kerala, India
e-mail: princyscms@[Link]
P. Vinod
e-mail: pvinod21@[Link]

© Springer Nature Singapore Pte Ltd. 2018 281

M. U. Bokhari et al. (eds.), Cyber Security, Advances in Intelligent Systems
and Computing 729, [Link]
282 P. George and P. Vinod

emails. All these methods look for some known patterns or features (words) alone
that usually appear in spam or ham messages, to classify the emails. These methods
do not consider the syntactic and the semantic peculiarities of the messages. This
was the primary motivating factor to discover varied features existing in the spam
emails. Also, the studies done on author gender identication [3] by applying NLP
[4, 5] became another inspiration behind this proposed work to investigate the
contribution of tag-based features and other linguistic attributes in developing an
email spam classication model with minimum FPR. Contributions of our approach
are (a) prepared an efcient model for classication of spam and ham mails,
(b) higher performance is obtained by applying feature selection method, (c) eval-
uated efciency of each category of attribute set in spam categorization, and (d) an
overall performance, i.e., F-measure of 0.9935 with a smaller FPR of 0.0004 jus-
tifying the applicability of our proposed approach in real-time spam ltering
system.
The reminder of the paper is organized as follows. In Sect. 2, a review of the
related works is done. The proposed mechanism is described in Sect. 3. Section 4
discusses details of experiments and the results of the study. Finally, inferences are
included in Sects. 5 and 6 presents the concluding remarks of the study.

2 Related Works

In [3], an author gender identication technique was proposed and it could achieve
accuracy of 85.1%. A new one-class ensemble scheme is put forward, which uses
meta-learning to combine one-class classiers in [6]. Blanzieri and Bryl [7] have
discussed various machine learning applications for email spam ltering. Menahem
et al. [8] implemented a new sender reputation mechanism based on an aggregated
historical dataset. In [9], the authors designed a fusion algorithm based on online
learners and experimented on TREC (Text REtrieval Conference) and other data-
sets. Comprehensive review on machine learning approaches to spam ltering is
discussed in [10]. Drucker et al. [11] investigated the applicability of Support
Vector Machines (SVMs) in classifying email as spam or legitimate mail.
Three-layer Backpropagation Neural Network (BPNN) technique is implemented
on datasets PU1 and Ling, resulting in 97 and 99% of classication accuracy with
less execution time [12]. A three-way decision approach (accept or reject or further
exam) is discussed and experiments on SpamBase dataset resulted in reduced
misclassication rate in [13]. Wu [14] utilized spamming behaviors with a back-
propagation neural network, employed on datasets from Hopkins, Reeber, etc. to
achieve improved performance (FPR = 0.0063).
Composite Email Features for Spam Identication 283

3 Proposed Methodology

Email spam detection process is carried out through different steps (refer Fig. 1) and
evaluated over Enron-spam dataset [15, 16]. The following subsections introduce
the proposed approach.

3.1 Preprocess Dataset

Email body is extracted from each email in the dataset. The resulted collection of
extracted email body is partitioned into train and test (60:40 ratio). Style markers
are treated as features in our approach. There are 31 characters, 38 words, 35 tags, 3
structural features, and 10,280 Bag-of-words extracted from mail body.
• Character-based features [3] include total number of characters (C), ratio of
total number of lower case letters (a–z) and C, ratio of total number of uppercase
characters and C, fraction of total number of digital characters and C, fraction of
total number of white-space characters and C, ratio of total number of tab space
characters and C, and fraction of number of special characters and C (25 special
symbol features).
• Word-based features [3] consist of total number of words (N), average length
per word, ratio of total different words and N, fraction of words longer than 6
characters and N, ratio of total number of short words (1–3 characters) and N,
Guirad’s R, Herdan’s C, Rubet’s K, Maa’s A, Dugasts U, L. Janenkov and
Neistoj Measure, Sichel’s S, Yule’s K measure, Simpson’s D measure, Hapax
Dislegomena, Hapax legomena, Honore’s R measure, Entropy, and ratio of
word length frequency distribution and N.

Fig. 1 Framework for email spam ltering

284 P. George and P. Vinod

• Function words [3] (or grammatical words or tag-based features) are words that
express grammatical relationships with other words within a sentence. Tags are
extracted from email text using NLTK (Natural Language Tool Kit) [17] in
python, and Part-of-Speech (POS) [18] tagging is done using Penn Treebank [5]
tag set.
• Structural features [3] represent the way an author organizes the layout of a
message. The main features are total number of lines, total number of sentences
(S), and average number of words per sentence.
• In Bag-of-Words, all sentences in each email body are tokenized into a set of
words and frequency of every term is counted within each le (called as term
frequency).

3.2 Application of Feature Selection Method

Feature selection determines optimal attributes from a huge attribute space without
changing physical meaning of the attribute. The main benets of dimensionality
reduction (or feature selection) are (a) elimination of redundant features, (b) re-
duction in noise thereby increases accuracy of classiers, (c) reduction of time
complexity of classication, and (d) minimization of over-tting of the training
data.
A weighting method called TF–IDF–CF [19] is applied in our proposed
approach. This method is developed based on TF–IDF (Term Frequency–Inverse
Document Frequency). It says that if a term appears in more documents, then it
becomes less important, and the weighting will also be less. A new attribute, called
as class frequency, is introduced to assess the frequency of each term in every
document within a specic class. A general form of TF–IDF–CF is shown in
Eq. (1).

aij ¼ logðtfij þ 1:0Þ  logððN þ 1:0Þ=nj Þ  ðncij =Nci Þ ð1Þ

In Eq. (1), tfij indicates the term frequency of term j in document i, N is the total
number of instances in the dataset, and nj indicates the number of documents that
term j occurs. The term ncij represents the number of les within the same class
c where document i belongs to and term j appears, Nci gives the total count of
documents within the same class c where document i belongs to. The algorithm for
extracting signicant words is given below.
Composite Email Features for Spam Identication 285

3.3 Generation of Classication Models and Prediction

Feature selection produces a reduced feature vector table (FVT) which is taken as
the input for training the classiers. Multinomial Naïve Bayes (MNB) and support
vector machine are used as classiers in this investigation. Individual training
models are created for each category of feature during training phase. The model
with highest F-measure is chosen for prediction. Finally, the optimal models
obtained from each category of features are aggregated to develop a composite
feature space used for building spam and ham model, subsequently used for
prediction.

4 Experimental Setup and Results

The experiment was performed on Ubuntu 14.04 platform with the support of Intel
core 7 and 8 GB RAM. In this work, 12,045 ham and 4496 spam emails have been
chosen from Enron-spam dataset. The classication models are generated by
LibSVM (kernels k0 (Linear), k1 (Polynomial), k2 (Radial), and k3 (Sigmoid)), and
Multinomial Naïve Bayes (MNB) in WEKA [20]. When a ham is misclassied as
spam, a false positive (FP) occurs. If ham data is predicted as ham then it is known
as true negative (TN), whereas if spam is correctly classied as spam data then it is
true positive (TP). When a spam is wrongly taken as ham, it is considered as false
negative (FN) [21, 22]. In this analysis, F-measure (also called as F1-score) and
286 P. George and P. Vinod

FPR are used as the signicant evaluation parameters. The F1-score can be inter-
preted as a weighted average of the precision and recall, and it ranges from 0 to 1.
Precision (P) is a measure of the accuracy provided that a specic class has been
predicted. Recall (R) measures the proportion of actual positives which are correctly
identied as such.

F-measure ¼ 2PR=ðP þ RÞ ð2Þ

P ¼ TP=ðTP þ FPÞ ð3Þ

R ¼ TP=ðTP þ FNÞ ð4Þ

FPR ¼ FP=ðFP þ TNÞ ð5Þ

Figures 2, 3, and 4 depict F-measure, FPR, and Time parameters, respectively,

obtained for six feature sets, represented as C, W, T, S, B, and E (refer Fig. 1) with
ve different classiers (LibSVM-k0, k1, k2, k3, and mnb). The highest F-measure
of value 0.9983 is produced in Bag-of-Words (B) (refer Fig. 2) with a smaller FPR
of 0.0013 by linear SVM for a feature length of size 10,153 (refer Fig. 3), whereas
the composite model could also achieve a very closer F-measure value 0.9935 (refer
Fig. 2) with lowest FPR rate of 0.0004 (see Fig. 3) in comparison with all other
models with linear SVM classication. From Fig. 4, it is clear that MNB takes less
execution time but produces insignicant performance for various style markers.
The top ten features from character-based, word-based, tag-based, bag-of-words,
and all three variables from structural attribute set are given in Table 1. These
attributes are found to have higher variance in target classes thereby stands as
representative features of writing styles adopted by email spammers [23, 24].

Fig. 2 F-measure versus

classiers
Composite Email Features for Spam Identication 287

Fig. 3 FPR versus classiers

Fig. 4 Time versus

classiers

5 Inferences

It has been analyzed that SVM classier performs well with large number of
features, but it is computationally expensive. Performance is observed higher when
stop words are removed from the text before model construction. Independent style
markers with small number of features produced insignicant results in terms of
F-measure, which is clearly visible for character-based, word-based, and structural
features. Hence, these attributes are not sufcient enough to prepare an efcient
spam ltering model independently. This is due to the absence of attributes having
high correlation with target class. As the features in the feature space increase, the
performance also improves, since relevant attributes contributing to the effective
classication appear as a candidate in the optimal feature space. This is why
bag-of-words produced a highest F-measure with larger feature space of size
10,153. Tag-based attributes and bag-of-words played an important role in the
generation of composite model as they could produce lower FPR value and an
288 P. George and P. Vinod

appreciating F-measure in its independent model generation. Therefore, ensemble

of different sets of style markers resulted in better performance. Finally, it has been
analyzed from the investigation that composite feature space could greatly reduce
misclassication of ham messages as spam. This is proved by lowest FPR value
(0.0004) produced by composite model along with a high F-measure (0.9935)
compared to all independent models generated in our study. This makes our pro-
posed meta-feature model a good detector system in the real-time applications.

6 Conclusion

In the investigation for email spam classication, TF–IDF–CF is the dimensionality

reduction method, applied on various style markers such as character-based,
word-based, tag-based, structural, and bag-of-words to choose relevant features
from the large feature space. A vector space model for the relevant features was
constructed and given to classier through WEKA tool. The composite feature
space generated an effective model for email spam classication, producing a very
high F-measure (0.9935) and a very small FPR (0.0004) in linear support vector
machine than independent models. The study can be extended in future to nd out
whether involvement of male or female community is more in email spam
generation.
Composite Email Features for Spam Identication 289

References

1. Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam

ltering. IEEE Trans Inf Forensics Secur 6(2):486–497
2. Wang D, Irani D, Pu C (2013) A study on evolution of email spam over fteen years. In:
Bertino E, Georgakopoulos D, Srivatsa M, Nepal S, Vinciarelli A (eds) CollaborateCom,
pp 1–10. ICST/IEEE
3. Cheng N, Chandramouli R, Subbalakshmi KP (2011) Author gender identication from text.
Digital Invest 8.1:78–88
4. Manning CD (1999) Foundations of statistical natural language processing. In: Schutze H
(ed). MIT Press, Cambridge
5. Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media,
Inc.
6. Menahem E, Rokach L, Elovici Y (2013) Combining one-class classiers via meta learning.
In: Proceedings of the 22nd ACM international conference on information and knowledge
management, ACM
7. Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam ltering.
Artif Intell Rev 29.1:63–92
8. Menahem E, Pusiz R, Elovici Y (2012) Detecting spammers via aggregated historical data set.
Network and system security. Springer, Berlin, pp 248–262
9. Xu C, Su B, Cheng Y, Pan W, Chen L (2014) An adaptive fusion algorithm for spam
detection. IEEE Intell Syst 29(4):2–8
10. Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam
ltering. Expert Syst Appl 36(7):10206–10222
11. Drucker H, Wu S, Vapnik VN (1999) Support vector machines for spam categorization. IEEE
Trans Neural Netw 10.5:1048–1054
12. Ruan G, Tan Y (2010) A three-layer back-propagation neural network for spam detection
using articial immune concentration. Soft Comput 14(2):139–150
13. Zhou B, Yao Y, Luo J (2010) A three-way decision approach to email spam ltering. In:
Farzindar A, Keselj V (eds) Canadian conference on AI. LNCS, vol 6085. Springer, pp 28–39
14. Wu C-H (2009) Behavior-based spam detection using a hybrid method of rule-based
techniques and neural networks. Expert Syst Appl 36(3):4321–4330
15. Bekkerman R (2004) Automatic categorization of email into folders: benchmark experiments
on Enron and SRI corpora
16. The Enron-Spam Datasets. [Link]
17. Natural Language Tool Kit (NLTK). [Link]
18. POS tagging. [Link]
pos-tagger
19. Liu M, Yang J (2012) An improvement of TFIDF weighting in text categorization. In:
International proceedings of computer science and information technology, pp 44–47 (2012)
20. WEKA-Data Mining Software in Java. [Link]
21. Han J, Kamber M (2005) Data mining: concepts and techniques. Kaufmann, San Francisco [u.
a.]
22. Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial Naïve Bayes for text
categorization revisited. In: Webb GI, Yu X (eds) Australian conference on articial
intelligence. LNCS, vol 3339. Springer, Berlin, pp 488–499
23. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam ltering with Naïve Bayes-which
Naïve Bayes? In: CEAS, pp 27–28
24. Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on
interactive presentation sessions, association for computational linguistics

Spam Detection Using ID3 Decision Trees
No ratings yet
Spam Detection Using ID3 Decision Trees
4 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Semantic Email Classification Method
No ratings yet
Semantic Email Classification Method
11 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages
Ijirt156181 Paper
No ratings yet
Ijirt156181 Paper
5 pages
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
No ratings yet
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
5 pages
JETAIspam Paper2007
No ratings yet
JETAIspam Paper2007
14 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Spam Filtering Techniques Survey
No ratings yet
Spam Filtering Techniques Survey
7 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Personalized Classification of Non-Spam Emails Using Machine Learning Techniques
No ratings yet
Personalized Classification of Non-Spam Emails Using Machine Learning Techniques
7 pages
Chung-Kwei Spam IA
No ratings yet
Chung-Kwei Spam IA
18 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
No ratings yet
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
7 pages
2023 V14i805
No ratings yet
2023 V14i805
7 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
Content Based Spam Detection in Email Us PDF
No ratings yet
Content Based Spam Detection in Email Us PDF
5 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
ML Lab
No ratings yet
ML Lab
13 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
No ratings yet
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
12 pages
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
100% (2)
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
58 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
AI-Enabled Email Classiciation Spam Detection (RP)
No ratings yet
AI-Enabled Email Classiciation Spam Detection (RP)
6 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
IEEE Conference Template 148
No ratings yet
IEEE Conference Template 148
6 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Email Based Spam Detection
No ratings yet
Email Based Spam Detection
5 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Ijst 2023 2979
No ratings yet
Ijst 2023 2979
12 pages
Egyptian Informatics Journal
No ratings yet
Egyptian Informatics Journal
11 pages
Email Spam Detection for ML Experts
No ratings yet
Email Spam Detection for ML Experts
7 pages
Feature Learning For Spam Email Filtering
No ratings yet
Feature Learning For Spam Email Filtering
16 pages
ML Techniques for Spam Detection
No ratings yet
ML Techniques for Spam Detection
10 pages
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
No ratings yet
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
3 pages
46 - Ijme... Mech Engg..Research Paper-1
No ratings yet
46 - Ijme... Mech Engg..Research Paper-1
10 pages
Email Spam Filtering Using Machine Learning.1
No ratings yet
Email Spam Filtering Using Machine Learning.1
16 pages
FICE Project Report Spam
No ratings yet
FICE Project Report Spam
14 pages
Aiproject 2
No ratings yet
Aiproject 2
4 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Id - 3747 - Literature Review
No ratings yet
Id - 3747 - Literature Review
3 pages
Email Spam Detection Techniques
No ratings yet
Email Spam Detection Techniques
5 pages
Spam Detection Thesis
100% (3)
Spam Detection Thesis
6 pages
Email Spam Detection Project Report
No ratings yet
Email Spam Detection Project Report
19 pages
Major-Final Research Paper
No ratings yet
Major-Final Research Paper
3 pages
E-Mail Spam Detection Using Machine Learning KNN
No ratings yet
E-Mail Spam Detection Using Machine Learning KNN
5 pages
Spam Filtering Techniques Survey
No ratings yet
Spam Filtering Techniques Survey
7 pages
Ai Project
No ratings yet
Ai Project
8 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Ccda Faq PDF
No ratings yet
Ccda Faq PDF
4 pages
Impact of Immediate and Delayed Error Co
No ratings yet
Impact of Immediate and Delayed Error Co
10 pages
District CampBSP-GSP-PERMIT-DO-2025
No ratings yet
District CampBSP-GSP-PERMIT-DO-2025
1 page
Data-Driven Talent Acquisition Guide
No ratings yet
Data-Driven Talent Acquisition Guide
17 pages
Chapter 7 - Economic Organization
No ratings yet
Chapter 7 - Economic Organization
10 pages
Grade 9 Certificate q1
No ratings yet
Grade 9 Certificate q1
14 pages
IJCRT2504984
No ratings yet
IJCRT2504984
3 pages
Teaching Practice vs. Induction Training
No ratings yet
Teaching Practice vs. Induction Training
26 pages
Employee Satisfaction Study at KPL International
No ratings yet
Employee Satisfaction Study at KPL International
46 pages
Masculinities 2nd Edition Raewyn W. Connell ebook complete literary edition
100% (1)
Masculinities 2nd Edition Raewyn W. Connell ebook complete literary edition
88 pages
1.4.3.1 Social Action
No ratings yet
1.4.3.1 Social Action
17 pages
CBT - Ocd
No ratings yet
CBT - Ocd
11 pages
GCSE AQA Science Grade Boundaries
No ratings yet
GCSE AQA Science Grade Boundaries
1 page
Evaluation Report Checklist
No ratings yet
Evaluation Report Checklist
3 pages
Civil Eng. Internal Actions Guide
No ratings yet
Civil Eng. Internal Actions Guide
3 pages
Bài tập 1_ Viết lại câu sử dụng let’s sao cho nghĩa không thay đổi
No ratings yet
Bài tập 1_ Viết lại câu sử dụng let’s sao cho nghĩa không thay đổi
4 pages
Deep Work: Focused Success Guide
No ratings yet
Deep Work: Focused Success Guide
1 page
English as a Global Language
No ratings yet
English as a Global Language
3 pages
Making Wise Decisions for Jehovah
No ratings yet
Making Wise Decisions for Jehovah
1 page
Machine LC
No ratings yet
Machine LC
18 pages
IRC - Research Proposal Format Student New
No ratings yet
IRC - Research Proposal Format Student New
4 pages
Social Studies Lesson For Portfolio
No ratings yet
Social Studies Lesson For Portfolio
4 pages
All Report MHRDNational Institutional Ranking Framework NIRF
No ratings yet
All Report MHRDNational Institutional Ranking Framework NIRF
19 pages
Patfit m2
100% (1)
Patfit m2
13 pages
Burke's Metaphor For The Unending Conversation 2
No ratings yet
Burke's Metaphor For The Unending Conversation 2
3 pages
PCMB Trad. Careers
No ratings yet
PCMB Trad. Careers
1 page
Aibe-Xx Admit Card
No ratings yet
Aibe-Xx Admit Card
2 pages
Receipt 1746174907293
No ratings yet
Receipt 1746174907293
2 pages
Action Research for Educators
No ratings yet
Action Research for Educators
4 pages
Earthquake and Volcano Safety Lesson Plan
No ratings yet
Earthquake and Volcano Safety Lesson Plan
3 pages

10-2018-Composite Email Features For Spam Identification

Uploaded by

10-2018-Composite Email Features For Spam Identification

Uploaded by

Composite Email Features for Spam

Princy George and P. Vinod

Abstract An approach is proposed in this work to search for composite email

Keywords Email  Ham  Spam  Style markers  Dimensionality reduction

P. George (&)  P. Vinod

© Springer Nature Singapore Pte Ltd. 2018 281

3.1 Preprocess Dataset

Fig. 1 Framework for email spam ltering

3.2 Application of Feature Selection Method

aij ¼ logðtfij þ 1:0Þ  logððN þ 1:0Þ=nj Þ  ðncij =Nci Þ ð1Þ

3.3 Generation of Classication Models and Prediction

4 Experimental Setup and Results

F-measure ¼ 2PR=ðP þ RÞ ð2Þ

P ¼ TP=ðTP þ FPÞ ð3Þ

R ¼ TP=ðTP þ FNÞ ð4Þ

FPR ¼ FP=ðFP þ TNÞ ð5Þ

Figures 2, 3, and 4 depict F-measure, FPR, and Time parameters, respectively,

Fig. 2 F-measure versus

Fig. 3 FPR versus classiers

Fig. 4 Time versus

Table 1 Top features from each attribute category

appreciating F-measure in its independent model generation. Therefore, ensemble

In the investigation for email spam classication, TF–IDF–CF is the dimensionality

1. Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam

You might also like