Composite Email Features for Spam
Identication
Princy George and P. Vinod
Abstract An approach is proposed in this work to search for composite email
features by applying a language-specic technique known as NLP (Natural
Language Processing) in email spam domain. Different style markers are employed
on Enron-spam dataset to capture the nature of emails written by spam and ham
email authors. Mainly, features from ve categories, consisting of character-based
features, word-based features, tag-based, structural features, and Bag-of-Words, are
extracted. Dimensionality reduction is applied subsequently using TF–IDF–CF
(Term Frequency–Inverse Document Frequency–Class Frequency) feature selection
method in order to choose the prominent features from the huge feature space. The
experiments are carried out on individual feature as well as composite feature
models. A promising performance is produced by composite model with an
F-measure of 0.9935 and minimum FPR of 0.0004.
Keywords Email Ham Spam Style markers Dimensionality reduction
Composite model
1 Introduction
Email is one of the most popular means of communication in the era of Internet.
Spam, referred to as unsolicited commercial email (UCE) or unsolicited bulk email
(UBE) [1], consumes most of the bandwidth. Moreover, it can also quickly con-
sume server storage space. It is observed that the nature and the characteristics of
spams change over time [2] and this demanded efcient approach for ltering
unwanted emails. There are many techniques designed and developed to categorize
P. George (&) P. Vinod
Department of Computer Science & Engineering,
SCMS School of Engineering & Technology, Ernakulam, Kerala, India
e-mail: princyscms@[Link]
P. Vinod
e-mail: pvinod21@[Link]
© Springer Nature Singapore Pte Ltd. 2018 281
M. U. Bokhari et al. (eds.), Cyber Security, Advances in Intelligent Systems
and Computing 729, [Link]
282 P. George and P. Vinod
emails. All these methods look for some known patterns or features (words) alone
that usually appear in spam or ham messages, to classify the emails. These methods
do not consider the syntactic and the semantic peculiarities of the messages. This
was the primary motivating factor to discover varied features existing in the spam
emails. Also, the studies done on author gender identication [3] by applying NLP
[4, 5] became another inspiration behind this proposed work to investigate the
contribution of tag-based features and other linguistic attributes in developing an
email spam classication model with minimum FPR. Contributions of our approach
are (a) prepared an efcient model for classication of spam and ham mails,
(b) higher performance is obtained by applying feature selection method, (c) eval-
uated efciency of each category of attribute set in spam categorization, and (d) an
overall performance, i.e., F-measure of 0.9935 with a smaller FPR of 0.0004 jus-
tifying the applicability of our proposed approach in real-time spam ltering
system.
The reminder of the paper is organized as follows. In Sect. 2, a review of the
related works is done. The proposed mechanism is described in Sect. 3. Section 4
discusses details of experiments and the results of the study. Finally, inferences are
included in Sects. 5 and 6 presents the concluding remarks of the study.
2 Related Works
In [3], an author gender identication technique was proposed and it could achieve
accuracy of 85.1%. A new one-class ensemble scheme is put forward, which uses
meta-learning to combine one-class classiers in [6]. Blanzieri and Bryl [7] have
discussed various machine learning applications for email spam ltering. Menahem
et al. [8] implemented a new sender reputation mechanism based on an aggregated
historical dataset. In [9], the authors designed a fusion algorithm based on online
learners and experimented on TREC (Text REtrieval Conference) and other data-
sets. Comprehensive review on machine learning approaches to spam ltering is
discussed in [10]. Drucker et al. [11] investigated the applicability of Support
Vector Machines (SVMs) in classifying email as spam or legitimate mail.
Three-layer Backpropagation Neural Network (BPNN) technique is implemented
on datasets PU1 and Ling, resulting in 97 and 99% of classication accuracy with
less execution time [12]. A three-way decision approach (accept or reject or further
exam) is discussed and experiments on SpamBase dataset resulted in reduced
misclassication rate in [13]. Wu [14] utilized spamming behaviors with a back-
propagation neural network, employed on datasets from Hopkins, Reeber, etc. to
achieve improved performance (FPR = 0.0063).
Composite Email Features for Spam Identication 283
3 Proposed Methodology
Email spam detection process is carried out through different steps (refer Fig. 1) and
evaluated over Enron-spam dataset [15, 16]. The following subsections introduce
the proposed approach.
3.1 Preprocess Dataset
Email body is extracted from each email in the dataset. The resulted collection of
extracted email body is partitioned into train and test (60:40 ratio). Style markers
are treated as features in our approach. There are 31 characters, 38 words, 35 tags, 3
structural features, and 10,280 Bag-of-words extracted from mail body.
• Character-based features [3] include total number of characters (C), ratio of
total number of lower case letters (a–z) and C, ratio of total number of uppercase
characters and C, fraction of total number of digital characters and C, fraction of
total number of white-space characters and C, ratio of total number of tab space
characters and C, and fraction of number of special characters and C (25 special
symbol features).
• Word-based features [3] consist of total number of words (N), average length
per word, ratio of total different words and N, fraction of words longer than 6
characters and N, ratio of total number of short words (1–3 characters) and N,
Guirad’s R, Herdan’s C, Rubet’s K, Maa’s A, Dugasts U, L. Janenkov and
Neistoj Measure, Sichel’s S, Yule’s K measure, Simpson’s D measure, Hapax
Dislegomena, Hapax legomena, Honore’s R measure, Entropy, and ratio of
word length frequency distribution and N.
Fig. 1 Framework for email spam ltering
284 P. George and P. Vinod
• Function words [3] (or grammatical words or tag-based features) are words that
express grammatical relationships with other words within a sentence. Tags are
extracted from email text using NLTK (Natural Language Tool Kit) [17] in
python, and Part-of-Speech (POS) [18] tagging is done using Penn Treebank [5]
tag set.
• Structural features [3] represent the way an author organizes the layout of a
message. The main features are total number of lines, total number of sentences
(S), and average number of words per sentence.
• In Bag-of-Words, all sentences in each email body are tokenized into a set of
words and frequency of every term is counted within each le (called as term
frequency).
3.2 Application of Feature Selection Method
Feature selection determines optimal attributes from a huge attribute space without
changing physical meaning of the attribute. The main benets of dimensionality
reduction (or feature selection) are (a) elimination of redundant features, (b) re-
duction in noise thereby increases accuracy of classiers, (c) reduction of time
complexity of classication, and (d) minimization of over-tting of the training
data.
A weighting method called TF–IDF–CF [19] is applied in our proposed
approach. This method is developed based on TF–IDF (Term Frequency–Inverse
Document Frequency). It says that if a term appears in more documents, then it
becomes less important, and the weighting will also be less. A new attribute, called
as class frequency, is introduced to assess the frequency of each term in every
document within a specic class. A general form of TF–IDF–CF is shown in
Eq. (1).
aij ¼ logðtfij þ 1:0Þ logððN þ 1:0Þ=nj Þ ðncij =Nci Þ ð1Þ
In Eq. (1), tfij indicates the term frequency of term j in document i, N is the total
number of instances in the dataset, and nj indicates the number of documents that
term j occurs. The term ncij represents the number of les within the same class
c where document i belongs to and term j appears, Nci gives the total count of
documents within the same class c where document i belongs to. The algorithm for
extracting signicant words is given below.
Composite Email Features for Spam Identication 285
3.3 Generation of Classication Models and Prediction
Feature selection produces a reduced feature vector table (FVT) which is taken as
the input for training the classiers. Multinomial Naïve Bayes (MNB) and support
vector machine are used as classiers in this investigation. Individual training
models are created for each category of feature during training phase. The model
with highest F-measure is chosen for prediction. Finally, the optimal models
obtained from each category of features are aggregated to develop a composite
feature space used for building spam and ham model, subsequently used for
prediction.
4 Experimental Setup and Results
The experiment was performed on Ubuntu 14.04 platform with the support of Intel
core 7 and 8 GB RAM. In this work, 12,045 ham and 4496 spam emails have been
chosen from Enron-spam dataset. The classication models are generated by
LibSVM (kernels k0 (Linear), k1 (Polynomial), k2 (Radial), and k3 (Sigmoid)), and
Multinomial Naïve Bayes (MNB) in WEKA [20]. When a ham is misclassied as
spam, a false positive (FP) occurs. If ham data is predicted as ham then it is known
as true negative (TN), whereas if spam is correctly classied as spam data then it is
true positive (TP). When a spam is wrongly taken as ham, it is considered as false
negative (FN) [21, 22]. In this analysis, F-measure (also called as F1-score) and
286 P. George and P. Vinod
FPR are used as the signicant evaluation parameters. The F1-score can be inter-
preted as a weighted average of the precision and recall, and it ranges from 0 to 1.
Precision (P) is a measure of the accuracy provided that a specic class has been
predicted. Recall (R) measures the proportion of actual positives which are correctly
identied as such.
F-measure ¼ 2PR=ðP þ RÞ ð2Þ
P ¼ TP=ðTP þ FPÞ ð3Þ
R ¼ TP=ðTP þ FNÞ ð4Þ
FPR ¼ FP=ðFP þ TNÞ ð5Þ
Figures 2, 3, and 4 depict F-measure, FPR, and Time parameters, respectively,
obtained for six feature sets, represented as C, W, T, S, B, and E (refer Fig. 1) with
ve different classiers (LibSVM-k0, k1, k2, k3, and mnb). The highest F-measure
of value 0.9983 is produced in Bag-of-Words (B) (refer Fig. 2) with a smaller FPR
of 0.0013 by linear SVM for a feature length of size 10,153 (refer Fig. 3), whereas
the composite model could also achieve a very closer F-measure value 0.9935 (refer
Fig. 2) with lowest FPR rate of 0.0004 (see Fig. 3) in comparison with all other
models with linear SVM classication. From Fig. 4, it is clear that MNB takes less
execution time but produces insignicant performance for various style markers.
The top ten features from character-based, word-based, tag-based, bag-of-words,
and all three variables from structural attribute set are given in Table 1. These
attributes are found to have higher variance in target classes thereby stands as
representative features of writing styles adopted by email spammers [23, 24].
Fig. 2 F-measure versus
classiers
Composite Email Features for Spam Identication 287
Fig. 3 FPR versus classiers
Fig. 4 Time versus
classiers
5 Inferences
It has been analyzed that SVM classier performs well with large number of
features, but it is computationally expensive. Performance is observed higher when
stop words are removed from the text before model construction. Independent style
markers with small number of features produced insignicant results in terms of
F-measure, which is clearly visible for character-based, word-based, and structural
features. Hence, these attributes are not sufcient enough to prepare an efcient
spam ltering model independently. This is due to the absence of attributes having
high correlation with target class. As the features in the feature space increase, the
performance also improves, since relevant attributes contributing to the effective
classication appear as a candidate in the optimal feature space. This is why
bag-of-words produced a highest F-measure with larger feature space of size
10,153. Tag-based attributes and bag-of-words played an important role in the
generation of composite model as they could produce lower FPR value and an
288 P. George and P. Vinod
Table 1 Top features from each attribute category
Character Word Tag Structural BoW
| Words with Foreign word Total number of lines Php
length 20
% Words with -None- Total number of sentences Sex
length 18
_ Words with Adjective, Average number of words Meds
length 17 superlative per sentences
} Words with Noun, proper Medications
length 19 singular
¼ Words with Adjective, Pill
length 16 comparative
+ Words with Adverb, Macromedia
length 15 comparative
$ Words with Adverb, Dose
length 13 superlative
{ Words with Pronoun, Mai
length 14 possessive
` Words with Predeterminer Tongue
length 12
* Honore’s R Adverb Wi
appreciating F-measure in its independent model generation. Therefore, ensemble
of different sets of style markers resulted in better performance. Finally, it has been
analyzed from the investigation that composite feature space could greatly reduce
misclassication of ham messages as spam. This is proved by lowest FPR value
(0.0004) produced by composite model along with a high F-measure (0.9935)
compared to all independent models generated in our study. This makes our pro-
posed meta-feature model a good detector system in the real-time applications.
6 Conclusion
In the investigation for email spam classication, TF–IDF–CF is the dimensionality
reduction method, applied on various style markers such as character-based,
word-based, tag-based, structural, and bag-of-words to choose relevant features
from the large feature space. A vector space model for the relevant features was
constructed and given to classier through WEKA tool. The composite feature
space generated an effective model for email spam classication, producing a very
high F-measure (0.9935) and a very small FPR (0.0004) in linear support vector
machine than independent models. The study can be extended in future to nd out
whether involvement of male or female community is more in email spam
generation.
Composite Email Features for Spam Identication 289
References
1. Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam
ltering. IEEE Trans Inf Forensics Secur 6(2):486–497
2. Wang D, Irani D, Pu C (2013) A study on evolution of email spam over fteen years. In:
Bertino E, Georgakopoulos D, Srivatsa M, Nepal S, Vinciarelli A (eds) CollaborateCom,
pp 1–10. ICST/IEEE
3. Cheng N, Chandramouli R, Subbalakshmi KP (2011) Author gender identication from text.
Digital Invest 8.1:78–88
4. Manning CD (1999) Foundations of statistical natural language processing. In: Schutze H
(ed). MIT Press, Cambridge
5. Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media,
Inc.
6. Menahem E, Rokach L, Elovici Y (2013) Combining one-class classiers via meta learning.
In: Proceedings of the 22nd ACM international conference on information and knowledge
management, ACM
7. Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam ltering.
Artif Intell Rev 29.1:63–92
8. Menahem E, Pusiz R, Elovici Y (2012) Detecting spammers via aggregated historical data set.
Network and system security. Springer, Berlin, pp 248–262
9. Xu C, Su B, Cheng Y, Pan W, Chen L (2014) An adaptive fusion algorithm for spam
detection. IEEE Intell Syst 29(4):2–8
10. Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam
ltering. Expert Syst Appl 36(7):10206–10222
11. Drucker H, Wu S, Vapnik VN (1999) Support vector machines for spam categorization. IEEE
Trans Neural Netw 10.5:1048–1054
12. Ruan G, Tan Y (2010) A three-layer back-propagation neural network for spam detection
using articial immune concentration. Soft Comput 14(2):139–150
13. Zhou B, Yao Y, Luo J (2010) A three-way decision approach to email spam ltering. In:
Farzindar A, Keselj V (eds) Canadian conference on AI. LNCS, vol 6085. Springer, pp 28–39
14. Wu C-H (2009) Behavior-based spam detection using a hybrid method of rule-based
techniques and neural networks. Expert Syst Appl 36(3):4321–4330
15. Bekkerman R (2004) Automatic categorization of email into folders: benchmark experiments
on Enron and SRI corpora
16. The Enron-Spam Datasets. [Link]
17. Natural Language Tool Kit (NLTK). [Link]
18. POS tagging. [Link]
pos-tagger
19. Liu M, Yang J (2012) An improvement of TFIDF weighting in text categorization. In:
International proceedings of computer science and information technology, pp 44–47 (2012)
20. WEKA-Data Mining Software in Java. [Link]
21. Han J, Kamber M (2005) Data mining: concepts and techniques. Kaufmann, San Francisco [u.
a.]
22. Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial Naïve Bayes for text
categorization revisited. In: Webb GI, Yu X (eds) Australian conference on articial
intelligence. LNCS, vol 3339. Springer, Berlin, pp 488–499
23. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam ltering with Naïve Bayes-which
Naïve Bayes? In: CEAS, pp 27–28
24. Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on
interactive presentation sessions, association for computational linguistics