0% found this document useful (0 votes)
107 views11 pages

Sec v2 n23 2009 8 PDF

This document discusses using business intelligence methods to analyze malware log data for security information and event management (SIEM). Enterprises generate large amounts of security data from different systems that must be analyzed to monitor threats and improve security measures. The paper proposes applying business intelligence techniques like data mining and online analytical processing to malware logs to help security managers answer questions about current threats and how to optimize security measures. Examples from a project with a large multinational company are provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views11 pages

Sec v2 n23 2009 8 PDF

This document discusses using business intelligence methods to analyze malware log data for security information and event management (SIEM). Enterprises generate large amounts of security data from different systems that must be analyzed to monitor threats and improve security measures. The paper proposes applying business intelligence techniques like data mining and online analytical processing to malware logs to help security managers answer questions about current threats and how to optimize security measures. Examples from a project with a large multinational company are provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.

org/security/

203

Business Intelligence Based Malware Log Data Analysis as an Instrument for


Security Information and Event Management

Tobias Hoppe Alexander Pastwa Sebastian Sowa


Chair of Business Informatics Steria Mummert Consulting Institute for E-Business Security
Ruhr-University of Bochum Dusseldorf, Germany Ruhr-University of Bochum
Bochum, Germany alexander.pastwa@steria- Bochum, Germany
[email protected] mummert.de [email protected]

Abstract—Enterprises face various risks when trying to The protection of these security objectives therefore is
achieve their primary goals. In regard to the information one of the central goals of information management,
infrastructure of an enterprise, this leads to the necessity which generally aims to support the executives with an
to implement an integrated set of measures which should optimally designed and run information infrastructure.
protect the information and information technological Tasks and responsibilities focusing on the achievement
assets effectively and efficiently. Furthermore, tools are of the aforementioned security objectives are attributed
needed for assessing risks and the performances of to the subdivision respectively -function of information
measures in order to guarantee continuous effort to security management.
protect the enterprises’ assets. These tools have to be able
An integrated bundle of measures (containing
to support the handling of the vast amount of security
relevant data generated within the enterprise information
organizational, technical, logical as well as physical
infrastructure and their analysis. Both tasks are typical measures) is needed for the realization of the defined
for security information and event management. In this security objectives [5; 6]. Here, information security
context, the current paper introduces an approach for management includes the steering and controlling of
malware log data analysis by using business intelligence measures as well as their initial planning. This process
methods. Thereby, examples are given which are derived must be seen as a continuous operation to guarantee a
from the results of a project being conducted with a sustainable realization of the desired level of protection
world-wide operating enterprise. [7; 8]. In this context, information again incorporates a
very important role – it forms the basis for any possible
Business Intelligence; Data Mining; Malware; Online modification of the measures aiming to hold or improve
Analytical Processing; Security Information and Event the level of protection which is defined by the
Management executives on the basis of an analysis of threats and
economic impacts.
I. INTRODUCTION As subdivision or sub-function of the information
security management of an enterprise, the security
In general business management research as well as
information and event management (SIEM) discussed
in the field of business informatics, it is a well known
in this paper typically uses a wide range of information
fact that the effective as well as efficient processing of
from various elements of the information security
information constitutes one of the most important
architecture. The information security architecture is
drivers for the success of an enterprise [DBKDA 2009,
defined as the part of the information infrastructure
1; 2]. For this purpose, adequate information systems
which contains all components to enforce the defined
are used. The organization’s functions and processes
information security objectives. Further more, these
highly depend on information and on those information
components can be used for the management and re-
systems, which semi- or fully automatically support
engineering of the relevant security concepts. From this
information processing [3].
background, the architectural elements compromise all
Considering that already a temporary unavailability
access controls, operating system cores, firewalls and
of essential information systems may lead to existential
further measures to guarantee safe communication, for
dangers, special attention must be paid to measures
instance [9].
which ensure that all devices and applications of the
As comprehensive as the amount of elements of the
information infrastructure being necessary for the
information security architecture is, as comprehensive
information processing activities are used. Furthermore,
is the amount of data generated from its elements. As
breaches in the confidentiality, integrity, and the non-
consequence, the task of data evaluation is complex and
repudiability in regard to information assets or
time consuming. Therefore, a critical success factor for
information processing technologies may constitute
executives of SIEM has to be seen in the quality and not
perceptible impairments or even existential crises [4].
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

204

the quantity of data relevant for the decisions about the paper focuses on data gathered from technological
conceivable modifications of security measures. elements, it is stressed that this only covers one aspect
Due to the amount and complexity of data that have of the entire tasks of information security management
to be analyzed, questions about adequate tools, methods executives.
and models to support the analysis process arise. Here, As consequence of the appreciation of information,
one of the most successful applied approaches in the also information security has to cover technical as well
business management context is business intelligence as non-technical challenges. In this context, the ISO
(BI). This paper shows how BI can be used to answer explains that whatever “form the information takes, or
two questions which are relevant for SIEM: 1. How do means by which it is shared or stored, it should always
malware causing attributes relate to each other? 2. How be appropriately protected. Information security is the
does malware spread in the IT landscape and how long protection of information from a wide range of threats
does it reside in the system? For these purposes, known in order to ensure business continuity, minimize
malware which occurred within a certain timeframe business risk, and maximize return on investments and
will be analyzed. business opportunities” [7; 8].
After dealing with the theoretical backgrounds The term SIEM combines security information
concerning SIEM in Chapter II, Chapter III introduces management and security event management. In both
the concept of business intelligence. Chapter IV shows areas, the focus lies on the collection and analysis of
how Online Analytical Processing (OLAP) can be security relevant data in information infrastructures
applied for SIEM. Chapter V then focuses the research respectively the security infrastructures. Thereby, the
objectives of this paper from the perspective of data security event management emphasizes the aggregation
mining whereas Chapter VI refers to its results. Chapter of data into a manageable amount of information in
VII gives a brief conclusion and finally, Chapter VIII order to deal with events and incidents immediately (for
exemplifies future work. example, in a timely fashion).
In contrast to security event management, security
II. THEORETICAL BACKGROUND – SIEM information management primarily focuses on the
Before presenting how BI, in particular OLAP and analysis of historical data aiming to improve the long
data mining, may support the goals of SIEM, the term effectiveness/efficiency of the information security
following paragraphs characterize specific problems of infrastructure [12].
data analysis as well as the requirements for designing a
BI system. In the first step, terms and definitions which
are relevant for the overall conceptual coherences are Security Information and
Event Management (SIEM)
introduced.
Security Information

Security Event
Management
Management

A. Relevant Terms and Definitions


Information as the first relevant term used in the
discussion of information security management topics Collect/Store/Correlate/Analyze
can linguistically be derived from the Latin informatio.
In this turn, informatio stands for the explanation or
interpretation of ideas as well as it can be used in the Security
Log-Data
meaning of education, training or instruction. This gives
a first consideration about an accurate and precise
definition: Information in this paper is defined as an Figure 1. Conceptual Architecture of SIEM
explanatory, significant assertion that is part of the
overall knowledge as well as it is seen as specific, from As shown in Figure 1, SIEM then stands for the
human beings interpreted technical or non-technical amalgamation of security information management and
processed data [10; 11]. security event management into an integrated process of
The just given definition of information is precisely planning, steering, and controlling security relevant
in line with the ISO/IEC standards which explain that information on the basis of the data collected from the
information “can exist in many forms. It can be printed information security architecture. Carr states: “Security
or written on paper, stored electronically, transmitted by information and event management (SIEM) systems
post or by using electronic means, shown on films, or help to gather, store, correlate and analyze security log
spoken in conversation” [7; 8]. This – mostly trivial – data from many different information systems” [13].
way to use the term information unfortunately does not B. Selected Challenges SIEM is facing
reflect the common sense in the information security
community. There, it is quite often assumed to only The analysis of security relevant data collected from
affect electronic data, and thereby information security the information security architecture is a challenging
management has mostly to deal with IT. Although this task because of the following reasons:
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

205

 Amount of data number of protocols as well as the amount of data


 Heterogeneity of data formats generated is enormous. Depending on the system and
 Heterogeneity of the data contents the action performed, log data may contain information
 Limited personnel and budget about incidences or threats (due to email or internet use,
for example). In addition, the data relevant for security
As consequence of the various information security information and event management (SIEM) may be
architecture elements and the number of protocols, the recorded because specific ports were used by gateways
amount of data gathered is massive. Thus, considerable and firewalls, for instance [15].
manual effort is needed to gather relevant information
III. THE CONCEPT OF BUSINESS INTELLIGENCE FOR
about security threats. Furthermore, the data collected
exist in various formats, making evaluation difficult and SUPPORTING SIEM
time-consuming. The heterogeneity of the data contents After describing the challenges of SIEM, the current
also impedes a simple and flexible analysis. Depending chapter focuses on the introduction of the concept of
on the system and the action performed, the data may business intelligence (BI).
contain information about incidents or threats due to Business intelligence stands for a conceptual
email or internet use, for example. In addition, data may framework which bundles numerous approaches, tools
be recorded, since specific ports are used by gateways and applications used for the analysis of business
and firewalls, for instance. Therefore, the possibility of relevant data [16]. The general aim of BI is to support
manually analyzing data which are derived from the effective and efficient business decision making for
information security architecture elements is severely what purpose a data warehouse is built up. Usually a
limited due to the sheer volume of data as well as the data warehouse serves as the central storage system of a
heterogeneity of data formats and contents. BI system. For implementing a BI application serving
Two further aspects must be considered. Typically, the goals of SIEM, a reference architecture has to be
information security management divisions have only a defined initially. Here, Figure 2 shows the layers and
small fraction of personnel, and the budget is also elements of an architecture that serves as a basic
limited. As well as in other entities of an enterprise, the guiding topology in this context.
resources also spent for SIEM have to be managed
economically. Thus, SIEM faces the same requirements

Presentation and
as the other organizational units of the entire enterprise.

Analysis Layer
Data Usage

The executives have to allocate resources in such a way


that the specific entity contributes to the enterprise’s
goals as much as possible [14]. To sum up, the
following aspects are identified as the primary
requirements for SIEM:
Data Provision

Processing Layer
Storage

Storage and
 Extraction of information and knowledge
 Establishment of an integrated and continuous
management process ETL
 Effective and efficient data evaluation
 Support for network management
Sources
Log
 Support for compliance management External OLTP-
Data
=
Data Systems Servers

By identifying relevant information and deducting


Figure 2. BI Reference Architecture [17]
knowledge from the existing volume of data, SIEM
strives to guarantee the protection of information and
information system values. To achieve this goal, it is A. Data Sources
necessary to conduct SIEM as an integrated, continuous At the lowest level of the BI reference architecture,
management process. In turn, this process is dependent various enterprises’ operational systems as well as
on the information relevant to the decision makers. This useful external data sources are located. They serve as
information again is extracted from the data pool. From data suppliers for the data warehouse as the integral part
the background of the limitations of data evaluation as of the middle layer. The data primarily relevant for
described above, it is crucial to establish appropriate SIEM is gathered from the information security
(what means highly effective and efficient) practices architecture elements which log information security
and mechanisms to support the data processing for the relevant processes and incidents. This includes data
needs of the SIEM executives. about installed operating systems, versions of patches,
As consequence of the numerous elements installed installed anti-malware programs or information about
in the enterprise information security architectures, the the frequency of user password changes, for instance.
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

206

Potential threats can be identified by logging policy how long the interval between the single batches should
violations, malware reports, login-/logout-events and be. Thus, depending on the amount of data as well as on
account-lockouts of users. This data is transported to the information and communication technologies in use
log servers providing the input data for the data and the information needed by the decision makers, the
warehouse. data is transferred flexibly from the source systems into
the data warehouse.
B. Storage and Processing Layer
One goal of a BI application is the consolidation of C. Presentation and Analysis Layer
different data contents and formats towards a uniform The top layer of the BI reference architecture
perspective. For this task, an ETL (extraction, comprises all methods and tools which are capable to
transformation, and load) component is combined with analyze the multidimensional data as well as to present
solutions for storing and preparing the data for later analytical reports. Among the different possibilities in
presentation / analysis [17]. This component constitutes this context, OLAP and data mining methods play an
a further module of the BI architecture and serves as the especially prominent role:
interface between the operational systems and the data Online Analytical Processing: OLAP is a software
warehouse [18]. It transfers the heterogeneous data into technology. It allows decision makers to accomplish
a consistent and multidimensional data perspective and fast, interactive and flexible requests to the relevant
loads the data into the data warehouse; in detail: data stored in a multidimensional structure [19].
Extraction: Extraction deals with the selection and Data Mining: While OLAP focuses mainly on
deployment of source data. Since relevant data typically historical analysis, data mining is concerned with a
exist in a very heterogeneous form, the ETL tool needs prospective analysis. By applying various statistical and
to access all data from the operational systems mathematical methods, data miners aim to identify so
containing the security relevant log data. far unknown data patterns [20].
Transformation: Transforming the source data into OLAP and data mining increase the prospect of
the target formats of the data warehouse is the central analyzing security relevant data efficiently for the short
task of the ETL process. It can be further divided into term treatment (e.g., of malware threats) as well as for
the steps of filtering, harmonization, aggregation and the long term improvement of the overall information
enrichment. Filtering ensures that only the data security architecture. Especially in regard to the SIEM
necessary for the multidimensional analysis is loaded challenges, BI offers the chance to handle the accrued
into the data warehouse. Log files usually contain lots amount of data and to transfer the heterogeneous data
of information not needed for analysis. For example, into a consistent format that can be used for analyses
Windows event logs record a multitude of application and reports of SIEM relevant topics.
and system information. But for the purposes of SIEM,
only information security events are needed. Following, IV. APPLYING ONLINE ANALYTICAL PROCESSING
harmonization corrects the data of syntactical and FOR SIEM
semantic defects. Also an adjustment of codes, Up to now, challenges of SIEM and characteristics
synonyms and homonyms as well as the unification of of BI have been described. The following chapters
different definitions of terms will be conducted. For focus on the combination of these fields, presenting an
example, for the same person, a different user name OLAP application for SIEM.
could have been assigned in a Windows environment
and in a UNIX or a Linux environment. In the A. Multidimensional Data Model
multidimensional database, this user must be clearly Modeling an adequate multidimensional data
identifiable, however. In a further step of structure is one of the crucial factors of success when
transformation, the consistent, but in the lowest level of designing a BI application. It forms the basis for the
granularity existing data will be aggregated to improve execution of the ETL process with which relevant data
analysis performance. Here, the aggregation of hosts to is loaded from the operational systems into the data
organizational units or geographical locations is a warehouse. The resulting data construct can then be
possibility. Enhancing the data by adding contextual analyzed by typical OLAP operations: Slice, dice, drill
information represents the last and very important step down, and roll up. By using these operations, diverse
of the transformation process because the knowledge occurrences of different perspectives can be determined
generated in the consequence enables to systematically and evaluated, like the frequency of malware infections
substantiate decision making processes on a broader within a certain period on a certain operating system,
base. for instance. Figure 3 visualizes the arrangement of the
Load: Finally, the extracted and transformed data is dimensions mentioned above in the structure of a so
loaded into the data warehouse where it is permanently called data cube [21].
stored. For this purpose, batches are used. In order to Multidimensional data models consist of a fact table
ensure the adequate supply of information in regard to and further tables which serve to depict the so called
timeliness and quality, the question has to be answered dimensions. Dimensions stand for the relevant entities
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

207

with which the metrics of the fact table can be analyzed analyze, so they are aggregated to incidents. An
[22]. Hence, dimensions are used to provide additional incident thus covers one or more log events which
perspectives to a given fact [23]. In order to ensure the belong together.
data quality, it is of vital importance to follow a In order to aggregate malware logs, two cases must
systematic and holistic approach when defining the be considered:
dimensions and selecting the facts. (1) A malware which is detected at t1 reappears on
the same computer at t2 and thus generates a new
Operating Systems log file. For this case, the reappearance of the
malware at t2 is treated then as a new incident, if
OS 3 the malware has been deleted successfully in t1 and
OS 2 Months the subsequent scan has not revealed a persistence
OS 1 of the malware. In addition, the malware events
must have occurred on the same computer and must
Virus 1 11 0 15 be caused by the same user.
(2) Further on, each log event indicating that a new
Virus 2 0 20 14 malware has been detected on a computer becomes
part of a new malware incident.
Virus 3 9 5 0 Figure 5 gives an overview of the input log data
made available for the case study. Only known malware
was in the focus of the upcoming analysis.
Occurrences

Viruses
Context Data Log Event Data
Figure 3. Example of a data cube for OLAP analyses
Malware Events
Hosts
The content proposed in this paper refers to the key
Unix Events
findings resulting from a cooperative project between a
university and an industrial institution of leading Users Windows Events
presence. The goal was to develop a solution for a more
sophisticated analysis of information security relevant
data. The industrial institution uses a combination of Figure 5. Overview of Input Data
several security systems. The generated log data is
stored in a centralized relational database. Amongst The data set can be separated into actual log data
others, main sources of the log data of interest are those and context data. The actual log data is divided into
from anti-malware solutions. three types. On the one hand, logs contain log data
Figure 4 illustrates the business objectives of the originating from the Windows operating systems. On
business intelligence project and the way log data the other hand, for UNIX hosts, similar data was made
contributes to them. available. The most interesting log data in respect to the
paper is the malware log data. The malware event
records contain information about the time, location,
Analysis Relationship Analysis of Malware Permanence and type of malware found on a system.
Goals Attributes Affecting Malware and Propagation Analysis The context data consists of records representing the
computers (hosts) and the users of the enterprise’s IT
systems. These records offer data in several dimensions
Aggregation Incident Incident Incident Incident
such as geographic and demographic information. The
user records include fields containing information like
Log Log Event Log Event Log Event Log Event the user’s age and gender as well as his or her
Events Log Event Log Event Log Event Log Event organizational status within the company. The host
records include fields containing the computer’s current
status and the operating system running on it as well as
Figure 4. Aggregation of Log Events information about the patch status of the operating
system.
A log event thereby is a specific, single event created The resulting multidimensional data model,
by some log source and stored in the database. An presented in Figure 6, illustrates the relations between
example for a log event is the finding of a malware the relevant dimensions containing different levels of
program. Log events are very numerous and hard to hierarchy and the measures (facts).
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

208

analysis, which is only possible within the individual


organizational context. Since the original results of the
data analysis are not allowed to be published due to
confidentiality requirements, it has to be stressed that
the following findings base on generated random data.
Nevertheless, the results convey an impression about
the possible outcomes of such an analysis.
Report no. 1 depicts the top five malware programs
Figure 6. Multidimensional Data Model for Malware Analysis measured by the number of affected hosts, the number
of affected users, and the duration of the malware in the
The metrics Malware Event Count and Malware institutions’ IT systems. The malware “JS/Downloader-
Incident Count can be analyzed according to these AUD” stands out, infecting 664 hosts and 431 users. It
dimensions in any combination. was present on at least one host on 322 days which is
The User Dimensions include demographic virtually every day in the given time frame of one year.
information about users (e.g., gender, age category) This result implies that this particular malware either
who caused malware events as well as their admin remains on the system or returns frequently.
status on the host where the malware was found. A specific top five list of malware affections is
Additionally, their geographic location is tracked by the helpful to identify particular pertinent malware and thus
Location dimension, which consists of the hierarchy is a valuable tool for risk management. The types of
levels Country, City and Organizational Unit. malware visualized in the diagram can be filtered while
Information about the host computers on which the time period of the collected data can be adapted to
malware events were found is provided by the Host one’s need. The variability of such dimensions is a
Dimensions. The host Model is a description of the main feature of multidimensional OLAP analysis.
hardware. The host Status provides information about Report no. 2 illustrates the long-term development
the host’s current status in regard to anti-malware of the number of hosts and users infected with malware.
logging as well as information about the patch status of Once countermeasures have been applied, this diagram
the operating systems. can be used to control the measure effects. Scaling from
The Malware Dimensions provide information quarters to months or even days, the diagram can also
about the Type of malware found and its Threat Level, serve for medium to short-term controlling tasks and is
which is either “high” or “low” by previous definition. thus another useful tool for risk management.
E.g., cookies, adware, and joke programs are classified The reports no. 3 and 4 give details about the most
as low risks while malware such as viruses, trojans, and frequent malware, in this case of the “JS/Downloader-
key loggers represent high risks. The Malware Source AUD”. The left diagram represents the success of
indicates the location of the malware; e.g., “local hard malware elimination over time, the right one shows the
drive” or “internet browser files”. Since anti-malware presence of the malware in the IT systems over time. In
programs scan on a regular basis as well as on file this chart, strong excursions are to be recognized. Even
access, the corresponding scan types are the elements of after deleting the malware successfully, it seems that
the dimension Scan Type. The countermeasures which the malware re-emerges quickly. Further investigations
are taken by the anti-malware software constitute the concerning this malware should be accomplished.
definition of another dimension (Counter Measure). During the project, several more dashboards were
This multidimensional processed data also serve as developed to enable users to analyze malware findings
data basis for the upcoming data mining process. in regard to geographical aspects, for instance. The
B. Prototyping an OLAP System for SIEM associated reports are represented as color coded maps
in which significant occurrences of malware affection
Dashboards are usually used to visualize different, can be recognized rapidly. Further more, occurrences
distributed information in a concentrated and integrated can be examined in detail by drilling down. With this
form. Relevant information is qualified in order to opportunity, enterprises are able to identify locations
represent large quantities of information to the decision which particularly cause the malware spreading. Thus,
makers more clearly. Dashboards enable organizations it can be derived in which organizational units security
to measure, monitor, and manage business objectives measures have to be improved immediately. Another
more effectively in the consequence [24]. In the case of dashboard visualizes user groups which cause various
SIEM, security dashboards are deployed in order to malware, by demographic characteristics. In this way,
visualize security relevant data. various age groups and/or gender-specific classes can
The dashboard illustrated in Figure 7 is currently set be identified that correlate with increased malware
up to enable analysis of malware permanence and affection. This information could be utilized to design
propagation. Here, the four reports merely provide specifically targeted awareness measures aiming to
descriptions of the data, indicating irregularities. Thus, significantly reduce malware infections amongst the
they provide the starting point for a more accurate users and for other purposes.
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

209

Malware Security Dashboard


 Top 5 Malware Programs Malware Permanence and Propagation 

Detail View: JS/Downloader-AUD


 Success of Malware Elimination – Detail View Malware Permanence and Propagation – Detail View 
120

100

80

60

failed
40
successful
20 undefined

0
October November December January February March

Quarter 4 Quarter 1

2007 2008

Figure 7. Excample of a Malware Security Dashboard

To sum up, all OLAP functions specified above can clustering, or classification [25]. These algorithms
be used for detailed analysis in dashboards. First of all, originate from diverse research fields, like statistics,
dashboards give a general overview of the relevant pattern recognition, database engineering, and data
measures, but also can be designed for presenting visualization, for instance.
important details. Additionally, using a reporting tool, It has to be stressed, that the application of data
many other OLAP reports can easily be generated by mining algorithms must be accompanied by preparatory
accessing the data warehouse. Here, dimensions can be as well as post processing steps [25]. As Fayyad et al.
combined flexibly in order to analyze measures in point out, “blind application of data mining methods
regard to the perspectives of individual interest. In the can be a dangerous activity, easily leading to the
consequence, OLAP enables a powerful descriptive discovery of meaningless and invalid patterns” [20]. In
analysis and effectively supports SIEM. order to conduct the necessary steps, and to analyze the
data efficiently / effectively, the Cross Industry Process
V. DATA MINING RESEARCH OBJECTIVES for Data Mining (CRISP-DM) was used [26]. CRISP-
Undoubtedly, the simple storage of security relevant DM is an industry- and tool-neutral process model for
data alone does not enable to draw sensible conclusions data mining analysis which was and still is applied in
from the data in order to support SIEM. Data by itself is several industry sectors successfully.
of little direct value since potential insights are buried Actually, every single log event is potentially
within and are often very hard to uncover. As described interesting for further investigative analysis. Since most
above, OLAP and dashboards are one way to analyze organizational IT networks are in some way connected
and visualize data which is modeled multidimensionally to the Internet and are thus subject to attacks from
and stored in a data warehouse. Data mining is another outside, the most popular application of data mining on
option. The concept of data mining provides specific log data is concerned with intrusion detection [27; 28].
algorithms for data analysis, like association analysis, In addition, questions to be answered by analyzing the
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

210

log data could be why, where, when, and how long a number of computers and users affected by malware
malware incident happened and who was involved and incidents and the duration the malware resides within
responsible. In order to attain new and useful insights the IT infrastructure. Thus, the second objective of
from the log data of interest, the following research mining the security relevant data aims to analyze
objectives were identified. malware permanence and propagation.
Here, the k-means algorithm was applicable in order
A. Objective 1: Relationship Analysis of Attributes to cluster malware incident records in dependence of
Affecting Malware Infection their similarity. Describing similarity is the main task of
One goal of applying data mining techniques is to clustering algorithms. Similar records are put into the
identify interesting, unknown and relevant patterns in same cluster, whereas dissimilar records are allocated to
the data. Rules help to verbalize and quantify the different clusters. Thus, the second research objective
patterns. The resulting set of rules can then be further was stated as “given a set of n malware incidents, group
analyzed by a human expert who decides how these them by similarity into k clusters”.
rules will further be used in the process of SIEM.
Among the different methodologies which are used to VI. DATA MINING RESULTS
extract rules from a given data set, the authors of this Since the results of the data mining analysis are not
paper focused on the association analysis. This method allowed to be published due to confidentiality reasons,
aims to discover interesting relationships between the the following findings also base on randomized data.
attributes of a data set [29]. For this purpose, the two Nevertheless, the results convey an impression on the
measures support and confidence are used. They possible outcomes of using data mining techniques for
indicate the interestingness of a relationship. Support supporting SIEM.
quantifies how frequently a rule is applicable to a given
data set, while confidence indicates how often items in A. Findings of the Relationship Analysis
B appear in transactions that contain A [29]. As Since the Apriori algorithm is appropriate for
depicted in Figure 8, the support of 2% means that in analyzing small or mid-size data sets, the authors have
2% of the whole set of hosts, Windows XP and a decided to apply this algorithm to provide an answer for
malware incident went along with each other. The research objective 1 [29]. Table I depicts an extract of
confidence of 10% conveys that malware incidents random data which served as input in this context.
occurred on 10% of all Windows XP hosts.
TABLE I. OVERVIEW OF DATABASE EXTRACT

All Hosts All Hosts No. User Age User is Admin Malware Risk
(100%) Hosts with Hosts with
virus incidents virus incidents 1. IV true low
2. V false low
Windows XP Windows XP
3. III false low
Hosts Hosts (100%)
10% 4. II true high
2% Support Confidence
5.
n. …
II false
… high

Figure 8. Support and Confidence
Each row represents a virus incident with three
Mathematically, support and confidence can be attributes. Thereby, the user ages are grouped into one
represented as in the following equations, where A is of five classes with “I” for the youngest employees to
the antecedent and B the consequent of the rule: “V” for the eldest ones. In order to find out which
 Support (A  B) = P (AB); attributes are associated with high malware risks (or
 Confidence (A  B) = P (BA). low malware risks, respectively), the different types of
Since many relationships may exist between the malware had to be assessed prior to the analysis. This
attributes causing malware incidents, the following was done by adding a new attribute to the data table for
research objective has been stated: “Given malware malware risks. Thus, it was possible to assign each user
incidents with certain attributes, find associations a “low” or “high” malware risk. Like done for OLAP,
between those attributes, and state them as rules cookies, adware, and joke programs were classified as
satisfying a minimum confidence and support.” low risk while malware such as viruses, trojans and key
B. Objective 2: Malware Permanence and loggers, was classified as high risk.
Propagation Analysis Since the data mining analysis focused malware
affecting indicators, only those item sets were regarded
Another interesting question is how malware which contain the risk attribute. In order to gain
spreads in the IT landscape and how long it resides in significant rules, support and confidence factors, as
the system. Such a profile may contain data about the shown in Table II, were calculated.
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

211

TABLE II. ASSOCIATION RULES which the same malware was present on different hosts
hoher Malware-Befall, wenn Support % Confidence %
within the entire enterprise. So, if a specific malware
was identified on at least one host at the beginning of
user age category = E and
1.
user gender = male
9.5 82.7 April and again in the middle of April, one is dealing
with two separate infections. The malware incident data
user age category = D and
2. user gender = male and 5.3 75.6 thus was aggregated once more to provide information
user is admin = false about such infections. This time, the aggregation had to
… … … …
be performed along the date attribute of the malware
incidents. Incidents with the same malware and similar
dates were aggregated to the same malware infection
niedriger Malware-Befall, wenn Support % Confidence %
group.
user is admin = true and

user gender = female
1.5 50.7 In order to identify similar dates, a grouping
algorithm was applied. The algorithm devised for the
user age category = E and
n.
user gender = female
8.7 60.9 present use case groups data objects by date and
malware ID. The results were a number of classes, each
containing a number of data objects with the same
The Apriori algorithm made it possible to separate
malware ID and a similar date. The algorithm performs
the rule set. Rules with a confidence of less than 70%
the following steps for each identified malware ID:
and a support of below 5% were not taken into account.
(1) Sort all data objects by date.
The upper part of the table displays the rules which lead
(2) Create an initial empty group.
to high malware affection. The lower part displays
(3) Go through the data objects systematically and
those rules with low malware affection, respectively.
compare each date to the date of the previous one. If
The support of rule 1, as shown in the table, allows to
dates are similar, put the current data object into the
conclude that in 9.5% of malware incidents the user’s
just opened group. Otherwise, close the open group
age category is IV, the user’s gender is male, and the
and create a new one containing the current object.
malware affection was high. The confidence of rule 1
Similarities between dates may be parameterized. In
indicates that in 82.7% of those malware incidents
the case above, dates were considered dissimilar if
where the age category is IV and the user’s gender is
they were more than 7 days apart.
male, the malware affection is high.
Finally, the attribute “group” was added to each
It was tempting to interpret the rules indicating low
record. This attribute will have the value “0” if the
malware affection similarly. However, the analysis only
record belongs to no group and a different number if it
included records which already represented at least one
is part of a malware infection group. The result was a
incident. The “low malware affection” incidents merely
number of groups, each containing data objects with the
occurred on hosts with less malware incidents. Thus,
same malware ID and a similar date. The grouping
the last two rules have to be interpreted with specific
algorithm was parameterized during test runs in such a
attention, since they merely indicated lower affections
way that most groups contain either mostly malware
than rules 1 and 2, for instance, but not a complete
incidents with high malware affection or mostly those
absence of it.
with low malware affection.
B. Findings of the Malware Permanence and After pre-processing the data, a cluster analysis was
Propagation Analysis performed. Some findings are depicted in Figure 9. Due
Data mining aiming to describe the permanence and to the already mentioned confidentiality reason, real
propagation of malware incidents throughout the hosts values must not be shown; hence, the results of the
of the enterprise was not performed in a straightforward analysis cannot be discussed in detail. Since the k-
fashion such as for the association analysis. The efforts means algorithm has been proven to be effective in
put into this task are described now. producing good clustering results for many practical
In order to narrow the analysis focus, measures for applications, this method was applied for clustering the
malware permanence and propagation were defined. malware incidents [30]. The attribute distributions
The propagation of malware is described by the number indicate if the administrative privileges, the age, and the
of hosts and number of users a specific malware has gender result in uncommon malware affection.
affected. The duration of a malware infection can serve In total, eight clusters were identified. Figure 9
as measure for malware permanence. With background shows cluster 1 and 3 which were the most extensive
of these measures, concrete data sources were defined. ones. Cluster 1 includes 22%, whereas cluster 3
Here, the malware event data served as basis for what contains 19% of all malware incidents. Cluster 1 reveals
reason no further data preparation was necessary. that male users (cell 2) are likely to be affected by low-
The most difficult measure to extract from the data risk malware (cell 1) while in cluster 3 female users
was the duration of a malware infection. A malware (cell 6) are in danger of being affected by high-risk
infection in this context is defined as the duration in malware (cell 5). Further on, cluster 1 indicates that
middle-aged employees tend to be infected by malware
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

212

(cell 3). The admin status does not seem to have In order to narrow the entire set of data to a manageable
influence on malware infection in this cluster (cell 4) subset and to ensure that this subset matches the needs
what is surprising. This is also the case in cluster 3 (cell of the decision makers, the data relevance must be
8). In contrast to cluster 1, cluster 3 reveals that judged. In addition, an appropriate multidimensional
younger and elder employees tend to have malware on data model which serves as the basis for flexible data
their computers (cell 7). analyses has to be designed.
While many research papers focused the analysis of
log data e.g., for web marketing purposes, the analysis
of security relevant log data has barely been explored.
As result of the named project, it was exemplified that
the so called native data mining methods are applicable
for the analysis of security relevant log data.
1 5 Although the results presented in this paper are
based on random data, rules were identified throughout
the data mining project indicating that the age of a user
has impact on malware affection on the one hand and
2 6 that the user’s gender influences malware occurrences
on the other hand. At the same time, it had to be stated
that the admin status of a user does not seem to have
influence on malware affection. However, the findings
3 7 should not be generalized as they may relate to specific
circumstances of the project conducted.
Due to the amount of data processed during the
timeframe of the project, major efforts had to be made
4 8 to ensure the quality of the log data in regard to its
readiness for analysis. Though not being in the focus of
this paper, it has to be stated that the application of a
data mining process, like CRISP-DM for instance, is a
Figure 9. Results of the Cluster Analysis crucial success factor in this context.
VIII. FUTURE WORK
VII. CONCLUSION Naturally, the results of the association analysis
While BI systems are used in many enterprises to should provide information about relationships between
support classical business entities like the controlling or the different attributes which influence the number of
production one, they usually have little to no experience malware occurrences on the enterprise’s hosts. Easily
with BI systems in the context of SIEM. Taking the understandable representations of such information are
benefits of a classic BI system into account, this paper rules. A rule might say that “if a user has administrative
focused on the option of using OLAP and data mining privileges on a host, this host does not have an
techniques for the purposes of SIEM. Based on results abnormal high number of malware incidents”.
of a project with an international enterprise, it can be As for research objective 1 discussed in this paper,
derived that OLAP and data mining strongly support it seems sensible to create another model based upon a
information security management teams. The gathered different technique in order to support or disprove the
data can be analyzed more efficiently and patterns can rules generated by Apriori. This can be achieved by
be found which were previously hidden. Although the training a clustering model with the k-means algorithm.
methods do not increase the detection ratio of malware An association rule might be supported by the cluster
directly, they support in finding internal (and external) analysis, if at least one cluster can be associated to it. A
factors which influence malware infestation. As result, cluster representing the rule stated above might contain
measures (like awareness campaigns) could be set up to only those records in which the user possessed
increase performances of running traditional measures administrative privileges and the host was subject to a
like anti-virus and intrusion detection systems. relatively low number of malware occurrences.
It has to be stressed that the quality of the data is In order to serve the goals of SIEM, future research
crucial for success and that interpretation questions in has to focus on further fields of log data analysis. For
regard to false positives and false negatives were not in example, policy violations could be monitored by the
the focus of this paper. Thus, the implementation of an use of data mining methods. Since enterprises usually
adequate ETL process to transfer data from the source have a bulk of policies (like password and access rules
into the data warehouse correctly and consistently is as or the enforcement of regular updates of anti-malware
important as the validation of the accuracy of the data. and operating system software) to which the users and
International Journal on Advances in Security, vol 2 no 2&3, year 2009, http://www.iariajournals.org/security/

213

hosts have to comply to, the corresponding security data [9] M. Nyanchama and P. Sop, “Enterprise Security
cannot be handled manually. By applying the described Management: Managing Complexity”, Information
Systems Security, Vol. 9, No. 6, 2001, pp. 37-44.
data mining techniques here, factors for violations of
[10] J. Biethahn, H. Mucksch, and W. Ruf, Ganzheitliches
policy compliance could be identified efficiently as well Informationsmanagement, Band I, 5th Edition,
as countermeasures could be set up in a timely fashion Oldenbourg, München et al., 2000.
in the consequence. Thereby, identified policy violation [11] R. Gabriel and D. Beier, Informationsmanagment in
issues should be categorized, rated, and visualized Organisationen, Kohlhammer, Stuttgart, 2003.
automatically in a clearly arranged manner. Thus, the [12] A. Williams, “Security Information and Event
information security management executives can be Management Technologies”, Siliconindia, Vol. 10, No.
provided with high-quality information. Thereby, data 1, 2006, pp. 34-35.
mining is a promising option to identify patterns inside [13] D.F. Carr, “Security Information and Event
the data sets which were previously hidden. Management”. Baseline, No. 47, 2005, p. 83.
Another way to perform data analyses and visualize [14] D. Hellriegel, S.E. Jackson, and J.W. Slocum,
Management, South-Western College Publishing, Ohio,
the results is OLAP. This technology leads to efficient 1999.
identifications of policy compliance violations for [15] B. Gilmer, “Firewalls and security”, Broadcast
which corresponding countermeasures could be set up Engineering, Vol. 43, No. 8, 2001, pp. 36-37.
rapidly. The presented OLAP approach should not only [16] M. Anandrarajan, A. Anandrarajan, and C.R.
be limited to the own enterprise. Also, the standard Srinivasan, Business Intelligence Techniques, Springer,
reporting modules of anti-malware software can be Berlin et al., 2004.
substantially improved by integrating a function which [17] P. Gluchowski and H.G. Kemper, “Quo Vadis Business
enables to use dashboards as presented in the paper. Intelligence? Aktuelle Konzepte und
To sum up, the possibilities of BI in the context of Entwicklungstrends”, BI Spektrum, Vol. 1, No. 1, 2006,
pp. 12-19.
SIEM are manifold. Thereby, data mining techniques
[18] W.H. Inmon, Building the Data Warehouse, Wiley, New
offer the promising chance to extract new knowledge York et al., 1996.
out of the seemingly unstructured set of continuously [19] E.F. Codd, S.B. Codd, and C.T. Salley, Providing
logged data on the one hand. On the other hand, OLAP OLAP to User Analysts, An IT Mandate, White Paper,
enables various powerful descriptive analyses of s.l., 1993.
measures according to different perspectives of interest. [20] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From
This knowledge again enables to design new or adjust Data Mining to Knowledge Discovery in Databases”, AI
current measures resulting in an enhancement of the Magazine, Vol. 17, No. 3, 1996, pp. 37-54.
quality of the entire information security infrastructure [21] M. Jarke, M. Lenzerini, Y. Vassiliou, and P. Vassiliadis,
of the enterprise using BI for SIEM. Fundamentals of Data Warehouses, Springer, Berlin et
al., 2000.
REFERENCES [22] W.H. Inmon, J.A. Zachman, and J.G. Geiger, Data
Stores, Data Warehousing and the Zachman Framework,
[1] R. Gabriel, T. Hoppe, A. Pastwa, and S. Sowa, McGraw-Hill, New York, 1997.
“Analyzing Malware Log Data to Support Security [23] P. Rob and C. Coronel, Database Systems: Design,
Information and Event Management: Some Research Implementation, and Management, Boston, 2007.
Results”, Proc. First International Conference on
Advances in Databases, Knowledge, and Data [24] W.W. Eckerson, Performance Daschboards: Measuring,
Applications (DBKDA 2009), IEEE Press, Mar. 2009, Monitoring, and Managing Your Business, Wiley &
pp. 108-113, doi: 10.1109/DBKDA.2009.26. Sons, New York et al., 2006.
[2] K.C. Laudon and J.P. Laudon, Management Information [25] J. Han and M. Kamber, Data Mining: Concepts and
Systems, Managing the Digital Firm, Prentice Hall Techniques, Morgan Kaufmann, San Francisco, 2006.
International, Upper Saddle River, 2005. [26] P. Chapman, J. Clinton, R. Kerber, T. Khazaba, T.
[3] J.-C. Laprie, “Dependability of Computer Systems: Reinartz, C. Shearer, and R. Wirth, “CRISP-DM 1.0
from Concepts to Limits“, Proceedings of the 6th Step-by-Step Data Mining Guide”, 2000, URL:
International Symposium on Software Reliability http://www.crisp-dm.org/CRISPWP-0800.pdf,
Engineering, 1995, pp. 2-11. 22.09.2009.
[4] S.C. Shih and H.J. Wen, “Building E-Enterprise [27] D.G. Conorich, “Monitoring Intrusion Detection
Security: A Business View”, Information Systems Systems: From Data to Knowledge”, Information
Security, Vol. 12, No. 4, 2003, pp. 41-49. Systems Security, Vol. 13, No. 2, 2004, pp. 19-30.
[5] R. Anderson, Security Engineering, A Guide to [28] K. Yamanshi, J.-I. Takechu, and Y. Maruyama, “Data
Building Dependable Distributed Systems, Wiley & Mining for Security”, NEC journal of advanced
Sons, New York et al., 2008. technology, Vol. 2, No. 1, 2004, pp. 13-18.
[6] B. Schneier, Secrets and Lies, Wiley & Sons, New York [29] V. Kumar, M. Steinbach, and P.-N. Tan, Introduction to
et al., 2004. Data Mining, Addison Wesley, Upper Saddle River,
2005.
[7] ISO/IEC 17799:2005, Information technology – Code of
practice for information security management, 2005. [30] K. Alsabti, S. Ranka, and V. Singh, “An Efficient K-
Means Clustering Algorithm”, 1998, URL:
[8] ISO/IEC 27001:2005, Information technology – http://www.cs.utexas.edu/~kuipers/readings/Alsabti-
Security techniques – Information security management hpdm-98.pdf, 22.09.2009.
systems – Requirements, 2005.

You might also like