SIT Internal
Lecture ICT 3204 Security
4 Analytics
Data Analysis
Techniques
SIT Internal
Lecture 3 Review
• Events of security interests
SIT Internal
Lecture 3 Review
Filtering &
Raw logs Correlation
Normalization
• Data filtering (Extraction)
• Irrelevant data fields
• Duplicated data entries, could be from different sources
• Redundant data that is heavily dependent and can be derived from other
data, e.g., collinearity between data, DoB and Age
• Data normalization and reformatting (Transformation)
• Break down known log message into a normalized format, e.g., inconsistent
representation between data sources
• Reformatting e.g., .pcap (for Wireshark) to csv (for Splunk)
• Handling data discrepancy (Feature engineering)
• Noise, outliers
• Missing values
SIT Internal
Lecture 3 Review
Correlation Patterns
Micro-Level Macro-Level
Source IP Destination IP Time Anti-port Geographic Vulnerability
correlation correlation correlation correlation location correlation
correlation
Interleaving Port Watch list
address correlation correlation
correlation
SIT Internal
Lecture 4 Contents
Data analysis techniques
• Linux commands for log analysis
• Regular expressions
• Statistical analysis for data exploration
SIT Internal
Linux commands
for log analysis
• grep, awk
• Data filtering
• sed
• Parsing utility like awk
• Good at search and text replacements to format the log output
• sort
• Data summary
• head, tail
• …
SIT Internal
Operations on Data
• Target operations
• Data reformatting: Modifying the way we see it
• .pcap -> Splunk
• Data filtering: We want to only see specific stuff
• Data summarization: Seeing a condensed view
• E.g., count, uniq
10
SIT Internal
Linux Command - grep
• Linux/Unix utility, also ported to Cygwin on Windows
• Search input files based on a pattern or regular expressions
• Human readable text files
• User need to know the search term or what they are looking for
• [Link]
SIT Internal
Using grep for Log Analysis
• See all messages except those containing ssh or telnet
# grep –v ‘ssh|telnet’ /var/log/messages
• See all messages matching the patterns from a file “patterns”
# grep –f patterns /var/log/messages
• Look for records with the string “Failed” or “failed”
# tail –1000 /var/log/messages | grep ailed
12
SIT Internal
Using grep for Log Analysis
Someone at address [Link] is doing something
malicious by incrementing a customer account number by
one and trying to guess the valid accounts we have in the
system!
SIT Internal
Linux Command - awk
• Linux/Unix tool
• [Link]
• We focus on “print” command of awk to help us piece together what the
malicious attacker has done
• View what devices and systems has logged to our file
# cat messages | awk ‘{print $4}’ | sort -u
SIT Internal
Using awk for Log Analysis
Use $n to reference a
specific field
• e.g., awk ‘{print $1}’
gives the client IP
addresses
SIT Internal
Combined Usage of grep and awk
• Show the URLs that were accessed by the attacker at [Link],
what pages returned an error, with status code 403, and what
pages were accessed successfully, with status code 200
The attacker at [Link] was able to brute force guess the
account number 111111114 and changed the password on
this client account.
SIT Internal
Regular Expressions
• Information on regular expression
• [Link]
• Grep can be used with regex
• [Link]
expressions-to-search-for-text-patterns-in-linux
• Splunk can be used with regex
19
SIT Internal
Regex - Characters [Link]
Sample Sample
Character Legend Example Character Legend Example
Match Match
\d Most engines: one digit from 0 to 9 file_\d\d file_25 Any character
\d .NET, Python 3: one Unicode digit in any script file_\d\d file_9੩ . a.c abc
except line break
Most engines: "word character": ASCII letter,
\w \w-\w\w\w A-b_1
digit or underscore Any character whatever,
. .*
.Python 3: "word character": Unicode letter, except line break man.
\w \w-\w\w\w 字-ま_۳
ideogram, digit, or underscore
A period (special
.NET: "word character": Unicode letter,
\w \w-\w\w\w 字-ま‿۳ character: needs
ideogram, digit, or connector \. a\.c a.c
to be escaped by a
Most engines: "whitespace character": space, ab
\s a\sb\sc \)
tab, newline, carriage return, vertical tab c
.NET, Python 3, JavaScript: "whitespace ab Escapes a special \.\*\+\?
\s a\sb\sc \ .*+? $^/\
character": any Unicode separator c character \$\^\/\\
One character that is not a digit as defined by Escapes a special \[\{\(\)\}\
\D \D\D\D ABC \ [{()}]
your engine's \d character ]
One character that is not a word character as \W\W\W\W\
\W *-+=)
defined by your engine's \w W
One character that is not a whitespace
\S \S\S\S\S Yoyo
character as defined by your engine's \s
SIT Internal
Regex - Quantifiers [Link]
Sample Sample
Quantifier Legend Example Quantifier Legend Example
Match Match
Version A- The + (one or more)
+ One or more Version \w-\w+ + \d+ 12345
b1_1 is "greedy"
Exactly three Makes quantifiers
{3} \D{3} ABC ? \d+? 1 in 12345
times "lazy"
Two to four The * (zero or
{2,4} \d{2,4} 156 * A* AAA
times more) is "greedy"
Three or more regex_tutori Makes quantifiers empty in
{3,} \w{3,} ? A*?
times al "lazy" AAA
Two to four times,
Zero or more {2,4} \w{2,4} abcd
* A*B*C* AAACC "greedy"
times
Makes quantifiers
? \w{2,4}? ab in abcd
? Once or none plurals? plural "lazy"
SIT Internal
Regex - Character class [Link]
Character Legend Example Sample Match
[…] One of the characters in the brackets [AEIOU] One uppercase vowel
[…] One of the characters in the brackets T[ao]p Tap or Top
- Range indicator [a-z] One lowercase letter
[x-y] One of the characters in the range from x to y [A-Z]+ GREAT
One of either:
[…] One of the characters in the brackets [AB1-5w-z]
A,B,1,2,3,4,5,w,x,y,z
Characters in the printable
[x-y] One of the characters in the range from x to y [ -~]+
section of the ASCII table.
[^x] One character that is not x [^a-z]{3} A1!
Characters that are not in the
[^x-y] One of the characters not in the range from x to y [^ -~]+ printable section of the ASCII
table.
Any characters, inc-
[\d\D] One character that is a digit or a non-digit [\d\D]+ luding new lines, which the
regular dot doesn't match
Matches the character at hexadecimal position 41
[\x41] [\x41-\x45]{3} ABE
in the ASCII table, i.e. A
SIT Internal
Regex - logic [Link]
Logic Legend Example Sample Match
Alternation / OR
| 22|33 33
operand
Apple (captures
( … ) Capturing group A(nt|pple)
"pple")
Contents of
\1 r(\w)g\1x regex
Group 1
Contents of (\d\d)\+(\d\d)=\2\
\2 12+65=65+12
Group 2 +\1
Non-capturing
(?: … ) A(?:nt|pple) Apple
group
SIT Internal
[Link]
Regex - Anchors and Boundaries
Anchor Legend Example Sample Match
Start of string or start of linedepending on multiline
^ ^abc .* abc (line start)
mode. (But when [^inside brackets], it means "not")
End of string or end of linedepending on multiline mode.
$ .*? the end$ this is the end
Many engine-dependent subtleties.
Beginning of string abc (string...
\A \Aabc[\d\D]*
(all major engines except JS) ...start)
Very end of the string
\z the end\z this is...\n...the end
Not available in Python and JS
End of string or (except Python) before final line break
\Z the end\Z this is...\n...the end\n
Not available in JS
Beginning of String or End of Previous Match
\G
.NET, Java, PCRE (C, PHP, R…), Perl, Ruby
Word boundary
\b Most engines: position where one side only is an ASCII Bob.*\bcat\b Bob ate the cat
letter, digit or underscore
Word boundary
\b .NET, Java, Python 3, Ruby: position where one side only Bob.*\b\кошка\b Bob ate the кошка
is a Unicode letter, digit or underscore
\B Not a word boundary c.*\Bcat\B.* copycats
SIT Internal
Regular Expression Online Engine
Splunk uses Perl Compatible
Regular Expressions (PCRE)
[Link]
SIT Internal
Regular Expression Exercise
• Copy and paste sample log messages onto [Link] then try to
develop the regular expression to match IPv4 addresses.
Date flow start Duration Proto Src IP Addr:Port Dst IP Addr:Port Packets Bytes Flows
2007-02-24 [Link].917 42.682 UDP [Link]:57024 -> [Link]:19522 2 58 1
2007-02-24 [Link].552 15.202 UDP [Link]:57024 -> [Link]:18278 2 58 1
2007-02-24 [Link].806 13.998 UDP [Link]:57024 -> [Link]:31991 2 58 1
2007-02-24 [Link].434 96.322 UDP [Link]:54606 -> [Link]:38662 166 4814 1
2007-02-24 [Link].714 72.352 UDP [Link]:57024 -> [Link]:34016 2 58 1
2007-02-24 [Link].830 91.019 UDP [Link]:3656 -> [Link]:4027 160 4640 1
2007-02-24 [Link].941 80.638 UDP [Link]:57024 -> [Link]:34197 2 58 1
SIT Internal
IP Address Validation
\d+\.\d+\.\d+\.\d+ or /([0-9]{1,3}\.){3}[0-9]{1,3}/g
• While it will catch IP addresses like [Link], it will also catch an invalid
IP address like [Link]
• The regular expression for matching IP addresses should make sure each octet
is in the proper range.
^([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$
• The expression will detect an IP address of [Link], invalid in some network
types
• Some security systems report spoofed IP addresses as [Link]
28
SIT Internal
• on Lecture-3 contents • Open [Link] on
your web browser
• 3 MCQs • Key in the Class code that
appears in the top right-hand
corner of the presentation
• Type in your student ID and join
SIT Internal
DATA EXPLORATION
SIT Internal
Exploratory Data Analysis
○ uncovering interesting trends, outliers, and patterns in the data
○ identifying areas of interest, understanding the context of data
Exploratory Data Analysis (EDA)
SIT Internal
• Process to understand data
• Learn about variables
• Significance of variables
• Entities involved
• Relationships between variables
• Relation with other datasets
• Without understanding the data
• Cannot assess usefulness
• Cannot refine it over time
• Cannot visualize suitably
• Cannot think algorithmically
• Cannot comprehend the capabilities
SIT Internal
Statistical Techniques
• Techniques such as mean, median, standard deviations,
inter-quartile ranges, and distance formulas
SIT Internal
Analysis on a Single Variable
• Univariate Analysis
• Analyzing a single variable/attribute
• Purpose is to describe the quantitative data
• Does not deal with relationships
• Describing patterns using
• Central Tendency – Concentration of the data
• Mean, Mode and Median
• Dispersion – Spread of the data
• Range
• Variance
• Quartiles
SIT Internal
Central Tendency
• Central Tendency – Concentration of the data
• Mean, Mode and Median
• Mean - sum of all values divided by the number of count
• Mode – value that occurs most frequently
• Median - the value at the middle of the data set
SIT Internal
Dispersion Exercise:
Calculate the standard
deviation (σ) of the
• Dispersion – Spread of the data dataset containing 3, 4,
• Range 4, 5, 6, 8.
• range = max – min
• difference between the maximum and minimum values
• Variance (σ2)
• measures how far each value in the dataset is from the mean
• defined as the sum of the squared distances of each term in
the distribution from the mean (μ), divided by the number of
terms in the distribution (N).
• Quartiles
• ¼ population according to some attribute
• First and third quartiles (the 25th and 75th percentiles, or
the median value of the first and last halves of the data)
SIT Internal
Analysis on Multiple Variables
• Multivariate analysis (MVA)
• Analysing one or more attributes
• Quantitative measures
• Relationship between two attributes
• How attribute 1 affects attribute 2
• Interesting patterns
• Relationship between three attributes
• How attributes affect each other
• Interesting trends
SIT Internal
Analysis on Multiple Variables
• Regression Analysis
• Predicting the outcome of an attribute
from another attribute
• Predictive Modelling
• Principal Component Analysis
• Identify dominant patterns in data
• Detect Outliers using a box-plot
SIT Internal
Five Number Summary for
quantitative variables
• Use five numbers to summarize on the range and distribution of a
quantitative variable
• Minimum and maximum values;
• taking the difference of these will give you the range (range = max -
min)
• Median
• the value at the middle of the data set
• First and third quartiles
• 25th and 75th percentiles
• Mean
• or called average
• Provides an exploratory step to look at descriptive statistics of
quantitative variables
SIT Internal
• Even though we represent Reliability and Risk as numbers, they
are ordinal variables
• meaning each entry is assigned an integer, and a value of 4 is not
necessarily twice the Reliability or Risk of 2.
• It only means that Reliability or Risk that is scored 4 is higher than that
scored 2.
SIT Internal
Frequency for qualitative variables
• Display the count for each category of a qualitative variable
table(av$Reliability) # summary sorts by the counts by default
## 1 2 3 4 5 6 7 8 9 # maxsum sets how many factors to display
## 5612 149117 10892 87040 7 4758 297 21 summary(av$Type, maxsum=10)
686 ## Scanning Host Malware Domain
## 10 ## 234180 9274
## 196 ## Malware IP Malicious Host
## 6470 3770
table(av$Risk) ## Spamming C&C
## 1 2 3 4 5 6 7 ## 3487 610
## 39 213852 33719 9588 1328 90 10 ## Scanning Host;Malicious Host Malware
Domain;Malware IP
## 215 173
## Malicious Host;Scanning Host (Other)
## 163 284
SIT Internal
Graph to visualize
data distribution
Bar charts giving a visual overview of
Country, Risk and Reliability factors
respectively
SIT Internal
Two-way Tables
Table of counts for reliability and Risk
Two-way Table Graphical Representation
SIT Internal
SIT Internal
Lecture 4 Summary
• Linux commands for log analysis
• grep, awk ‘{print $n}’
• sed
• sort, uniq, count
• Regular expressions
• Data exploration
• Quantitative variables: five number summary
• Qualitative variables: frequency table
• Charts and graphs