0% found this document useful (0 votes)
75 views9 pages

On The Database Lookup Problem

This document discusses improving the efficiency of approximate matching algorithms for digital forensics investigations. It presents a concept to reduce the lookup complexity of approximate matching from O(x) to O(1), where x is the number of digests in a database. Instead of using multiple small Bloom filters, the approach demonstrates using a single, large Bloom filter has better performance. An evaluation shows current algorithms take over 21 minutes to compare digests, while the improved version solves it within seconds without impacting precision or recall rates.

Uploaded by

elite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views9 pages

On The Database Lookup Problem

This document discusses improving the efficiency of approximate matching algorithms for digital forensics investigations. It presents a concept to reduce the lookup complexity of approximate matching from O(x) to O(1), where x is the number of digests in a database. Instead of using multiple small Bloom filters, the approach demonstrates using a single, large Bloom filter has better performance. An evaluation shows current algorithms take over 21 minutes to compare digests, while the improved version solves it within seconds without impacting precision or recall rates.

Uploaded by

elite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

On the database lookup problem of approximate matching

Frank Breitinger
a, b,
*
,1, 2
, Harald Baier
a,1
, Douglas White
b, 2
a
da/sec Biometrics and Internet Security Research Group, Hochschule Darmstadt, Haardtring 100, 64295 Darmstadt, Germany
b
National Institute for Standards and Technologies, 100 Bureau Dr, Gaithersburg, MD 20899, United States
Keywords:
Digital forensics
Approximate matching
Bloom lter
sdhash
mrsh-v2
Indexing
Blacklisting
a b s t r a c t
Investigating seized devices within digital forensics gets more and more difcult due to the
increasing amount of data. Hence, a common procedure uses automated le identication
which reduces the amount of data an investigator has to look at by hand. Besides iden-
tifying exact duplicates, which is mostly solved using cryptographic hash functions, it is
also helpful to detect similar data by applying approximate matching.
Let x denote the number of digests in a database, then the lookup for a single similarity
digest has the complexity of O(x). In other words, the digest has to be compared against all
digests in the database. In contrast, cryptographic hash values are stored within binary
trees or hash tables and hence the lookup complexity of a single digest is O(log
2
(x)) or O(1),
respectively.
In this paper we present and evaluate a concept to extend existing approximate matching
algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of
using multiple small Bloom lters (which is the common procedure), we demonstrate that
a single, huge Bloom lter has a far better performance. Our evaluation demonstrates that
current approximate matching algorithms are too slow (e.g., over 21 min to compare 4457
digests of a common le corpus against each other) while the improved version solves this
challenge within seconds. Studying the precision and recall rates shows that our approach
works as reliably as the original implementations. We obtain this benet by accuracythe
comparison is now a le-against-set comparison and thus it is not possible to see which
le in the database is matched.
2014 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access
article under the CC BY-NC-ND license ([Link]
Introduction
Handling hundreds of thousands of les is a major
challenge in todays digital forensics. In order to cope with
this information overload, investigators often apply hash
functions for automated input identication. A common
processing is known le ltering which is quite simple:
compute the hashes for all les on a target device and
compare them to a reference database. Depending on the
underlying database, les are either ltered out (e.g., les of
the operating system) or ltered in (e.g., known offensive
content). Avery common database for lter out data is the
National Software Reference Library (NSRL) (NIST
Information Technology Laboratory, 2013) maintained by
National Institute for Standards and Technologies (NIST).
Besides identifying exact duplicates, which is mostly
solved running cryptographic hash functions, it is also
necessary to cope with similar inputs (e.g., different ver-
sions of les), embedded objects (e.g., a JPG within a Word
document), and fragments (e.g., network packets) which is
commonly solved by approximate matching. The essential
idea is to complement the use of cryptographic hash
* Corresponding author. da/sec Biometrics and Internet Security
Research Group, Hochschule Darmstadt, Haardtring 100, 64295 Darm-
stadt, Germany.
E-mail addresses: [Link]@[Link] (F. Breitinger), harald.
baier@[Link] (H. Baier), [Link]@[Link] (D. White).
1
[Link]
2
[Link]
Contents lists available at ScienceDirect
Digital Investigation
j ournal homepage: www. el sevi er. com/ l ocat e/ di i n
[Link]
1742-2876/ 2014 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://
[Link]/licenses/by-nc-nd/3.0/).
Digital Investigation 11 (2014) S1S9
functions to detect data objects with bytewise identical
representation with the capability to nd objects with
bytewise similar representations.
However, the lookup complexity of similarity digests
hamper the usage in the eld. Let x denote the amount of
digests in a database, then the naive lookup for a single
similarity digest has the complexity of O(x). In contrast,
cryptographic hash values can utilize binary trees or hash
tables and hence the lookup complexity is O(log
2
(x)) or
O(1), respectively. Assuming a set instead of a single digest,
the lookup complexity of similarity digests has a quadratic
runtime as it is solved by an all-against-all comparison
(brute-force).
Recently, at least one approximate matching algorithm
(ssdeep) was extended and now has a possibility of
indexing (Winter et al., in press). The authors showed an
improvement of a factor of almost 2000 which is practical
speed. However, analysis showed there are more power-
ful algorithms like sdhash (Roussev, 2011) and mrsh-v2
(Breitinger et al., 2013b) which output a different type of
similarity digest. While ssdeep produces a Base64
encoded ngerprint, both other algorithms output Bloom
lter based hashes. In spite of all of them, the problem
remains.
In this paper we present and evaluate a concept for
approximate matching that allows a le-against-set com-
parison with a lookup complexity of O(1) for a single digest.
In contrast to general approximate matching, our approach
can only answer the question does this set contain a
similar le to le A? by
yes, there is a similar le (but it cannot say which one),
or
no, there is no similar le,
which is sufcient in case of blacklisting. We obtain this
benet either at the cost of more hashing operations or
requiring a lot of main memory. Our evaluation demon-
strates that the current procedures are too slow (e.g., over
21 min to compare 4457 digests of the t5-corpus
3
against
each other) while our improved version solves this chal-
lenge within seconds. Analyzing the precision and recall
rates shows that our approach works as reliably as the
original implementations.
The rest of the paper is organized as follows. Sec. 2 in-
troduces the necessary background and related work. The
problem description and solution overview is explained in
Sec. 3. All details about our concept are presented in Sec. 4.
The experimental results are given in Sec. 5. Sec. 6 con-
cludes the paper.
Background & related work
This section explains the foundations and presents
related literature. First, we briey present the usage of hash
functions and approximate matching in digital forensics
which is followed by an introduction of Bloom lters. Sec.
2.3 starts with an overview of approximate matching and
then introduces three concepts in more detail.
Hash functions and approximate matching in digital forensics
Currently a popular use case is to employ hashing
methods for known le ltering of les which is quite sim-
ple: an investigator computes the hashes for all les on a
target device and compares them to a reference database.
Depending on the underlying database, les are either
ltered out (e.g., les of the operating system) or ltered in
(e.g., known offensive content). Files not found in the
database remain unclassied.
In case of lter out (a.k.a. whitelisting) the database
contains benign les, e.g., operating system les. We claim
that here an investigator is only interested in exact matches
and thus crypto hashes are the only choice. However, in
case of lter in (a.k.a. blacklisting) the database contains
illegal or suspicious inputs, e.g., child abuse or leaked
company secrets, and an investigator is also interested in
similar les. Note, approximate matching may operate on
the byte level or the semantic level (Breitinger et al., 2014).
Bloom lter
Bloom lters (Bloom, 1970) have a wide eld of appli-
cations, e.g., database applications (Mullin, 1990) or
network applications (Broder and Mitzenmacher, 2005)
and commonly used to represent elements of a nite set S.
A Bloom lter is an array of m bits initially all set to zero. In
order to insert an element s S into the lter, k inde-
pendent hash functions are needed where each hash
function h outputs a value between 0 and m 1. Next, s is
hashed by all hash functions h. To insert, the bits at the
positions h
0
(s), h
1
(s),.h
k1
(s) of the Bloom lter are set to
one.
To answer the question if s
0
is in S, we compute h
0
(s
0
),
h
1
(s
0
), .h
k1
(s
0
) and analyze if the bits at the corresponding
positions in the Bloomlter are set to one. If this holds, s
0
is
assumed to be in S, however, we may be wrong as the bits
may be set to one by different elements from S. Hence,
Bloom lters suffer from a non-trivial false positive rate.
Otherwise, if at least one bit is set to zero, we know with
certainty that s
0
;S. It is obvious that the false negative rate
is equal to zero.
In case of uniformly distributed data the probability that
a certain bit is set to one during the insertion of an element
is 1/m, i.e., the probability that a bit is still zero is 1 1/m.
After inserting n elements into the Bloom lter, the prob-
ability of a given bit position to be one is 1 (1 1/m)
k$n
. In
order to have a false positive, all k array positions need to be
set to one. Hence, the probability p for a false positive is
p
_
1 1 1=m
k$n
_
k
z
_
1 e
kn=m
_
k
: (1)
Bytewise approximate matching
Approximate matching is a rather new area and prob-
ably had its breakthrough in digital forensics in 2006 with
an algorithm called context triggered piecewise hashing
3
[Link] (last accessed Nov. 29th, 2013).
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S2
(CTPH) (see Sec. 2.3.1). Since then, a couple of algorithms
were presented. As this work focuses on Bloom lter based
approaches, we discuss those in Sec. 2.3.2. A complete
overviewof different algorithms is given by Breitinger et al.
(2013a).
Basically approximate matching consists of two sepa-
rate functions. First, tools run a feature extraction function
that extracts features or attributes from the input that
allow a compressed representation of the original object
(the exact proceeding depends on the implementation it-
self). Second, to compare two similarity digests, a similarity
function is used that normally outputs a score s which is
scaled to 0 s 100. Despite its range, this value is not
necessarily an estimate of percentage commonality be-
tween the compared objects but a level of condence. It is
meant to serve as a means to sort and lter the results.
ssdeep and the F2S2 software
The program ssdeep, also known as context triggered
piecewise hashing (CTPH, Kornblum (2006)) may be the
origin of approximate matching and is based on the spam
detection algorithm from Tridgell (20022009). The
implementation is open source and available online.
4
The basic idea is very simple: split an input into chunks,
hash each chunk independently and concatenate the chunk
hashes to a nal similarity digest. In order to split an input
into chunks, the algorithm identies trigger points using a
rolling hash (a variation of the Adler-32 function
5
) which
considers the current context of seven bytes. Each chunk is
then given to the non-cryptographic hash function FNV
(Noll, 19942012). Instead of using the complete FNV hash,
CTPHonly takes the least signicant 6 bits which is equal to
one Base64 character. Thus, two les are similar if they
have common chunks.
F2S2 was presented by Winter et al. (in press) and is an
extension for ssdeep that allows a faster similarity digest
comparison. F2S2 initializes a hash table that allows to
insert n-grams
6
of the Base64 similarity digest. Each simi-
larity digest is split into its n-grams and the ID to the cor-
responding le is put into its corresponding hash table
bucket. In order to lookup a similarity digest, the queried
digest is split into its n-grams. Next, the content of all
buckets are correlated in order to receive a set of possible
similar les. The nal decision is then made by using the
ssdeep comparison function.
The authors showed an improvement of a factor of
almost 2000 which is practical speed. For instance, they
decrease the time for verifying 195,186 les against a
database with 8,334,077 entries from 364 h to 13 min.
Bloom lter based approaches
This section presents two further prominent approaches
that outperform ssdeep with respect to precision & recall
(Roussev, 2011; Breitinger et al., 2013b, 2013c). In the
following, we provide a brief sketch of the feature
extraction functions of sdhash and mrsh-v2, respectively;
a detailed description is beyond the scope of this paper.
Details about the similarity function and the similarity di-
gests are given at the end of this section.
sdhash. This algorithm was proposed by Roussev (2010)
and attempts to pick characteristic features for each ob-
ject that are unlikely to appear by chance in other objects,
which is the result from an empirical study. In the baseline
implementation, each feature is hashed with SHA-1
(Gallagher and Director, 1995) and inserted into a Bloom
lter (Bloom, 1970) where a feature is a sequence of 64
bytes. The similarity digest of the data object is a sequence
of 256-byte Bloom lters each representing approximately
10 KiB of the original data, on average.
Subsequently, a block-aligned version was developed
(Roussev, 2012), in which xed-size blocks (16 KiB by
default) are mapped to each 256-byte lter. Although the
two versions are compatible (the two versions of the di-
gests can be meaningfully compared) we do not consider
the block-aligned version in our study as it requires addi-
tional parameters.
mrsh-v2. Breitinger and Baier (2013) propose a new algo-
rithm that is based on ssdeep and multi-resolution simi-
larity hashing (Roussev et al., 2007). Equal to ssdeep, the
algorithm divides an input into chunks using a rolling hash
where the estimated blocksize is 160 bytes. Each chunk is
then hashed by the 64-bit non-cryptographic hash function
FNV-1a (Noll, 19942012) and inserted into a Bloom lter
where a lter can store up to 160 chunks. Once a Bloom
lter reaches its capacity, a new one is created.
Note, in the following we are using the term feature as a
synonym for chunk.
Similarity digest. The similarity digest is very similar in both
cases. To insert a feature-hash into a m 2048 bit Bloom
lter (default size for both algorithms), the algorithms take
55 bits of the digest, split them into k 5 sub-hashes of 11
bits and set the corresponding bit. For instance, the sub-
hash 00010001100
2
8C
16
140
10
sets bit 140 in the
Bloom lter. Both implementations have a maximum of
features per Bloom lter. If this limit is reached, a new
Bloomlter is created. Hence, the nal similarity digest is a
sequence of Bloom lters which is supposed to be
approximately 1.0% (mrsh-v2) or 1.6 %2.6% (sdhash) of
the input length (compression ratio). To identify the simi-
larity between two digests, all Bloomlters of ngerprint A
are compared against all Bloom lters of similarity digest B
with respect to the Hamming distance as metric.
7
Problem & solution
Currently, the problem is that it is not possible to order/
index Bloom lter digests. Thus, if a database contains x
digests a comparison of a given similarity digest against the
database requires an against-all comparison. Extending
4
[Link] (last accessed Nov. 29th, 2013).
5
[Link] (last accessed Nov. 29th, 2013).
6
n-grams a fragments of a longer sequence, e.g., 2-g of ABCD are AB, BC
and CD.
7
The original comparison is only sketched in this paper, as we replace
it in our new concept.
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S3
this scenario means that comparing y similarity digests
against the same database corresponds to an all-against-all
comparison (bruteforce) which equals a quadratic runtime
complexity: O(xy).
Sec. 3.1 gives an overview of the overall idea. After that,
we briey repeat the terminology and denition, as it is
really important to have the abbreviations in mind.
Proceeding overview
The basic idea is to insert all features into a single Bloom
lter instead of having multiple lters as the lookup
complexity per lter is O(1). Thus, we overcome the
drawback of existing approaches and avoid the all-against-
all comparison. To speed up the comparison decision we
additionally replace the classical comparison function by a
decision based on a sufciently large number of common
substrings (later called longest run) as explained in Sec. 4.
More precisely, let S
B
and S
D
be two sets of digests.
Traditionally (using cryptographic hash functions) an
investigator possesses a database containing the elements
of S
B
(e.g., the blacklist). When he receives D (e.g., a seized
device), he hashes all les to S
D
and compares them against
S
B
. Note, the database S
B
can be precomputed and hence its
generation time is irrelevant.
Regarding our concept, there are two alternatives
depending on the underlying hardware:
1. Alternative one is identical to the traditional procedure.
That is, the Bloomlter is lled with the features of S
B
in
advance, that is we can neglect its generation time. Note,
using more than one Bloom lter will slow down the
process as they always have to be loaded into memory.
2. The second possibility assumes that S
B
does not t into
the Bloom lter, but S
D
does. In that case we turn the
work owupside down by lling the Bloomlter with S
D
and compare S
B
against it.
The difference between these two procedures is the
overall time. While in traditional procedure (alternative
(1)) only S
D
needs to be processed, the second possibility
also has to hash S
B
as a precomputation step. In the
following (1) is denoted by best-case and (2) by worst-case.
The reason why alternative (2) might be necessary is
that it is not possible to load the Bloom lter for S
B
into
main memory. Hashing all les of a set into a single Bloom
lter requires a large Bloomlter which has to t into main
memory due to efciency reasons. Thus, the limiting source
is the physically available RAM.
For instance, let us assume that jS
B
j 1500 GiB and
jS
D
j 200 GiB. As shown later, an everyday working station
with 8 GiB RAMcannot handle a Bloomlter of S
B
but of S
D
.
Therefore, we suggest creating a Bloom lter out of S
D
and
comparing all les of S
B
in a second step. It is obvious that
both sets have to be hashed it is not possible to create the
database in advance.
To optimize (2), one may store a list of hash values of S
B
instead of the les. Thus, the les are already hashed and
the overall proceeding is almost as fast as (1). In addition,
the compression is better as we only store a 256-bit (32
byte) hash for each 64-byte chunk.
Another downside of this approach is that we can
only say: yes, there is a similar le but not which le is
matched. In contrast, the traditional procedure allows a
statement: le A of the seized device matches le B in
the database. However, we argue that with respect to
lter in (blacklisting) this is sufcient as an investigator
has to analyze matches anyway. In short, we only want
to know if the investigated le is similar to any le on
our set which is perfectly suited for blacklisting. In the
case a yes or no decision is insufcient, this procedure
can be used as pre-proceedingif a le is not found in the
Bloom lter it is denitely not a black listed le and can
be ignored.
Terminology & denition
This section repeats the notations from the previous
section which are necessary to understand all improve-
ments and design decisions. Let m; k; nK.
feature describes a byte sequence which is hashed and
inserted into the Bloom lter. In case of mrsh-v2 this
equals a chunk of approximately 160 bytes and regarding
sdhash this is a sequence of exactly 64 bytes.
m denotes the Bloom lter size in bits.
k number of sub-hashes where each one sets a bit in the
Bloom lter.
n is the number of features inserted into a Bloom lter.
s denotes the le set size in MiB.
Design decisions and implementation
This section describes the design and code changes of
our approach. Sec. 4.1 shows the correlation between the
input le size and the number of features which are
inserted into the Bloom lter. The relevance of the feature
hash function is discussed in Sec. 4.2. Based on all these
ndings, Sec. 4.3 explains the procedure to calculate the
best Bloom lter size. Sec. 4.4 introduces our match deci-
sion approach and the false positive rate which is the nal
parameter of our conguration. After that, we describe the
result presentation in Sec. 4.5 and motivate the default
conguration in Sec. 4.6. The nal part shows some details
about our reference implementation.
Correlation between elements (n) and le set size
In Eq. (1), n denotes the number of elements that are
inserted into a Bloomlter. Note, the number of elements is
different to the amount of les in the set but equal to the
number of features. Hence, this section analyzes the rela-
tion between n and le set size. Let s denote the le set size
in MiB.
sdhash inserts 192 features into a Bloom lter for every
approximately 10 KiB of the input le. Thus, n is calculated
by n s$2
20
$192/(10$2
10
) z s$2
14
where 2
20
is needed to
change from MiB to bytes.
In case of mrsh-v2, the implementation splits the input
in 160-byte features. Thus, n is calculated by n s$2
20
/
160zs$2
13
where 2
20
converts MiB into bytes.
Since we adapted the feature size of mrsh-v2 in our
prototype to 64 bytes, we set n s$2
14
in the following.
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S4
Feature hash function
To insert a feature into a Bloom lter of m bits length, k
bits are set which requires a hash value of the feature hash
function of at least k$log
2
(m) bits. More formally, having a
feature hash function of b bits, b k$log
2
(m) which is
equivalent to m 2
b/k
.
The default implementations of sdhash and mrsh-v2
run 160-bit SHA-1 and the 64-bit FNV hash, respectively,
and set k 5. This comes at a maximumBloomlter size of
2
160/5
2
32
bits 2
29
bytes is 512 MiB (sdhash) and 2
64/
5
2
12.8
bits 891 bytes (mrsh-v2).
Depending on the le set, this is insufcient as we might
have a Bloom lter of several gigabytes (for instance when
mapping several terabytes into a Bloom lter). As a
consequence, both tools should implement 256-bit ver-
sions of the hashing algorithms. For instance, keeping k 5
and using a 256-bit hash function allows us to handle
Bloom lter sizes of 2
256/5
2
51.2
bits z 2
18
GiB. Alterna-
tively, if we have an upper limit of the Bloom lter size, we
can increase k which reduces the false positive rate (see
Sec. 2.2). For instance, assuming a Bloom lter of
m 8 GiB 2
36
bits, k can be at most 256/36 7.11. For the
remainder of this paper, we limit k to 5 k 7 where 5 is
the lower limit to minimize the chance for a false positive.
The upper limit is necessary to handle huge amounts data,
e.g., 1 TiB. If 7 is too small, one might change the hash
function to 512-bit or more.
Dening the Bloom lter size
Traditionally when dealing with Bloom lters, one tries
to optimize k for a given n, m, p setting. However, we have
limited 5 k 7 and discussed n. Thus, this section
identies a reasonable Bloom lter size m by transposing
Eq. (1) and substituting n in the last step by s$2
14
(see Sec.
4.1):
p
_
1 e
kn=m
_
k
m
k$n
ln
_
1

p
k
p
_

k$s$2
14
ln
_
1

p
k
p
_ :
(2)
Note, in our reference implementation, the Bloom lter
size has to be a power of two, i.e., m 2
c
, c N.
Match decision and false positives
In contrast to the classical approximate matching
comparison, we are only interested in a binary decision:
the currently processed le or a fragment of it is on the
blacklist or not. Therefore we adapt the classical com-
parison function of Bloom lters as follows: a fragment of
a given le is assumed to be part of the Bloom lter, if a
sufciently large number of subsequent features is found
in the lter. We discuss in the following our approach to
identify a reasonable interpretation of what sufciently
means.
Eq. (1) and Eq. (2) relate the false positive probability p
for a single feature to the different parameters k, n, s, m. In
fact, we are less interested in the false positive rate for a
single feature but more for a fragment of a whole le
(which may consist of hundreds of thousands of features).
Thus, we have to extend Eq. (2) to a false positive proba-
bility of a fragment.
Let p
f
denote the false positive probability for a frag-
ment. If we require r
min
N consecutive false positive
features to be a false positive fragment, the false positive
probability for a fragment is p
f
p
r
min
.
Regarding Eq. (2), we can substitute p by

p
f
r
min
p
and
obtain
m
k$s$2
14
ln
_
1

p
f
k$r
min
p
_: (3)
Result presentation
Instead of printing a similarity score between two les,
our algorithm outputs.
[Link]: 163 of 2518 (longest run: 111)
which means that [Link] consists of 2518 features
in total where 163 match the underlying Bloom lter. The
longest run are 111 features which means that the algo-
rithm identied 111 features in a row (which is larger then
r
min
and therefore a match).
Sample setting
In the following we briey discuss the default values. n
cannot be inuenced as it is dened by the set size. In case
of the false positive rate it is obvious that the smaller the
better. Since we do not expect more than 1 million les on a
device, we set p
f
10
6
k is xed between 5 and 7 and we
decided for 5 as the smaller k the better the runtime
(feature hash length can be reduced; see Sec. 4.7.1). r
min
is
by default 6.
For instance, let us assume s 200 GiB 25$2
13
MiB of
data, a false positive rate per le of p
f
10
6
, k 5 and
r 6:
m
k$s$2
14
ln
_
1

p
f
k$r
p
_
5$
_
25$2
13
_
$2
14
ln
_
1 10

6
30
_
125$2
27
0:9968
125:40$2
27
z2
34
bits
Thus, our procedure requires a Bloomlter size of 2 GiB.
Note, using the default setting, the Bloom lter size in
megabytes m
mb
can be estimated by m
mb
zs/100.
Implementation details
To verify our ndings, we released a tool called mrsh-
net which is basically a modication of the latest mrsh-
v2 version. It can be downloaded from our website for
tests.
8
8
[Link] anonymous for
review.
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S5
The implementation is very simple and only has two
options:
- g generates the database and prints it to stdout. Usage:
./mrsh-net -g t5-corpus > dbFile
- i reads DB-FILE dbFile and compares DIR/FILE against it.
Usage: ./mrsh-net -i dbFile t5-corpus
The nal step is to compile it by running make mrsh-
net.
Feature hash function
The main change was the implementation of the FNV-1a
256 bit function which only consists of an XOR and the
multiplication with the prime 2
168
2
8
0 63. As the
runtime efciency is very important, the implementation
of the multiplication is hardcoded, i.e., it is not trivial to
change the prime number or extend it to 512 bit.
In order to speed up the implementation, one may
manipulate the FNV implementation in src/fnv.c. The
function mulWithPrime2 is responsible for the multipli-
cation with the 256-bit prime. However, in case of a small
Bloom lter, we do not need the most signicant bits and
can remove them. For instance, setting the Bloom lter to
32 MiB and k 5, we only need log
2
(32$2
20
$8)$5 140 bits.
Thus, we can comment out lines 108112, which then ig-
nores the bits 160255.
Settings
To adapt our prototype for a specic use case, the user
can change the following conguration in header/
cong.h:
SUBHASHES amount of sub-hashes, parameter k
(default: 5).
MIN_RUN minimal longest run, parameter r (default:
6).
BF_SIZE_IN_BYTES Bloom lter size in bytes
(default: 33$554$432 2
25
32 MiB). It has to be a power
of 2.
There are more settings available. However, this is
ongoing research and we therefore do not recommend
changing them at this time.
Experimental results & assessment
This chapter mainly consists of three parts. First, we
analyze the general efciency of different approaches. The
second part relates mrsh-net with the longest common
substring. The nal part compares mrsh-net and mrsh-v2
with each other.
All the presented results are based on the t5-corpus
9
(Roussev, 2011), which contains 4457 les with a total
size of 1.78 GiB. The average le is z400 KiB and the le
type distribution is given in Table 2.
For our testing, we used the default conguration of
mrsh-net, with k 5,r
min
6 and a Bloom lter size of 32
MiB. The blocksize, i.e., the approximate length of a feature,
is set to 64 bytes.
Efciency in general
Let S
D
denote the hashes of les from a device and let S
B
denote database set (i.e., the blacklist). Traditionally the
proceeding requires to hash all les in S
D
and to compare
the hashes against an existing database of S
B
les. Thus,
this section focuses the general properties of the different
approaches with respect to runtime efciency and database
size (compression).
The results are given in Table 1 whereby the details are
discussed in the upcoming subsections. First, the
compression (row1) is analyzed in Sec. 5.1.1. Next, Sec. 5.1.2
explains the runtime of the algorithms (rows 24). The last
section is an estimation for a large scale scenario to clarify
the impact of non-indexing.
Columns 1 and 2 present the results for the original
implementations of sdhash and mrsh-v2, respectively. In
column 3 we show the results for the worst case which
means that we do not have an underlying database (see Sec.
3.1). The following column also presents the worst case but
we modied mrsh-net based on the defaults in Sec. 4.7.2
(less bits of the FNV hash are considered). For complete-
ness we included the results for F2S2 and SHA-1 in the last
two columns.
Database size
Let S
B
be the t5-corpus. Then, this section shows the size
of the corresponding database. In case sdhash, mrsh-v2,
F2S2 and SHA-1 the database is trivial as the database is
equal to the hashes.
Regarding mrsh-net there are two possibilities: worst
vs. best case. The worst case describes the scenario where
the database does not t in RAM and hence a hash-
database as such does not exist. The investigator needs to
have the whole dataset available. In contrast, for the best
case where sufcient RAM is available, the database is
simply the Bloom lter.
To conclude, the size of the databases of sdhash,
mrsh-v2 and mrsh-net (best) are in the same order of
magnitude and therefore only a weak assessment
criterion.
Experimental runtime efciency
This section focuses on the time of hashing S
D
and
comparing it against the database of S
B
(generating the
database is neglected as it can be done in advance). As we
are interested in the runtime only, we use t5-corpus as
both S
B
and S
D
. Note, this results in 4457 4457
comparisons.
10
The times are given in Table 1. Row 2 states that all al-
gorithms perform well in hashing but are still slow
compared to SHA-1. The problem is shown in row 3 where
both Bloom lter approaches need an extremely long time
for the all-against-all comparison. The last row only sums
rows 2 and 3. Note, the worst-columns constitute an
exception. Admittedly the comparison takes less then a
second, however there is no underlying database and thus
9
[Link]
10
We run the tools by ./tool -c D B which compares both lists also
there are duplicate comparisons, i.e., A against B and B against A.
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S6
S
B
has to be processed. Therefore, row 4 contains two times
the hashing time (as D B).
One should keep in mind that the comparison has
quadratic complexity and thus increases enormously when
the number of les increases. In contrast, our new concept
has a linear runtime as it only needs to hash the les.
Impact on runtime in large scale forensics
Based on the ndings frombefore, this section estimates
the efciency for a real life scenario. More precisely, we
used the numbers from Table 1 and calculated the up-
coming ones for a larger use case where we pick up the
example from Sec. 4.6 assuming 200 GiB of seized data and
a database of 1500 GiB
11
.
The results are given in Table 3. The estimated database
size is given in row 1 where in the best case mrsh-net
needs the less space, however, it should be kept in RAM.
Row 2 calculates the approximate hashing time by multi-
plying 200/1.78 (we have to process 200 GiB instead of 1.78
GiB). The last row assesses the comparison time. For
instance, sdhash needed 1281 s for comparing
1.78 1.78 3.17 GiB of data. As this sample requires to
compare 200 1500 300,000 GiB of data, we estimate
the overall time by 300,000/3.17$1281 s.
Again, there two possible scenarios with mrsh-net. In
the worst case, we hash the 200 GiB to the Bloomlter and
then process the 1500 GiB. As the comparison costs
nothing, mrsh-net has to hash 1700 GiB which is (1700/
1.78$123) s 117471 s (approx 32 h). In the best case we
have a powerful station that can hold the Bloom lter for
1500 GiB data in RAM (approximately 16 GiB of RAM are
needed). Thus, we only need to process the 200 GiB which
takes 227 min.
Precision & recall on base of the longest common substring
The current version of mrsh-net decides between
match and non-match based on the longest run. Hence, this
section focuses on the relation between mrsh-net and the
longest common substring. Due to the complexity, we build
our ground truth on the approximate longest common
substring which is briey described in the upcoming
subsection. Based on this assumed ground truth, we eval-
uate mrsh-net in Sec. 5.2.2.
Approximate longest common substring
The basic idea of the approximate longest common
substring metric (aLCS) is not to compare les byte by byte
but rather block by block. To identify the blocks, we utilize
the rolling hash from ssdeep and aim at having a block
size bs z80 byte. If we set the blocks size smaller than 80,
the runtime efciency decrease enormously (Note, a block
size of 1 equals the tradition longest common substring).
Instead of comparing blocks bytewise, each one is hashed
and compared using the 64-bit FNV-1a hash Noll (1994
2012). Besides the hash value, the entropy and length for
each block is stored in the nal linear list called aLCS-
digest.
12
The output of the aLCS tool is a list which we denote as
ground truth. It contains both le names, the longest
common substring and the entropy for this sequence, e.g.,
le1 j le2 j 993 j 5.56
le2 j le3 j 11945 j 0.5
For instance, the rst line says that le1 and le2 have
an aLCS score of 993 bytes with an entropy of 5.56. The
second line shows a special case with a very low entropy
which could be an indicator that both les share mostly
zeros.
Precision & recall rates
To calculate the rates, we perform an all-against-all
comparison but neglect self-comparisons.
13
Thus, we use
the following simplied notation:
mrsn(f,BF) compares le f against the Bloomlter BF and
returns the longest run.
aLCS(f,GT) returns the longest aLCS score for f in the
ground truth GT.
Table 1
Database size and runtime efciency of different algorithms.
sdhash mrsh-v2 mrsh-net worst mrsh-net worst mrsh-net best F2S2 SHA-1
Database size 61.18 MiB 27.33 MiB 1.78 GiB 1.78 GiB 32 MiB 3.69 MiB 0.24 MiB
Hashing 178 s 53 s 123 s 77 s 123 s 221 s 24 s
Comparing 1281 s 1259 s <1 s
a
<1 s
a
<1 s <1 s <1 s
Total 1459 s 1312 s 246 s 154 s 123 s 221 s 24 s
a
The comparison itself need less than a second, however, in worst case B needs to be hashed.
Table 2
Number of les per le type: t5-corpus.
jpg gif doc xls ppt html pdf txt
362 67 533 250 368 1093 1073 711
Table 3
Estimated runtime for a sample use case.
sdhash mrsh-v2 mrsh-net
worst
mrsh-net
best
Database
size
49.79 GiB 22.22 GiB 1500 GiB 16 GiB
Hashing 329 min 98 min 227 min 227 min
Comparing 3.84 years 3.77 years 32.63 h <1 min
11
The current NSRL of NIST contains about 2 TiB of unique data, hence
this a realistic size.
12
Note, this is ongoing research and currently in review. In the future
we will publish some more details about the evaluation and show that
this is a valid approximation.
13
Compare a le against itself will result in a perfect match.
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S7
According to this, we dene the true positive (TP), false
positive (FP), true negative (TN) and false negative (FN) as
follows:
TP: mrsn(f, BF) r
min
and aLCS(f, GT) r
min
$bs.
FP: mrsn(f, BF) r
min
and aLCS(f, GT) < r
min
$bs.
TN: mrsn(f, BF) < r
min
and aLCS(f, GT) < r
min
$bs.
FN: mrsn(f, BF) < r
min
and aLCS(f, GT) r
min
$bs.
where r
min
$bs 6$64 384 bytes.
Positives. Our comparison returned 2555 positive matches
with a true positive rate of 99.3% and a false positive of 0.7%.
Reviewing the false positives, all but one of the longest run
lr do not exceed 9 which means that they are very close to
our threshold.
In addition, we studied the distribution of the aLCS
scores d
aLCS
relative to r
min
$bs for the false positives where
d
aLCS

_
100
_
1
aLCSf ; GT
384
__
; d
aLCS
N:
This shows how close the false positives are to the
threshold of 384. The results are given in Table 4. For
instance, over 60% have an aLCS score above 30% (269
bytes). To sum it up, although these are false positive, they
are close to the thresholds.
Next, we consider the relation between the longest run
and the aLCS score. In other words, we expect that the
longest run lr multiplied by the blocksize bs is greater or
equal the aLCS score, i.e., lr$bs aLCS. According to that, we
adapt the conguration from the beginning of this section
and changed r
min
$bs to lr$bs. Thus, the new true positive
setting is:
TP: mrsn(f, BF) r
min
and aLCS(f, GT)lr$bs.
.
In this case, the detection rates worsen and fall down to
a true positive rate of 92.3% and a false positive rate of 7.7%.
Again, we consider the distribution for the aLCS scores in
Table 5. As we can see, over 75% vary by less or equal then
30% and we rate these results as still acceptable.
Negatives. Obviously the negatives are 4457 2555 1902
which can be broken down into 77.1% true negatives and
22.9% false negatives. Having a closer look at this very high
false negatives, we observe that most aLCS matches are
based on low entropy sequences. In other words, the high
aLCS scores between some les are based on long runs of
zeros only, i.e., the entropy of the substring is e 0, or runs
of with a lot of zeros, e.g., the entropy of the substring is
0 < e < 3. Thus, Table 6 shows the impact of considering
aLCS sequences with a higher entropy.
Nevertheless, false negatives are not so much relevant.
For instance, with respect to blacklisting, these les remain
unclassied and an investigator has to analyze them
manually. Hence, false negatives are considered during a
further investigation.
Precision & recall rates compared to mrsh-v2
This section compares the relation between mrsh-v2
and mrsh-net. As both are based on the same procedure,
we expect that both implementations yield similar results.
In other words, comparing a le f against database BF, both
algorithms should either output a match or a non-match.
Thus, we dene the following rates:
TP: mrsn(f, BF) r
min
and mrsh(f, BF) 1.
FP: mrsn(f, BF)r
min
and mrsh(f, BF) 0.
TN: mrsn(f, BF)<r
min
and mrsh(f, BF) 0.
FN: mrsn(f, BF)<r
min
and mrsh(f, BF) 1.
Positives
Regarding the 2555 positive matches from mrsh-net,
92.1% are true positives and also identied by mrsh-v2.
The false positive rate is therefore at 7.9%. Comparing the
false positives against the aLCS showed that in fact only
3.6% (out of the 7.9%) are really false positive. On the other
side, Table 7 shows the distribution of the longest run for
the false positives. Most of themare close to the longest run
threshold 8.
To conclude, the results are slightly different, however,
the mrsh-net shows a ner granularity as in fact these are
not false positives but true positives.
Negatives
The negatives yield a 61.8% true negative rate and a
38.2% false negative rate. Recall, false negative means that
mrsh-net does not identify a match while mrsh-v2 out-
puts a score greater 0. In other words, mrsh-v2 identies a
positive.
Thus, we rst compared the mrsh-v2 results against
aLCS. In fact, almost 7 0% percent of these matches are
based on les that share less than 384 bytes which have no
Table 5
Empirical pdf & cdf for d
aLCS
for the relation between longest run and aLCS
score.
X 10 30 50 70 100
P{d
aLCS
X} 0.3214 0.2296 0.0714 0.0051 0.0051
P{d
aLCS
X} 0.3214 0.7551 0.9235 0.9796 1.0000
Table 6
Distribution of false negative with respect to entropy.
Entropy >0 >1 >2 >3
TN 78.5% 82.3% 86.4% 91.2%
FN 21.5% 17.7% 13.6% 8.8%
Table 7
Empirical pdf & cdf for longest run lr.
X 10 15 20 30 50
P{lr X} 0.1095 0.0200 0.0200 0.0200 0.0050
P{lr X} 0.4925 0.7363 0.7960 0.9665 0.9950
Table 4
Empirical probability distribution function (pdf) and cumulative distri-
bution function (cdf) for d
aLCS
.
X 10 20 30 50 70
P{d
aLCS
X} 0.1111 0.2778 0.2222 0.1111 0.0556
P{d
aLCS
X} 0.1111 0.3889 0.6111 0.8889 0.9444
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S8
false negatives for mrsh-net (our setting aims at having
more than 384 bytes). Regarding the remaining 30%, most
of the matches are again based on a low entropy, e.g., over
75% have e < 3.
To conclude, the algorithms do not coincide very much
with respect to negatives.
Conclusion
We have presented and evaluated a new approach to
efciently decide about the similar membership of a le to
a given dataset and hence solve an important issue in the
context of approximate matching. Our approach decreases
the lookup complexity from O(x) to O(1), where x is the
number of les in the reference dataset. We released a
sample implementation for a practical evaluation. For the
well-known t5-corpus as evaluation le data set we were
able to solve the similar membership problemfor all les in
the order of seconds (rather than minutes). The drawback is
that we are only able to decide about membership, but not
about the similarity to a certain le, which is sufcient for
the important use case of blacklisting.
References
Bloom BH. Space/time trade-offs in hash coding with allowable errors.
Commun ACM 1970;13:4226.
Breitinger F, Baier H. Similarity preserving hashing: eligible properties
and a new algorithm mrsh-v2. In: Rogers M, Seigfried-Spellar K, ed-
itors. Digital forensics and cyber crime. Lecture notes of the institute
for computer sciences, social informatics and telecommunications
engineering, vol. 114. Berlin Heidelberg: Springer; 2013. pp. 16782.
Breitinger F, Guttman B, McCarrin M, Roussev V. Approximate matching:
denition and terminology. NIST Publication; 2014.
Breitinger F, Liu H, Winter C, Baier H, Rybalchenko A, Steinebach M. To-
wards a process model for hash functions in digital forensics. In: 5th
International Conference on Digital Forensics & Cyber Crime; 2013a.
Breitinger F, Stivaktakis G, Baier H. FRASH: a framework to test algorithms
of similarity hashing. In: 13th Digital Forensics Research Conference
(DFRWS13); 2013b. Monterey.
Breitinger F, Stivaktakis G, Roussev V. Evaluating detection error trade-
offs for bytewise approximate matching algorithms. In: 5th ICST
Conference on Digital Forensics & Cyber Crime (ICDF2C); 2013c.
Broder A, Mitzenmacher M. Network applications of bloom lters: a
survey. Internet Math 2005;1:485509.
Gallagher P, Director A. Secure Hash Standard (SHS). Technical Report
National Institute of Standards and Technologies, Federal Information
Processing Standards Publication; 1995. pp. 1801.
Kornblum J. Identifying almost identical les using context triggered
piecewise hashing. Digit Investig 2006;3:917.
Mullin J. Optimal semijoins for distributed database systems. IEEE Trans
Softw Eng 1990;16:55860.
NIST Information Technology Laboratory. National software reference li-
brary [Link] 2013.
Noll LC. Fnv hash. [Link]
html; 19942012.
Roussev V. Data ngerprinting with similarity digests. In: Chow K-P,
Shenoi S, editors. Advances in digital forensics VI. IFIP advances in
information and communication technology, vol. 337. Berlin Heidel-
berg: Springer; 2010. pp. 20726.
Roussev V. An evaluation of forensic similarity hashes. Digit Investig
2011;8:3441.
Roussev V. Managing terabyte-scale investigations with similarity digests.
In: Peterson G, Shenoi S, editors. Advances in digital forensics VIII. IFIP
advances in information and communication technology, vol. 383.
Berlin Heidelberg: Springer; 2012. pp. 1934.
Roussev V, Richard III GG, Marziale L. Multi-resolution similarity hashing.
Digit Investig 2007;4:10513.
Tridgell A. spamsum. [Link]
spamsum/; 20022009. Accessed 29.11.13.
Winter C, Schneider M, Yannikos Y. F2s2: fast forensic similarity search
through indexing piecewise hash signatures. Digit Investig 2013;
10(4):36171.
F. Breitinger et al. / Digital Investigation 11 (2014) S1S9 S9

You might also like