Asymmetric Distances For Binary Embeddings
Asymmetric Distances For Binary Embeddings
Abstract cient operations. The second one is the memory cost: the
memory footprint of the objects should be small enough so
In large-scale query-by-example retrieval, embedding that all database image signatures fit in RAM. If this is not
image signatures in a binary space offers two benefits: data the case, i.e. if a significant portion of the database signa-
compression and search efficiency. While most embedding tures has to be stored on disk, then the response time of a
algorithms binarize both query and database signatures, it query collapses because the disk access is much slower than
has been noted that this is not strictly a requirement. In- that of RAM access.
deed, asymmetric schemes which binarize the database sig- To address these two interrelated issues, many binary
natures but not the query still enjoy the same two bene- embedding techniques were proposed including among oth-
fits but may provide superior accuracy. In this work, we ers Locality Sensitive Hashing (LSH) [4, 2], Spectral Hash-
propose two general asymmetric distances which are appli- ing (SH) [17], Kernel LSH (KLSH) [8], Locality Sensi-
cable to a wide variety of embedding techniques including tive Binary Codes (LSBC) [13] or Semi-Supervised Hash-
Locality Sensitive Hashing (LSH), Locality Sensitive Binary ing (SSH) [16]. They consist in transforming the initial
Codes (LSBC), Spectral Hashing (SH) and Semi-Supervised signatures in a binary space where the Hamming distance
Hashing (SSH). We experiment on four public benchmarks Ha can be employed. Binary embeddings address both the
containing up to 1M images and show that the proposed computational and memory issue: the Hamming distance
asymmetric distances consistently lead to large improve- between two binary signatures can be computed extremely
ments over the symmetric Hamming distance for all binary efficiently and the memory footprint is drastically reduced.
embedding techniques. We also propose a novel simple bi- In all the previous techniques, both the database signatures
nary embedding technique – PCA Embedding (PCAE) – and the query signature are binarized. For this reason, we
which is shown to yield competitive results with respect to will refer to such distance computation algorithms as sym-
more complex algorithms such as SH and SSH. metric.
However, it has been noted that compressing the query
signature is not mandatory [3, 6, 1]. Indeed, the additional
1. Introduction cost of storing in memory a single non-binarized signature
We are interested in computing efficiently the distance is negligible. Also, the distance between an original signa-
between two image signatures. The application we will fo- ture and a compressed signature can still be computed effi-
cus on is query-by-example image retrieval – i.e. given a ciently through look-up table operations. As the distance is
query image, rank a set of database images from closest to computed in two different spaces, these algorithms are re-
farthest – but our work finds applications in numerous other ferred to as asymmetric. A major benefit of asymmetric al-
problems such as nearest neighbor classification, density es- gorithms is that they can achieve higher accuracy for a fixed
timation or clustering. compression rate because they take advantage of the more
When dealing with a large number of database images, precise position information of the query. We note however
there are two considerations of paramount importance. The that the asymmetric algorithms presented in [3, 6, 1] are tied
first one is the computational cost: the computation of the to specific compression schemes. Dong et al. [3] presented
distance between two image signatures should rely on effi- an asymmetric algorithm for compression schemes based
on random projections. Jégou et al. [6] proposed an asym-
∗ Albert Gordo is partially supported by the Spanish projects TIN2008-
metric algorithm for compression schemes based on vector
04998, TIN2009-14633-C03-03, and CONSOLIDER-INGENIO 2010
(CSD2007-00018). This work was done while Albert Gordo was a visi-
quantization. Brandt [1] subsequently used a similar idea.
tor at XRCE. In this work, we generalize these asymmetric schemes to
729
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
the broad family of binary embeddings. More precisely, our We show that for LSH, LSBC, SH, SSH and PCAE the
contributions are the following ones: functions hk can be decomposed as follows:
• We first provide a theoretical analysis of several binary hk (x) = qk [gk (x)], (1)
embedding algorithms (including LSH, LSBC, SH and
SSH) showing that they can be decomposed into two where gk : Ω → R, and qk : R → {0, 1}. We denote
steps: i) the signatures are first embedded in an inter- g : Ω → RK with g(x) = [g1 (x), . . . , gK (x)]′ . If we
mediate real-valued space and ii) thresholding is per- have two image signatures x and y, we also show that the
formed in this space to obtain binary outputs. A key Euclidean distance is the natural metric between g(x) and
insight is that the Euclidean distance is a natural met- g(y).
ric in the intermediate real-valued space.
2.1. Locality Sensitive Hashing (LSH)
• Building on the previous analysis we propose two In LSH, the functions hk are called hash functions and
asymmetric distances for binary embeddings which are selected to approximate a similarity function sim in the
can be broadly applied to binary embedding algo- original space Ω = RD . Valid hash functions hk must sat-
rithms. The first one is an expectation-based technique isfy the LSH property:
inspired by [6]. The second one is a lower-bound-
based technique which generalizes [3]. P r [hk (x) = hk (y)] = sim(x, y). (2)
• We show experimentally on four datasets of different Here we focus on the case where sim is the cosine similar-
nature that the proposed asymmetric distances con- ity for which a suitable hash function is [2]:
sistently and significantly improve the retrieval accu-
racy of LSH, LSBC, SH and SSH over the symmetric hk (x) = σ (rk′ x) , (3)
Hamming distance. Somewhat surprisingly, although
with
the lower-bound and expectation-based techniques are (
0 if u < 0,
very different in nature, they are shown to yield very σ(u) = (4)
similar improvements. 1 if u ≥ 0.
• We provide an experimental comparison of LSH, The vectors rk ∈ RD are drawn from a multi-dimensional
LSBC, SH and SSH and draw conclusions about the Gaussian distribution p with zero mean and identity covari-
respective merits of these algorithms. Especially, we ance matrix ID . We therefore have qk (u) = σ(u) and
show that the key ingredient to the success of SH and gk (x) = rk′ x. In such a case the natural distance between
SSH is the dimensionality reduction. We therefore g(x) and g(y) in the intermediate space is the Euclidean
propose a much simpler PCA-based binary embedding distance as random Gaussian projections preserve the Eu-
technique, called PCAE (for PCA Embedding), which clidean distance in expectation. This property stems from
is shown to perform comparably. the following equality:
730
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
a dot product in another space using a non-linear mapping As optimizing (8) under constraint (9) is a NP hard problem
φ(x) of the input vectors. This enables to extend LSH as (even in the case where K = 1, i.e. of a single bit), Weiss et
well as the asymmetric distance computations beyond vec- al. propose to optimize a relaxed version of their problem,
torial representations. i.e. to remove constraint (9), and then to binarize the real-
valued output at 0. This is equivalent to minimizing:
2.2. Locality-Sensitive Binary Codes (LSBC) Z
Consider a Mercer kernel K : RD × RD → R that satis- ||g(x) − g(y)||2 sim(x, y)p(x)p(y)dxdy, (12)
x,y
fies the following properties for all points x and y:
1. It is translation invariant: K(x, y) = K(x − y). with respect to g and then writing h(x) = 2σ(g(x)) − 1,
i.e. q(u) = 2σ(u) − 1. The solutions to the relaxed problem
2. It is normalized: i.e., K(x − y) ≤ 1 and K(0) = 1. are eigenfunctions of the weighted Laplace-Beltrami opera-
tors for which there exists a closed-form formula in certain
3. ∀α ∈ R : α ≥ 1, K(αx − αy) ≤ K(x, y).
cases, e.g. when sim is the Gaussian kernel and p is sep-
Two well known examples are the Gaussian kernel and arable and uniform. To satisfy, at least approximately, the
the Laplacian kernel. Raginsky and Lazebnik [13] showed separability condition a PCA is first performed on the input
that for such kernels, K(x, y) can be approximated by vectors.
1 − Ha(h(x), h(y)) with hk (x) = qk (gk (x)) where: Minimizing the objective function (12) enforces the Eu-
gk (x) = cos(wk′ x + bk ), (6) clidean distance between g(x) and g(y) to be (inversely)
correlated with the Gaussian kernel, and therefore the Eu-
qk (u) = σ(u − tk ). (7)
clidean distance, in the original space. This shows that in
bk and tk are random values drawn respectively from the case of SH the natural measure in the intermediate space
unif[0, 2π] and unif[−1, +1]. The vectors wk are drawn is the Euclidean distance.
from a distribution pK which depends on the particular
choice of the kernel K. For instance, if K is the Gaussian
2.4. Semi-Supervised Hashing (SSH)
kernel with bandwidth γ, then pK is a Gaussian distribution Let {xi , i = 1 . . . N } be a set of points in Ω = RD . SSH
with mean zero and covariance matrix γID . As the size of [16] assumes the availability of supervisory information in
the binary embedding space increases, 1 − Ha(h(x), h(y)) the form of labeled matching and non-matching pairs. Us-
is guaranteed to converge to K(x, y). ing the notations of [16], we have two sets M and C. A pair
We know from Rahimi and Recht [14] that (xi , xj ) ∈ M if xi and xj are considered similar (neighbor
g(x)′ g(y) is guaranteed to converge to K(x, y) since pair) and (xi , xj ) ∈ C if xi and xj are considered dissimilar
Ewk ∼pK gk (x)gk (y) = K(x, y) and therefore that the (non-neighbor pair). The functions hk are of the form:
embedding g preserves the dot-product in expectation.
hk (xi ) = sign(wk′ xi ), (13)
Since K(x, x) = K(x − x) = K(0) = 1, the norm
||g(x)||2 is also guaranteed to converge to 1, ∀x. In such a i.e. q(u) = 2σ(u) − 1 and gk (x) = wk′ xi . The goal is to
case the dot-product is equivalent to the Euclidean distance learn the set W = {wk , k = 1 . . . K} which maximizes the
and the Euclidean distance is preserved in the intermediate following objective function:
space. X X
J(W) = h(xi )′ h(xj ) − h(xi )′ h(xj ), (14)
2.3. Spectral Hashing (SH) (xi ,xj )∈M (xi ,xj )∈C
D
Given a similarity sim between objects in Ω = R , subject to the constraints:
and assuming that the distribution of objects in Ω may be
N
described by a probability density function p, SH [17] at- X
hk (xi ) = 0, ∀k (15)
tempts to minimize the following objective function with
i=1
respect to h: N
1 X
h(xi )h(xi )′ = ID .
Z
(16)
||h(x) − h(y)||2 sim(x, y)p(x)p(y)dxdy, (8) N i=1
x,y
subject to the following constraints: This is an NP-hard problem and Wang et al. propose to relax
the binary assumption both in the objective function and the
h(x) ∈ {−1, 1}K , (9) constraint. This leads to the following objective function:
Z
h(x)p(x)dx = 0, (10)
X X
J(W) = g(xi )′ g(xj ) − g(xi )′ g(xj )
x
Z (xi ,xj )∈M (xi ,xj )∈C
h(x)h(x)′ p(x)dx = ID . (11) + ηR(W), (17)
x
731
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
where R(W) is a regularization function which arises from database vector is encoded by the index of its closest cen-
the relaxation of constraint (15) and η strikes a balance be- troid in the codebook. The distance between the uncom-
tween the two terms. We observed experimentally on sev- pressed query and a quantized database signature is simply
eral datasets that the value of η had little impact on the accu- computed as the Euclidean distance between the query and
racy. We note that in the extreme case where η → ∞ SSH the corresponding centroid.
simply performs a PCA (see [16]). This shows that, for We now adapt this idea to binary embeddings. We note
large η values, SSH approximately preserves the Euclidean that in the case of LSH, LSBC, SH, SSH or PCAE we have
distance in the intermediate space. no notion of centroid. However, we note that the centroid of
a given cell in k-means clustering can be interpreted as the
2.5. PCA Embedding (PCAE) expected value of the vectors assigned to this particular cell.
We also propose the following very simple embedding: Similarly, we propose an asymmetric expectation-based ap-
proximation dE for binary embeddings.
gk (x) = wk′ (x − µ), (18) Assuming that the samples in the original space Ω are
qk (u) = σ(u), (19) drawn from a distribution p, we define:
dE (x, y) =
where µ is the data mean and wk is the eigenvector associ- X
ated with the k-th largest eigenvalue of the data correlation d(gk (x), Eu∼p [gk (u)|hk (u) = hk (y)]). (21)
matrix. The previous equations can be trivially extended k
to kernel PCA (kPCA) and therefore this embedding can d(gk (x), Eu∼p [gk (u)|hk (u) = hk (y)]) is the distance in the
be easily extended to non-vectorial signatures. This em- intermediate space between gk (x) and the expected value of
bedding function is interesting as it can be related to SH the samples gk (u) such that hk (u) = hk (y).
and SSH. In the case of SH, the PCA is the first step of Since we generally do not have access to the distribution
the embedding (to ensure the approximate separability of p, the expectation operator is approximated by a sample av-
the data). In the case of SSH, this corresponds to the limit erage. In practice, we randomly draw a set of signatures
case η → ∞. We will show experimentally that, despite S = {xi , i = 1 . . . N } from Ω. For each dimension k of
its simplicity, this embedding technique yields competitive the embedding, we partition S into two subsets: Sk0 con-
results. In the case of PCAE, the Euclidean distance is also tains the signatures xi such that hk (xi ) = 0 and Sk1 those
the natural distance in the intermediate space as the PCA signatures xi that satisfy hk (xi ) = 1. We compute offline
projections preserve approximately the Euclidean distance. the following query-independent values:
3. Proposed Asymmetric Distances 1 X
α0k = gk (u), (22)
|Sk0 | 0
In the previous section we decomposed several binary u∈Sk
3.1. Expectation-Based Asymmetric Distance The cost of pre-computing the β values is negligible with
respect to the cost of computing many dE (x, y)’s for a large
In [6] Jégou et al. proposed an asymmetric algorithm number of database signatures y. Note that the sum (26) can
for compression schemes based on vector quantization. be efficiently computed, e.g. by grouping the dimensions in
A codebook is learned through k-means clustering and a blocks of 8 bits and having one 256-dimensional look-up
2 We use the following abuse of notation for simplicity: d denotes both table per block (rather than one 2-dimensional look-up table
the distance between the vectors g(x) and g(y), i.e. d : RK × RK → R, per dimension). This reduces the number of summations as
and the distance between the individual dimensions, i.e. d : R × R → R. well as the number of accesses to the memory.
732
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
3.2. Lower-Bound-Based Asymmetric Distance 4. Experimental Validation
In [3], Dong et al. proposed an asymmetric algorithm for We now show the benefits of the asymmetric distances
binary embeddings based on random projections. We now proposed in section 3 on the binary embeddings we re-
generalize this idea and show that a similar approach can be viewed in section 2. We first describe the four public bench-
applied to a much wider range of binary embedding tech- marks we experiment on. We then provide implementation
niques. For the simplicity of the presentation, we assume details for the different embedding algorithms. We finally
that qk has the form qk (u) = σ(u − tk ) where tk is a thresh- report and discuss results.
old but this can be trivially generalized to other quantization
functions. 4.1. Datasets
The idea is to lower-bound the quantity (20) by bounding We run experiments on four public benchmarks: La-
each of its terms. We note that tk splits R into two half-lines belMe, MNIST, the University of Kentucky Benchmark
and consider two cases: (UKB) and INRIA Holidays. We describe these datasets
as well as the associated standard experimental protocols.
• If hk (x) 6= hk (y), i.e. gk (x) and gk (y) are on differ- LabelMe. The LabelMe dataset [15] contains more than
ent sides of tk , then a lower-bound between gk (x) and 180,000 scene and object images. We follow the protocol
gk (y) is the distance between gk (x) and the threshold of [13] and use the same subset of 14,871 images. 1,000 are
tk , i.e. d(gk (x), gk (y)) ≥ d(gk (x), tk ). used as queries and the remaining 13,781 as database im-
ages. Images are described using a 320 dimensional GIST
• If hk (x) = hk (y), i.e. gk (x) and gk (y) are on the same [10] descriptor. A nominal threshold of the average distance
half-line, then we have the following obvious lower- to the 50th nearest neighbor is used to determine whether a
bound: d(gk (x), gk (y)) ≥ 0 (actually, this bound is database image returned for a given query is considered a
always true). true positive [13]. We note that the number of true posi-
tives varies widely from one query to another (from 0 to
Merging the two cases in a single equation, we have the
2,353). We compute the Average Precision (AP) for each
following lower-bound on d(g(x), g(y)):
query (with a non-zero number of true positives) and report
X the mean over the 1,000 queries (Mean AP or MAP).
dLB (x, y) = δ̄hk (x),hk (y) d(gk (x), tk ), (27) MNIST. The MNIST dataset 3 consists of 70,000 hand-
k
written digits of size 28 × 28 pixels. The dataset is par-
titioned into a training set of 60,000 samples and a test
where δ̄i,j is the negation of the Kronecker delta, i.e. δ̄i,j =
set of 10,000 samples. Following the protocol of [16], we
0 if i = j and 1 otherwise. We note that, in the case of LSH,
use directly the pixel values as features resulting in 784 di-
equation (27) is equivalent to the asymmetric LSH distance
mensional image signatures. We randomly select 1,000 test
proposed in [3].
samples as queries and all 60,000 training samples are used
In practice, for a given query signature x, we can pre-
as database images. Accuracy is reported in terms of the
compute online the values:
proportion of relevant images in the top 500 retrieved im-
ages, i.e. precision@500.
γk0 = δ̄hk (x),0 d(gk (x), tk ), (28)
UKB. The University of Kentucky Benchmark (UKB)
γk1 = δ̄hk (x),1 d(gk (x), tk ). (29) [9] contains 10,200 images of 2,550 objects (4 images per
object). Each image is used in turn as query to search
We note that one of these two values is guaranteed to be through the 10,200 images. The accuracy is measured in
0 for each dimension k. We can subsequently pack these terms of the number of relevant images retrieved in the
values by blocks of 8 dimensions for faster computation as top 4, i.e. 4 × recall@4. We use the same low-level fea-
was the case of the expectation-based approximation. By ture detection and description procedure as in [5] (Hessian-
definition we have: Affine extractor and SIFT descriptor) since the code is avail-
X h (y) able online. We reduce the dimensionality of these fea-
dLB (x, y) = γk k . (30) tures from 128 to 64 dimensions with PCA. To aggregate
k these local feature vectors into a single image-level signa-
ture, we use the Fisher Vector (FV) framework [11] which
A major difference between dE and dLB is that the for- was recently shown to yield excellent results for object and
mer one makes use of the data distribution while the latter scene retrieval [12, 7]. We use probabilistic codebooks (i.e.
one does not. Despite its very crude nature, we will see that GMMs) with 64 Gaussians so that an image is described
dLB leads to excellent results on a variety of binary embed-
ding algorithms. 3 http://yann.lecun.com/exdb/mnist/
733
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
by a 64 × 64 = 4, 096 dimensional FV. For learning pur- Table 1. mAP results (in %) as a function of the number of bits
on the LabelMe dataset. The uncompressed baseline is 100%. For
poses (e.g. to learn the PCA on the SIFT descriptors and
SSH and PCAE, the number of embedding dimensions is limited
the GMM for the FV), we use an additional set of 60,000
by the dimensionality of the GIST vectors (320).
images (Flickr60K) made available by the authors of [5]. dimensions 16 32 64 128 256 512
Holidays. The INRIA Holidays dataset [5] contains PCA no bin 66.8 87.4 97.3 99.6 100 -
1,491 images of 500 scenes and objects. The first image Ha 11.5 16.9 24.1 32.2 40.3 46.5
LSH dE 13.9 20.7 29.1 37.2 44.4 49.2
of each scene is used as query to search through the remain- dLB 13.6 20.3 28.5 36.9 44.3 49.4
ing 1,490 images. We measure the accuracy for each query Ha 7.5 11.9 19.4 29.7 44.2 59.8
LSBC dE 9.4 16.4 26.8 39.8 55.1 68.5
using AP and report the MAP over the 500 queries. As was dLB 8.4 13.9 22.6 34.2 49.5 65.1
the case for UKB, images are described using 4,096 dimen- Ha 10.4 12.0 16.8 23.1 24.6 18.3
SH dE 16.7 20.7 29.7 39.6 37.5 22.9
sional FVs. dLB 14.4 17.6 28.8 42.1 41.0 24.2
Ha 9.6 8.9 8.7 7.7 6.4 -
4.2. Implementation Details SSH dE 16.8 19.6 21.4 21.8 21.9 -
dLB 16.6 19.3 21.1 21.6 21.9 -
Ha 9.8 9.1 8.6 7.7 6.2 -
To learn the parameters of the embedding functions and PCAE dE 17.5 20.2 21.9 22.4 22.5 -
to compute offline the α values for dE we need training dLB 17.2 19.4 21.1 21.8 21.9 -
data. For LabelMe and MNIST we use respectively the
13,781 and 60,000 database images. For UKB and Holi- Table 2. Precision@500 (in %) as a function of the number of bits
days we use the Flickr60K dataset [5]. on the MNIST dataset. The uncompressed baseline is 72.0%
dimensions 8 16 24 32 40 48
For LSH and LSBC, which perform binarization through PCA no bin 69.0 75.2 76.2 75.9 75.3 75.0
random projections, experiments are repeated 5 times with Ha 25.9 37.3 43.6 48.8 52.8 55.7
5 different projection matrices and we report the average LSH dE 27.7 42.9 49.8 55.2 59.0 61.7
dLB 27.7 42.8 49.4 54.6 58.4 60.9
results. Ha 17.7 24.8 28.1 31.5 34.7 37.3
For LSBC, the γ bandwidth parameter is set to the in- LSBC dE 19.6 29.4 34.2 39.1 43.6 46.4
dLB 17.9 26.0 29.9 33.6 37.1 39.7
verse of the mean distance between the descriptors in the Ha 44.9 52.5 55.9 58.7 59.8 59.8
training set. This heuristic provided good results in practice SH dE 48.9 62.3 66.6 69.1 70.4 70.0
dLB 47.9 60.2 64.3 66.8 67.6 67.5
(i.e. hand tuning this parameter did not lead to major im- Ha 54.2 55.8 57.2 57.0 56.4 55.5
provements). We are not aware of any principled approach SSH dE 58.1 65.3 68.1 69.0 69.5 69.6
dLB 58.1 65.0 67.5 68.1 68.4 68.4
to set automatically this parameter. Ha 47.4 53.6 56.7 57.4 57.0 56.2
For SSH, the η parameter, which controls the relative PCAE dE 52.3 64.3 68.0 69.4 69.8 70.0
dLB 51.7 64.3 67.7 68.9 69.2 69.4
weight of the unsupervised vs the supervised information
during the learning, is set to 5. We observed that this pa-
rameter has, in general, a very limited impact on the final 4.3. Results and Analysis
accuracy. Again, we are not aware of any principled ap-
proach to set automatically this parameter. We report results on the four datasets in Tables 1 to 4.
SSH also requires neighboring relationships between We provide two baselines: i) the uncompressed baseline
training samples. This is straightforward to derive for which computes the Euclidean distance between the orig-
MNIST where digit labels are available. For the other inal vectors and ii) the PCA baseline which computes the
datasets, we follow a procedure similar to that of [16]. We Euclidean distance between the vectors after PCA but with
sort the distances between all pairs of images in the train- no binarization (referred to as “PCA no bin” in the tables).
ing set and use the distances at the 5th and 95th percentiles The PCA baseline is important for two reasons: PCA is an
as thresholds to decide whether two images are neighbors integral component of several embedding techniques (SH,
(distance in the smallest 5%), not neighbors (distance in the SSH and PCAE) and this operation often improves the re-
largest 5%) or unrelated (distance between the smallest and trieval accuracy. For the five binary embedding algorithms
largest 5%). discussed in section 2 and for each of the four datasets we
SSH requires the data to have zero mean [16] and, as dis- provide results for the symmetric Hamming distance Ha
cussed in section 2.1, mean-centering the data can impact and the proposed asymmetric distances dE and dLB . We
very positively LSH. Therefore, the LabelMe and MNIST can draw the following conclusions on the proposed asym-
signatures (respectively GIST and pixel values) are mean- metric distances.
centered. This is not necessary for FVs (i.e. for UKB and Asymmetric vs symmetric. The use of asymmetric dis-
Holidays) since, by definition, they are already (approxi- tances consistently improves the results over the symmetric
mately) zero-mean. We note that centering the data on the Hamming distance, independently of the dataset and of the
origin does not impact SH and PCAE (which perform a binary embedding technique. The gain in accuracy is often
PCA of the signatures) or LSBC (which is shift-invariant). impressive both in terms of absolute and relative improve-
734
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
Table 3. 4 ×recall@4 as a function of the number of bits on the (SH, SSH and PCAE) clearly outperform random-based ap-
UKB dataset. The uncompressed baseline is 3.35.
dimensions 16 32 64 128 256 512 1024
proaches (LSH and LSBC) especially for a small number of
PCA no bin 2.71 3.02 3.19 3.31 3.39 3.44 3.35 bits, except on LabelMe. This shows that one can expect
Ha 1.04 1.16 1.44 1.90 2.39 2.77 3.01 better results by exploiting the data structure than by rely-
LSH dE 1.06 1.28 1.70 2.21 2.64 2.94 3.12 ing on “chance”. We underline that LSH and LSBC still
dLB 1.06 1.26 1.65 2.14 2.58 2.90 3.10
Ha 0.40 1.04 1.13 1.33 1.66 2.10 2.51 enjoy an important property: the guarantee that with more
LSBC dE 0.27 1.03 1.27 1.63 2.06 2.49 2.82 bits the Hamming distance converges to the true distance.
dLB 0.41 1.05 1.17 1.42 1.80 2.25 2.63
Ha 1.15 1.72 2.15 2.54 2.82 3.00 3.15 LSH vs LSBC. LSBC is the best performing method
SH dE 1.14 2.06 2.53 2.90 3.13 3.25 3.33 on the LabelMe dataset but is significantly outperformed
dLB 1.25 2.00 2.46 2.85 3.09 3.23 3.31
Ha 1.32 1.83 2.25 2.56 2.79 2.95 3.09 by LSH (as well as all other embedding techniques) on
SSH dE 1.40 2.18 2.63 2.91 3.08 3.20 3.29 MNIST, UKB and Holidays. This shows that LSBC is ap-
dLB 1.47 2.19 2.63 2.90 3.07 3.19 3.29
Ha 1.31 1.82 2.24 2.56 2.78 2.94 3.09 propriate when the goal is to retrieve database points which
PCAE dE 1.38 2.17 2.62 2.90 3.08 3.20 3.29 are contained in a ball centered on a given query point. This
dLB 1.45 2.18 2.62 2.89 3.06 3.18 3.28
is not surprising given that LSBC approximates the Gaus-
sian kernel and that the iso-density lines for this kernel are
Table 4. mAP results (in %) as a function of the number of bits on
the Holidays dataset. The uncompressed baseline is 59.7%. hyper-spheres. On the other hand, LSBC seems less appro-
dimensions 16 32 64 128 256 512 1024 priate for the general k-nearest neighbors problem since the
PCA no bin 40.8 47.2 52.0 55.8 57.9 60.0 59.7 distance to the k-th neighbor may vary significantly from
Ha 2.8 5.6 12.9 22.3 33.2 42.2 49.8 one query to another.
LSH dE 3.7 9.4 18.5 29.1 39.0 47.0 52.3
dLB 3.7 8.8 17.1 27.2 37.5 46.0 51.8 SH vs SSH vs PCAE. The three PCA-based embed-
Ha 0.9 2.2 4.5 8.7 17.0 25.3 35.7 ding techniques perform fairly similarly on all four datasets.
LSBC dE 1.2 3.5 8.0 15.2 25.6 34.9 43.8
dLB 1.0 2.7 5.7 11.2 19.3 28.6 38.5 However we observe two noticeable exceptions:
Ha 12.6 23.4 33.0 40.4 45.3 48.3 52.4
SH dE 18.2 31.3 39.1 47.9 51.6 53.6 56.8 • SH performs significantly better on LabelMe. This
dLB 17.8 29.8 38.5 46.0 50.6 53.5 56.3 might be explained by the fact that SH approximates
Ha 12.9 23.7 35.4 39.6 44.9 49.3 53.4
SSH dE 19.2 33.4 43.5 49.9 54.4 58.3 60.8 the Gaussian kernel and is therefore very similar in
dLB 18.8 33.4 43.7 49.7 53.0 57.7 60.7 spirit and in practice to LSBC (as noted in section 3
Ha 13.8 24.1 34.2 38.0 44.7 47.3 49.8
PCAE dE 18.7 34.0 42.5 48.7 53.0 57.8 59.7 of [13]). Therefore, SH might be more appropriate for
dLB 18.4 35.7 42.9 48.3 52.7 57.4 59.7 ball search problems (as is the case of LSBC) than k-
nearest neighbor problems.
ment. For instance, at 128 bits, the improvement of LSBC • SSH performs significantly better on MNIST for small
on LabelMe is approximately 10% absolute and 30% rela- operating points (8 and 16 bits). This shows that SSH
tive and the improvement of PCAE on Holidays is approxi- has an advantage when “real” labels are provided, but
mately 10% absolute and 25% relative. that “creating” neighboring relationships from unla-
Expectation vs lower-bound. For almost all embedding beled data brings little additional information.
techniques, dE and dLB yield very similar results, with a These results show that a crucial factor in the success of SH
slight advantage for the expectation-based algorithm. This and SSH is the PCA dimensionality reduction (done explic-
is somewhat surprising given that the two approximations itly in the case of SH and implicitly through the regular-
are very different in nature. The advantage of dE over dLB ization in the case of SSH) and that a conceptually simpler
comes from the fact that the former approach uses informa- technique such as PCAE can often perform comparably.
tion about the data distribution (through the pre-computed
values α) while the latter does not. The only exception is 4.4. Large-Scale Experiments
for LSBC where dE performs significantly better than dLB . We now show that our good results scale to large
We are still investigating this difference but we observed datasets. For these experiments, we merge Holidays with
that the distributions of the values gk (x) in the intermediate a set of 1M distractor Flickr images made available by the
real-valued space for LSBC are significantly different from authors of [5]. We refer to the combined dataset as Holi-
those observed for the other embedding methods (typically days+1M. The same 500 queries are used as for the stan-
U-shaped as opposed to Gaussian-shaped). Note that pre- dard Holidays. We experiment with LSH since it performs
liminary experiments on fusing dE and dLB yielded only significantly better than LSBC on Holidays and with PCAE
marginal improvements. since it performs comparably to SH and SSH while being
We also draw the following conclusions with respect to much simpler.
the different binary embedding techniques: We first report in Figure 1 the recall@K for an operating
PCA-based vs random-based. PCA-based approaches point of 128 bits. For both PCAE and LSH, the asymmet-
735
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.
70 50
PCAE Binary 128 bits PCAE Binary
65 PCAE Asymm with dE 128 bits PCAE Asymm with dE
45
60 PCAE Asymm with dLB 128 bits PCAE Asymm with dLB
LSH Binary 128 bits 40 LSH Binary
55 LSH Asymm with dE 128 bits LSH Asymm with dE
30
40
35 25
30
20
25
20 15
15 10
10
5
5
0 0
1 2 5 10 25 50 100 250 500 1000 16 32 64 128 256 512 1024
k number of bits
Figure 1. Large scale experiments on Holidays+1M using 128 bits Figure 2. Comparison of the proposed asymmetric distances to the
signatures. Recall@k as a function of k. best methods of Jégou et al. [7] on Holiday+1M. MAP as a func-
tion of the number of bits.
736
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 19,2024 at 10:37:36 UTC from IEEE Xplore. Restrictions apply.