0% found this document useful (0 votes)
20 views4 pages

Graph-Based Semi-Supervised Learning With Multiple Labels

Uploaded by

Frank Puk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views4 pages

Graph-Based Semi-Supervised Learning With Multiple Labels

Uploaded by

Frank Puk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

GRAPH-BASED SEMI-SUPERVISED LEARNING WITH MULTI-LABEL

Zheng-Jun Zha † , Tao Mei ‡ , Jingdong Wang‡ , Zengfu Wang † , Xian-Sheng Hua ‡

University of Science and Technology of China, Hefei, 230027, P. R. China

Microsoft Research Asia, Beijing, 100080, P. R. China

ABSTRACT
Conventional graph-based semi-supervised learning meth-
ods predominantly focus on single label problem. However,
it is more popular in real-world applications that an exam-
FDUURDG VN\PRXQWDLQZDWHU IDFHSHUVRQZDONLQJBUXQQLQJ
ple is associated with multiple labels simultaneously. In this SHRSOHPDUFK
paper, we propose a novel graph-based learning framework Fig. 1. Sample video clips associated with multiple labels.
in the setting of semi-supervised learning with multi-label.
The proposed approach is characterized by simultaneously
exploiting the inherent correlations among multiple labels field and Harmonic function (GRF) method. They defined a
and the label consistency over the graph. We apply the pro- quadratic loss function with infinity weight to clamp the la-
posed framework to video annotation and report superior beled examples, and formulated the regularizer based on the
performance compared to key existing approaches over the graph combinatorial Laplacian. In [4], Belkin et al. men-
TRECVID 2006 corpus. tioned that p-Laplacian can be used as a regularizer. Zhou
et al. [2] presented the Local and global consistency (LGC)
Index Terms— graph-based learning, multi-label, semi- method. They defined a quadratic loss function and used the
supervised learning normalized combinatorial Laplacian in the regularizer.
The existing graph-based approaches mainly limit in deal-
1. INTRODUCTION ing with single label problems. However, many real-world
tasks are naturally posed as multi-label problems, where an
In real-world applications of machine learning, one often example can be associated with multiple labels simultane-
faces a lack of sufficient labeled data since human labeling is ously. For example, in video annotation, a video clip can be
a labor-intensive and time-consuming process. However, in annotated with multiple labels at the same time, such as ‘sky,”
many cases, the unlabeled data are usually plentiful. Conse- “mountain,” and “water” (see Fig. 1). In text categorization,
quently, semi-supervised learning, which attempts to leverage a document can be classified into multiple categories, e.g.,
the unlabeled data in addition to labeled data, has attracted a document can belong to “novel,” “Jules Verne’s writing,”
much attention. Many different semi-supervised learning and “books on travelling.” A direct way to tackle the typical
techniques have been proposed. Extensive review can be graph-based learning under multi-label setting is to translate
found in [1]. it into a set of independent single label problems [5]. The
As a major family of semi-supervised learning, graph- drawback with this approach is that it does not take into ac-
based methods have recently attracted significant interest due count the inherent correlations among multiple labels. How-
to their effectiveness and efficiency [2] [3] [4]. Almost all ever, researchers have proven the value of label correlations
the graph-based methods essentially estimate a function on in various application fields [6] [7] [8].
the graph such that it has two properties: 1) it should be To address the above issue, in this paper, we propose
close to the given labels on the labeled examples, and 2) it a novel graph-based semi-supervised learning framework,
should be smooth on the whole graph. This can be expressed which can simultaneously explore the correlations among
in a regularization framework where the first term is a loss multiple labels and the label consistency over the graph.
function to penalize the deviation from the given labels, and Specifically, the framework employs two types of regular-
the second term is a regularizer to prefer the label smooth- izers. One is used to prefer the label smoothness on the
ness. The typical graph-based methods are similar to each graph, the other is adopted to address that the multi-label
other, and differ slightly in the loss function and the regular- assignments for each example should be consistent with the
izer. For example, Zhu et al. [3] proposed a Gaussian random inherent label correlations. We applied the proposed frame-
This work was performed when Zheng-Jun Zha was visiting Microsoft work to video annotation and conducted experiments over
Research Asia as a research intern. TRECVID 2006 development set [9]. We reported the su-

978-1-4244-2571-6/08/$25.00 ©2008 IEEE 1321 ICME 2008


perior performance compared to key existing graph-based scenario, and they are sub-optimal for multi-label scenario,
learning approaches. which is more challenging but much closer to real-world
The rest of this paper is organized as follows. We applications.
give a short introduction to the existing graph-based semi-
supervised learning framework in Section 2. Section 3 pro- 3. GRAPH-BASED SEMI-SUPERVISED LEARNING
vides the detailed description of the proposed framework. WITH MULTI-LABEL
Experimental results on TRECVID 2006 data set are reported
in Section 4 followed by concluding remarks in Section 5. In this section, we address the semi-supervised K-label prob-
lem. Define an N × K label matrix Y, where Yik is 1 if xi
2. GRAPH-BASED SEMI-SUPERVISE LEARING is a labeled sample with its label k, and 0 otherwise. Define
WITH SINGLE LABEL an N × K matrix F = [f1 , f2 , · · · , fN ]T , where Fik is the
confidence that xi is associated with label k.
The graph-based semi-supervised methods define a weighted
graph such that the nodes correspond to labeled and unlabeled 3.1. Motivation
data points and the edges reflect the similarities between data
points. The existing algorithms are all based on the common In multi-label scenario, it is observed that the labels assigned
assumption that the labels are smooth on the graph. Then, to a sample (e.g., a video clip in video annotation task) are
they essentially estimate a function over the graph such that usually consistent with the inherent label correlations. For ex-
it satisfies two conditions: 1) it should be close to the given ample, “car,” and “road” are usually co-assigned to a certain
labels, and 2) it should be smooth on the whole graph. Gen- video clip since they often appear simultaneously, while “ex-
erally, these two conditions are presented in a regularization plosion fire” and “waterscape” are generally not assigned to a
form. sample at the same time since they usually do not co-occur.
Let X = {x1 , x2 , · · · , xN } be a set of N data points in Motivated by this observation, we propose a novel graph-
Rd . The first L points are labeled as yl = [y1 , y2 , · · · , yL ]T based learning framework, such that the vector-valued func-
with yi ∈ {0, 1} (1  i  L), and the task is to label the tion over the graph has three properties: 1) it should be close
remaining points {xL+1 , · · · , xN }. Denote the graph by G = to the given labels, 2) it should be smooth on the whole graph,
(V, E), where the node set V = L∪U with L corresponding to and 3) it should be consistent with the label correlations. The
{x1 , · · · , xL } and U corresponding to {xL+1 , · · · , xN }. The former two properties have been addressed in existing graph-
edges E are weighted by the n×n affinity matrix W with Wij based methods. The third one is novel and the key contribu-
indicating the similarity measure between xi and xj , and Wii tion of this paper.
is set to 0. Let f = [f1 , f2 , · · · , fL , fL+1 , · · · , fN ]T denote
the predicted labels of X . 3.2. Formulation
Mathematically, the graph-based methods aim to find an
optimal f ∗ essentially by minimizing the following objective This unified regularization framework consists of three com-
function ponents: a loss function El (F), and two types of regularizers
E(f ) = El (f ) + Es (f ), (1) Es (F) and Ec (F). Specifically, El (F) corresponds to the
first property to penalize the deviation from the given multi-
where El (f ) is a loss function to penalize the deviation from label assignments, Es (F) is a regularizer to address the multi-
the given labels, and Es (f ) is a regularizer to prefer the la- label smoothness, and Ec (F) is a regularizer to prefer the
bel smoothness. For example, the Gaussian random field third property. Then, the proposed framework can be formu-
method [3] formulates El (f ) and Es (f ) as lated to minimize

El (f ) = ∞ (fi − yi )2 = (f − y)T Λ(f − y), E(F) = El (F) + Es (F) + Ec (F), (2)
i∈L
1
Es (f ) = Wij (fi − fj )2 = f T Δf , where El (F) and Es (F) can be specified in a way similar to
2 i,j∈L∪U
that adopted in existing graph-based methods. Here we define
where Λ is a diagonal matrix with Λii = ∞, i  L and Λii = them as the same as those in GRF [3], i.e.,
0, i > L. Δ = D − W is the combinatorial graph Lapla-
cian, D is a diagonal matrix with its (i, i)-element equal to the El (F) = tr((F − Y)T Λ(F − Y)),
sum of the ith row of W. In Local and Global Consistency Es (F) = tr(FT ΔF),
method [2], El (f ) is defined as (f − y)T (f − y) and Es (f ) is
formulated as f T D−1/2 ΔD−1/2 f , where D−1/2 ΔD−1/2 is where tr(M) is the trace of matrix M.
the normalized combinatorial Laplacian. We make use of the correlations among multiple labels
As aforementioned, the existing graph-based methods to define Ec (F). To capture the label correlations, we in-
mainly address the semi-supervised problem for single label troduce a K × K symmetric matrix C with Cij represents

1322
the correlation between label i and label j. Then, we spec- 4.1. Implementation Issues
ify Ec (F) = −tr(FCFT ) to make the predicted multiple la-
bels for each sample satisfy the correlations. Eqn. (2) is then In many real-world applications, the labeled data is usually
specifically formulated as the following insufficient. Thus the correlation matrix C obtained from that
limited data is unreliable. To tackle this difficulty, we pro-
E(F) = tr((F − Y)T Λ(F − Y)) pose an iterative solution. In each iteration t, the labels are
+(1 − α) tr(FT ΔF) − α tr(FCFT ), (3) predicted by solving Eqn. (6) with Ct−1 estimated at last it-
eration (t − 1), and the data points labeled with high certainty
where α is the trade-off parameter. For the sake of simplicity, are incorporated to re-estimated Ct .
we name this algorithm instance as ML-GRF. Suppose there are M data points with the predicted la-
bel matrix Fp = [f1 , f2 , · · · , fM ]T . For the label vector fi
3.3. Solution associated with xi , we calculate the certainty as g(fi ) =
1
K 1
M
Differentiating E(F) with respect to F, we have K j=1 exp{−fij (2μj − fij )}, where μj = M i=1 fij .
Then, given a label matrix Fc = [f1 f2 · · · fn ]T on n
Λ(F − Y) + (1 − α)ΔF − αFC = 0. (4) data points with certain labeling, C is calculated as Cij =
  
exp(−||fi − fj ||2 /2σc2 ), where fi is the ith column of Fc , and
Let F = [FTl FTu ]T , and represent the matrix W (and simi-  

larly D) in the form of block matrices: σc = E( fi − fj ) is the average distance.


 
Wll Wlu 4.2. Evaluations
W= . (5)
Wul Wuu
We conduct experiments to compare the proposed approaches
Then Eqn. (4) can be wrote as (i.e., ML-GRF and ML-LGC) against other two representa-
Puu Fu + Fu C = S, (6) tive graph-based methods: GRF [3] and LGC [2]. The com-
parison is conducted over TREVID 2006 development set [9],
where Puu = (1 − 1/α)(Duu − Wuu ) and S = (1 − which contains 137 broadcast news videos. These videos are
1/α)Wul Y. Eqn. (6) is essentially a Sylvester equation [10], segmented into 61901 sub-shots and 39 concepts are labeled
which is widely used in control theory. on each sub-shot according to LSCOM-Lite annotations [11].
Vectorizing the unknown matrix Fu , Eqn. (6) can be The development set is separated into four partitions with 90
transformed to a linear equation: videos as training, 16 videos as validation, 16 videos as fu-
sion, and 15 videos as test [12].
(I ⊗ Puu + C ⊗ I)vec(Fu ) = vec(S),
T
(7)
The low-level feature we use here is 225-Dimensional
where ⊗ is the Kronecker product, I is the identity matrix, block-wise color moment, which are extracted over 5 × 5
and vec(M) is the vectorization of the matrix M. We can fixed grids, each block is described by 9-Dimensional fea-
then obtain Fu from vec(Fu ), which is equal to (I ⊗ Puu + tures. We use the Gaussian kernel function to calculate the
CT ⊗ I)+ vec(S). weighted matrix W. The kernel bandwidth σ and the trade-
In addition, we can also define El (F) and Es (F) as those off parameter α are respectively selected via the validation
adopted in LGC [2] method, i.e., process.
For performance evaluation, we use the TRECVID per-
El (F) = tr((F − Y)T (Fl − Y)), formance metric: Average Precision (AP) [9] to evaluate and
Es (F) = tr((F − Y)T D−1/2 ΔD−1/2 (Fl − Y)). compare the approaches on each concept. Through averaging
the AP over all 39 concepts, we obtain the mean average pre-
Then Eqn. (2) is specifically formulated as cision (MAP), which is the overall evaluation result. Table 1
E(F) = tr((F − Y) (F − Y)) − α tr(FCF )
T T shows the MAP of our two approaches, the LGC and GRF,
and Fig. 2 illustrates the AP of these four approaches. The
+(1 − α)tr((F − Y)T D−1/2 ΔD−1/2 (F − Y)). following observation can be obtained:
We name this approach as ML-LGC and the corresponding
• By exploiting the label correlations, the proposed meth-
solution is similar to that of ML-GRF. We omit the discussion
ods outperform the approaches which treat the seman-
due to lack of space.
tic labels separately and neglect their integration. In de-
tails, ML-LGC achieves around 7.2% improvements on
4. EXPERIMENTS MAP compared to LGC, while ML-GRF obtains about
6.5% improvements compared to GRF.
In this section, we evaluate the proposed framework on
a widely used benchmark video data set and compared it • ML-LGC outperforms LGC over 35 of all the 39 con-
against two state-of-the-art graph-based methods. cepts. Some of the improvements are significant, such

1323
>' D>Ͳ>' 'Z& D>Ͳ'Z&
Ϭ͘ϵ
Ϭ͘ϴ
Ϭ͘ϳ
Ϭ͘ϲ
Ϭ͘ϱ
Ϭ͘ϰ
Ϭ͘ϯ
Ϭ͘Ϯ
Ϭ͘ϭ
Ϭ

Fig. 2. The AP comparison of LGC [2], GRF [3], and the proposed approaches (i.e., ML-LGC and ML-GRF) .

Madison, 2005.
Table 1. Comparison of MAP for the four approaches.
[2] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Scholkopf,
Approach MAP Improvement
“Learning with local and global consistency,” in Advances
LGC [2] 0.307 -
in Neural Information Processing Systems (NIPS), Cambridge,
ML-LGC 0.329 +7.2%
MA, 2004, vol. 16, pp. 321–328, MIT Press.
GRF [3] 0.325 -
[3] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised
ML-GRF 0.346 +6.5%
learning using gaussian fields and harmonic functions,” in
Proceedings of the Twentieth International Conference on Ma-
as the 52%, 49%, and 31% improvements on “office,” chine Learning (ICML), Washington, DC, 2003, pp. 912–919.
“airplane,” and “waterscape,” respectively. Compared [4] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and
to GRF, ML-GRF obtains improvements on 32 con- semi-supervised learning on large graphs,” in Annual Confer-
cepts, such as “waterscape” (105%), “office” (55%), ence on Computational Learning Theory (COLT). 2004, vol.
and “airplane” (40%). 3120 of Lecture Notes in Computer Science, pp. 624–638,
Springer.
• The proposed methods degrade slightly on a few con-
[5] T. Mei, X.-S. Hua, W. Lai, L. Yang, Z.-J. Zha, and et al.,
cepts. The main reason is that each of these concepts “Msra-ustc-sjtu at trecvid 2007: High-level feature extraction
has weak interactions with other concepts. As a result, and search,” in TREC Video Retrieval Evaluation Online Pro-
the presence/absence of these concepts cannot benefit ceedings, 2007.
from those of the others. [6] N. Ghamrawi and A. McCallum, “Collective multi-label clas-
sification,” in Proceedings of the ACM International Con-
In summary, compared to the existing graph-based meth-
ference on Information and Knowledge Management (CIKM),
ods, the proposed approaches consistently outperform the per- New York, NY, 2005, pp. 195–200, ACM.
formance on the diverse 39 concepts.
[7] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang,
“Correlative multi-label video annotation.,” in Proceedings of
5. CONCLUSIONS the ACM International Conference on Multimedia (MM). 2007,
pp. 17–26, ACM.
We have proposed a novel graph-based semi-supervised [8] Y. Liu, R. Jin, and L. Yang, “Semi-supervised multi-
learning framework to address the multi-label problems, label learning by constrained non-negative matrix factoriza-
which simultaneously takes into account both the correla- tion,” in Proceedings of AAAI Conference on Artificial Intelli-
tions among multiple labels and the label consistency over gence(AAAI), Boston, 2006, pp. 666–671.
the graph. In addition to the label smoothness, the proposed [9] TRECVID, “[Link] .
framework also addresses that the multi-label assignment [10] P. Lancaster and M. Tismenetsky, “The theory of matrices,”
to each sample should be consistent with the inherent label Mathematical Social Sciences, vol. 13, no. 1, February 1987.
correlations. Experiments on the benchmark TRECVID data [11] M. Naphade, L. Kennedy, J. R. Kender, S. F. Chang, J. R.
set demonstrated that the novel framework is superior to key Smith, P. Over, and A. Hauptmann, “LSCOM-lite: A
existing graph-based methods, in both overall performance light scale concept ontology for multimedia understanding for
and the consistency of performance on diverse concepts. TRECVID 2005,” in Technical Report. RC23612, IBM T.J.
Watson Research Center, 2005.
6. REFERENCES [12] A. Yanagawa, S.-F Chang, L. Kennedy, and W. Hsu,
“Columbia university’s baseline detectors for 374 LSCOM se-
[1] X. Zhu, “Semi-supervised learning literature survey,” Tech. mantic visual concepts,” in Columbia University ADVENT
Rep. 1530, Computer Sciences, University of Wisconsin- Technical Report#222-2006-8, 2007.

1324

You might also like