CS282 Machine Learning
Course Project
Student 1 Student 2
ID: xxxxxxxx ID: xxxxxxxx
email1@[Link] email2@[Link]
1 Related Work
Variational Autoencoder
Variational autoencoders (VAEs) are widely used generative models known for encoding input data
into a compact representation using an encoder-decoder structure. Unlike traditional autoencoders,
VAEs aim to model the latent space distribution of input data, enabling the generation of new data
samples that align with this distribution. Pioneering work on VAEs by [1] introduced the model
and its objective function, while [2] developed a gradient approximation technique by sampling
latent variables. Since then, VAEs have been utilized in various applications, including image
generation, data compression, and anomaly detection. For instance, [3] combined VAEs with
generative adversarial networks (GANs) for producing high-quality images, and [4] leveraged VAEs
for detecting anomalies in seasonal key performance indicators (KPIs). Additionally, VAEs have
been integrated with other deep learning models, such as convolutional neural networks (CNNs) and
recurrent neural networks (RNNs), to enhance task performance. An example is [5], where VAEs
and CNNs were combined to extract hierarchical features from images. In summary, VAEs represent
a significant advancement in generative modeling, contributing notably to deep learning. They are
particularly useful in deep clustering within the probabilistic graphical model framework, as they
effectively merge variational inference with deep autoencoders [6].
Information Bottleneck
The information bottleneck (IB) principle is a crucial concept in information theory and machine
learning, providing insight into balancing compression and prediction. Introduced by [7], it em-
phasizes the importance of learning algorithms that effectively retain relevant information while
discarding irrelevant data. Since its inception, the IB principle has found extensive application in
various machine learning tasks, including unsupervised learning [8], classification [9], and clustering
[10, 11]. More recently, it has contributed to understanding brain information processing [12, 13].
By constraining the representations learned by deep neural networks, the IB principle can enhance
these models’ generalization abilities and reduce computational demands[10] . It has also inspired
new methods for training deep neural networks, like the deep variational information bottleneck
[10]. Its foundation in information theory concepts such as Markov chains, entropy, and conditional
entropy [14] supports its diverse applications across fields like data mining, image processing, natural
language processing, computer vision [15–17], and even control theory [18]. Overall, it has become
a valuable tool for designing advanced machine learning algorithms.
Multi - View Clustering
In actual clustering tasks, data is often multi-viewed, leading to the development of multi-view
clustering (MVC) techniques. These methods leverage complementary information from different
data perspectives to tackle the limitations of traditional clustering methods. Co-training, introduced
by [19], is a prevalent MVC approach, involving multiple models trained on distinct data views to
label additional unlabeled data, successfully applied in text classification [20], image recognition
[21], and community detection [22]. Low-rank matrix factorization is another strategy aimed at
CS282 Machine Learning (2024 Fall), SIST, ShanghaiTech University.
discovering a shared latent data representation across views, with variations like structured low-rank
matrix factorization [23], tensor-based factorization[24] , and deep matrix factorization [25]. Recent
years have seen subspace methods used in MVC, such as [26, 27, 6, 28] integrating multiple affinity
graphs into a consensus graph considering topological relevance.
Traditional MVC approaches largely rely on linear and shallow embedding, which can’t fully exploit
data nonlinearities crucial for complex clustering structures. Deep learning’s emergence as a potent
MVC tool has seen various neural network architectures proposed. For instance, [29] introduced a
deep adversarial MVC method using adversarial training to learn a joint data representation across
multiple views, while [30] proposed deep embedded clustering (DEC) to map high-dimensional orig-
inal feature spaces to optimized lower-dimensional spaces. DEC’s introduction has led to numerous
extensions and improvements. Autoencoder networks are frequently employed in unsupervised data
representation learning, efficiently learning complex nonlinear mapping functions. Utilizing deep
autoencoders (DAE) [31] is a common strategy in developing deep clustering techniques. Lastly, [32]
proposed a multi-level feature learning framework for contrastive multi-view clustering (MFLVC),
merging MVC with contrastive learning to boost clustering effectiveness.
2 Contribution Percent
References
[1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint
arXiv:1312.6114, 2013.
[2] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with
deep generative models,” in NeurIPS, 2014, pp. 3581–3589.
[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, pp.
139–144, 2020.
[4] Y. Liu, Y. Lin, Q. Xiao, G. Hu, and J. Wang, “Self-adversarial variational autoencoder with
spectral residual for time series anomaly detection,” Neurocomputing, vol. 458, pp. 349–363,
2021.
[5] J. Masci, U. Meier, D. Cire¸san, and J. Schmidhuber, “Stacked convolutional auto-encoders for
hierarchical feature extraction,” in ICANN, 2011, pp. 52–59.
[6] C.-Y. Lu, H. Min, Z.-Q. Zhao, L. Zhu, D.-S. Huang, and S. Yan, “Robust and efficient subspace
segmentation via least squares regression,” in ECCV. Springer, 2012, pp. 347–360.
[7] N. TISHBY, “The information bottleneck method,” Computing Research Repository (CoRR),
2000.
[8] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015
IEEE Information Theory Workshop, 2015, pp. 1–5.
[9] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,”
arXiv preprint arXiv:1612.00410, 2016.
[10] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through noisy
computation,” TPAMI, vol. 40, pp. 2897–2905, 2018.
[11] W. Yan, J. Zhu, Y. Zhou, Y. Wang, and Q. Zheng, “Multi-view semantic consistency based
information bottleneck for clustering,” arXiv preprint arXiv:2303.00002, 2023.
[12] A. M. Saxe, J. L. McClelland, and S. Ganguli, “A mathematical theory of semantic development
in deep neural networks,” Proceedings of the National Academy of Sciences, vol. 116, pp.
11 537–11 546, 2019.
[13] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”
Information Flow in Deep Neural Networks, 2022.
2
[14] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. John Wiley Sons: Hoboken,
NJ, USA, 2006.
[15] J. Goldberger, H. Greenspan, and S. Gordon, “Unsupervised image clustering using the infor-
mation bottleneck method,” in DAGM-Symposium, 2002, pp. 158–165.
[16] S. M. D. A. C. Jayatilake and G. U. Ganegoda, “Involvement of machine learning tools in
healthcare decision making,” Journal of Healthcare Engineering, 2021.
[17] Q. Sun, J. Li, H. Peng, J. Wu, X. Fu, C. Ji, and P. S. Yu, “Graph structure learning with
variational information bottleneck,” in AAAI, 2022, pp. 4165–4174.
[18] B. Paranjape, M. Joshi, J. Thickstun, H. Hajishirzi, and L. Zettlemoyer, “An information
bottleneck approach for controlling conciseness in rationale extraction,” in EMNLP, 2020, pp.
1938–1952.
[19] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Pro-
ceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998, pp.
92–100.
[20] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf, “Learning with local and global
consistency,” in NeurIPS, 2003, pp. 321–328.
[21] C.-K. Lee and T.-L. Liu, “Guided co-training for multi-view spectral clustering,” in ICIP, 2016,
pp. 4042–4046.
[22] J. Liu, C. Wang, J. Gao, and J. Han, “Multi-view clustering via joint nonnegative matrix
factorization,” in SDM, 2013, pp. 252–260.
[23] S. Zheng, X. Cai, C. Ding, F. Nie, and H. Huang, “A closed form solution to multi-view low-rank
regression,” in AAAI, 2015, pp. 1973–1979.
[24] T. V. de Cruys, T. Poibeau, and A. Korhonen, “A tensor-based factorization model of seman-
tic compositionality,” in Conference of the North American Chapter of the Association of
Computational Linguistics (HTL-NAACL), 2013, pp. 1142–1151.
[25] H. Zhao, Z. Ding, and Y. Fu, “Multi-view clustering via deep matrix factorization,” in AAAI,
2017, pp. 2921–2927.
[26] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,”
TPAMI, vol. 35(11), pp. 2765–2781, 2013.
[27] S. Huang, H. Wu, Y. Ren, I. Tsang, Z. Xu, W. Feng, and J. Lv, “Multi-view subspace clustering
on topological manifold,” in NeurIPS, 2022, pp. 25 883–25 894.
[28] F. Nie, H. Wang, H. Huang, and C. Ding, “Unsupervised and semi-supervised learning via l
1-norm graph,” in ICCV. IEEE, 2011, pp. 2268–2273.
[29] Z. Li, Q. Wang, Z. Tao, Q. Gao, Z. Yang et al., “Deep adversarial multi-view clustering network,”
in IJCAI, 2019, pp. 2952–2958.
[30] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in
ICML, 2016, pp. 478–487.
[31] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks,” Science, vol. 313, pp. 504–507, 2006.
[32] J. Xu, Y. Ren, H. Tang, X. Pu, X. Zhu, M. Zeng, and L. He, “Multi-vae: Learning disentangled
view-common and view-peculiar visual representations for multi-view clustering,” ICCV, pp.
9234–9243, 2021.