2IIG0 Cheat Sheet 1
2IIG0 Cheat Sheet 1
(x1 , y1 ), . . . , (xN , yN ). determined by the support vectors (removing other points does EPE: Train k models on different datasets. Calculate MSE for
∥v + w∥ ≤ ∥v∥ + ∥w∥
(Triangle inequality) Assumption: Independent and identically distributed from not affect the hyperplane). Let yi = ±1 determine class. models on different test sets. Average MSE.
Norm ⇐⇒ ∥αv∥ = |α|∥v∥ (Homogeneity) unknown distribution p∗ (X, Y ), (xi , yi ) ∼ p∗ Margins: wT x + b = ±1 Bias: Difference between predicted values and actual values on
Classifier: Function h maps from x to predicted label ŷ. Distance from hyperplane to margin is 1/∥w∥ which is not con- the training set. High bias is underfitting.
∥v∥ = 0 ⇐⇒ v = 0
1
Learning Algorithm: Takes dataset and returns classifier h. vex. max ∥w∥ = min ∥w∥ = min 12 ∥w∥2 which is convex. Variance: Difference between trained models (if training the
v T w = v · w = cos(v, w)∥v∥∥w∥(Prop. of Euclidean (L2 ) norm) True Classification Error: P[Y ̸= h(X)] Linearly Seperable Data: model on different datapoints produces a different curve). High
v · w/∥w∥ (Project v onto w) A classifier h∗ that minimizes this error is Bayes Optimal. This 1 variance is overfitting.
d min ∥w∥2 s.t. yi (wT xi − b) ≥ 1 for i = 1 . . . N
X is only possible if the true distribution is known. w,b 2 k-fold Cross-Validation: Divide dataset I into k disjunctive
∥v∥p = ( |vi |p )1/p (Lp norm for vectors) Non-linearly Seperable Data: Introduce slack variables ξi
K-Nearest Neighbours chunks. Train jth model on I \ Ij and compute MSE for that
i=1 1 X
model on Ij . CV MSE is the average MSE across the models.
n X
m Pick modal class among the k-nearest neighbours (use Eu- min ∥w∥2 + C ξi
w,b,ξ 2
X
∥A∥p = ( |Aji |p )1/p (Lp norm for matrices) clidean or other distance metric). Is stable, reduces data noise. i Regularization
i=1 j=1 Use odd k for binary classification (to ensure a majority exists). s.t. yi (wT xi − b) ≥ 1 − ξi , ξi ≥ 0 for i = 1 . . . N X T X is not invertible ⇐⇒ ΣT Σ (using SVD X = U ΣV T ) is
Lp norms for p < 1 violate triangle inequality (are not norms) Naive Bayes 1 X
not invertible ⇐⇒ r < p nonzero singular values. Then use
Equivalent to min ∥w∥2 + C max(1 − yi (wT xi − b), 0)
∥A∥op = max ∥Av∥ h∗ (x) = arg max p∗ (y|x). Idea: learn p∗ (y|x). w,b 2 pseudoinverse, (ΣT Σ)+ defined as
∥v∥=1 i 1
y
C is a hyperparameter, larger is stricter margin. σ12
... 0
tr(ABCD) = tr(BCDA) = tr(CDAB)... (Cycling property) p∗ (x|y)p∗ (y) . ..
Bayes Rule: p∗ (y|x) = ∝ p∗ (x|y)p∗ (y) Kernel Trick: Apply Lagrangian to obtain new formulation .. ..
L2 (= Frobenius for matrices) norm properties: p∗ (x) T + . . 0 ∈ Rp×p
1X X (Σ Σ) =
• ∥v∥2 = v T v ∥A∥2 = tr(AT A) arg max p∗ (y|x) = arg max p∗ (x|y)p∗ (y) max − λi λj yi yj xTi xj + λi 0 . . . σr2
1
d y y λ 2 i,j i
• ∥AX + B∥2 = 2X T (AX + B) Estimate p∗ (y) from class distribution of training set. X 0 0
dX s.t. λi yi = 0, 0 ≤ λ ≤ C for i = 1 . . . N and use solution β+ = V (ΣT Σ)+ ΣT U T y
Assumption (1): Conditional independence of D features:
SVD: For every matrix X ∈ Rn×p there exist orthogonal ma- D i Feature Selection
trices U ∈ Rn×n , V ∈ Rp×p , Σ ∈ Rn×p such that Optimal
X λ provides optimal solution:
Y
p∗ (x|y) ≈ p∗ (xd |y) giving us If features d ≫ observations n → sparse/ridge regression (elim-
• X = U ΣV T d=1
w= λi yi xi , b = wT xi − yi (for any i where λi > 0) inate some features). Penalizing less relevant features using
• U T U = U U T = In , V T V = V V T = Ip D i norm, and using Lagrangian with λ > 0:
Usually decision boundaries are not linear, apply non-linear fea-
Y
• Σ is diagonal containing singular values hN B = arg max p∗ (y) p∗ (xd |y) min ∥y − Xβ∥ + λ∥β∥p is the penalized objective.
• If all singular values > 0 then X −1 = V Σ−1 U T
y
d=1 ture transformation ϕ. Then xTi xj = ϕ(xi )T ϕ(xj ). β
Optimization Decision Trees Kernel: k : RD × RD → R. k is positive definite if for any L0 counts non-negative elements but is not a real norm.
Possible decisions: finite collection of vectors x1 , . . . , xN and ∀a ∈ RN it holds Choose squared L2 norm: β = (X T X + λI)−1 X T y
Analytical Optimisation
FONC & SONC: (Unconstrained min. & twice diff) Continuous: sort, use threshold between consecutive data that aT Ka ≥ 0 with K ∈ RN ×N , Kij = k(xi , xj ) Lasso Regression: Choose L1 norm. Not convex → can be
Discrete: partition set into 2 (one vs all, or arbitrary partition) Representer Theorem: k is postive definite solved with coordinate
1 descent:
∇f (x) = 0, ∇2 f (x) is positive semidef (xT Ax ≥ 0 ∀x ∈ Rd ) ∥X·k ∥2 (ck − λ) if ck > λ
⇐⇒ k(xi , xj ) = ϕ(xi )T ϕ(xj ) in some feature space ϕ(x)
Lagrangian: Adding constraints, ci (x) = 0, gk (x) ≥ 0. Cost: 1
∥2 (ck + λ) if ck < −λ
Replace ϕ(xi )T ϕ(xj ) in algorithm with k(xi , xj ), which is βk = ∥X·k
C = {x ∈ Rd | constraints hold on x} (Feasible set) • Number of data points at old leaf N .
cheaper to compute. Not necessary to know ϕ (can also not 0 if − λ ≤ ck ≤ λ
m l • N0 , N1 are number of data points at new leaves L0 , L1 . T
X
T
y−
X X
L = f (x) − λi ci (x) − µk gk (x) (Lagrangian) N0 N1 be determined from k). Finally, to classify kernelized SVMs: with ck = X·k X·k X·i βi
• Cost = Impurity(L0 )+ Impurity(L1 )−Impurity(L) 1X X i̸=k
i=1 k=1 N N max − λi λj yi yj k(xi , xj ) + λi
µi ≥ 0 if ≥ constraint, otherwise µi ≤ 0. Sometimes (e.g. for Impurity: λ 2 i,j Ridge regression is faster, Lasso gives sparser regression vectors.
i
convex problems) the minimizer for L will minimize f . To min- • Gini Impurity: 1 −
X
p(y)2
X Neural Networks
s.t. λi yi = 0, 0 ≤ λ ≤ C for i = 1 . . . N Are good at capturing signal-like data (images, video, speech,
imze L solve the system of d + m + l equations: y
i
m l X audio, language). Prominent network structure for data types:
X X • Entropy: − p(y) log(p(y)) (binary → base 2) w =X need to know ϕ, but w is not needed.
∇f (x) = λi ∇ci (x) + µk ∇gk (x) • Unstructured → MLPs (other classifiers usually better)
y b= λj yj k(xj , xi ) − yi (for any i where λi > 0)
i=1 k=1
• Self Classification Error: 1 − max p(y) • Image → CNNs.
j D
!
ci (x) = 0 ∀i : 1 ≤ i ≤ m y X X
Stopping Condition: Too few datapoints (use arbitrary h(x) = sgn( λi yi k(xi , x) − b) Neuron: ŷ = ϕ wi xi (bias: w0 , x0 = b, 1)
gk (x) ≥ 0 ∀k : 1 ≤ k ≤ l
threshold) or too small improvement of cost (use arbitrary i i=0
Numerical Optimisation threshold, or significance testing) Regression A single layer neural network with non-linear activation units
Coordinate Descent: Minimise each input variable. 1×d can approximate any continuous function. Use more layers to
(t+1) (t) (t)
Random Trees: Bootstrap and aggregate (bagging). With N Given dataset D = {(Di , yi )|Di ∈ R , yi ∈ R, 1 ≤ i ≤ n}
xi = arg min f (x1 . . . xd ), 1 ≤ i ≤ d samples, generate K datasets (bootstraps) of size N by sam- where Di , yi are the features and target for observation i: improve computation speed.
xi d T
pling with replacement. Train a new decision tree on each boot- Approximate noisy function with f : R → R = ϕ(x) β Non-linear activation functions:
Gradient Descent: xt+1 = xt − η∇f (x) where η is step size.
strap. Predict by aggregating the results of the classifiers (tak- Types of functions: • Sigmoid: y = 1/(1 + e−x ), y ′ = y(1 − y) = y(x)y(−x)
Convex Optimisation • Hyperbolic Tangent: y = (ex −e−x )/(ex +e−x ), y ′ = 1−y 2
ing average or the modal result). • Affine: ϕ(x) = (1 x)T ∈ Rd+1
Numerical optimisation gives local minimizer. For a convex d
• ReLU: y = max(0, x), y ′ = 1(x > 0)
function, every local minimizer is a global minimizer. Error • Polynomials of degree k: ϕ(x) ∈ R(k+1)
1 X
N
• Sum of k Gaussians: • Leaky ReLU: y =(max(αx, x), y ′ = 1 if x > 0 else α
Convex Set: X is convex ⇐⇒ ∀x, y ∈ X , α ∈ [0, 1] : Classification error: 1(yi ̸= h(xi ))
N i=1 ϕ(x) = (exp(−γ∥x − µ1 ∥2 ) . . . exp(−γ∥x − µk ∥2 ))T ∈ Rk y = x if x ≥ 0 else α(ex − 1)
αx + (1 − α)y ∈ X (Line segment entirely in X ) • Exponential LU:
Convex Function: f : Rd → R is convex ⇐⇒ Low error signifies overfitting. Solution: partition data to train- Goal: min ∥y − Xβ∥2 with ϕ : Rd → Rp , β ∈ Rp and y ′ = 1 if x ≥ 0 else αex
β
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ing/validation/test set. — ϕ(D1T )T —
y1 α is a hyperparameter 0 < α < 1
∀x, y ∈ Rd , α ∈ [0, 1] (Function below line) • Repeat: train on test set, tune on validation set. .
X= .. ∈R
n×p
y = ... ∈ Rn
Convex Properties: • Test model with best validation error on test set.
• Nonnegative weighted sums of convex functions are con- — ϕ(DnT )T — yn
Accuracy: Opposite of error. Confusion matrix shows count
vex. of true label vs predicted label. Normalise to show probability. with X the design matrix.
• g : Rd → Rk , g(x) = Ax + b ∧ f : Rk → R is convex Support Vector Machines If X T X is invertible then (X T X)−1 X T y is the only minimizer.
=⇒ f ◦ g is convex. Binary classification using linear decision boundaries (separat- Otherwise there are infinitely many.
• All norms are convex. ing hyperplane): h(x) = sgn(wT x + b) Bias-Variance
• Linear functions (and maps) are convex and concave. X Tradeoff
A good hyperplane maximises the minimal distance (margin) to M SE = 1 (yi − ϕ(DiT )T β)2 with test set I ⊆ training
Classification any training example. The hyperplane is supported by vectors |I|
i∈I
Feature vector x ∈ RD , class/label value y, N training samples which are exactly on the margin. The hyperplane is entirely data.
Traditional functions (top) kill gradients, especially for many Training Techniques cated SVD, use Y X T as recommendations. Approximation to Nonconvex Clustering
layers due to multiplying small numbers for backprop (use Data Normalization: For xj the sample values of feature j mean rating is poor for sparse entries. Solution, minimise on Feature transform by using kernel trick for clustering. Refor-
Batch Normalization). Modern functions (bottom) can blow Zero-mean-unit-variance: xj ← xj −mean(xj ) observed entries using O, theX
set of observed entries. mulate k-means objective:
maxY tr(Z T DDT Z) s.t. Z = Y (Y T Y )−1/2 , Y ∈ 1n×r
std(xj )
up activation (also normalize). ReLU neurons may die (never xj −min(xj ) min ∥1O ◦ (D − Y X T )∥2 = (Dij − Yi Xjt )2
Min-max ([0, 1]): xj ← max(xj )−min(xj ) −0.5 (for [−0.5, 0.5])
activate again) if too many activations go below zero. X,Y
(i,j)∈O Where DDT is defined using inner products. Use feature trans-
Multilayer Perceptrons Stochastic Gradient Descent: Partition randomly, update T
s.t. X ∈ Rd×r , Y ∈ Rn×r (Nonconvex, use coordinate descent) form ϕ, kernel matrix K = ϕ(D)ϕ(D) , equivalent objectives:
Notation: with loss on every mini-batch.
Dimensionality Reduction (PCA) minY ∥ϕ(D) − Y X ∥ s.t. X = ϕ(D)T Y (Y T Y )−1 , Y ∈ 1n×r
T 2
Momentum: Moving average of gradient to stabilise.
• # Units at layer i : Hi (H0 is number of features D) Adaptive Stepsizes: Multiply gradient by positive semidef. Can create features from linear combination of other features. maxY tr(Z T KZ) s.t. Z = Y (Y T Y )−1/2 , Y ∈ 1n×r
• wk,j
i
: weight from layer i − 1 → i, and unit k → j matrix for independent stepsizes for each neuron.
Such a feature with highest variance is a good feature. Eigendecomposition: K = K T ∈ Rn×n =⇒ K = V ΛV T , V
• wji = (w0,j i i
, . . . , wH i−1 ,j
)T (weights to unit j in layer i) Early Stopping: Stop when validation error starts increasing For existing features F1 , . . . , Fd : is orthogonal and Λ = diag(λ1 , . . . , λn ), |λ1 | ≥ · · · ≥ |λn |
• h0 = (1, x)T (bias + input vector) (or doesn’t achieve new minimum for Kfail epochs). Collect mean value for each feature µF = (µF1 , . . . , µFd )T Then since kernel matrices are positive semidef, we have sym-
Centered data matrix: C = D − 1µTF (1 is vector of ones) metric decomposition K = AAT = AT A, A = V λ1/2 = ϕ(D).
Regularization: Add penalty term λ∥w∥22 or λ∥w∥1 to Ltrain
Then the direction of maximal variance is given by α
• W = w0 . . . wHi
i i i
(all weights to layer i) Dropout: Randomly remove inputs or units when training. Substitute A into above objective, run k-means, ϕ not needed.
αT C T Cα Drawback: not robust, full eigendecomposition needed.
Batch Normalization: Normalize the output from neurons. max (n observations)
∥α∥=1 n Consider symmetric similarity matrix W describing similarity
Convolutional Neural Networks
(
i iT i−1
a =W h 1
• Recursively define Include inductive bias in NN structure to specialise. Convolu- µFd+1 = µF α σF2d+1 = ∥Cα∥2 of points according to some measure (distance, knn or other).
hi = ϕi (ai ) ϕi acts element-wise n Use W as weighted adjacency matrix to make graph (graph
tions specialise in local correlations present in signal data. For low dimensional representation, find r orthogonal directions
• Then ai sums weights for each neuron at layer i w : K × K convolutional kernel (not kernel trick) i.e. matrix must be connected!). Clusters designate max similarity but
of largest variance, given by the columns of Z:
• hi contains the output for each neuron at layer i x : image (in matrix form) (K is odd!) which is equivalent to min cut:
max tr(Z T C T CZ) s.t. Z ∈ Rn×r , Z T Z = I
• ϕi can be a different activation function for each layer i y : convolutional layer (i.e. processed image) Z maxY Sim = tr(Y T W Y (Y T Y )−1 ) s.t. Y ∈ 1n×r
minY Cut = tr((1 − Y )T W Y (Y T Y )−1 ) s.t. Y ∈ 1n×r
T 2 T 2 n×r T
• Final layer does not accumulate. ⌊K/2⌋ ⌊K/2⌋ = min ∥C C − ZΣRR Z ∥ s.t. Z ∈ R ,Z Z = I
X X Z
1 2
The weights w = W , W , . . . are parameters to train. y[i, j] = x[i + k, j + l]w[k, l] for C = U ΣV T , R = {1, . . . , r}, then Z = V·R . Laplacian: L = IW − W with IW the degree matrix for W .
All these objectives are nonconvex (hence why we use SVD). maxY Sim = tr(Y W Y (Y Y ) ) s.t. Y ∈ 1
T T −1 n×r
ŷ = f (x; w) is the predicted output for x with weights w. l=⌊K/2⌋ k=⌊K/2⌋
Basically: elementwise (Hadamard) multiplication of w and the Indicator vectors of connected components: eigenvectors of L.
Loss = L(y, ŷ) describes how ŷ approximates y. K-Means Clustering Spectral Clustering: Uses only the r eigenvectors after the
N K × K matrix centered at i, j in x, then sum all the elements. n observations, r clusters, data D. Let Pn denote the set of all
1 X
Training loss = L(yi , f (xi ; w)) Undefined if x[i ± ⌊K/2⌋, j] or x[i, j ± ⌊K/2⌋] do not exist, this first one, which is faster than a full eigendecomposition.
N i=1 possible disjoint partitions of {1, . . . , n}. Let Ci denote a cluster
removes ⌊K/2⌋ from each side of x. Solution: padding. (set of points). Then we minimize distance between points:
Data Fusion
is the average loss across the training set of N observations. Convolutional Layers: If image is 3D feature map (e.g. 3rd r True Parameters from Noisy Observations
Single Output:
X 1 X 2 Unknown vector of parameters θ. Noisy observations {y̌ℓ }, ℓ ∈
dimension = colour channels). Then kernel is K ×K ×C tensor. min ∥Dj − Di ∥
• ŷ = final layer Still the same, elementwise multiplication with ’overlapping‘ el- {C1 ,...,Cr }∈Pn
s=1
|C s | {1, . . . , R}, R sensors. Parameters are random: Θ ∼ p(θ). Sen-
j,i∈Cs
∂L ements and sum (i.e. 2D convolution on C layers, add results). This is a discrete problem - gradient undefined. Reformulate: sors are also random, but depend on the parameters they mea-
• Squared Error (Loss): L = (ŷ − y)2 , = 2(ŷ − y)
1 X T
∂ ŷ This creates a layer with a single channel. Use multiple kernels r X X·s =
Di sure, so we consider Yℓ |θ ∼ p(yℓ |θ).
– Used for regression. |Cs | R
X
and stack layers → feature map with multiple channels. T 2
min ∥Di − X·s ∥ s.t. i∈Cs
Y
Multiple Outputs (Classes): Optionally: add bias terms b after each convolution. s=1 i∈Cs
{C , . . . , C } ∈ P Assumption: p(y1 , . . . , yR |θ) = p(yℓ |θ) Idea: True pa-
1 r n
ℓ=1
Zero Padding: Pad input with ⌊K/2⌋ 0s on each side.
• Let h = h be the final (accumulation) layer with C units
L Minimizing distance to centroid (X·s is the centroid for Cs ) rameters are the most likely ones given the observations from
exp(hi ) Can apply non-linearity ϕ elementwise, and apply a new con- Lloyd’s Algorithm: Block coordinate descent the sensors. Use p̃ ≈ pZthen
• ŷ = softmax(h)i = PC (base e) volutional layer (like the layers in MLP). Higher layers in CNN
• Initialize(random centroid locations (in reasonable range)
j=1 exp(hj ) represent more abstract feature. Without padding, information θ̂ = E[θ|y̌1 , . . . , y̌R ] = θ p̃(. . . )dθ.
– Transforms output to probability distribution. Assign points to closest centroid
∂L
is compressed. Use padding, then pooling to reduce # features. • Repeat Monte Carlo: Approximate by taking experimental average
• Cross Entropy: − log(ŷy ), = ŷj − 1 if j = y else ŷ Pooling: Sliding window like convolution, but with aggre- Centroid center ← avg. location of its points
∂hj of p(θ|y̌1 , . . . , y̌R ) over many samples.
gation functions (e.g. max, average) instead of multiplica- • Until converges (which it does) State Tracking (states are hidden)
– You only need to compute softmax at position y Reformulate: Let Y ∈ {0, 1}n×r where Yis = 1 ⇐⇒ i ∈ Cs
tion/summation. Applied independently for each channel. States {xn }, time instants {tn }, observations (one source) {y̌n }.
(except for derivative)
• Window size Kp (Can be even!) Then |Yi | = 1 ∀ i ∈ {1, . . . , n} p(xn |x0:n−1 ) = p(xn |xn−1 )
– Used for classification 1n×r is the set of all binary matrices partitioning n points into Markov Assumption:
• Stride/hop size Sp (Convolution slides one at a time) p(xn |xn−1 , y0:n−1 ) = p(xn |xn−1 )
Then train model via gradient descent: w ← w − η∇w Ltrain Flattening: Convert W × H × C feature map to vector of r sets: 1n×r = {Y ∈ {0, 1}n×r ||Yi | = 1 for i ∈ {1, . . . , n}}
Training loss gradient = average of sample loss gradients. 1 Only depends on directly previous state, and does not depend
length W · H · C so that MLP learning can be used. Then the sth centroid X·s = DT Y·s and X gathers all cen- on previous observations.
NB: ∇w Ltrain is vector of matrices of partial derivatives of loss Toeplitz Matrix: Convolution can be described by matrix, |Y·s |
White Observation Noise: p(yn |xn , y0:n−1 ) = p(yn |xn )
with respect to the weight. e.g. forw ∈ R2×2 and x ∈ R3×3 then for a flattenedx and y: troids column-wise: X = DT Y (Y T Y )−1 . Then the following
Observations do not depend on previous ones.
Backpropagation w1 w2 0 w3 w4 0 0 0 0 k-means objectives are equivalent and nonconvex:
good luck!
Xn+1 = fn (Xn ) + Wn ∼ p(wn )
Let N = # layers, then we have helper variables: 0 w1 w2 0 w3 w4 0 0 0 min ∥D − Y X T ∥2 s.t. Y ∈ 1n×r , X = DT Y (Y T Y )−1
x ∈ R4
y= Y Yn = hn (Xn ) + Vn ∼ p(vn )
(k) ∂L X (k+1) (k+1)
0 0 0 w 1 w 2 0 w 3 w4 0
• Bi = = B w for 1 ≤ k < N min ∥D − Y X ∥ s.t. Y ∈ 1 Additive (process, observation) noise.
T 2 n×r d×r
,X = R
∂hi
(k) j i,j 0 0 0 0 w1 w2 0 w3 w4
j Y Chapman-Kolmogorov:
∂L (then y ∈ R2×2 ) r X n R
p(xn |y̌0:n−1 ) = p(xn |xn−1 )p(xn−1 |y̌0:n−1 )dxn−1
′(k) (k) (k)
• Bi = ′
Yis ∥Di − X·s ∥ s.t. Y ∈ 1
X
= ϕk (ai )Bi for 1 ≤ k < N Which allows us to use backprop. Never update the 0s, and gra- T 2 n×r d×r
(k) min ,X = R
∂ai Y
Bayes Rule: p(xn |y̌0:n ) ∝Rp(y̌n |xn )p(xn |y̌0:n−1 )
dients for shared weights are the sum of gradients from nodes s=1 i=1
(N )
• Bi =
∂L ′(N )
(No activation at layer N , so no Bi ) where they are used. Still easier to flatten instead. So k-means is instance of recommender problem, but cannot MMSE Estimate: x̂n = xn p(xn |y̌0:n )dxn
∂ ŷi A: predicted posterior, B: state transition distribution, C: pre-
Recommender and PCA use SVD. Given cluster assignment Y , then objective:
T 2 d×r vious posterior, D: updated posterior, E: likelihood function.
Then we have the following derivatives of loss with respect to Recommending (rank-r matrix factorization problem) min ∥D − Y X ∥ s.t. X ∈ R
weight:
X p(x0 ) → Repeat[C;B;A;(yˇn +E);D] → x̂n
Data matrix D ∈ Rn×d and rank r < min(n, d), then objective: is convex with solution X = DT Y (Y T Y )−1 . Use p(x0 ) as previous posterior for first loop.
∂L ′(k) (k−1) min ∥D − Y X T ∥2 s.t. X ∈ Rd×r , Y ∈ Rn×r (is nonconvex!) Or given centroid location X then objective:
• = B h for 1 ≤ k < N X,Y
min ∥D − Y X T ∥2 s.t. Y ∈ 1n×r is nonconvex because 1n×r
(k) j i
∂wi,j Solution: use truncated SVD, D = U ΣV T then: Y
∂L (N ) (N −1) Y X T = U·R ΣRR V·R T
with R = {1, .., r} To recommend, re- is not a convex set. But does have a solution assigning every
• (N )
= Bj hi
∂wi,j place missing values in D with mean rating µ, compute trun- point to closest centroid. Eduardo Costa Martins